English
 
Help Privacy Policy Disclaimer
  Advanced SearchBrowse

Item

ITEM ACTIONSEXPORT

Released

Thesis

Systematic identification of relevant features for the statistical modeling of materials properties of crystalline solids

MPS-Authors
/persons/resource/persons199248

Regler,  Benjamin
NOMAD, Fritz Haber Institute, Max Planck Society;

External Resource
No external resources are shared
Fulltext (restricted access)
There are currently no full texts shared for your IP range.
Fulltext (public)

thesis_regler.pdf
(Any fulltext), 9MB

Supplementary Material (public)
There is no public supplementary material available
Citation

Regler, B. (2022). Systematic identification of relevant features for the statistical modeling of materials properties of crystalline solids. PhD Thesis, Freie Universität, Berlin.


Cite as: https://hdl.handle.net/21.11116/0000-000A-FA81-A
Abstract
Designing materials with desired properties is essential to developing new materials for today's challenges. Historically, new materials have been discovered through trial and error. Nowadays, materials can be simulated and designed on the computer before they are synthesized in the laboratory. However, despite increasingly powerful computational resources and automatized experiments, this process is still comparatively demanding.
Given the anticipated potential diversity of materials, a brute-force search for candidate materials with desired properties is impractical. In recent years, algorithms for building statistical models, especially machine learning, have been used to estimate properties from available materials data. These models relate a set of materials properties -- the so-called features of the data set -- to a property of interest. Because there is no standardized procedure for selecting a set of features related to a property of interest, materials data sets can have hundreds to thousands of features. As a result, models are often complex, placing high demands on computational resources.
This thesis proposes a systematic approach to reduce the number of features prior to statistical modeling and a framework for automatically constructing and estimating the prediction uncertainty of statistical models. The information-theoretic approach presented first allows a ranking of the identified features by quantifying the relevance of features in terms of their mutual dependence to the property of interest. Whereas traditional methods work well for discrete data, a method for continuous data is developed for the application to materials data. A framework for feature identification is designed that can be applied to information-theoretic methods as well as to machine-learning algorithms. The framework is based on the branch-and-bound algorithm and iteratively combines sets of features with the goal of identifying the features related to a property of interest with either the highest mutual dependence or the best prediction performance.
Examples with known as well as empirically identified feature-property relationships are used to compare the information-theoretic method and the developed framework with established methods. The framework is then applied to actual materials data sets. The information-theoretic method is robust in the presence of inter-correlated features and is stable with increasing numbers of data samples, but requires more data to identify the same set of features than machine-learning algorithms for feature identification. Generated machine-learning models therefore resulted in higher prediction errors. The same framework, but using machine-learning algorithms, required fewer features to achieve a comparable prediction performance to the models reported in the literature.
The framework identifies different sets of features that leads to an ensemble of statistical models with similar prediction performance. A number of additional tools are developed to further identify feature inter-correlations and to estimate the prediction error within a probabilistic tolerance. These tools can be used to assess the limitations of the generated models in predicting the desired property of new materials, to determine which materials cannot be predicted, and to find the features related to the property of interest in a model-independent framework for feature identification and model construction.