Supplemental Material: A Semi-Supervised Approach to Phase Identification From Combinatorial Sample Diffraction Patterns

Jonathan Kenneth Bunn, Jianjun Hu, Jason R. Hattrick-Simpers

Effect of Repeats

To understand how the expected deviation in the predictions from SS-AutoPhase change with the number of repeats, the accuracy and standard deviation of the classification was performed with 5, 10, 15, 20, and 25 repeats for the Unknown 1 species using 42 clusters, shown in Fig. S1.

Figure S1: The accuracy and standard deviation of SS-AutoPhasewithout any data processing as a function of the number of repeats used for 42 clusters. These results are for the Unknown 1 phase.

Fig. S1 shows that the standard deviation in the predictions reduces from ±4.1% to ±3.4% as the number of repeats is increased from 5 to 25. Two notable characteristic from Fig. S1 is that the accuracy does not change significantly when increasing the number of repeats, and that the decrease in the standard deviation is smaller as the number of repeats is increased.

Accuracy Without Data Truncation

To evaluate the accuracy of SS-AutoPhase without data truncation, the first stage of SS-AutoPhase was performed with 14, 28, 42, 56, 70, 83, 97, and 125 training data sets, representing 5%, 10%, 15%, 20%, 25%, 30%, 35%, and 45% of the available data, respectively (The same as for the data truncated evalution). After the clusters were selected, the second stage of SS-AutoPhase was performed and the accuracy of the predicted phase map was recorded. This process was repeated 25 times for each training sample size and the average error and standard deviation was determined for each phase, shown in Fig. S1.

Figure S1: The accuracy of SS-AutoPhasewithout any data processing as a function of the training set size. The lower x-axis shows the absolute number of samples that were labeled for training while the upper axis shows what % of the total data was labeled.

Fig. S1 shows that when 14 data sets were used for training, the accuracy varied from 74.6% for the Unknown 1 phase to 01.5% for BCC Fe. As the training set size increased to 42 data sets, the overall accuracy increased, with a minimum accuracy of 79.6% associated with the Unknown 1 phase, while the maximum remained statistically the same, with a maximum accuracy of 96.5% for the Unknown 3 phase. As the number of training data sets was increased from 42 to 125 data sets, the accuracy of all species remained unchanged within the standard deviation of each phases’ accuracy. The reason for the relatively low accuracy of Unknown 1 is stated in the main text; peak aliasing with (111) FCC FePd and (220) Fe3Si. This double aliasing and lack of a distinct feature would make expert peak identification difficult and created similar ambiguity for SS-AutoPhase.

Comparison between the data truncated accuracy analysis and the accuracy analysis of SS-AutoPhase without any data processing shows that the ranges of the accuracy are within the standard deviation of the maximum and minimum accuracies from both data sets. One difference between the analysis without data processing and with data truncation is that the prediction accuracy of FCC FePd, Pd9Si and Fe3Si, and Unknown 2 decreased by approximately 5 %. This could be due extra peak features existing outside the data truncation range that mislead either the human or SS-AutoPhase phase maps. Alternatively, with the expansion of the diffraction pattern sizes, the trained AdaBoost classifier for the truncated data sets would focus on the more important features. This would improve the average predictive rate of the AdaBoost classifier when creating its phase map. This analysis also showed supports that the suggested training set size should be at least 15% of the sample data, as the minimum predictive accuracy was around 80%.

Most Probable Phase Diagram Predictions

Figure S2: Pictorial representations of the phase prediction results for A) BCC Fe, B) FCC Fe, C) FCC FePd, D) Hexagonal FeGa, E) Pd9Si anf Fe3Si, and F) Unknown 2. In this figure represents data points in which the human expert identified that the phase was not present and represents data points in which the human expert identified that the phase was present. In this image the color of each symbol represents the number of times SS-AutoPhase labeled that the phase was present at a given composition, with a white filled symbol indicating that the phase was never labeled as present by SS-AutoPhase.

Fig. S2 A), C), E), and F) show that SS-AutoPhase correctly identifies the existence of the BCC Fe, FCC FePd, Pd9Si and Fe3Si, and the Unknown 2 phases for the majority of the phase diagram, respectively. Fig. S2 B) and D) show that while SS-AutoPhase correctly identifies the existence of FCC Fe and Hexagonal FeGafor the majority of the phase diagram, it also predicts the presence of these phases compositions with less than 9.00 at.% Ga and between 25.0 at.% - 50.0 at.% Pd. The misidentification of these phases in this regime is due to the low signal to noise ratio in the diffraction patterns that associated with the diffraction patterns in this regime. This issue also seemed to mislead the NMF analysis from Ref. [19]. In order to overcome this issue a more intelligently selected the threshold value of the AdaBoost classifier should be used. This will make the AdaBoost classifier less sensitive to the lower absolute intensity of the data, and will lead the AdaBoost classifier to not misidentify the high noise levels in this composition regime.