Crystal Structure Predictor Probabilities

Hello,

I recently used the crystal structure predictor application, which seems to have completed successfully and provided me with a ranked list of candidate species. I just have a question about the column labeled “Log ( Probability ).”

How is this value calculated? For example, if I have a log probability of 0.003 for a structure, does that mean the probability is 10^0.003=1.007, and if so, does that indicate the probability of that particular structure is only 1%?

Thank you!

This is the relevant publication for crystal structure prediction which goes into more detail: http://pubs.acs.org/doi/abs/10.1021/ic102031h

I’m not personally familiar with that specific probability though, so maybe someone else can comment?

The probability values reported are the output of the SubstitutionPredictor.list_prediction method in our pymatgen.analysis.structure_prediction.substitution_probability module.

You may notice that several predicted structures have the same “probability”. This is because the probabilities are one-to-one with a set of specie (pymatgen terminology for an element + oxidation state) substitutions, e.g.

{
  'probability': 0.007200195085657344,
  'substitutions': {
    Specie Na+: Specie V3+,
    Specie Li+: Specie Li+,
    Specie P5+: Specie P5+,
    Specie O2-: Specie O2-
  }
}

represents the conditional probability of arriving at a target compound with species {V3+,Li+,P5+,O2-} from a source compound with species {Na+,Li+,P5+,O2-} and with the same crystal structure prototype, via independent binary substitutions. The joint conditional probability for a predicted structure is thus calculated as the product of conditional probabilities for substituting one ion for another in a particular crystal structure prototype. These substitution probabilities, in turn, are pre-calculated using log-likelihood estimation on ICSD data of experimentally observed ionic structures. There are obvious limitations to this approach, discussed in the paper @mkhorton referenced.

Because the probability listed for a structure is not unique to that structure and is only a function of a mapping of binary species substitutions, it is most helpful to use the probabilities only as a rough metric for scoring predictions relative to another, and for allowing a threshold value to be employed so as to say with some confidence that there is scant suggestion from experimental data that certain structures can be observed (and we do not return structures with scores below this threshold). It is still quite important for any predictions to be further investigated, e.g. via DFT so that they may be analyzed via compositional phase stability diagrams, to get a better handle on synthesizability.

Also, separate from all the above, and no doubt confusing, is that it turns out that past runs of the structure predictor had log10(probability) as the stored output, whereas recent runs just show probability. I have pushed a fix so that in your case you should now see that the column header reads “Probability” and not “Log (Probability)”.

2 Likes

This makes a lot more sense than what I was considering when I made the thread. Thank you both for the excellent explanation of what’s going on in the application!