Scientists have identified human biases in datasets used to train models for computer-aided syntheses.
copyright by www.chemistryworld.com
They found that models trained on a small randomised sample of reactions outperformed those trained on larger human-selected datasets. The results show the importance of including experimental results that people might think are unimportant when it comes to developing computer programs for chemists.
Machine learning models are a valuable tool in chemical synthesis, but they’re trained on data from the literature where positive results are favoured, whereas the dark reactions – the experiments that were tried but didn’t work – are usually left out. ‘Including these failures is essential for generating predictive
‘We considered extra dark reactions – a class of reactions that humans don’t even attempt, not because of scientific or practical reasons, but simply because it’s humans who make the decisions,’ Schrier says. ‘We found that chemists tend to be stuck in a rut when planning new experiments, and this gets reinforced by social cues. There’s a tendency to follow the
The researchers evaluated over 5000 amine-templated metal oxide structures deposited in the Cambridge Structural Database and found that 17% of the known amine reactants (70 ‘popular’ molecules) occur in 79% of the reported structures, while the remaining 83% (345 ‘unpopular’ molecules) are present in just 21% of the structures. They also analysed unpublished experimental records for hydrothermal vanadium borate reactions from their Dark Reactions Project and found similar biases in the pH and amine quantities used.
‘We removed this bias by intentionally rejecting the standard approach to these exploratory reactions,’ says Alexander Norquist of Haverford College, US, who was also involved in the study. He points out that there was no difference in the reaction performance when the ‘unpopular’ amines were used. ‘We created two
1 Comment