- 1Mechanical Engineering, School of Engineering, University of Birmingham, Birmingham, B15 2TT, UK
- 2Institute of Nanoscience and Nanotechnology, National Centre for Scientific Research “Demokritos”, Aghia Paraskevi, 15310 Athens, Greece
Predicting the carcinogenic potential of emerging pollutants is vital to safeguarding public health and the environment. Approaches relying on high-fidelity chemical models incur substantial computational costs, while approaches solely relying on mathematical methods often lack robust predictive performance. In recent years, to address these limitations, Quantitative Structure–Activity Relationship (QSAR) models integrating chemical structure with mathematical methods have been developed. QSAR facilitates the implementation of the three principles of Replacement, Reduction, and Refinement (3Rs) in the context of green and sustainable chemistry for carcinogenicity prediction. Nevertheless, the prediction accuracy of existing models requires enhancement, as it is currently limited due to uncertainties in chemical classification databases, limited feature selection, and the complexity of carcinogenic mechanisms.
This study developed a tailored machine learning QSAR platform to predict the carcinogenic potential of Polycyclic Aromatic Hydrocarbons (PAHs), potentially reducing reliance on in vivo testing. The platform employs the Random Forest machine learning method, an ensemble of decision trees, based on molecular structure features (i.e., descriptors including constitutional, topological, geometrical, etc.) and carcinogenicity classification data. A total of 66 PAHs were selected based on available evidence of their presence in emissions from transport. PAH carcinogenicity classification data were extracted primarily from the International Agency for Research on Cancer (IARC), the Integrated Risk Information System (IRIS), and the European Chemicals Agency (ECHA) databases. PAHs were subsequently classified into carcinogenic (+1) and non-carcinogenic (−1) categories. Of the 66 PAHs, 56 were used for model training and 10 for evaluation using machine learning validation criteria, including accuracy, precision, sensitivity (i.e., recall), and the harmonic mean of precision and recall (i.e., F1 score). The optimal combination of model hyperparameters was selected based on the lowest average prediction error (i.e., out-of-bag error). Molecular descriptors were calculated using PaDEL-descriptor software, yielding 1,875 descriptors for each compound. Constant and highly correlated molecular descriptors (>0.96) were removed, reducing the descriptors to 291.
The results indicate that feature importance analysis successfully reduced the molecular descriptors to a final set of 12. This reduction is critical for preventing overfitting, given the limited transport-derived PAH carcinogenicity data available. The platform demonstrated robustness regarding uncertainties in the initial categorisation of compounds. Furthermore, it captured the most influential molecular characteristics for predicting PAH carcinogenicity. Its high accuracy is evidenced by F1 scores of 0.95 and 0.83 for the training and evaluation sets, respectively. Consequently, this study demonstrates that integrating QSAR with Random Forest can facilitate cost-effective and accurate prediction of the carcinogenic potential of unclassified PAHs, supporting the transition toward New Approach Methodologies (NAMs) by reducing the need for costly in vitro and in vivo testing.
Acknowledgement: This research was funded by the European Union’s Horizon Europe research and innovation programme within the AEROSOLS project under grant agreement number 101096912 and the UK Research and Innovation (UKRI) under the UK government’s Horizon Europe funding guarantee [grant numbers 10092043 and 10100997].
How to cite: Najafpour, N., Herreros, J. M., Tsolakis, A., Sideratou, Z., Katsaros, F., and Zeraati-Rezaei, S.: Development of a Machine Learning QSAR Platform to Predict Carcinogenicity Potential of Transport-derived PAHs, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-18990, https://doi.org/10.5194/egusphere-egu26-18990, 2026.