ROBERT: Bridging the Gap Between Machine Learning and Chemistry
Funding: Juan V. Alegre-Requena and David Dalmau acknowledge Gobierno de Aragón-Fondo Social Europeo (Research Groups E07_23R and E17_23R) and the State Research Agency of Spain (MCIN/AEI/10.13039/501100011033/FEDER, UE) for financial support (IJC2020-044217-I, PID2022-140159NA-I00, and PID2019-106394GB-I00). Juan V. Alegre-Requena and David Dalmau acknowledge the computing resources at the Galicia Supercomputing Center, CESGA, including access to the FinisTerrae supercomputer and the Drago cluster facility of SGAI-CSIC. David Dalmau thanks Gobierno de Aragón-FSE for a PhD fellowship (2021–2025).
ABSTRACT
Beyond addressing technological demands, the integration of machine learning (ML) into human societies has also promoted sustainability through the adoption of digitalized protocols. Despite these advantages and the abundance of available toolkits, a substantial implementation gap is preventing the widespread incorporation of ML protocols into the computational and experimental chemistry communities. In this work, we introduce ROBERT, a software carefully crafted to make ML more accessible to chemists of all programming skill levels, while achieving results comparable to those of field experts. We conducted benchmarking using six recent ML studies in chemistry containing from 18 to 4149 entries. Furthermore, we demonstrated the program's ability to initiate workflows directly from SMILES strings, which simplifies the generation of ML predictors for common chemistry problems. To assess ROBERT's practicality in real-life scenarios, we employed it to discover new luminescent Pd complexes with a modest dataset of 23 points, a frequently encountered scenario in experimental studies.
Graphical Abstract
1 Introduction
The world is witnessing a growing interest in applying machine learning (ML) to everyday tasks, primarily driven by substantial savings in time, effort, and resources. Its integration not only fulfills technological needs but also fosters sustainability through the adoption of digitalized procedures, yielding important benefits for a more environmentally conscious future. In particular, the integration of ML in chemistry has opened up new avenues for exploring chemical space and predicting properties of molecules and reaction outcomes. This has led to significant advances in fields such as drug discovery [1-5], materials science [6-9], chemical synthesis [10-17], and catalyst discovery [18-24], among others.
Automated ML workflows in chemistry have also become increasingly popular, enabling researchers to predict outcomes efficiently and accurately [25-28]. Numerous packages and tools are available, including molSimplify [29], pycaret [30], ChemML [31], Chemprop [32], DeepChem [33], Chainer Chemistry [34], Lazy Predict [35], and PREFER [36]. However, despite the increasing interest in this field, there is a vast community of chemists that possess little experience in cheminformatics and find these ML tools impossible to include as part of their research.
Another major problem that chemical ML faces is the lack of reproducibility and transparency of the studies in this field [37]. Despite prior attempts to establish quality standards, there is an unexpectedly high rate of publications that lack information regarding the database used, the model parameters, or the source code [38]. The problem extends further as there are no standard protocols for most ML techniques, causing that multiple groups use different code and workflows to carry out the same tasks.
All these arguments stress that there is a need for a program in chemical ML that produces expert-quality results and grants easy use for researchers independently on their programming experience, while meeting strict reproducibility and transparency standards. In this work, we present ROBERT, a free and open-source software that performs automated ML workflows particularly designed for supervised regression and classification problems in chemistry. ROBERT automates protocols commonly conducted in chemical ML (Figure 1) and generates results directly from a CSV file with a single command, yielding performance comparable to that achieved by experts in the field.

2 Overview of ROBERT
ROBERT is a Python program designed to meet the demands of predictive ML that many members of the computational and experimental chemistry communities have, especially those who lack experience in this field. The software can be used to create predictive models for supervised regression and classification problems related to chemistry, such as experimental yield and selectivity, activation barriers calculated with density functional theory (DFT), and luminescent properties, among others. Using simple databases as starting points, ROBERT allows users with limited or no ML expertise to generate expert-level results within timeframes that allow its integration into modern research workflows.
The installation of this toolkit is performed using conda, taking less than 2 min to set up from scratch on a pool of nine testers with various operating systems and different levels of expertise in Python (Table S1). ROBERT then initiates workflows from CSV databases that contain the target values (y values) along with either X values or SMILES strings of compounds (Figure 2, Input). In the second case, the SMILES strings undergo an AQME [39] workflow to automatically generate atomic and molecular descriptors, which are then used as X values.

The user only needs to execute a single command line to couple multiple protocols, including data curation, scikit-learn-based model screening [40, 41], assessment of predictive ability, and predictor generation (Figure 2, Executing ROBERT). The resulting models can be readily used to predict the y values of new data points. As an alternative to command lines, the easyROB graphical user interface (GUI) is available to simplify job setups. The program also provides feature importance analyses, enabling users to identify the most influential parameters affecting the target values and facilitating the identification of data trends (Figure 2, Outputs). Additionally, the outcomes are supplemented with an outlier detector, allowing chemists to recognize potential inaccuracies associated with data flaws, such as experimental measurement errors or imprecise computational calculations. Different ready-to-use examples are accessible in our online documentation (https://robert.readthedocs.io), and a comprehensive explanation of the protocols performed in each module is included in the Robert modules section of Data.
Furthermore, ROBERT uses multiple correlation and feature importance filters to curate datasets by removing noise and irrelevant descriptors, ensuring that only the most significant information is retained in the final models (see the CURATE and GENERATE modules in Data S1). These data curation protocols result in more efficient and human-interpretable predictors. For example, the database shown in Figure 3C initially contained 101 descriptors, but ROBERT generated an optimal predictor using only two descriptors while achieving better results than the original model. Additionally, the program automates the processing of missing data and the standardization of data (see the CURATE module in Data S1).

General ML and data science software, such as pycaret [30] and Lazy Predict [35], often neglect essential protocols for chemical ML. In contrast, ROBERT incorporates a range of innovative features specifically designed for this field. For instance, the program generates a comprehensive report in PDF format that assesses the reliability of the resulting models' predictive capabilities, extending beyond conventional metrics such as coefficient of determination (R2), mean absolute error (MAE), and root mean squared error (RMSE). This is particularly valuable for researchers with limited experience in the field, as these metrics may not accurately reflect the models' predictive performance. The software also performs end-to-end workflows that originate from databases of SMILES strings, including the generation of molecular and atomic descriptors. This functionality offers automated discovery workflows to both experimental and computational chemists directly from routine laboratory data. Other chemical software tools, such as ChemML [31] and DeepChem [33], also have the capability to process molecular data in SMILES format, predominantly using RDKit, molecular fingerprints, and graph networks for descriptor calculation. However, ROBERT offers a unique approach for SMILES-to-descriptor conversion by generating molecular and atomic descriptors with semi-empirical QM methods, which are then combined with RDKit structural and molecular mechanics features (Figure 2). This approach is generally more robust, especially for small datasets and predictions of molecules with marked structural differences from the compounds in the training set.
This toolkit also seeks to enhance reproducibility and transparency of chemical ML protocols, addressing two long-standing concerns within the scientific community working in this field when publishing in peer-reviewed journals. Multiple authors have emphasized the critical need to establish and enforce standards in this regard [37] since, currently, a significant proportion of publications are challenging or even impossible to replicate [38]. The generated ROBERT_report.pdf files contain a reproducibility section that guides authors on which files they should upload as supporting information and provides other researchers with the exact programs, versions and commands necessary to replicate the results (Figure 2, Outputs). Furthermore, the PDF reports contain comprehensive information regarding the ML models employed and other pertinent details regarding ROBERT protocols, including the data partition scheme and the type of ML problem.
When following the provided instructions, the results starting from databases of descriptors were correctly reproduced by the nine program testers across multiple operating systems.1 When repeating end-to-end workflows starting from SMILES strings, it may not be possible to exactly reproduce the results due to subtle changes in xTB calculated properties when using more than one processor. Nevertheless, the resulting ROBERT scores and model accuracies are very similar between different ROBERT runs (Table S5). In such cases, the PDF report recommends that authors upload the descriptor database created (i.e., AQME-ROBERT_FILENAME.csv) to facilitate the reproduction of results by other researchers. The introduction of a multiprocessing approach in version 1.1.0 has resolved this reproducibility issue.
3 ROBERT's Performance in Real Examples
We ensured that the program is capable of producing publication-quality results by comparing its outcomes with examples from recent literature in chemical ML, spanning various areas such as catalysis and photophysics. This benchmarking required only a single initial CSV database per example, either downloaded from the manuscripts or obtained directly from the authors. Despite the diversity of conditions included in the benchmarking, the command line used in ROBERT remained identical in all cases (i.e., python -m robert --y VALUE --csv_name FILENAME [--ARG], see section Benchmarking with Six Examples in Data S1). To increase the complexity of the challenge, the examples included models designed by research groups that are highly respected in the field, such as the Liu [42], Doyle. [43], Sunoj [44], Kulik [45], Aspuru-Guzik and Balcells [46], and Ess [47] groups. We also introduced other parameters to make the benchmarking more demanding, including a wide and realistic range of data points spanning more than two orders of magnitude (from 18 to 4149 data points), the presence of different ML algorithms, and the inclusion of both regression and classification supervised problems.
By default, the program calculates R2, MAE and RMSE for regression problems, and accuracy, balanced F-score (F1 score) and Matthew's correlation coefficient (MCC) for classification tasks. The error type selected for comparison, RMSE or MAE for examples A–E and error in accuracy for F, was the type reported by the authors. In cases where both RMSE and MAE were reported, we opted to use RMSE for comparison.
The results demonstrated that ROBERT performs similarly to experts in the field in three cases (−1.9%, +0.9%, and +0.3% scaled error for examples C, D, and E, respectively), while achieving lower errors in the other instances (−3.3%, −7.0%, and −4.1% for examples A, B, and F, respectively) (Figure 3 and Table S7). Furthermore, the exhaustive algorithm screening conducted by ROBERT led to models that employed similar training sizes (±10% of the data) in five of the examples (B–F). Another advantage of the program is its permutation feature importance (PFI) filter, which generates simpler and more interpretable protocols. In five examples, ROBERT models contained fewer descriptors than the original examples, with some noteworthy reductions, such as in examples A (from 3 to 1 descriptor), C (from 101 to 2), and E (from 135 to 32). Despite the various protocols integrated into the workflows, the execution times on eight processors ranged from less than a minute to a few hours (Figure 3). These execution times make it feasible to run typical ROBERT processes on personal computers.
Next, we pondered whether conventional metrics employed for evaluating ML algorithms, such as R2, MAE, and RMSE, sufficed for creating a versatile, general-purpose tool. While experienced ML users can assess whether an ML workflow has yielded favorable results for valid reasons, inexperienced users may encounter difficulties in gauging the reliability of its predictive proficiency. It is widely recognized that ML algorithms can exhibit good metrics in the validation set while harboring questionable predictive ability. For example, a user might assume that a model with good metrics is proficient at prediction, but that very model could yield comparably low errors when the y values are shuffled [48] or when using random numbers as descriptors [49].
For this reason, we have included a section in the PDF files generated by the program that contains ROBERT scores. The ROBERT score is a rating out of 10, designed to offer users insights into the predictive capabilities of the protocols selected by ROBERT. We attempted to rank the models following guidelines in line with modern ML research [38] which encompass factors such as the correlation between predictions and measurements, human interpretability, error distribution, sensitivity to features, and avoidance of overfitting and underfitting (Figure 3). A comprehensive explanation of how each of these factors affects the score is provided in the ROBERT Score section of Data S1. Across the six examples examined, the scores ranged from 7 to 10, indicating varying degrees of predictive ability from moderate to strong.
4 Workflows From Smiles Strings
After assessing ROBERT's capability to generate results of a quality comparable to experts in chemical ML, we aimed to advance the program's implementation in real-life scenarios. One significant challenge that chemists often face when integrating ML into their work routines is the limited experience in creating useful databases. This is important because choosing the right descriptors is much more crucial than including many irrelevant ones in the database. ROBERT introduces an innovative workflow enabling users to generate meaningful descriptors from SMILES strings [50]. This 1D representation type has gained popularity in the field of chemical ML and is readily available in most chemical databases or easily generated using chemical drawing software such as ChemDraw [51]. Using AQME workflows [39], the program can produce over 200 Boltzmann-weighted descriptors with methods significantly faster than DFT. These workflows include RDKit [52] conformer samplings and xTB [53] geometry optimizations, followed by the generation of descriptors with RDKit, xTB and DBSTEP [54]. The resulting descriptors encompass electronic and steric properties and can be calculated for the entire molecule and for specific relevant atoms. After the PFI filter, only the most influencing descriptors remain.
We tested the SMILES-to-predictions workflow in two examples from fields that are particularly interesting for ML predictions: property calculation and catalysis. First, we employed ROBERT to create a predictor for the solubility of organic molecules in water. To do this, we accessed the 1124 entries of the ESOL database, which is available in the supporting information of the corresponding publication [55], and executed the command line shown in Figure 4A to generate molecular descriptors. Using eight processors, the predictor was ready in approximately 3 h. The resulting multivariate lineal (MVL) model used 80% training size and 47 descriptors, and can accurately predict the solubility of new organic molecules (ROBERT score of 10). Furthermore, the generated PDF report contains valuable information that can aid in the rational design of solutes. This includes feature importance analysis to understand the key molecular properties influencing solubility and outlier analysis to identify potential measurement errors, among other parameters.

Subsequently, we employed ROBERT to predict the activation barriers of hydrogenations using different Vaskas' catalysts (example E from Figure 3) [46]. In this case, the database contained 1901 SMILES strings of the catalyst-substrate complexes displayed in Figure 4B. We executed the command line depicted in the figure using 16 processors and the predictor was generated within 9 h. In this case, the optimal model was a neural network (NN) that used 80% training size and 19 descriptors. This algorithm exhibited robust predictive performance (ROBERT score of 10) and is ready for inclusion within a catalyst discovery workflow, where users can input SMILES representations of new catalyst candidates. It is noteworthy that in this example, ROBERT incorporated atomic properties of the Ir metal center. This feature enables the addition of relevant local electronic and steric environments alongside the default molecular properties.
Overall, these findings indicate that employing non-DFT descriptors can yield highly predictive models while utilizing a reasonable number of descriptors (47 and 19 descriptors for 1124 and 1901 entries, respectively). The use of these cost-effective descriptors enables ROBERT to generate predictors in timeframes that result suitable for integrating ML protocols into real-life research workflows. Even in the two examples involving over 1000 SMILES entries, the predictors were generated in just 3 and 9 h.
5 Data-Driven Discovery of New Pd Luminescent Probes
We then decided to focus on a specific research line within our group, which involves the discovery of Pd-based emitters [56]. The synthesis and design of luminescent compounds have experienced significant growth and recognition in recent years, with applications spanning diverse areas such as sensors, catalysts, biomarkers and light-emitting devices [57]. However, Pd exhibits poor photophysical properties compared to other transition metals due to readily available non-radiative relaxation pathways. In our previous studies [56, 58-60], we were only able to synthesize a very limited number of oxazolone–Pd probes with good luminescence using our chemical intuition, as there were no clear trends in the results. Given that only five structural modifications resulted in quantum yields (QYs) above 10% (1a–e, Figure 5A), our objective was to employ ROBERT for the identification of new Pd complexes with QY values exceeding this threshold.

First, we compiled a dataset of molecules in ChemDraw containing all the oxazolone–Pd complexes reported in our previous studies, albeit this database was still limited to 20 entries. Next, we generated SMILES strings of the compounds within minutes and executed ROBERT using one command line. The automated workflow generated molecular descriptors, as well as atomic features for the Pd centers. One of ROBERT's unique procedures is to include an evaluation of the predictive capability of the models at the beginning of the PDF report, quantified as the ROBERT score. With the initial 20 data points, the model exhibited weak performance, primarily attributed to an imbalance resulting from the limited number of data points in the region where the QY exceeded 10% (ROBERT score 5, Figure S8, left).
To address this issue, we decided to compensate the dataset with three additional molecules that were chosen for their rapid synthetic availability, with logical human decisions guiding the selection process (Figure S8, middle). For instance, we anticipated that adding a Me group in the para position of a Py ligand should maintain the favorable QY observed in points with more than 10% QY. Subsequently, we reran ROBERT with the expanded database of 23 points (Figure S9) and observed a significant improvement in the ROBERT score, reaching 7, which represents a decent level of predictive ability for that dataset size (Figure 5B, left, and Figure S8, right).
We then initiated the data-driven discovery of Pd probes, where the candidates featured structural modifications whose effects could not be reliably determined using human chemical intuition. The modifications were driven by the availability of ligands and oxazolone cores in our laboratory (Figure 5B, middle), which is arguably one of the most common practices used to expand substrate scopes in experimental chemistry studies. This combinatorial process generated 15 candidates with entirely unknown QYs (Figure S10). We used the SMILES strings of these new 15 compounds to generate descriptors and predict QY, resulting in four predictions (1f–i) with QY values exceeding 10%. Subsequently, we conducted experimental measurements of the QY for these positively predicted candidates, and also assessed the QY of a negative result (1j) for further experimental validation. Despite the limited number of data points in the initial training database, the predictor showed promising results, with four out of the five measured QYs yielding the expected range (80% accuracy, Figure 5B, right).
This digitalized approach enabled us to discover three new luminescent emitters with QYs exceeding our 10% threshold (1f, 1h, and 1i) from a pool of over 15 potential candidates while conducting a minimal number of experiments. Finding three new probes marks a significant progress in the field of oxazolone–Pd-based emitters, especially considering that only five such examples were previously known. These positive outcomes highlight the potential of ROBERT to facilitate data-driven approaches, even in situations with limited available data.
6 Conclusion
ROBERT has been designed to make ML more accessible to chemists, irrespective of their programming expertise, while producing results comparable to those achieved by experts in the field with just one command line. Another primary objective of ROBERT is to address the long-standing issues of reproducibility and transparency that have plagued chemical ML. Upon executing the program, users receive a comprehensive PDF report evaluating the reliability of the generated models, accompanied by a step-by-step guide for replicating the outcomes. Research efforts are currently underway to overcome the program's current limitations, such as incorporating new features into the GUI and making the program compatible with GPUs.
To validate the effectiveness of ROBERT, we conducted benchmarking using six recent studies in chemical ML. These examples spanned a wide spectrum of databases, from modest datasets with just 18 entries to extensive ones including more than 4000 entries. The results demonstrated ROBERT's capacity to produce results on par with domain experts, all with the simplicity of a single command line.
Furthermore, this software can initiate workflows directly from SMILES strings, bypassing complications tied to database generation and simplifying the generation of ML predictors for chemistry problems. We illustrated how databases containing only SMILES representations and target outcomes can produce precise predictions for properties and reactivity.
Lastly, we sought to evaluate ROBERT's practicality in real-life scenarios. To this end, we employed the program for discovering new luminescent Pd complexes. Starting with a rather limited dataset of only 23 data points, a situation frequently encountered in experimental studies, ROBERT demonstrated its utility by proposing several luminescent emitters from a pool of potential candidates.
Author Contributions
David Dalmau: data curation (lead), methodology (lead), writing – original draft (equal), writing – review and editing (equal). Juan V. Alegre-Requena: conceptualization (lead), data curation (supporting), methodology (supporting), software (lead), writing – original draft (equal), writing – review and editing (equal).
Acknowledgments
The acronym ROBERT is dedicated to Prof. ROBERT Paton, who was a mentor to Juan V. Alegre-Requena throughout his years at Colorado State University and who introduced him to the field of cheminformatics. David Dalmau thanks Oliver Lee (University of St Andrews) for assisting with the style of the ROBERT_report.pdf file. David Dalmau and Juan V. AlegreRequena thank the program testers who meticulously evaluated the installation, usage, and reproducibility of the program (listed in chronological order): Prof. David Valiente (Miguel Hernández University), Íñigo Iribarren (Trinity College Dublin), Dr. Heidi Klem (Colorado State University), Dr. Guilian Luchini (Bristol Myers Squibb), Alex Platt (Colorado State University), Xinchun Ran (Vanderbilt University), and Oliver Lee (University of St. Andrews). The authors also thank Ignacio Funes-Ardoiz (University of La Rioja) and Dr. Esteban Urriolabeitia (CSIC) for pre-reviewing the manuscript.
Conflicts of Interest
The authors declare no conflicts of interest.
Related Wires Articles
Machine learning methods in chemoinformatics
Making machine learning a useful tool in the accelerated discovery of transition metal complexes
Machine learning activation energies of chemical reactions
AQME: Automated quantum mechanical environments for researchers and educators
Endnotes
Open Research
Open Research Badges
Data Availability Statement
The ROBERT program is freely available on GitHub (https://github.com/jvalegre/robert), along with comprehensive documentation provided on Read the Docs (https://robert.readthedocs.io) and hands-on tutorials on YouTube (https://www.youtube.com/@thealegregroup4964). The Supporting Information (Data S1), all the ROBERT PDF reports, raw data, and CSV databases used in the examples from the paper can be accessed on Zenodo (ROBERT ESI and raw data, https://zenodo.org/records/10798528, DOI: 10.5281/zenodo.10798528). The PDF reports encompass detailed instructions for replicating the results (Reproducibility section), information about the models (Transparency section), ROBERT scores, R2, MAE and RMSE metrics of all datasets (ROBERT Score section), feature importance, outlier analysis, and other relevant data (Predict section).
This article has earned an Open Data badge for making publicly available the digitally-shareable data necessary to reproduce the reported results. The data is available at https://zenodo.org/records/10798528.