Resolution limit of data-driven coarse-grained models spanning chemical space

Increasing the efficiency of materials design and discovery remains a significant challenge, especially given the prohibitively large size of chemical compound space. The use of a chemically transferable coarse-grained model enables different molecular fragments to map to the same bead type, while also reducing computational expense. These properties further increase screening efficiency, as many compounds are screened through the use of a single coarse-grained simulation, effectively reducing the size of chemical compound space. Here, we propose new criteria for the rational design of coarse-grained models that allows for the optimization of their chemical transferability and evaluate the Martini model within this framework. We further investigate the scope of this chemical transferability by parameterizing three Martini-like models, in which the number of bead types ranges from five to sixteen for the different force fields. We then implement a Bayesian approach to determining which chemical groups are more likely to be present on fragments corresponding to specific bead types for each model. We demonstrate that a level of performance and accuracy comparable to Martini can be obtained by using a force field with fewer bead types. However, the advantage of including more bead types is a reduction of uncertainty with respect to back-mapping these bead types to specific chemistries. Just as reducing the size of the coarse-grained particles leads to a finer mapping of conformational space, increasing the number of bead types yields a finer mapping of chemical compound space. Finally, we note that, due to the relatively large size of the chemical fragments that map to a single martini bead, a clear resolution limit arises when using the water/octanol partition free energy as the only descriptor when coarse-graining chemical compound space.


I. INTRODUCTION
Molecular design is a cornerstone of materials science, requiring a fundamental understanding of the relationships between molecular structure and the resulting properties. Traditionally, these structure-property relationships 1 only arise after multiple rounds of screening and discovery of new materials. [2][3][4][5][6] These cases are examples of direct molecular design, in which the space of all chemical compounds, known as the chemical compound space (CCS), is explored to determine the most suitable chemistry for the target application. Direct molecular design can be interpreted as identifying a hypersurface in the high-dimensional CCS onto a lower dimensional space defined by certain key molecular descriptors that strongly correlate with the desired property. In contrast, inverse molecular design, in which a structure-property relationship is used to infer a suitable chemical structure from a desired property, remains the holy grail of materials science. The main obstacle to achieving this goal is the inability to quickly establish structure-property relationships that can span broad regions of CCS. This is an exceedingly difficult task, given that the size of CCS was estimated to be 10 60 for drug-like molecules less than 500 Da. 7 Experimentally, this process is inhibited due to both the material and time cost associated with synthesizing and testing a large variety of chemistries that are necessary to infer a relation that is both robust and accurate enough to enable inverse molecular design. a) Electronic mail: kanekal@mpip-mainz.mpg.de nine, c) twelve, and d) sixteen bead types. The number of bead types included in these models defines the degree to which CCS is partitioned on the ∆G W→Ol axis. By varying the number of bead types in each model, we obtain greater insight as to the range of chemistries spanned by a single bead type.
Computationally, recent advancements in processing power and in machine learning have enabled several efficient methods for estimating the electronic properties of a large variety of materials. [8][9][10][11][12][13] These methods have the added benefit of screening molecules that cannot be easily synthe-sized, and can thus motivate (or demotivate) the experimental exploration of these chemistries. However, there has been relatively little success in applying computational highthroughput screening methods to determine the stability of chemical compounds in soft matter systems for which thermal fluctuations play a critical role. 14,15 Force field based methods, such as molecular dynamics simulations, are typically used to account for the immense number of configurations that result from thermal fluctuations in these systems. Unfortunately, due to the extensive computational resources required, a high-throughput scheme based on atomistic molecular dynamics simulations is currently unfeasible for spanning the large regions of CCS needed to obtain broadly applicable structure-property relationships.
Coarse-grained molecular dynamics simulations provide a means to significantly reduce the computational expense relative to fully atomistic simulations while still capturing the relevant physical properties. [16][17][18][19] Coarse-grained representations of molecules result from mapping groups of atoms to coarse-grained "pseudo-atoms" or beads. The governing interactions between beads are determined such that the desired properties of the atomistic system are retained. This usually corresponds to a smoothing of the underlying free-energy landscape, allowing for more efficient sampling. Conventionally, coarse-graining is applied to a single molecule with the goal of efficiently sampling a specific system of interest. The coarse-grained potentials are obtained via one of several possible methods (e.g., iterative Boltzmann inversion 20,21 , forcematching 22,23 ). However these methods are computationally expensive, requiring an initial atomistic simulation that sufficiently explores the underlying free energy landscape of the system of interest. 24 Therefore, adapting coarse-grained molecular dynamics simulations to high-throughput screening of chemical compounds requires flexible yet reliable mapping and force field parameterization methods that do not rely on results from higher resolution simulations for each compound screened.
The coarse-grained Martini force field has become widely used to simulate biological systems as it provides a robust set of transferable force field parameters by constructing biomolecules from a small set of bead types. [25][26][27] The Martini model is a top-down model, which maps an atomistic compound or molecular fragment to a coarse-grained site based on its partitioning between aqueous and hydrophobic environments. In the context of molecular design, the main advantage that Martini provides is its chemical transferability. While the force field was explicitly parameterized for a set of specific molecules, a single Martini bead can represent several different chemistries that share similar oil/water partitioning characteristics. In this context, the main feature captured by the Martini model is hydrophobicity, which can act as a key driving force in the physics of soft-matter systems. Rather than running a single atomistic simulation that yields a single data point in CCS, a Martini coarse-grained molecular dynamics simulation provides a representative point in CCS, corresponding to the average behavior of all the chemistries that lay in the region surrounding that point. Thus, high-throughput coarse-grained (HTCG) simulations that use chemically-transferable force fields, such as Martini, are advantageous because they span vast regions of CCS to quickly infer the structure-property relationships and chemical descriptors that can be used to enable inverse molecular design at any resolution. Menichetti et al. recently demonstrated this by running Martini HTCG simulations to construct a structure-property relationship describing the thermodynamics of the insertion of a small organic molecule into a biological membrane across CCS. 28,29 In doing so, they discovered a linear relationship between the bulk partitioning behavior of the solute and its potential of mean force. They were then able to identify a structure-property hyper surface to obtain membrane permeabilities for these solute molecules. Using the Generated DataBase 30,31 (GDB), a systematically computer-generated set of organic drug-like compounds, as a proxy for CCS, we then related the regions of this surface to regions of CCS that were dominated by specific chemical moieties, enabling inverse molecular design of small molecules given a desired permeability. The question remains: how representative of CCS is the Martini force field? Given that Martini was designed to reproduce the partitioning behavior of certain solvents as well as the properties of lipid-bilayer membranes, is there a way to accurately parameterize a transferable coarse-grained force field with the goal of optimizing its coverage of CCS? In the context of highthroughput coarse-grained simulations that use Martini, creating a structure-property relationship that enables inverse design requires an understanding of the chemistry that is representative of a specific bead type. The metric used in assigning specific chemical fragments to Martini bead types is the water/octanol partition free energy (∆G W→Ol ). Therefore, an intuition for which chemistry maps to a given bead type can only be obtained by understanding how ∆G W→Ol varies as a function of chemistry. Given that the number of heavy (nonhydrogen) atoms that usually map to a Martini bead ranges from three to five, we can think of each bead as representing a small carbon scaffold perturbed to some degree by either replacing carbons with other heavy atom types (e.g., oxygen, nitrogen, or fluorine) or by replacing single bonds with double or triple bonds. We define a functional group as being one or a localized combination of these types of perturbations.
In this work, we quantify the information loss that occurs when a top-down coarse-grained model, like Martini, is used to reduce the resolution of CCS. Additionally, we parameterize three sets of coarse-grained force fields in the Martini framework. In this context, we use the terms "force field" and "model" interchangeably, defined as a set of parameters which describe the interactions between a fixed number of coarsegrained representations called bead types. Each force field developed in this work consists of five, nine, and sixteen neutral bead types, as well as two extra types to account for hydrogen bond donors and acceptors. We observe that Martini does not provide the most efficient reduction of CCS. We show that the nine-bead force field reduces CCS to the same degree as Martini despite having three fewer bead types, and that further increasing the number of bead types yields negligible improvements in the performance of the model. The models are validated by performing coarse-grained simulations to calculate the water/octanol partition free energies of approximately 500 compounds for which experimental data is available. Finally, we demonstrate that the main advantage of a force field with a large number of bead types is the reduction of uncertainty when back-mapping these coarse-grained representations to real chemical functional groups. Just as decreasing the resolution of the CG mapping reduces the resolution of the potential energy landscape, a reduction in the number of bead types of a chemically transferable CG force field allows for an increased degeneracy of chemical fragments that map to a single bead type, illustrated in Fig. 1. Ideally, a well-designed chemically transferable CG force field would contain some number of bead types that can be intuitively back-mapped to single chemical functional groups. However, the size of a single functional group is small relative to the size of a Martini bead, such that many functional groups could be identified within a fragment mapping to a single Martini bead. Here, we demonstrate that this mismatch between the size of a Martini bead and a single functional group requires additional constraints in order to identify the unique chemistry that maps to each bead type. Incorporating these constraints into a Bayesian formalism yields probabilities of specific chemistries mapping to a given bead type, further promoting inverse molecular design. However, even these additional constraints allow for the same functional groups to be present in multiple bead types, indicating a natural resolution limit when using ∆G W→Ol as the sole basis for a chemically-transferable, top-down coarse-grained model.

The Auto-Martini Algorithm
This work relies on the AUTO-MARTINI algorithm initially developed by Bereau and Kremer. 32 The algorithm first determines an optimal mapping for an organic small molecule. The mapping provides the number of coarse-grained beads used to represent the molecule as well as their placement. A mapping cost function is minimized for each molecule so as to optimize both the number and placement of beads used in its coarse-grained representation. The assignment of coarsegrained potentials to each bead (bead-typing) occurs by assigning an existing Martini bead type that has the closest matched water/octanol partition free energy (∆G W→Ol ) with that of the molecular fragment encapsulated by the bead. The partition coefficients of these fragments are obtained by using ALOGPS, 33,34 a neural network algorithm that predicts these values given the chemical structure of the fragment. In this work, we use an updated version of the AUTO-MARTINI algorithm that has three significant changes from the previous version. The first change is an increased energetic penalty for "lonely" atoms (i.e., atoms that fall outside the Van der Waals radius of the placed coarse-grained beads). The second change is a reduction of the multiplicative factor used when assigning bead types to rings for both five and sixmembered rings. Finally, the cutoff value for the ∆G W→Ol for the assignment of donor and acceptor fragments to their corresponding bead types was modified such that the CG and atomistic population distributions more closely matched. All of these changes increased the algorithm's accuracy, which is quantified in the supporting information. Using the refined AUTO-MARTINI algorithm, approximately 3.5 million molecules with ten heavy atoms or less that make up the GDB were mapped to coarse-grained representations for four different force fields. The molecules contain carbon, nitrogen, oxygen, fluorine, and hydrogen atoms only. Of these 3.5 million compounds, approximately 340,000 were successfully mapped to both coarse-grained unimers (1 bead representations) and dimers (2 bead representations) for all of the force fields described in this work. The majority of the remaining compounds were mapped to coarse-grained representations with a higher number of beads, and a small fraction of compounds were unable to be successfully mapped by the algorithm. Histograms comparing the distributions of ∆G W→Ol for each set of atomistic compounds mapping to CG unimers and dimers and their CG counterparts were constructed using the NUMPY histogram function 35 , with the number of bins equal to 1000 and 1050 for unimers and dimers, respectively. These histograms are shown in Fig. 2a-d for Martini, while the other histograms can be found in the SI.

The Jensen-Shannon Divergence
In this work, the main tool used to quantify information loss when going from atomistic to coarse-grained resolution is the relative entropy in the form of a Jensen-Shannon divergence (JSD). 36 The relative entropy framework has been previously established as a useful tool for evaluating the quality of coarse-grained models. 37,38 The JSD is a variation of the well-known Kullback-Leibler divergence 39 used to calculate the relative entropy between two distributions. It offers two advantages over the Kullback-Leibler divergence in that it is symmetric and always has a finite value. Rather than directly relating two distributions, as is the case for the Kullback-Leibler divergence, the JSD computes the relative entropy by comparing each of these distributions to a third distribution which is the average of the other two distributions, as shown in the following equations where and P avg = 1 2 (P CG + P AA ).
Here, we use the JSD to evaluate how well the distribution of the water/octanol partition free energies for the coarsegrained molecules (P CG ) match the corresponding distribution at the atomistic resolution (P AA ). A value of 0 indicates that the two distributions are the same. The use of the average distribution (P avg ) conveniently prevents divisions by zero when comparing histograms like those shown in Fig. 2a-d.

Basin Hopping and Minimization Schemes
In this work, we use multiple methods to optimize the coarse-grained partition free energies to best match the atomistic distribution of free energies. The first such method is the basin-hopping method, 40 which is a variation of Metropolis-Hastings Monte Carlo. The algorithm proceeds in the following steps. Given a set of initial coordinates and objective function, the initial coordinates are first randomly perturbed and subsequently minimized. The results of the minimization are either accepted or rejected based on a predefined Metropolis criterion. These two steps form a single iteration of the algorithm, and a large number of iterations may be required to find the desired minima. Here, we use the JSD as our objective function and a set of possible water/octanol partition free energies for each coarse-grained bead type as our initial coordinates. Each move then corresponds to shifting the values of ∆G W→Ol for each coarse-grained bead type in a given force field. The optimizations were performed in order to define the desired ∆G W→Ol values for the five-bead-type force field, using the basinhopping function provided by SCIPY 41 with a Broyden-Fletcher-Goldfarb-Shanno local minimizer, 42 a Metropolis temperature parameter of 0.008, and a step size of 0.024 kcal/mol. For the reference atomistic distribution, we applied the ALOGPS neural network to predict ∆G W→Ol for molecules in the GDB, restricting the maximum number of heavy atoms per molecule to eight. However, finding the optimal set of ∆G W→Ol values for the sixteen-bead-type force field using this approach proved to be computationally unfeasible, as the dimensionality of the problem scales with M N , where N is the total number of bead types in the force field and M is the range of ∆G W→Ol values spanned by the Martini bead types divided by the step size. To parameterize the sixteenbead-type force field, we used the SCIPY minimize function 41 with the modified Powell method, 42 starting with an initial set of eighteen bead types that were evenly distributed along the ∆G W→Ol axis. The results of the minimization indicated two sets of two bead types that were within 0.1 kcal/mol of each other, and so each pair was combined into a single bead type, resulting in sixteen bead types total.

Clustering the GDB
In addition to optimization of the JSD, a new set of coarse grained water/octanol partition free energies was also proposed by clustering the GDB, leading to the 9-bead-type force field. Specifically, all molecules with eight heavy atoms or less that were known to map to single bead representations using the AUTO-MARTINI algorithm were grouped based on the number and type of hetero-atom substitutions present in the molecule (i.e., the number of times that a C was replaced with N, O, or F). The resulting atomistic molecular populations as well as the mean and standard deviation of their water/octanol partition free energies are shown in Fig. 3. Detailed information on each of the distributions (beyond what is provided in Fig. 3) is available in the SI. The desired water/octanol partition free energies are determined by clustering the points on this graph, starting from the highest populated points and accepting anything that was within plus or minus 0.5 kcal/mol of these points. For example, the first point with the highest population in Fig. 3a is chosen as a starting point for the first bead type. All points that fall within 0.5 kcal/mol are assigned to this bead type and the ∆G W→Ol is determined by taking a population-weighted average of all of these points. The next bead type is determined by selecting the highest point on Fig. 3a that is not already assigned to a bead type and repeating the process.

Functional Group Analysis
A statistical analysis of the functional groups found in the molecular fragments mapping to single beads is necessary in order to obtain a more detailed picture as to which chemistries are representative of specific bead types. The enumeration of functional groups was achieved through the use of the CHECKMOL software developed by Haider. 43 This software uses the 3D coordinates of each atom and the corresponding atom labels in a given molecule to identify common chemical functional groups. A full list of the functional groups identified can be found in the SI. Using CHECKMOL, we determine the degeneracy of specific functional group pairs with respect to single bead types for the set of molecular fragments that mapped to a single bead. This amounts to counting the number of fragments containing a specific functional group pair and mapping to a single bead type. This population is then normalized with respect to the total number of fragments containing that same functional group pair across all bead types. It is useful to frame this statistical analysis in terms of conditional probabilities, as this yields specific information relevant for molecular design applications. For example, the aforementioned counting and normalization is equivalent to calculating the likelihood of assigning a bead type (T ) given a specific functional group pair (F), defined as P(T |F). We use the fragment population distributions for each bead type and each functional group pair to obtain probabilities P(T ) of a bead type and P(F) of a functional group pair. We then calculate the posterior probabilities P(F|T ) of a given bead type back-mapping to a specific functional group pair using Bayes' theorem The results are shown as a series of heat maps for each force field in Fig. 4.

Parameterization of New Bead Types
The new force fields share most of the parameters defined by the Martini force field. 44 For the intra-molecular interactions, bonded, angle, and dihedral force constants remain the same as those prescribed by Martini. The non-bonded interactions only contain one deviation from those in Martini. We linearly interpolate across the interaction matrix defined in Martini, 44 utilizing the distance between the established Martini ∆G W→Ol 32 and the desired ∆G W→Ol for the interpolation. The partition free energies of each bead type was then confirmed by running coarse-grained molecular dynamics simulations of single beads of each new bead type. These results are included in the SI, and show that this method yields an accurate force field without relying on an iterative scheme. Linear interpolation is chosen as it is clear that there is no underlying functional form or smooth landscape that can be derived from this parameter space (see SI for details). The new bead types are named as Ti types, with i ranging from 1 to N where N is the total number of bead types in the force field. The numbering is also ordered by polarity. For example, the T1 bead type for all new force fields is the most polar type. Conversely, the T5, T9, and T16 bead types are the most apolar bead types in the five, nine, and sixteen-bead-type force fields, respectively. The full list of bead types for each force field, their force field parameters, and their corresponding ∆G W→Ol values is available in the SI.

Coarse-Grained Simulations
Coarse-grained molecular dynamics simulations were performed in GROMACS 45 version 4.6.6 using the standard Martini force field parameters as well as the new force field parameters derived in this work. A time step of δt = 0, 03 τ was used for all simulations, where τ is the natural time unit for the propagation of the model defined in terms of the units of energy E , mass M and length L as τ = L M /E . The simulations were run in an NPT ensemble with a Langevin thermostat and Andersen barostat 46 to keep the temperature and pressure at 300 K and 1 bar, respectively. The corresponding coupling constants were τ T = τ and τ P = 12τ.
Water/octanol partition free energies were obtained by simulating approximately 500 coarse-grained molecules in octanol and water. Approximately 250 octanol molecules and 350 Martini water molecules were simulated for their respective systems, with the appropriate number of antifreeze particles. 44 The free energies were computed using the Bennett acceptance ratio method 47 in which the coarse-grained solute was incrementally decoupled from the solvent via the coupling parameter, λ . Twenty-one simulations were run for each molecule at evenly spaced λ values ranging from 0 to 1, with each simulation run for 200,000 time steps. Finally, the partition free energies were calculated using the relation ∆G W→Ol = ∆G W − ∆G Ol .

Quantifying information loss of CG models with varying number of bead types
The updated AUTO-MARTINI algorithm was used to first map and subsequently assign bead types to 3.5 million molecules of the GDB containing ten or fewer heavy atoms using the Martini force field as well as the other three force fields parameterized by interpolating the Martini interaction matrix. Fig. 2 shows a comparison of the atomistic and coarsegrained ∆G W→Ol distributions for molecules mapping to Martini unimers (Fig. 2a,b) and dimers (Fig. 2c,d). The corresponding histograms for the other three force fields, as well as a histogram constructed using the Martini force field but with the original AUTO-MARTINI algorithm can be found in the SI. The width of the coarse-grained bars reflects the range of ∆G W→Ol values within which a molecule must fall in order to be assigned that bead type, or, in the dimer case, a combination of bead types. The height of the bars is set such that the area covered by each bar is equal to the total number of molecules that were assigned that coarse-grained representation. We then calculate the JSD between the coarse-grained and atomistic histograms for each force field to quantify the information loss as a function of the number of bead types present in each force field (Fig. 2e). Increasing the number of bead types reduces the information loss when going from atomistic to CG resolution, though this reduction becomes insignificant after reaching nine bead types. The JSD comparing the unimer histograms (red curve in Fig. 2e) changes negligibly when increasing the number of bead types from nine to sixteen, with only a small increase for the Martini case (12 bead types). This is expected due to the fact that the atomistic histogram of GDB molecules mapping to a single bead is a simple, unimodal distribution with a peak at ∆G W→Ol = 0. Since all of the force fields have at least one amphiphilic bead type with a ∆G W→Ol close to 0, they all capture this defining feature of the histogram, and, comparatively, further information gains are negligible. However, the JSDs calculated from the dimer histograms (blue curve in Fig. 2e) show a variety of interesting features. Both the nine and sixteen-beadtype force fields maintain roughly the same JSD, suggesting that the combinatorial explosion that results from doubling the molecular weight is captured by these force fields. The slight increase seen in the unimer JSD for Martini is noticeably amplified for the dimer case, indicating that careful placement of bead types on the ∆G W→Ol axis is necessary to maximize chemical transferability. Surprisingly, the greatest deviation in the JSD going from the unimer to dimer histogram comes from the five-bead-type force field, dropping well below the values for the higher bead type force fields. The reason for this can be seen in Fig. S3b, which shows that the distribution of atomistic compounds mapping to dimers in the five-beadtype force field is significantly different from its analogs for the other force fields. This indicates that a significant number of molecules that would map to dimers when using one of the other force fields are mapped to trimers or tetramers using the 5-bead-type force field.

Relating chemistry to bead types
As an alternative to purely numerical methods for determining the optimal ∆G W→Ol values for the bead types of a CG force field that best partitions CCS, we cluster the GDB itself and use the weighted average of ∆G W→Ol for each cluster. Fig. 3a shows the two descriptors upon which we project and subsequently cluster the GDB. Each point in Fig. 3a represents the set of molecules in the GDB that have a specific number and type of heavy atom substitutions (i.e., N, O, or F). The points are placed on the ∆G W→Ol axis according to the average of their ∆G W→Ol distribution. The error bars represent the standard deviation of the ∆G W→Ol in each distribution. One of the corresponding distributions is shown in Fig. 3b. Interestingly, all of the distributions with populations of over 1000 molecules are unimodal. The points are clustered hierarchically with respect to population and average as shown in Fig. 3a. The highest-populated points are all chosen as cluster centers as long as they are separated by at least 0.5 kcal/mol, which is an arbitrarily chosen lengthscale for the clustering to ensure a reasonable number of bead types in the final force field. After the points are clustered, the desired ∆G W→Ol of each bead type is determined by taking the population-weighted average of all the points in a cluster. This intuitively provides a basic understanding of the chemistry that maps to a specific bead type. For example, a T4 bead is more likely to back-map to a molecule with one N and one O substitution compared to two N substitutions because of the difference in the GDB populations of each molecule type.
It is important to characterize the degree to which unique chemistries are captured by the bead types of each force field. Using the GDB as a proxy for CCS enables a quantitative understanding of the chemical transferability of each bead type through the calculation of conditional probabilities. Fig. 4 shows a series of heat maps corresponding to each of the four force fields investigated in this work. These heat maps are constructed by counting all fragments containing only five heavy atoms and assigned to a specific bead type, such that two functional groups are detected by the CHECKMOL software package. The fragment population distributions are then used to calculate the Bayesian likelihood P(T |F) and posterior P(F|T ) for each bead type/functional pair combination in every force field. The numbers on the horizontal axis for each heat map denote specific pairs of functional groups found in the chemical fragments that are assigned to a bead type, while the color corresponds to either the likelihood or posterior probabilities. We see the localization of functional group pairs to specific bead types mainly because of the constraint of only including fragments with five heavy atoms. This constraint limits the combinatorics of hetero-atom and bond substitutions that result in functional group pairs. Despite the addition of these constraints, a large number of functional group pairs are still split across multiple bead types. The corresponding heat maps constructed using four-heavy-atomfragments only are included in the SI and show far less degeneracy of functional group pairs across bead types compared to these heat maps, although the general trends observed are the same. Table I provides additional quantification of the trends  T9  T8 T7 T6T5 T4 T3 T2  T1

CG force field validation
While we have demonstrated that the careful placement of bead types on the ∆G W→Ol axis leads to more chemical transferability, the force fields themselves must be validated. Because ∆G W→Ol was used as the target property for the interpolation of the Martini interaction matrix, we must ensure that this property is indeed captured by the resulting models and determine to what extent the accuracy of these models changes as the number of bead types increases. Fig. 5 shows correlation plots comparing ∆G W→Ol values computed from coarse-grained MD simulations with experimental values for approximately 500 ring-less molecules obtained from the National Cancer Institute database. 48 The comparison is made for all four of the models examined in this work. The number of compounds varies for each model, as the AUTO-MARTINI algorithm was able to successfully find mappings for more molecules in the database when using a model with a higher number of bead types, ranging from 479 compounds mapped when using the five-bead-type model to 505 when using the sixteen-bead-type model. The full set of compounds as well as their corresponding coarse-grained representations is provided in the SI. The vertical series of points prominently seen in Fig. 5a are a consequence of the increased degeneracy of CCS for the 5-bead-type model: they represent many compounds mapping to the same coarse-grained representation. As expected, the correlation becomes less discretized as the number of bead types increases. Examining Figs. 5e and 5f, we see corresponding gains and losses in the Pearson correlation coefficients and MAEs, respectively. Surprisingly, the gains in accuracy are very slight as a function of number of bead types-with the correlation coefficient only increasing by 0.01 and the MAE decreasing by 0.2 kcal/mol-despite tripling the number of bead types. Even with the five-beadtype model, we achieve an MAE of 0.8 kcal/mol, within the standard for chemical accuracy.

DISCUSSION
Given the immense size of CCS, the creation of reduced models that efficiently subdivide the space is necessary for  screening applications. Here we demonstrate the use of the water/octanol partition free energy as the parameter used to generate top-down chemically-transferable coarse-grained models of varying numbers of bead types. This choice of descriptor is inspired by the Martini force field, which prescribes the use of ∆G W→Ol when determining the bead type to be used to represent a molecular fragment. Here, we use the GDB as a proxy for CCS 28,29 and apply the AUTO-MARTINI algorithm to compare the populations of the GDB molecules and their corresponding CG representations for four different force fields with varying numbers of bead types. This effectively amounts to a discretization of CCS projected onto ∆G W→Ol at multiple resolutions. Fig. 2e quantifies the level of information loss using the JSD as the resolution is varied, allowing us to determine how effectively each of these force fields, including Martini, represents CCS. The JSD decreases as the number of bead types increase. However, the information retention becomes negligibly greater, essentially plateauing after 9 bead types. Remarkably, despite the fact that the Martini force field was parameterized using a small number of chemical compounds (relative to the large distribution of compounds used to parameterize the other mod-els in this work), it shows only a minuscule increase in the JSD. This is mainly due to the inefficient placement of the P3, P4, and P5 beads within close proximity to each other on the ∆G W→Ol axis. Unfortunately, this increase in the JSD is amplified when comparing the ∆G W→Ol distributions for dimer molecules, whereas for the 9 and 16 bead type models, the JSD seems to converge. The combinatorial explosion that results from doubling the size of molecules (i.e., going from unimer to dimer) is reflected in these histograms as a broadening of the total distribution, since more hydrophobic and hydrophilic values of ∆G W→Ol are possible as molecule size increases. Fig. 2e shows that the 9 and 16 bead type force fields match this combinatorial explosion. On the other hand, Figs. 5e and f clearly demonstrate that a high level of accuracy is already achieved with respect to ∆G W→Ol using the five-bead-type force field. What, then, is the benefit to using a model with more than five bead types? As we show from Figs. 3 and 4, the main advantage is in back-mapping the coarse-grained representations to their likely atomistic counterparts. 49 Specifically, the 9 bead force field is parameterized not by simply optimizing the JSD, but rather by clustering the GDB molecules into sub-distributions based on the type and number of heavy-atom substitutions on the carbon scaffold of each molecule as shown in Fig. 3. As expected, this clustering strategy also results in a minimal value of the JSD, while providing an added convenience. The distributions that were clustered to make this force field provide a method for predicting the chemistries that are most representative of a bead type. Since the standard deviations of these distributions are so large, such that some span across three different bead types, this provides only a rough idea of the probable chemistry accessible to a bead type. Moreover, knowledge of the presence of one or two heavy-atom substitutions on a carbon scaffold of up to 8 heavy atoms is insufficient for back-mapping given the number of ways in which they can be arranged on that scaffold resulting in wildly different chemical properties. Fig. 4 shows how different functional group pairs will map clearly to specific bead types when the scaffold size is reduced to five heavy atoms. This extra constraint enables a clearer understanding of the range of unique chemistries that are accessible to a specific bead type. Decreasing the size of the scaffold from five to four heavy atoms yields correspondingly narrower distributions of ∆G W→Ol , meaning that the same functional group pair can be found in fewer bead types. By no longer requiring functional group pairs and increasing the scaffold size to eight heavy atoms, we begin obtaining distributions similar to those seen in Fig. 3.
Table I also demonstrates that the number of unique functional group pairs that map to a given bead type decreases as the number of bead types increases, to the point where, for Martini as well as the 16 bead type force field, there exist bead types that essentially back-map to a single functional group pair. Here, we see a clear parallel with structural coarsegraining methods: just as decreasing the size of the beads leads to a finer mapping of the configurational space, increasing the number of bead types leads to a finer mapping of CCS. The efficiency of a CG model can be optimized by tuning the mapping function and bead size of a CG model such that the accuracy of the model is balanced with respect to the computational cost of simulating a greater number of particles. By fixing the geometric mapping method and bead size, and only varying the number of bead types possible, we instead balance between the accuracy of representing specific chemical features and the cost of parameterization and validation of the inter-particle potentials. We circumvent this cost by interpolating the Martini interaction matrix to obtain accurate parameters for all of the force fields presented in this work. However, this cost will be significant for models requiring a more rigorous parameterization scheme relying on other molecular descriptors. Separate from this trade-off between accuracy and parameterization cost, a "back-mapping efficiency" can be defined as the average number of functional group pairs that map to a single bead type, indicating a larger region of CCS being captured by a single bead type. Unsurprisingly, Table I shows that the five-bead-type force field clearly has the highest back-mapping efficiency.
This statistical analysis of functional group pairs also suggests a Bayesian approach to computing the probability of a functional group pair, F, given a bead type, T , represented as P(F|T ) in equation 2. P(F), the Bayesian prior, is the probability of finding the specific functional group pair in the set of molecular fragments (made up of five heavy atoms and containing two functional groups as defined by CHECKMOL) that mapped to single beads, and P(T ) is the probability of choosing the given bead type from that same data set. The likelihood, P(T |F), shown in the left side of Fig. 4 prescribes the bead type or types to which a fragment could be assigned based on its chemistry-the equivalent of the Martini "bible" for assigning bead types. As shown in Table I, the number of functional group pairs with likelihoods greater than 0.99 (essentially localized to a single bead type) decreases as the number of bead types increases. The Martini force field deviates slightly from this trend, with two more functional group pairs with high likelihoods as compared to the nine-bead-type force field. This may stem from the parameterization strategy used for Martini that relied on specific molecules and their functional groups rather than aiming to efficiently span chemical space by optimizing the JSD, as proposed in this work. The posterior probabilities, which provide a quantitative description of which chemistries are more representative of each bead type, increase as the number of bead types increases. This effect more easily facilitates the back-mapping of coarse-grained representations. These two quantities, the Bayesian likelihood and posterior, are essential for further exploring CCS covered by specific bead types and enabling both direct and inverse molecular design.
Interestingly, we immediately see a resolution limit with respect to the functional group pairs that map to specific bead types. Because there are certain length scales on the ∆G W→Ol axis that correspond to the distribution of specific functional group pairs, increasing the number of bead types will naturally split these distributions, such that one functional group pair is represented in multiple bead types. Fig. 4a shows that the majority of functional group pairs are encompassed either by a single bead type or one of its neighbors on the ∆G W→Ol axis. Increasing the number of bead types causes these splits to become more exacerbated, spanning multiple bead types for an increased number of functional group pairs. This is the resolution limit of this type of top-down coarse-graining. The large bead sizes of these models leads to a high degree of variability in the chemistry, meaning that it is no longer obvious which functional group/functional group pair belongs to which bead type. The limit is most evident for the functional group pairs mapping to the T3 and T13 bead types in Fig. 4d, indicating that they are placed too close to their neighbors on the ∆G W→Ol axis. These functional group pairs contain some combination of the following functional groups: alkene, alkyne, enamine, hydrazine, hydroxylamine, carboxylic acid derivatives, and fluorine substitution. The placement of these functional groups within a five-carbon scaffold will drastically shift the ∆G W→Ol beyond the range of the next nearest bead type on the ∆G W→Ol axis and highlights the limitations of only using this single descriptor for the projection of CCS. However, determining a suitable orthogonal descriptor and then parameterizing a chemically transferable CG force field to achieve a more direct relation with CCS is outside the scope of this work, and will be addressed subsequently.

CONCLUSION
In this work, we use the JSD to quantify the information loss in chemically transferable top-down coarse-grained models with varying numbers of bead types, with the GDB as our proxy for chemical compound space (CCS). We find that Martini, while not designed to efficiently reduce CCS, only performs slightly worse than the other force-fields explicitly designed to minimize the JSD. All force fields yield roughly the same level of accuracy with respect to ∆G W→Ol , but vary greatly in their coverage of CCS. We used a Bayesian approach to calculate the probabilities of back-mapping given bead-types to fragments containing specific chemical substitutions. Here, we found it necessary to constrain the size of chemical fragments to five heavy atoms and the presence of two functional groups in order to clearly differentiate between the chemical moieties mapping to each bead type. The results of this Bayesian analysis indicate that increasing the number of bead types decreases the range of accessible chemistry while increasing the corresponding posterior probabilities for each chemistry. However, there is a resolution limit when using this approach, as it does not take into account the specific positions of hetero-atom and bond substitutions within a fragment, causing different bead types to appear representative of the same chemistry. Martini, as well as other chemically-transferable coarse-grained models, can be used to quickly build structure-property relationships that span broad regions of CCS. Here we highlight the powerful combination of this method with Bayesian inference, providing an informed mapping of a coarse structure-property relationship to a higher resolution in chemical compound space and further enabling inverse molecular design.

SUPPORTING INFORMATION
The attached supporting information provides details on (i) the changes made to the AUTO-MARTINI code; (ii) the histograms used to calculate JSDs for all force fields described in this work; (iii) statistics for each of the distributions clustered when obtaining the nine-bead-type force field; (iv) the parameterization method for the new force fields; and (v) the specific functional group pairs used in the Bayesian analysis. In addition, we provide databases containing the set of GDB compounds mapping to each unimer and dimer, the force field parameters, and the trajectories simulated for each force field in a repository. 50

I. INTRODUCTION
In this supplementary text, we report additional results referenced in the main paper. In Sec. II we detail the changes made to the previously published version of the auto-martini code. In Sec. III we show the histograms comparing the ∆G W→Ol distributions for each of the force fields studied in this text. In Sec. IV we provide statistics for each of the clusters used in the creation of the nine-bead-type force field. In Sec. V we demonstrate the result of interpolating across the Martini interaction matrix in order to parameterize our new force fields. In Sec. VI, we provide lists of all the functional group pairs included in the calculation of the Bayesian likelihood and posterior distributions. We also include another analysis using this Bayesian approach for fragments that contain five heavy atoms. In addition, we have included text files containing force field parameters for each of the new force fields, the database of GDB compounds mapping to CG unimers and dimers for each force field, and trajectories for each of the simulations referenced in the main text in a zenodo repository which can be accessed via the following link: http://doi.org/10.5281/zenodo.3271766.

II. UPDATES TO AUTO-MARTINI
Several changes were made to the auto-martini code in order to increase its accuracy when applied to a large and varied database such as the GDB. The "lonely atom penalty" 1 , which weights the effect of leaving single heavy atoms outside the van der Waals radii of the Martini beads, was increased slightly from 0.20 to 0.28. Additionally, the "additivity check" was removed for molecules that map to single beads. This additivity check was designed to ensure that the voronoi decomposition of molecules into fragments and the subsequent selection of bead types for each fragment was sensible (the sum of the ∆G W→Ol values for each bead should be within a cutoff value when compared to the ∆G W→Ol of the entire molecule). This was enacted in order to resolve an issue in which molecules that were meant to be mapped to a single bead (e.g. Propanol) were unable to be successfully mapped using the code. The effect of these two changes on the distributions of ∆G W→Ol is shown in that would normally map to a single bead were excluded because they failed the additivity check, which should not be applied for single beads. Note that there is a noticeable dip in the coarse-grained distribution of Fig. S1. This corresponds to the N0 bead type, which is underpopulated when compared to the corresponding region in the atomistic distribution.
We found that this was an artifact due to a cut-off value in the code that caused molecules to be mapped to a donor-acceptor type of bead even if their ∆G W→Ol was closer to the N0 value. By reducing this cut-off value, we were able to obtain the distribution shown in Fig. 2a of the main text, and is also shown in Fig. S3c. The final change has to do with the assignment of ring molecules. The standard approach for ring molecules was to use the entire set of atoms in the ring for each fragment and weight each bead's contribution by a scaling factor. For all ring molecules, this was previously set to 2/3 so as to reproduce the Martini parameterizations for benzene and cyclohexane. 1,2 However, in order to optimize the mappings for the multitudes of ring-containing molecules in the GDB, we found that a factor of 1/2 for 5-membered rings and 1/3 for six-membered rings yielded much better agreement with respect to the alogps predictions for the ring molecules. The results are shown in

III. HISTOGRAMS
In Fig. S3, we show all of the histograms used to compute the JSDs shown in Fig. 2e of the main text. Note the significant differences in the ∆G W→Ol distributions for the molecules mapping to dimers in the five-bead-type model (Fig. S3b). The five-bead-type model is sampling a set of molecules from the GDB that is clearly different from those of the other models. While the other distributions contain populations ranging from 3.3 · 10 5 to 3.4 · 10 5 , the 5-bead-type force field has only 3.0·10 5 molecules. Furthermore, even though the shapes of the distributions for the other three force fields are far more similar to each other than to the five-bead-type force field, the intersection of the sets of atomistic compounds mapping to each force field consists of 2.3 · 10 5 molecules. Including the set of molecules mapping to the 5-bead-type force field reduces this intersection to 1.8 · 10 5 molecules. This explains why the JSD value for the five-bead-type model is significantly lower than all of the others.

S4
(a) S3. Histograms used to calculate JSD values shown in Fig. 2 of the main text. The unimer and dimer distributions are shown for the five-bead-type (a,b), nine-bead-type (c,d), Martini (e,f), and sixteen-bead-type (g,h) force fields.

IV. ∆G W→Ol DISTRIBUTIONS FROM THE GDB
In the zenodo repository linked above, we provide plots of distributions of ∆G W→Ol for molecules in the GDB containing up to eight heavy atoms. The distributions are constructed based on the number and type of heavy atom substitutions that exist in the molecules. For example, the file named "GDB02to08 HAstats fooo subs.pdf" shows the ∆G W→Ol distribution for all molecules containing one fluorine and three oxygen substitutions. Also included in the repository is a single file called "GDB02to08 HAstats.dat" which contains the mean and standard deviation for each of the distributions provided, which were used to make Fig.   3 of the main text.

S10
VI. FUNCTIONAL GROUP ANALYSIS Fig. S6 shows the likelihood and posterior values calculated for fragments containing only four heavy atoms and two functional groups as specified by checkmol. The total number of bead types of each force field is not reflected in these heat maps, with the most apolar bead types missing. This is because all of the fragments that map to these bead types consist of saturated hydrocarbons or single alkene/alkyne substitutions only, and thus are not detected as having a functional group pair by checkmol. Furthermore, there are no values calculated for the T7 beads in the sixteen-bead-type force field because there were no donor/acceptor/donor+acceptor fragments that also had two functional groups within the narrow range of ∆G W→Ol covered by the T7 bead types. Similar reasoning can also be applied to explain the lack of values for the T11 bead type in the same force field.
Over the course of this work, certain idiosyncrasies were discovered when using checkmol. One such issue was the fact that the code tended to double-count some functional groups. For example, fragments with only a single fluorine substitution were counted as both a "halogen derivative" and as a "alkyl fluoride". This was only observed for the aforementioned fluorine substitutions as well as for dialkyl ethers. Other examples were also found for which the software could not correctly identify the functional groups contained in the fragment. This is probably due to the fact that checkmol was not tested on some of the less common chemistries encountered in the GDB. The most egregious example of this was found for fragments containing the smiles string "NC=N" which were incorrectly labeled as a carboxylic acid derivatives by checkmol. For this reason, we did not explicitly label the horizontal axes with their corresponding chemistries in Fig. S6 and Fig. 4 of the main text.
For full transparency, we have included the smiles string for each unique fragment used in the Bayesian analysis for both four-heavy-atom and five-heavy-atom fragments as well as the corresponding values for P (F ), P (T ), P (T |F ), and P (F |T ) in the zenodo repository.
While the functional group labels given by checkmol are incorrect in a few cases, the over-