Our protocol demonstrates how open source software can allow any researcher to create and curate a computational structure library. This protocols appeal comes from its openness and flexibility. Anyone can use it and modify it to suit their specific research question.
Versions of this protocol can be applied to drug discovery applications, quickly creating specific structure libraries for in silico screening. Although the protocol is explained step by step, if users are not familiar with Java or basic coding they can first look at those before implementing the protocol. Begin by creating a new directory for the project.
Place all the files and executables in this directory for easy access. Download the latest version of Maygen as a jar file and the package management software Anaconda. On windows systems search for Anaconda prompt and click on the resulting shortcut to run.
To create an RDKit environment in Anaconda and to download the RDKit to the environment, type the command shown on the screen, press enter to run and answer yes to any questions that come up during the installation. Then download the Jupyter Notebooks and text files of the substrate patterns from the supplemental files, one to five. In the command prompt, navigate to the directory containing the maygen.
jar executable file. For each chemical formula of interest use the command shown on the screen to run Maygen. If the formula is a fuzzy formula instead of a discrete formula, replace the hyphen F flag with a hyphen fuzzy flag and enclose any element intervals in brackets.
In an Anaconda prompt navigate to the folder containing the Jupyter Notebooks and activate the RDKit environment. The downloaded notebooks require RDKit. So any future use of them in this protocol will require them to be opened in the RDKit environment.
Next, open the Jupyter Notebook for substructure filtering and close the file name in quotes if it contains spaces. In the designated cell at the start of the notebook enter the full file path of the input sdf file. The full file path of the desired sdf output file and the file path of the bad list file as strings.
If some sub structures in the filtered library or a good list need to be retained, create a txt file of SMARTS patterns for those sub structures and put the good list file path in the designated line at the start of the notebook. From the menu at the top select kernel, restart and run all to restart the notebook kernel and run all cells. A sdf file with the desired name will be created in the specified output folder.
Repeat these steps for each structure file generated by Maygen. For pseudoatom replacement open an Anaconda prompt, navigate to the folder containing the Jupyter Notebooks and activate the RDKit environment. Then open the Jupyter Notebook for pseudoatom replacement.
In the designated cell at the start of the notebook enter the full file path of the input sdf file and the full file path of the desired sdf output file as strings. Restart the notebook kernel and run all the cells to get a sdf file with the desired name in the specified output folder. Similarly, open an Anaconda prompt for amino acid N and C termini capping.
Navigate to the folder containing the Jupyter Notebooks and activate the RDKit environment. Open the Jupyter Notebook for amino acid capping. In the designated cell at the start of the notebook enter the full file path of the input sdf file and the full file path of the desired sdf output file as strings.
Restart the notebook kernel and run all the cells to get a sdf file with the desired name in the specified output folder. For the descriptor generation place all sdf files for which descriptors are to be calculated in a single folder. Then download the PaDEL descriptor, unzip it and extract it to that folder.
Open a command prompt, navigate to the folder containing the PaDEL descriptor jar file and run the PaDEL descriptor for the collected sdf files. The chemical space of all filtered amino acid libraries is shown here. Black markers represent amino acids from the libraries without sulfur and yellow markers represent amino acids from sulfur enriched libraries.
Here, VAIL and VAIL_S libraries are represented by circles. DEST and DEST_S libraries are represented by squares. Proline and Pro S libraries are represented by triangles and stars represent coded amino acids.
The range of possible log P values increases with the molecular volume even within the libraries that explicitly lack hydrophilic side chains. Coded amino acids with hydrocarbon side chains are more hydrophobic than most other amino acids of a comparable volume from their respective library. This is also the case for methionine insisting compared to other members of the VAILS library with similar volumes.
Coded amino acids with hydroxyl side chains were among the smallest members of the DEST library with Aspartic acid only slightly larger than three Anine. The represented image shows the mean Van der Waal volumes of libraries with sulfur and without sulfur. Sulfur substitution led to a slight increase in the molecular volume in all libraries.
The mean partition coefficient values of libraries with and without sulfur are shown here. The effect of sulfur substitution on log P is not as homogenous as for volume. The representative image shows the effects of a trivalent pseudoatom on Maygen structure generation.
Using a pseudoatom in structure generation decreased the number of structures generated by around three orders of magnitude in the total time needed to generate those structures by one to two orders of magnitude. Following this protocol, additional functionalities can be integrated in the future based on the needs of researchers. For example, one could integrate sub-structure filters into Maygen to avoid the post-processing step.
Library generation, curation and modification. This general process can accommodate other molecular structures and modifications with some coding knowledge, which will allow researchers to explore computational libraries beyond those of alpha amino acids. This protocol will help researchers enhance their computational work in the origins of life field.
Open source toolkits will greatly assist to these efforts.