This page features a series of carefully built codes that accelerate a protocol for creating a protein-ligand MD-ready system using freely available tools and web servers, including ACD/ChemSketch, PLANTS, MATCH, CHARMM-GUI, Chimera. It includes tools for proper ligand parameterization and flexible receptor docking. A paper describing the protocol will be submitted for review and will be linked if and when it gets public.
IMPORTANT There are five script file used to optimize the protocol: two python scripts (cod_flexi.py and cod_match.py), two Chimera scripts (chimera_combine_rec_lig.txt and chimera_combine_monomers.txt) and one file containing Linux bash commands (add_chain_lig.txt). The text file scripts should be run manually, by copy-pasting codes. The protocol is tested on Ubuntu 18.04 and 20.04 machines. Python codes should be run using Python3.
The python script file cod_flexi.py looks like this:
import re
import os
receptor='../trpm8_swiss_renumbered_monomer.pdb' #set name of original receptor pdb
nume='smiles:[O-][N+](=O)c1...' #set name of PLANTS output
out_rec='rec_final'
out_lig='lig_final'
n = 1
#special thanks to Darren Haynes
#https://stackoverflow.com/questions/20768783/python-why-is-re-sub-not-replacing-dict-key-with-dict-value-when-there-is-i
def replacer_factory1(dictionary):
def replacing(match):
if len(dictionary) > 0:
word = match.group()
exchange = dictionary.get(word, word)
#spintax = '{' + exchange + '}'
#create_place_holder = spintax.replace(' ', '#!#')
#return create_place_holder
return exchange
else:
return ""
return replacing
def replacing1(text):
regex_patt_list = r'\b(?:' + '|'.join(patt_list) + r')\b'
replacer = replacer_factory1(dictionary)
return re.sub(regex_patt_list, replacer, text)
while n<10:
with open(nume + '_entry_00001_conf_0' + str(n) + '_protein.mol2') as flexi:
dictionar = {}
lista = []
i = 0
for line in flexi:
attypef = line[8:11]
#print(attypef)
residuef = line[59:62]
#print(residuef)
resnumf = line[62:66]
#print(resnumf)
coordf = line[18:25] + line[27:35] + line[37:45]
#print(resnumf)
with open(receptor) as complete:
for line in complete:
if line.startswith("ATOM"):
attypec = line[13:16]
#print(attypec)
residuec = line[17:20]
#print(residuec)
resnumc = line[22:26]
#print(resnumc)
coordc = line[31:54]
if ((attypef == attypec) and (residuef == residuec) and (resnumf == resnumc)):
i += 1
dictionar[coordc]=coordf
lista.append(coordc)
#print(resnumc)
#print(resnumf)
f = open(receptor,'r')
input = f.read()
f.close()
a = open(out_rec + '_' + str(n) + '.pdb' ,'w+')
a.write(input)
a.close()
patt_list = lista
dictionary = dictionar
with open(out_rec + '_' + str(n) + '.pdb', 'r+') as sent:
read_sent = sent.read()
sent.seek(0)
sent.write(replacing1(read_sent))
os.rename(nume + '_entry_00001_conf_0' + str(n) + '.mol2',out_lig + '_' + str(n) + '.mol2')
n += 1
The python script file cod_match.py looks like this:
import re
import os
initial='1006542' # MATCH job number
final='icilin' #set output name
#the following two ifs remove any output files that may exist
if os.path.exists(final + '.top'):
os.remove(final + '.top')
if os.path.exists(final + '.par'):
os.remove(final + '.par')
#two definitions for replacing strings based on a list and dictionary
#special thanks to Darren Haynes
#https://stackoverflow.com/questions/20768783/python-why-is-re-sub-not-replacing-dict-key-with-dict-value-when-there-is-i
def replacer_factory1(dictionary):
def replacing(match):
if len(dictionary) > 0:
word = match.group()
exchange = dictionary.get(word, word)
#spintax = '{' + exchange + '}'
#create_place_holder = spintax.replace(' ', '#!#')
#return create_place_holder
return exchange
else:
return ""
return replacing
def replacing1(text):
regex_patt_list = r'\b(?:' + '|'.join(patt_list) + r')\b'
replacer = replacer_factory1(dictionary)
return re.sub(regex_patt_list, replacer, text)
##now we modify the *.rtf file and change atom types using kind of a "dictionary"
##we will save the file as a *.top
f = open(initial + '.rtf','r+')
input = f.read()
###create dictionary from atom types###
dictionar = {}
lista = []
i=1
new_attype_ch3 = 'a'
new_attype_ch4 = 'a'
for line in input.splitlines():
if line.startswith("MASS"):
old_attype = line[11:15]
#print(attype)
lista.append(old_attype)
dictionar[old_attype]=('hh' + new_attype_ch3 + new_attype_ch4)
i += 1
if i % 2 == 0:
new_attype_ch3 = chr(ord(new_attype_ch3) + 1)
else:
new_attype_ch4 = chr(ord(new_attype_ch4) + 1)
#add -1 after each of the MASS specs
result = re.sub(r'MASS(\s+)[0-9][0-9][0-9]',r'MASS -1',input)
#print (result)
f.close()
output = open(final + '.top','w+')
output.write(result)
output.close()
patt_list = lista
dictionary = dictionar
with open(final + '.top', 'r+') as sent:
read_sent = sent.read()
sent.seek(0)
sent.write(replacing1(read_sent))
##now we do the same for the *prm file
##and save it as *.par
prm = open(initial + '.prm','r+')
result = prm.read()
##creare fisier temporar1
output = open(final + '_temporar.par','w+')
output.write(result)
output.close()
##some functions to retrieve lines that contain a specific string
##cheers to Jack Romo https://www.quora.com/How-do-I-write-a-python-script-that-will-find-a-specific-string-in-a-txt-file-select-the-rest-of-string-surrounding-it-and-paste-in-another-txt
def get_lines_with(input_str, substr):
lines = []
for line in input_str.strip().split('\n'):
if substr in line:
lines.append(line)
return lines
def txt_lines_with(fname, substr):
f_contents = open(fname, 'r').read()
return get_lines_with(f_contents, substr)
def filter_txt_lines_to(fname_in, substr, fname_out):
filtered_lines = txt_lines_with(fname_in, substr)
joined_lines = '\n'.join(filtered_lines)
open(fname_out, 'a+').write(joined_lines)
##generate a temp file with lines starting with MASS
filter_txt_lines_to(final + '.top', 'MASS', 'temporar.txt')
with open("temporar.txt",'r') as contents:
save = contents.read()
with open("temporar.txt",'w') as contents:
contents.write('* prm file built by MATCH' + '\n' + '*' + '\n' + '\n' + 'ATOMS' +'\n')
with open("temporar.txt",'a') as contents:
contents.write(save)
##remove some lines from the par file
with open(final + '_temporar.par', 'r') as fin:
data = fin.read().splitlines(True)
with open(final + '_temporar.par', 'w') as fout:
fout.writelines(data[3:])
##combine files to obtain the final output
partA = open('temporar.txt','r+')
inputA = partA.read()
partA.close()
partB = open(final + '_temporar.par','r+')
inputB = partB.read()
partB.close()
parfinal = open(final + '.par','a+')
parfinal.write(inputA + '\n' + '\n' + inputB)
parfinal.close()
##the dictionary and list are the same
with open(final + '.par', 'r+') as sent:
read_sent = sent.read()
sent.seek(0)
sent.write(replacing1(read_sent))
##remove temp files
os.remove("temporar.txt")
os.remove(final + "_temporar.par")
The chimera_combine_rec_lig.txt file should contain:
open rec_final_1.pdb
open rec_final_2.pdb
open rec_final_3.pdb
open rec_final_4.pdb
open rec_final_5.pdb
open rec_final_6.pdb
open rec_final_7.pdb
open rec_final_8.pdb
open rec_final_9.pdb
open lig_final_1.mol2
open lig_final_2.mol2
open lig_final_3.mol2
open lig_final_4.mol2
open lig_final_5.mol2
open lig_final_6.mol2
open lig_final_7.mol2
open lig_final_8.mol2
open lig_final_9.mol2
combine #0,9
combine #1,10
combine #2,11
combine #3,12
combine #4,13
combine #5,14
combine #6,15
combine #7,16
combine #8,17
write #18 rec_conf1.pdb
write #19 rec_conf2.pdb
write #20 rec_conf3.pdb
write #21 rec_conf4.pdb
write #22 rec_conf5.pdb
write #23 rec_conf6.pdb
write #24 rec_conf7.pdb
write #25 rec_conf8.pdb
write #26 rec_conf9.pdb
While the chimera_combine_monomers.txt has the following commands:
open ../trpm8_swiss_renumbered.pdb
open rec_conf1.pdb
open rec_conf2.pdb
open rec_conf3.pdb
open rec_conf4.pdb
matchmaker #0:.A #1
matchmaker #0:.B #2
matchmaker #0:.C #3
matchmaker #0:.D #4
combine #1,2,3,4
write #5 complex.pdb
Finally, the Linux bash commands are included in add_chain_lig.txt as follows
sed -i 's/UNK /UNK R/g' rec_conf1.pdb
sed -i 's/UNK /UNK S/g' rec_conf2.pdb
sed -i 's/UNK /UNK T/g' rec_conf3.pdb
sed -i 's/UNK /UNK U/g' rec_conf4.pdb
sed -i 's/UNK /UNK V/g' rec_conf5.pdb
sed -i 's/UNK /UNK W/g' rec_conf6.pdb
sed -i 's/UNK /UNK X/g' rec_conf7.pdb
sed -i 's/UNK /UNK Y/g' rec_conf8.pdb
sed -i 's/UNK /UNK Z/g' rec_conf9.pdb
The protocol consists of several steps, and should allow anyone to build a classical, all-atom MD system that includes a protein-ligand complex. It does not include details about the requirements for a proper protein (pdb) model. A faithful description of the protein model and coordinates (titratable residue charges, glycosylations or other modifications etc.) must be at hand, as input. However, modeling of the ligand is described.
Step 1 – Ligand structure sketching
Model the ligand from scratch in the free version of ACD/ChemSketch, kindly provided by ACDLabs (registration is required for download). Although freeware, it is one of the most powerful small molecule sketcher. The program runs natively in Windows, but I used it embedded under PlayOnLinux. It is imperative to obtain a proper SMILES string, that will be used as an input in the next step. Other free tools that may work, yet untested, are BKChem (offline, outdated but powerful) or CACTUS Chemical Identifier Resolver (online, seems down). Nonetheless, ACD/ChemSketch allows a flawless retrieving of the SMILES string for our molecules. ACD/ChemSketch may also be used to generate enantiomers of our small molecules, if needed. The SMILES string can be obtained by first selecting the desired molecule atoms and using Tools -> Generate -> SMILES Notation. For this step, no script is needed.
Step 2 – Ligand structural modeling
The SMILES string will now be submitted to the Add SMILES string interface in the Build Structure module of the versatile molecular modeling program, Chimera (UCSF RBVI). The resulting molecular coordinates will be saved in mol2 format, and will be named ligand.mol2.
Step 3.1 – Ligand parameterization
Although the CHARMM Generalized Force Field (CGenFF) is a common choice, and readily available in CHARMM-GUI, it is only suitable for geometric assemblies which highly reflect the training set used for its creation. Instead of CGenFF, which may penalize the parameters for many molecules, we will submit the ligand.mol2 file to the Multipurpose Atom-Typer for CHARMM (MATCH) server, developed by the BrooksLab (University of Michigan).
The resulting output from MATCH will include two important files, that will be processed: a topology file, x.rtf, and a parameter file, x.prm, where x is the job number from MATCH.
Atom type renaming and other formatting are required for putting the files in a CHARMM-GUI-accepted format. To accomplish this, we use cod_match.py. The input match job number and output name (e.g. ligand) must be manually included in the first input lines of cod_match.py. This script contains all the necessary lines of code to output two files, in our case: ligand.top and ligand.par, corresponding to the topology and parameter files, respectively.
Step 3.2 – Fully flexible ligand docking to its target
We are going to be using the Protein Ligand ANT System (PLANTS, University of Tübingen). An imperative limitation is that all residue numbers in the original receptor file (receptor.pdb) must have EXACTLY four digits. If the first residue is, say, Ile106, we employ Chimera (Tools -> Structure Editing -> Renumber Residues) to set the starting residue at Ile1106, and we do this for all chains (we basically add 1000 to each and every residue number, for ease). We save the resulting coordinates as receptor_renumbered.pdb. If we deal with a multimer and we need to extract a monomer for docking (for instance chain D) Chimera is also used and we save the file as receptor_renumbered_monomer.pdb. This is due to the fact that, if two identical flexible residues atom name and number (e.g. Thr105) are placed on different chains (e.g. chain A and B have at position 105 a threonine), PLANTS cannot differentiate between them. However, this should only impact docking calculation times, not accuracy. We will go on with this example, based on receptor_renumbered_monomer.pdb.
Next, we use Structure PrOtonation and REcognition System (SPORES, included in the PLANTS package) with the –mode complete tag on the ligand.mol2 file and receptor_renumbered_monomer.pdb file. The resulting outputs will be named ligand_protonated.mol2 and receptor_renumbered_monomer_protonated.mol2.
A plants.conf file has to be built (center of the binding site, its radius, cluster structures 9, write_multi_mol2 0, and flexible_protein_side_chain_string VAL1275, THR3025 etc – these are the flexible amino acids, keeping in mind that we will subtract “1000” to get the actual residue number, since we renumbered them).
The output docking folder will contain nine pairs of files named something as: smiles:[O-1cc…_entry_00001_conf_0x.mol2 and smiles:[O-1cc…_entry_00001_conf_0x_protein.mol2, where x is the cluster number. The former corresponds to the docked ligand coordinates, while the latter corresponds only to the flexible residues coordinates.
We use the supplied cod_flexi.py to combine these nine pairs of coordinates into nine complete *.pdb files, one for each cluster. We copy the cod_flexi.py file in the PLANTS output folder. The upstream folder should contain the receptor_renumbered_monomer.pdb file, the one that has not been modified with SPORES. We specify the name of this file in the cod_flexy.py. We also need to specify in this file the prefix of the output files generated by PLANTS, namely “smiles:[O-1cc…”. Running this code will generate nine files named lig_final_x.mol2 and nine files named rec_final_x.pdb, where x is the cluster number. This code basically takes the flexible coordinates from the PLANTS output and overwrites them on the original pdb coordinates of the monomer when it finds correspondence between atom numbers.
Then, we will combine the final files using Chimera. This can be done manually only for the desired clusters, but, if Chimera is started from the output directory, we will input the commands found in chimera_combine_rec_lig.txt in Chimera console at once and we will get nine files named as rec_confx.pdb, x being the cluster number. To start Chimera from the working directory, the program was added in the Ubuntu path (chmod +x in the Chimera program location).
In the nine files, the UNK residue (the ligand) is missing the chain ID, so we add it using the commands found in add_chain_lig.txt from Ubuntu bash. This will assign a different letter for each complex.
We shall now have nine protein-ligand files which may be sent to CHARMM-GUI. However, since they are monomers and have to be simulated only in the homo-tetrameric form, we use Chimera again to rebuild the multimers. The file chimera_combine_monomers.txt contains the necessary commands that must be copied in Chimera console and ran from the output directory. We need to set the name of the original, homo-tetrameric pdb file, that should also be in the upstream folder, provided all steps have been implemented correctly, receptor_renumbered.pdb. Also, the script can be modified to include any of the nine conformations in the final file. A visual inspection is needed, but clusters 1-4 are the most energetically favorable and should be simulated. Thus, the final file is named complex.pdb and is ready to be used in a complete simulation system assembly, with the aid of CHARMM-GUI.
Step 4 – Using CHARMM-GUI to prepare the system
The last step concerns sending the final files to CHARMM-GUI for actual system build, either using a membrane environment or just an aqueous one. When prompted (i.e. in step 2 of the CHARMM-GUI Membrane Builder interface), when Reading Hetero Chain Residues we select to Upload CHARMM top & par for hetero chain, and feed the ligand.top and ligand.par files.