Preparing your PDB file

Overview

Teaching: 15 min
Exercises: 0 min
Questions
  • What should I check before preparing a simulation?

Objectives
  • Read the literature on your biological system.

  • Set the experimental conditions of your simulation (pH, …).

  • Assess the protonation states of your system (including all the elements in it: protein and ligand).

  • PROPKA calculates the pKa values for each of the residues in your protein.

  • Marvinsketch will assist you in predicting the protonation state for your small molecule.

  • Add hydrogen atoms to the GWS ligand with openbabel.

  • Prepare your protein with pdb4amber.

Understanding your PDB file

We are going to use the X-ray structure of Mpro protein bound to a ligand from the PanDDA library. The PDB id we are going to use is 5R84. If you open this entry from the Protein DataBank you will find a lot of information of the protein structure:

Snapshot of PDB page 5R84

We are interested in the quality of the structure and the contents of the structure:

Once we know which molecules from the PDB file we want to model and which should be deleted from the PDB file, we can continue. You can download the PDB file from the website yourself or use the one we are providing in this repository.


Splitting the PDB in protein, ligand and water molecules

First, we will split the PDB in 3 different groups: protein, GWS ligand and water molecules as we need to treat them separately:

grep -v -e "GWS" -e "DMS" -e "CONECT" -e "HOH" 5r84.pdb >  protein.pdb
grep "GWS" 5r84.pdb > GWS.pdb
grep "HOH" 5r84.pdb > waters.pdb

PDB contents split in three groups


Replicating the experimental/target conditions

It is important to take into account experimental conditions of the system we want to simulate such as the pH, salinity… Since Mpro protein from SARS-CoV2 is located in the cytosol of the infected human cell, we are going to assume a pH=7.4. Do not asume that will always be the case in the future, you should always, please check the literature on your target protein before you start running your simulations.


Assessing the protonation states of the protein and preparing the protein for AmberTools

There are several methods to assess the protonation state of the residues in a protein (for further information, here). For example, the H++ server and PROPKA software can help you in the future. We are only going to use PROPKA command-line in this workshop.

propka31 protein.pdb 

PROPKA returns the calculated pKa values for tritable residues in the protein. These are aminoacids, whose sidechain contains a chemical moiety that can change its protonation state depending on the pH of the environment they are found in. You have acidic residues (typically ASP and GLU, but in extreme environments TYR, SER and CYS can act as acids) and basic residues (LYS, ARG and HIS).

Protein forcefield encode these protonation states as different residue names (i.e. GLU and GLH for the two states of glutamic acid). You can find more information about these residues, and how its pKa depends on the chemical environment here .

When preparing a protein, we have to compare each one of the predicted pKa values with the chosen pH: if the pKa is below the chosen pH the residue should be deprotonated (that is acidic residues will be negatively charged and basic residues will be neutral), if above it will be protonated (acidic residues will be neutral and basic residues will be positively charged).

TIP:

This step is especially relevant when you are simulating catalytic cycles, where the reaction mechanism often involves proton transfer events. Please check the literature on your target protein before you start to run your simulations.

This is the summary of the propKa output for our protein:

--------------------------------------------------------------------------------------------------------
SUMMARY OF THIS PREDICTION
       Group      pKa  model-pKa   ligand atom-type
   ASP  33 A     3.79       3.80
   ASP  34 A     3.63       3.80
   ASP  48 A     2.69       3.80
   ASP  56 A     4.00       3.80
   ASP  92 A     3.11       3.80
   ASP 153 A     3.88       3.80
   ASP 155 A     3.00       3.80
   ASP 176 A     3.90       3.80
   ASP 187 A     3.97       3.80
   ASP 197 A     3.81       3.80
   ASP 216 A     3.54       3.80
   ASP 229 A     2.15       3.80
   ASP 245 A     4.03       3.80
   ASP 248 A     3.37       3.80
   ASP 263 A     3.54       3.80
   ASP 289 A     3.28       3.80
   ASP 295 A     3.91       3.80
   GLU  14 A     4.00       4.50
   GLU  47 A     4.67       4.50
   GLU  55 A     4.73       4.50
   GLU 166 A     3.98       4.50
   GLU 178 A     4.88       4.50
   GLU 240 A     4.60       4.50
   GLU 270 A     4.67       4.50
   GLU 288 A     4.42       4.50
   GLU 290 A     5.79       4.50
   HIS  41 A     4.60       6.50
   HIS  64 A     6.26       6.50
   HIS  80 A     5.71       6.50
   HIS 163 A     2.04       6.50
   HIS 164 A     1.44       6.50
   HIS 172 A     5.51       6.50
   HIS 246 A     5.37       6.50
   CYS  16 A    11.91       9.00
   CYS  22 A    10.26       9.00
   CYS  38 A    12.83       9.00
   CYS  44 A    11.04       9.00
   CYS  85 A    11.68       9.00
   CYS 117 A    11.97       9.00
   CYS 128 A    12.86       9.00
   CYS 145 A    11.82       9.00
   CYS 156 A     9.54       9.00
   CYS 160 A    13.16       9.00
   CYS 265 A    12.18       9.00
   CYS 300 A    10.29       9.00
   TYR  37 A    11.88      10.00
   TYR  54 A    15.15      10.00
   TYR 101 A    12.87      10.00
   TYR 118 A    10.28      10.00
   TYR 126 A    13.10      10.00
   TYR 154 A    10.17      10.00
   TYR 161 A    15.42      10.00
   TYR 182 A    13.69      10.00
   TYR 209 A    13.21      10.00
   TYR 237 A    10.15      10.00
   TYR 239 A    13.24      10.00
   LYS   5 A    10.57      10.50
   LYS  12 A    10.41      10.50
   LYS  61 A    10.72      10.50
   LYS  88 A    10.14      10.50
   LYS  90 A    10.56      10.50
   LYS  97 A    10.37      10.50
   LYS 100 A    11.37      10.50
   LYS 102 A    10.76      10.50
   LYS 137 A    10.33      10.50
   LYS 236 A    10.69      10.50
   LYS 269 A    11.49      10.50
   ARG   4 A    12.41      12.50
   ARG  40 A    14.21      12.50
   ARG  60 A    12.46      12.50
   ARG  76 A    12.59      12.50
   ARG 105 A    13.13      12.50
   ARG 131 A    15.69      12.50
   ARG 188 A    12.33      12.50
   ARG 217 A    12.22      12.50
   ARG 222 A    12.45      12.50
   ARG 279 A    12.06      12.50
   ARG 298 A    12.61      12.50
   N+    1 A     7.91       8.00

We can see that most of the residues are in their usual protonation state, therefore we do not need to modify residue names in the PDB. However we should play special attention to the histidines. As Mpro has a neutral Cys-His catalytic dyad embedded in a chymotrypsin-like fold, His41 is delta-protonated (HID). Also, given that His163 stablishes a hydrogen bond with the ligand, His163 is epsilon protonated. The other histidine residues must be checked individually but the most estable conformer when exposed to solvent is epsilon-protonated (HIE).

This table summarises the state you should change the HIS name in the protein.pdb file using a text editor of your preference:

Residue Histidine form Reason
41 HID Catalytic dyad
64 HIE Solvent exposed
80 HIE Solvent exposed
163 HIE H-bonding with the ligand
164 HIE H-bonding with Thr175
172 HIE Solvent exposed
246 HIE Solvent exposed

If you perform the pKa calculation using the H++ server, it will give you the same information and also a new PDB file with the necessary changes applied.

TIP:

Histidines are tricky residues, they have three possible protonation states: HID, HIE and HIP. HIP is the protonated residue. HID and HIE correspond to the two natural tautomers of the neutral histidine, where the proton can be found in delta or epsilon positions. In solution, the most common conformer is HIE but always visualise your structure before assuming any histidine protonation state.

three histidine states: HID HIE and HIP.

Last but not least, we need to clean the PDB file before passing it to AmberTools (removing residues with double ocupation, checking for disulfide bonds, …). So we are going to use the first AmberTools tool: pdb4amber which prepares your PDB file.

pdb4amber -i protein.pdb -o protein4amber.pdb  

It returns the following output:

==================================================
Summary of pdb4amber for: protein.pdb
===================================================

----------Chains
The following (original) chains have been found:
A

---------- Alternate Locations (Original Residues!))

The following residues had alternate locations:
VAL_73
-----------Non-standard-resnames


---------- Mising heavy atom(s)

None
The alternate coordinates have been discarded.
Only the first occurrence for each atom was kept.

The output above gives you a lot of information. First it tells you that your protein is composed of a single chain. Then it informs you that some residues had alternate conformations in the X-ray structure, but that only the first occurrence for each atom was kept.

pdb4amber informs us about the duplicity of several sidechains, disulphide bonds and missing heavy atoms. However, it does not check protonation states for protein residues and therefore we ALWAYS must run a pKa calculation first and apply the required changes to the PDB.


Assessing the protonation states of the ligand

So far we have not paid attention to the ligand. We will now move to check the protonation state of the ligand. We have to check the pKa of the ligand. There are several options to do so, the most reliable one if available is PubChem, where you can find experimental pKa values.

In this example, there are no pKa values (neither experimental nor calculated) reported for this molecule, so we need to use another resource. I like Marvin Sketch because it is free to use (you need to register and cannot use it for comercial purposes), it gives calculated pKa values and it has a nice visualisation of the chemical species at a different pH values.

pKa calculation of the GWS ligand

After assessing the protonation state of the GWS ligand using Marvin Sketch, we can proceed to protonate it. The easiest way to protonate it is using OpenBabel.

obabel -ipdb GWS.pdb -opdb -O GWS.H.pdb -h 

You should always check the result of this step and ammend it if needed. In this case, it gives the protonation state predicted to be the most abundant by Marvinsketch at pH 7.4:

protonated GWS ligand

Key Points

  • pdb4amber does a preliminary check on your PDB file and cleans potential errors in the protein structure.

  • ALWAYS assess the protonation state of your system at the pH of interest.

  • PROPKA and H++ server predict the protonation state of each aminoacid of protein.

  • openbabel is a chemical toolbox designed to search, convert, analyze, or store data from molecular modeling, chemistry, biochemistry, or related areas.