To assist structure-based approaches in drug design,
we have processed the PDB to identify binding sites suitable for the docking of a drug-like ligand
and we have so created a database called sc-PDB.
The sc-PDB database provides separated MOL2 files for the ligand, its binding site and the corresponding protein chain(s).
Ions and cofactors at the vicinity of the ligand are included in the protein.
More details about the sc-PDB scope, its content and its evolution during the 2004-2009 period are provided in a pdf document.
Web site specifications :
PHP/MySQL for database management
HTML/CSS/AJAX for style and web pages
VTemplate: A php/HTML template manager
JQUERY for client-side animation.
JQUERY-UI for client-side animation.
chosen plugin (v1.1.0) for form style and multiple selection handling
Datatable plugin for enhanced results table with the following plugins :
colvis extension to Dynamically show and hide columns in a table
TableTools for exports
PHPExcel to export results table into xlsx file
Highcharts for interactive charts
OpenAstexViewer to visualize alignment of proteins in complex with ligands
ChemAxon Marvin Sketcher and JChem Manager for sketching, storing, and searching molecular structures
The current release v.2013 was made using the frozen PDB data on 2014-02-13 11:50:19.
It contains 9283 entries, which correspond to 3678 different proteins and 5608 different ligands.
The full database totals 1.5 GB compressed.
The database is updated annually.
Manual corrections are occasionally applied to the database.
Database Set Up :
The selection of PDB complexes is based on the ligand molecular weight, buried surface area and the chemical structure as well as the volume of the corresponding cavities.
According to our own definition, a drug-like ligand refers to an organic compound (only H, C, N, O, S, P, F, Cl, Br and I allowed) within a defined range for physico-chemical descriptors (e.g., molecular weight, logP, number of H-bond donors and acceptors).
STEP 1 - Data collection
Non obsolete PDB Codes and structures are downloaded from RCSB website
Enzyme Commision Numbers are downloaded from enzyme.expasy.org
STEP 2 -Filtering
Only X-ray (< 3 Å resolution) or NMR structures kept
Proteins with less than 36 amino acids ared discarded
Protein structure described by CA atoms only are discarded
STEP 3 -Annotation
Uniprot AC identifier is retrieved for each PDB entry from RCSB
RESTFul web services Selection of the Target name based on the recommended name in the Uniprot entry
In the case of Uniprot entries describing polyproteins, the name of the appropriate domain is chosen for the PDB chain using Enzyme commission number or key words Proteins with multiple names are discarded
Chimeric proteins are detected based on key words describing fusion proteins. Fusion protein is ignored in the annotation
STEP 4 - Modification of PDB files to conform to the sc-PDB format :
If the PDB file consists of several variants of the same molecular structure (e.g., in NMR ensembles), only the first structure is kept
Selenomethionine residues are mutated into Methionine residues, Selenocystein residues into Cystein residues
The less frequent coordinate sets of an alternate location are deleted (or the last ones if several sets have the same occupancy)
STEP 5 - Residues description
Chains corresponding to an unwanted functional class (toxin, immmunoglobulin, antibody, heme proteins) are removed.
Chains lacking an annotation in STEP3 are removed from the entry. This steps removes all DNA/RNA chains and chains of unknown function
Inter-atomic distances between all PDB residues are systematically computed to built all covalent bonds within a molecule, thereby reconstituting molecular entities (e.g., merging sugar units to a protein to form the glycosylated form of the protein)
All possible residues within PDB entries are compared to the description provided in the Chemical Component Dictionary
wwPDB.org Unwanted ligands (e.g, salts, crystallization agents, polymer of oligosaccharides) are discarded based on HET code
STEP 6 - Identification of all molecules in the PDB files :
All molecules are identified and assigned to one of the following type:
protein chain (continuous or discontinuous chain of more than 8 residues)
peptide (continuous protein chain of 8 or less residues with at least half amino-acid residues)
putative ligand (a HET group or an assembly of HET groups not covalently bond to any protein chain)
STEP 7 - Identification of the sc-PDB ligand:
The cofactors, peptides and putative ligands are candidates for sc-PDB ligand.
Candidate ligands are further considered if they pass the following filters:
The best candidate is the putative ligand or the peptide with the highest buried surface area.
If no ligand is present, then cofactors become candidates too and that with the highest buried surface area is the final sc-PDB ligand.
STEP 8 - Writing sc-PDB Files:
Atom typing is based on MOL2 templates, created from the cactvs stereosmiles of the
mmCIF dictionary in the remediated PDB as follows:
HET release status is checked, molecules are ionized using Filter (OpenEye Scientific Software), standardized using Jchem ( ChemAxon),
3D coordinates are generated using Corina (Molecular Networks). Some templates are manually corrected.
Protein and site:
The protein.mol2 file contains all protein chains, ions and cofactors if at least one of their atoms is at less than 6.5 Å from any atom of the ligand. Water molecules are kept too, provided they are close to the ligand (inter atomic distance ó6.5 Å) and make at least two hydrogen bonds with the protein.
The site.mol2 file contains all protein residues, ions, water molecules and cofactors close to the ligand (inter atomic distance ó 6.5 Å).
Hydrogen atoms are added to all protein residues. The tautomerisation state of histidine residues in the vicinity of a metal ion (≤ 2.5Å) were verified.
The position of polar hydrogen atoms is optimized by using hydescorer ( BioSolve IT).
Interaction FingerPrint (IFP):
The inter-molecular non-bonded interactions between the protein and the optimized ligand are encoded in a binary fingerprint stored in the
The negative image of the binding site, centered on the ligand isdetected using
Cavities are described by two files: cavity_all.mol2 (full cavity) and cavity_6.mol2 (cavity cut at 6Å of any ligand atom)
STEP 9 - Annotation of sc-PDB entries :
Physico-chemical properties are calculated using Pipeline Pilot (
logP, number of rotatable bonds, number of rings, logS, polar surface area PSA, number of H-Bond donors/acceptors and the number of violations of the Lipinski rules.
The following annotations are retrieved from the PDB file: Resolution, deposition date
The following annotations are retrieved from Uniprot database, provided the cross-reference is found in the RCSB PDB database: sc-PDB target name (recommended name in the Uniprot entry, after removal of tag for location or maturation state.
In the case of polyproteins, the name of the appropriate domain is chosen), EC number, Uniprot identifier, uniprot accession number, Source organism (species, reign and TaxId)
Mutations in binding site are detected by comparing amino acid sequence in the protein.mol2 file with the SEQRES coming from the native PDB file.
SEQRES sequence is then aligned to the sequence found in the Uniprot entry.
Both residu sequence from protein.mol2 file and Uniprot sequence help to define mutation within binding site, i.e WILD_TYPE or MUTANT
The average B-factor is calculated for heavy atoms of the residues in the ligand binding site.
The total number of intermolecular interactions is computed using the in house IFP program. The following counts are given:
protein residues which make hydrophobic contacts with the ligand.
protein-ligand aromatic (face to face) interactions.
protein-ligand aromatic (edge to face) interactions.
protein-ligand H-bond(donor-acceptor) interactions.
protein-ligand H-bond(acceptor-donor) interactions.
protein-ligand ionic(cation-anion) interactions.
protein-ligand ionic(anion-cation) interactions.
protein-ligand metal coordination (metal-H-bond acceptor) interactions.
STEP 10 : Removing redundancy
If several entries have identical:
ligand canonical smiles
Only the entry with the best resolution is kept.
Entries are removed if they belong to an unwanted functional class (toxin, immmunoglobulin, antibody, heme proteins).