To assist structure-based approaches in drug design, we have processed the PDB to identify binding sites suitable for the docking of a drug-like ligand and we have so created a database called sc-PDB.

The sc-PDB database provides separated MOL2 files for the ligand, its binding site and the corresponding protein chain(s). Ions and cofactors at the vicinity of the ligand are included in the protein.
More details about the sc-PDB scope, its content and its evolution during the 2004-2009 period are provided in a pdf document.

Web site specifications :
  • PHP/MySQL for database management
  • HTML/CSS/AJAX for style and web pages
  • VTemplate: A php/HTML template manager
  • JQUERY for client-side animation.
  • JQUERY-UI for client-side animation.
  • chosen plugin (v1.1.0) for form style and multiple selection handling
  • Datatable plugin for enhanced results table with the following plugins :
    • colvis extension to Dynamically show and hide columns in a table
    • TableTools for exports
  • PHPExcel to export results table into xlsx file
  • Highcharts for interactive charts
  • OpenAstexViewer to visualize alignment of proteins in complex with ligands
  • ChemAxon Marvin Sketcher and JChem Manager for sketching, storing, and searching molecular structures
The current release v.2013 was made using the frozen PDB data on 2014-02-13 11:50:19.
It contains 9283 entries, which correspond to 3678 different proteins and 5608 different ligands.
The full database totals 1.5 GB compressed.
The database is updated annually.
Manual corrections are occasionally applied to the database.

Database Set Up :
The selection of PDB complexes is based on the ligand molecular weight, buried surface area and the chemical structure as well as the volume of the corresponding cavities. According to our own definition, a drug-like ligand refers to an organic compound (only H, C, N, O, S, P, F, Cl, Br and I allowed) within a defined range for physico-chemical descriptors (e.g., molecular weight, logP, number of H-bond donors and acceptors).

STEP 1 - Data collection
  • Non obsolete PDB Codes and structures are downloaded from RCSB website
  • Enzyme Commision Numbers are downloaded from

STEP 2 -Filtering
  • Only X-ray (< 3 Å resolution) or NMR structures kept
  • Proteins with less than 36 amino acids ared discarded
  • Protein structure described by CA atoms only are discarded

STEP 3 -Annotation
  • Uniprot AC identifier is retrieved for each PDB entry from RCSB RESTFul web services
  • Selection of the Target name based on the recommended name in the Uniprot entry
    In the case of Uniprot entries describing polyproteins, the name of the appropriate domain is chosen for the PDB chain using Enzyme commission number or key words
  • Proteins with multiple names are discarded
  • Chimeric proteins are detected based on key words describing fusion proteins. Fusion protein is ignored in the annotation

STEP 4 - Modification of PDB files to conform to the sc-PDB format :
  • If the PDB file consists of several variants of the same molecular structure (e.g., in NMR ensembles), only the first structure is kept
  • Selenomethionine residues are mutated into Methionine residues, Selenocystein residues into Cystein residues
  • The less frequent coordinate sets of an alternate location are deleted (or the last ones if several sets have the same occupancy)

STEP 5 - Residues description
  • Chains corresponding to an unwanted functional class (toxin, immmunoglobulin, antibody, heme proteins) are removed.
  • Chains lacking an annotation in STEP3 are removed from the entry. This steps removes all DNA/RNA chains and chains of unknown function
  • Inter-atomic distances between all PDB residues are systematically computed to built all covalent bonds within a molecule, thereby reconstituting molecular entities (e.g., merging sugar units to a protein to form the glycosylated form of the protein)
  • All possible residues within PDB entries are compared to the description provided in the Chemical Component Dictionary
  • Unwanted ligands (e.g, salts, crystallization agents, polymer of oligosaccharides) are discarded based on HET code

STEP 6 - Identification of all molecules in the PDB files : All molecules are identified and assigned to one of the following type:
  • protein chain (continuous or discontinuous chain of more than 8 residues)
  • prosthetic group
  • cofactor
  • peptide (continuous protein chain of 8 or less residues with at least half amino-acid residues)
  • putative ligand (a HET group or an assembly of HET groups not covalently bond to any protein chain)

STEP 7 - Identification of the sc-PDB ligand: The cofactors, peptides and putative ligands are candidates for sc-PDB ligand. Candidate ligands are further considered if they pass the following filters:
  • Druggable binding site (based on a pharmacophoric grid description of the negative image of the binding site)
  • ≥ 7 standard amino acids in the protein binding site (=all residues at a max distance of 6.5 Å to any ligand atom)
  • Molecular Weight > 140 g/mol
  • Molecular Weight < 800 g/mol
  • Ligand are complete (no missing atoms in residues, as defined in the Chemical Component Dictionary)
  • Relative buried surface area ≥ 50% AND 250 ≤ Molecular Weight ≤ 800 OR
    50% ≤ Relative buried surface area ≤ 90% AND 140 ≤ Molecular Weight ≤ 250
The best candidate is the putative ligand or the peptide with the highest buried surface area.
If no ligand is present, then cofactors become candidates too and that with the highest buried surface area is the final sc-PDB ligand.

STEP 8 - Writing sc-PDB Files:


Atom typing is based on MOL2 templates, created from the cactvs stereosmiles of the mmCIF dictionary in the remediated PDB as follows:
HET release status is checked, molecules are ionized using Filter (OpenEye Scientific Software), standardized using Jchem (ChemAxon), 3D coordinates are generated using Corina (Molecular Networks). Some templates are manually corrected.

Protein and site:

The protein.mol2 file contains all protein chains, ions and cofactors if at least one of their atoms is at less than 6.5 Å from any atom of the ligand. Water molecules are kept too, provided they are close to the ligand (inter atomic distance ó6.5 Å) and make at least two hydrogen bonds with the protein. The site.mol2 file contains all protein residues, ions, water molecules and cofactors close to the ligand (inter atomic distance ó 6.5 Å). Hydrogen atoms are added to all protein residues. The tautomerisation state of histidine residues in the vicinity of a metal ion (≤ 2.5Å) were verified.
The position of polar hydrogen atoms is optimized by using hydescorer (BioSolve IT).

Interaction FingerPrint (IFP):

The inter-molecular non-bonded interactions between the protein and the optimized ligand are encoded in a binary fingerprint stored in the ifp.txt file.

Cavity files:

The negative image of the binding site, centered on the ligand isdetected using VolSite.
Cavities are described by two files: cavity_all.mol2 (full cavity) and cavity_6.mol2 (cavity cut at 6Å of any ligand atom)

STEP 9 - Annotation of sc-PDB entries :


Physico-chemical properties are calculated using Pipeline Pilot (Accelrys):
logP, number of rotatable bonds, number of rings, logS, polar surface area PSA, number of H-Bond donors/acceptors and the number of violations of the Lipinski rules.


The following annotations are retrieved from the PDB file: Resolution, deposition date
The following annotations are retrieved from Uniprot database, provided the cross-reference is found in the RCSB PDB database: sc-PDB target name (recommended name in the Uniprot entry, after removal of tag for location or maturation state.
In the case of polyproteins, the name of the appropriate domain is chosen), EC number, Uniprot identifier, uniprot accession number, Source organism (species, reign and TaxId)
Mutations in binding site are detected by comparing amino acid sequence in the protein.mol2 file with the SEQRES coming from the native PDB file. SEQRES sequence is then aligned to the sequence found in the Uniprot entry. Both residu sequence from protein.mol2 file and Uniprot sequence help to define mutation within binding site, i.e WILD_TYPE or MUTANT


The average B-factor is calculated for heavy atoms of the residues in the ligand binding site.
The total number of intermolecular interactions is computed using the in house IFP program. The following counts are given:
  • protein residues which make hydrophobic contacts with the ligand.
  • protein-ligand aromatic (face to face) interactions.
  • protein-ligand aromatic (edge to face) interactions.
  • protein-ligand H-bond(donor-acceptor) interactions.
  • protein-ligand H-bond(acceptor-donor) interactions.
  • protein-ligand ionic(cation-anion) interactions.
  • protein-ligand ionic(anion-cation) interactions.
  • protein-ligand metal coordination (metal-H-bond acceptor) interactions.

STEP 10 : Removing redundancy If several entries have identical:
  • protein name
  • Source organism
  • ligand canonical smiles

Only the entry with the best resolution is kept.
Entries are removed if they belong to an unwanted functional class (toxin, immmunoglobulin, antibody, heme proteins).