LINGO Method (Concept)

LINGO methods provide tools to:

  1. Quantify the similarity between molecules.
  2. Quantitatively predict a large number of physical properties of a given compound.
  3. Propose possible biological activities by relating new compounds to active molecules known for such activity.
  4. Find possible bioisoster molecules in PubChem or in databases of commercially available compounds.

The only input for LINGO methods are systematic NAMES of molecules, such as SMILES or IUPAC names.

Why NAMES?

  • SYSTEMATIC chemical names have been designed to univocally capture the structure of chemical compounds into a ONE DIMENSIONAL representation.
  • The SYSTEMATIC NAME of a molecule implicitly contains most of the information that is particular to its CHEMICAL STRUCTURE but in a notably compact form.
  • NAME-FUNCTION relationships are analogous to STRUCTURE-FUNCTION relationships and NAMES can be used instead of structures to compare molecules or predict properties.
  • NAMES are simply processed as STRINGS OF CHARACTERS using LINGO tools, thus allowing rapid processing and making them highly suitable for searching very large databases.

Does it work?

  • LINGO-based similarities successfully discriminate between bioisosters and random pairs [1].
  • LINGO-based property prediction of LogP or LogS is as good as the best standard methods but much faster [1].
  • LINGO has been applied to predict 18 properties with cross-validated correlation coefficients Q2>0.85 for >400,000 compounds [2].
  • The accuracy of LINGO similarities is comparable to path-based fingerprints; however, the computations are much more efficient [3].
  • LINGO has been introduced into the MPA search algorithm [4]. Experimental validation of the method provided 9 hits out of 34 predicted ligands for a low molecular weight phosphatase selected from a 500,000-compound database [5].
  • LINGO has been combined with the affinity propagation algorithm to characterize the PubChem structure [6].
  • Clusters based exclusively on SMILES strings group compounds that bind to the same biological targets and discriminate against decoys in the DUD database [6].

What can I do in the ChemNprop website?

  1. Compare the LINGO-based similarity between two arbitrary text strings (LINGOsim Simple).
  2. Compare the LINGO similarities between the SMILES representations of two molecules drawn using standard tools (LINGOsim Simple).
  3. Compare the LINGO similarities between the SMILES representations of a group of molecules (LINGOsim Multiple).
  4. Predict the physical properties of a given compound (LINGOprop Simple).
  5. Predict the physical properties of a group of compounds (LINGOprop Multiple).
  6. Search for molecules similar to a given one in the PubChem database (LINGOsim DB).
  7. Search for molecules similar to a given one in databases of commercially available compounds (LINGOsim DB).
  8. Place the molecule of interest in an intrinsically ordered list of compounds related by mutual similarities (Virtual LINGO Chromatography).
  9. Search for the most similar compounds with known biological activities to predict the possible activities of a given compound, i.e. consult the “Yellow Pages” of chemical databases.

What are LINGOS and how are they produced from names?

  • LINGOs are fixed-length substrings extracted from the text representing the SYSTEMATIC name of a molecule. Standard LINGOs used in this website are 4 characters in length.
  • The LINGO profile is the ensemble of LINGOs generated from the complete name of the molecule.
  • Similarities are calculated by comparing LINGO profiles using a Tanimoto distance.
  • Properties are calculated from LINGO profiles by means of a linear model in which each specific LINGO is given a weight determined by calibration with a training set of molecules with known properties.

[1] D. Vidal, M. Thormann, M. Pons. LINGO, an Efficient Holographic Text Based Method To Calculate Biophysical Properties and Intermolecular Similarities J. Chem. Inf. Model. 2005, 45, 386-393.

[2] M. Thormann, D. Vidal, M. Almstetter, M. Pons. Nomen est omen: quantitative prediction of molecular properties directly from IUPAC names. The Open Applied Informatics Journal. 2007, 1, 28-32.

[3] G. A. Grant, J. A. Haigh, B. T. Pickup, A. Nicholls, R. A. Sayle. Lingos, finite state machines and fast similarity searching. J. Chem. Inf. Model. 2006, 46, 1912-1918.

[4] D. Vidal, M. Thormann, M. Pons. A novel search engine for virtual screening of very large databases. J. Chem. Inf. Model. 2006, 46, 836-843.

[5] D. Vidal, J. Blobel, Y. Pérez, M. Thormann, M. Pons. Structure-based discovery of new small molecule inhibitors of low molecular weight protein tyrosine phosphatase Eur. J. Med.Chem. 2007, 42, 1102-1108.

[6] G. Cincilla, M. Thormann, M. Pons. Structuring chemical space: similarity-based characterization of the PubChem database. Mol. Inf. 2010, 29, 37-49.