Chemical Structures

KEGG COMPOUND is one of the four original databases, together with KEGG PATHWAY, KEGG GENES and KEGG ENZYME, introduced at the start of the KEGG project in 1995. It is a collection of small molecules, biopolymers, and other chemical substances that are relevant to biological systems. Each entry is identified by the C number, such as C00047 for L-lysine, and contains chemical structure and associated information, as well as various links to other KEGG databases and outside databases. Some COMPOUND entries are also represented as GLYCAN and DRUG entries with the "Same as" links.

While GLYCAN entries are represented as tree structures with monosaccharide codes, COMPOUND entries for peptides and polyketides, such as C11996 for methymycin, are represented as sequences using the abbreviation codes for the monomeric units of amino acids and carboxylic acids. Naturally processed peptides from gene products, such as C00873 for human angiotensin I, are sometimes represented in the KEGG COMPOUND database. They are always associated with sequence information using the three-letter amino acid codes, but they may or may not contain the full atomic structure representation.

Chemical Compound Categories

The classifications of representative categories of KEGG COMPOUND, as well as biosynthetic genes, are given in the following BRITE hierarchy files.

Chemical Compounds in Pathways and Diseases

The role of KEGG COMPOUND has always been to enable links from molecular-levl data to molecular network-level data. Chemical compound entries are constituents of KEGG pathway maps, KEGG modules, reaction modules and network variation maps. They are used to analyze, for example, metabolome data to uncover higher-level functional features using the KEGG Mapper tools.

In addition, chemical compound entries are used to represent disease-associated perturbed networks, such as nt06014 for congenital disorders of sphingolipid metabolism in the KEGG NETWORK database.

Biosynthetic codes

The structures of DNA, RNA, and proteins are determined by template-based syntheses of replication, transcription, and translation with the genetic code. In contrast, the structures of glycans, lipids, polyketides, nonribosomal peptides, and various plant secondary metabolites are determined by biosynthetic pathways. Attempts are being made to develop overview pathway maps by defining reaction modules and to understand such biosynthetic codes.

Biodegradation codes

KEGG COMPOUND also contains xenobiotic compounds, many of which may be degraded by microbial degradation pathways. Here again attempts are made to develop overview pathway maps together with reaction modules.

Database Search Tools

SIMCOMP and SUBCOMP are database search programs for similar chemical structures. SIMCOMP is based on a graph matching to find maximal common subgraphs allowing mismatches, while SUBCOMP is a bit-string based search program to find exactly matching substructures or superstructures.

  1. Hashimoto, K., Yoshizawa, A.C., Okuda, S., Kuma, K., Goto, S., and Kanehisa, M.; The repertoire of desaturases and elongases reveals fatty acid variations in 56 eukaryotic genomes. J. Lipid Res. 49, 183-191 (2008). [pubmed]
  2. Minowa, Y., Araki, M., and Kanehisa, M.; Comprehensive analysis of distinctive polyketide and nonribosomal peptide structural motifs encoded in microbial genomes. J. Mol. Biol. 368, 1500-1517 (2007). [pubmed]
  3. Kanehisa, M.; KEGG bioinformatics resource for plant genomics and metabolomics. Methods Mol. Biol. 1374, 55-70 (2016). [pubmed]
  4. Hattori, M., Okuno, Y., Goto, S., and Kanehisa, M.; Development of a chemical structure comparison method for integrated analysis of chemical and genomic information in the metabolic pathways. J. Am. Chem. Soc. 125, 11853-11865 (2003). [pubmed]
  5. Hattori, M., Tanaka, N., Kanehisa, M., and Goto, S.; SIMCOMP/SUBCOMP: chemical structure search servers for network analyses. Nucleic Acids Res. 38, W652-W656 (2010). [pubmed]

Last updated: September 8, 2023