=================================== README for the "genes" subdirectory =================================== Genome annotation in KEGG is a process to make links between genes in the complete genomes to molecules in the molecular networks. This is represented mainly by the following three databases: KEGG ORTHOLOGY (KO) - consists of KEGG Orthology (KO) entries representing functional orthologs that generally correspond to KEGG pathway nodes and BRITE hierarchy nodes KEGG GENES - a collection of gene catalogs for all complete genomes (KEGG organisms) generated mostly from NCBI RefSeq and GenBank, supplemented by the viruses and addendum categories KEGG GENOME - contains organism-level information about complete genomes This directory contains the following files ko.tar.gz - KO entries genome.tar.gz - genome entries organisms/org/ - annotated gene catalogs of KEGG organism org/Tnum.ent.gz - entry file org/Tnum.pep.gz - amino acid sequence file org/Tnum.nuc.gz - nucleotide sequence file org/Tnum.genome.gz - genome sequence file org/Tnum.kff.gz - KEGG feature format file org/org_link.tar.gz - link information files organisms_new/org/ - unannotated gene catalogs of KEGG organisms viruses/ - gene set of all viruses addendum/ - gene set of addendum category fasta/ - fasta sequence files and auxiliary files (see separate README file) oc/oc.gz - Ortholog Clusters (OC) computationally generated from KEGG SSDB misc/ - miscellaneous files for taxonomy, etc. links/ - ID mapping files of GENES to/from KO, PATHWAY, MODULE, NCBI-GeneID, NCBI-ProteinID and UniProt where "org" is the three- or four-letter organism code and "Tnum" is the T number identifier. The T numbers and the two-letter codes for "viruses" and "addendum" are: T40000 (vg) and T10000 (ag), respectively. The "viruses" and "addendum" subdirectories additionally contains T40000.tax.gz and T10000.tax.gz for NCBI taxonomy IDs. The contents of tarballs are as follows. ko.tar.gz --------- ko - DBGET flat file containing all KO entries ko_dbname.list - ID mapping files of KO to/from other KEGG DBs, PubMed, COG, GO, CAZy and TC genome.tar.gz ------------- genome - DBGET flat file containing all GENOME entries genome_dbname.list - ID mapping files of GENOME to/from other KEGG DBs, PubMed, RefSeq, GenBank, Assembly and Taxonomy org_link.tar.gz --------------- org_dbname.list - ID mapping files of KEGG organisms to/from other KEGG DBs, NCBI-GeneID, NCBI-ProteinID, OMIM, UniProt, and many other databases The .kff file contains the following information for each gene in the genome: column 1 - KEGG GENES identifier (in the form of org:gene) column 2 - Feature key such as CDS, tRNA, rRNA, ncRNA, lncRNA, miRNA, and gene (usually meaning pseudogene) column 3 - Amino acid sequence length column 4 - Nucleotide sequence length column 5 - Geonomic position column 6 - NCBI GeneID column 7 - NCBI ProteinID column 8 - Gene name column 9 - Definition in the original DB column 10 - KO identifier (K number) column 11 - KO definition All the amino acid sequence file (Tnum.pep.gz) in KEGG GENES are concatenated into a single file and stored in the fasta directory. Some of the outside links (org_link.tar.gz) for all organisms in KEGG GENES are also duplicated in the links directory. KEGG SSDB (not distributed) is a huge database containing similarity scores computed by SSEARCH and best hit relations determined for all genome pairs in KEGG GENES. This information is used in the KOALA (KEGG Orthology and Links Annotation) tool to perform, both computationally and manually, cross-species annotation (K number assignment) of KEGG GENES. KEGG OC is an auxiliary dataset containing ortholog clusters computationally generated by a program to search quasi-cliques in the SSDB graph. --- Updated: June 14, 2017 (c) Kanehisa Laboratories