KEGG in Keg

KEGG Syntax – Genome Alignment

Genome Alignment

Genome alignment is usually done by aligning nucleotide sequences of two genomes. Here the genome is treated as a sequence of genes, and the gene order alignment is performed using a standard sequence comparison method with an appropriate measure of gene similarity. In KEGG this similarity is represented by ortholog grouping, KO (KEGG Orthology) and now VOG (virus ortholog group) as well. Thus, we have developed a new tool for finding all instances of locally similar gene orders in two genomes above a given threshold, such as shown below for human and mouse genomes, using the Goad-Kanehisa algorithm. hsa vs mmu

Gene order alignment tool

The tool can be applied to both KO sequences and VOG sequences. The former is appropriate for comparing cellular organisms and the latter may be applied to comparing viruses and cellular organisms.

More details about this tool
  1. Gene orders are available for KEGG organisms with the NCBI assembly level of "Complete Genome" or "Chromosome" (see br08611) and all viruses (see br08621).
  2. Since genes are labeled with K numbers (and VOG numbers), the gene order is converted to a sequence of K numbers (or VOG numbers).
  3. Genes for CDS, tRNA and rRNA are considered for K number sequences, and CDS only for VOG sequences.
  4. The algorithm applies to comparsion of two such gene order sequences with the scoring of
       match: 1, mismatch: -1, gap: -1, neutral: 0
    where neutral means the alignment of genes without K numbers (or VOG numbers).
  5. Locally similar gene orders with the score of 3 or more are reported.
  6. When genes with the same K numbers are repeated, they are combined into a single node with the number of repeats in parentheses in the output, enabling the alignment of varying numbers of repeats in two sequences.
  7. In addition, long stretches of neutrals (currently 5 or more) are combined into a single node and given a penalty to separate segments to be compared.
  8. The number associated with each genome in the output is the total number of nodes after processing of repeats and neutral stretches.
  9. When genes on the complementary strand are matched, they are marked with "<" in the output.
  10. Comparison of gene order sequences is made twice in two directions: forward-forward and forward-reverse directions.
  11. The reverse direction is marked with "(r)" in the output.

About Goad-Kanehisa Algorithm

In the early 1980s during the pre-GenBank project of Los Alamos Sequence Library, an algorithm for finding locally similar regions of two sequences was developed by Goad and Kanehisa and reported in Nucleic Acids Res 10:247-263 (1982) [doi]. The essence of this algorithm is to perform pruning of paths by taking a logical product of forward and reverse path matrices, in addition to the pruning associated with the weighting scheme of not allowing negative score values, which is similar to the Smith-Waterman algorithm [doi] as mentioned in their Note added in proof.

For protein and nucleic sequence alignments, the approach taken by Smith and Waterman for finding the best local similarity is sufficient. However, for the gene order alignment of two genomes, in which many gene positions are likely to be split and changed, the Goad-Kanehisa algorithm is better suited for finding a comprehensive set of local similarities.

Last updated: September 23, 2024