Precomputed sequence similarities
KEGG SSDB (Sequence Similarity DataBase) contains the information about amino acid sequence similarities among all protein-coding genes in the complete genomes, as well as the addendum and virus categories, of the GENES database.
All possible pairwise genome (and category) comparisons are performed by the SSEARCH program, and the gene pairs with the Smith-Waterman similarity score of 100 or more are entered in SSDB, together with the information about best hits and bidirectional best hits (best-best hits).
SSDB is thus a huge weighted, directed graph, which can be used for searching orthologs and paralogs, as well as conserved gene clusters with additional consideration of positional correlations on the chromosome.
The relationship of gene x in genome A and gene y in genome B is defined as follows:
(Note) The option to search reverse best hits is discontinued; "forward best" is now simply called "best".
The relationship of gene x in genome A and gene y in genome B is defined as follows:
forward best: reverse best: best-best: |
x is compared against all genes in genome B and y is found as top-scoring y is compared against all genes in genome A and x is found as top-scoring both of these relationships hold |
Orthologs and paralogs
In order to speed up the search, SSDB is organized as a collection of "GFIT tables" containing selected information that is useful for identifying possible orthologs and paralogs.
This includes not only the score and the direction of best hits, but also the margin, which is the score difference between the best hit and the second best hit.


Conserved gene clusters
SSDB is useful to efficiently search a conserved gene cluster containing the query gene.
First, the query gene and its best hits are considered as an initial cluster.
Second, neighboring genes on both sides of the chromosome are included in the cluster as long as they are also best hits.
Third, gapped genes are included if they are best hits in other locations on the chromosome.

Precomputed sequence motifs
SSDB also contains precomputed protein domains of Pfam, here called motifs, for all protein coding genes.



Last updated: October 1, 2017