Abstract: |
Revealing evolution of organisms is one of important biological research topics, and is also useful for understanding the origin of organisms. Hence, genomic sequences have been compared and aligned for finding conserved and functional regions. A protein can contain several domains, which are known as structural and functional units. In the previous work, a proteome, whole kinds of proteins in an organism, was regarded as a set of sequences of protein domains, and a grammar-based compression algorithm was developed for a proteome, where production rules in the grammar represented evolutionary processes, mutation and duplication. In this paper, we propose a similarity measure based on the grammar-based compression, and apply it to hierarchical clustering of seven organisms, Homo sapiens, Mus musculus, Drosophila melanogaster, Caenorhabditis elegans, Saccharomyces cerevisiae, Arabidopsis thaliana, and Escherichia coli. The results suggest that our similarity measure could classify the organisms very well. |