Laboratoire des systèmes solaires (L2S).
50 av. du Lac Léman,


Lespinats, S., Deschavanne, P., Giron, A. and Fertil, B. (2003) “DNA sequences share a common syntax.” European Conference on Computational Biology, ECCB 2003, 2003 sep, pp. 533-534.

Abstract :
The usage of short oligonucleotides in sequences (the so-called genomic signature) has been shown to be species-specific. Since the genomic signature can be observed in DNA segments as short as 1Kb, it appears to result from a “style” that characterizes the organization of DNA all over each genome. As a consequence, given a short DNA segment, it is generally possible to find its origin, insofar as the signature of the species the fragment comes from is already known.

By means of an Euclidian metric qualifying the distances between signatures, we have undertaken the systematic analysis of 43 genomes, by screening each of them with a sliding window to get samples of local DNA signatures. Using a simple nearest neighbor classifier, the origin of DNA segments is found with a high efficiency. A genetic algorithm has been used to find subsets of oligonucleotides providing the highest classification rates. It appears that oligonucleotides contribute unequally to the recognition process: some oligonucleotides are always found in the best subsets whereas others are always discarded.

We developed a method for quantifing the variability of oligonucleotide frequency along genome, accounting for the mean frequency and the self-overlapping of oligonucleotides. Results show a consensus among species about the variation of oligonucleotide frequencies. In particular, the oligonucleotides with the most variable usage along genomes are common to most species. Others share the property of frequency invariance along and among genomes.

Some elements of a DNA syntax may subsequently be proposed: based on their frequency properties. On one hand, “function” oligonucleotides can be identified as having a syntactic role (as “the, of, or...” in the human sentences) common to most species. “Content” oligonucleotides, on the other hand, may be characterized by a most variable –and less species-specific- usage along genome. In fact, “content” oligonucleotides are rarely found in the most efficient oligonucleotide subsets for the recognition of segment origin. The style of each species seems to result more from characteristics of “function” oligonucleotides than from “content” oligonucleotides.

Dowload the article                                                Dowload the slides