CEA, LITEN,
Laboratoire des systèmes solaires (L2S).
50 av. du Lac Léman,
73375 LE BOURGET DU LAC - CEDEX

 

Ph.D. Thesis


Ph.D. Thesis from the université Pierre et Marie Curie (Paris VI, Paris, France)

Genome style explored by textual data analysis of DNA



            defended the 10 april 2006 in front of :

Bernard FERTIL Reseach director, CNRS, Paris  Supervisor
Jean-Gabriel GANASCIA Professor, Université Paris VI , Paris Président
Alain GUENOCHE Reseach director, CNRS, Marseille Reviewer
Jeanny HERAULT Professor, Université de Grenoble I, Grenoble Reviewer
Eric MARECHAL Research fellow, CNRS, Grenoble Examiner
Michel VERLEYSEN Professor, Université Catholique de Louvain, Louvain-la-neuve, Belgium Examiner
Abstract :    Genome style explored by textual data analysis of DNA
DNA sequences can be considered as texts write in a 4-letters alphabet. A technique inspired from textual data analysis characterizes these sequences by short oligonucleotide (or word) frequencies. The whole word frequencies is called “genomic signature” (the “signature” term is justified because this set is species-specificity). Since the genomic signature can be observed in DNA segments as short as 1Kb, it appears to result from a “writing style” that characterizes the organization of DNA all over each genome. Moreover, proximities between species from the genomic signature point of view often correspond to proximities from the taxonomic point of view. However, the genomic signatures analysis is quickly confronted with limitations due to the curse of dimension. Indeed, the high dimensional data (the genomic signature generally has 256 dimensions) show unusual properties. For example, the concentration of Euclidean distances phenomenon is well known.
From these observations, we set up procedures to evaluate metrics in order to emphasize biological information extractable from genomic signatures. A associated non-linear method for vicinities’ representation frees from the curse of dimension and allows to visualize space occupied by data. The analysis of relations between signatures poses the problem of the contribution of each variable (the words) to the distance between signatures. An original Z-score based on the variation of word frequencies along genomes make it possible to quantify these contributions. The comparison between “local signatures” permit to extract original regions. Besides, the precise segmentation of original regions is computed thanks to a method based on signal analysis.
From this set of methods, we can propose diverse biological results. In particular, we highlight an organization in the genomic signatures space coherent with species taxonomy. Moreover, we note the presence of a “DNA syntax” : there are “syntactic words” and “semantic words”. The signature is especially based on syntactic words. Lastly, the analysis of signatures along genome allows detection and precise segmentation of RNA and probable horizontal transfers. The convergence of the horizontal transfer styles towards host signature can besides be observed.
Diverse kind of results was obtained by signature analysis. Thus, ease of use and speed of the genomic signature analysis make it a powerful tool to extract biological information from genomes.

 

Sorry, my thesis is not availlable in english. A french version can be dowload here.

_____________________________________________________________________


MASTER thesis 2002

_____________________________________________________________________