Model inference pour RNA pattern matching


[see all job offers]

Research group

Bioinformatics research group SEQUOIA
LIFL, Université Lille 1 / INRIA Lille Nord-Europe
The team develops efficient algorithms and software tools for biological sequence analysis: DNA, RNA, proteins. Our field of expertise is: Sequence alignment, noncoding RNA analysis, nonribosomal peptides, genome organization, comparative genomic.

Supervisor

Hélène Touzet [www]
helene.touzet [@] lifl.fr, 03 59 57 79 16

Scientific context

Noncoding RNAs are small molecules that are essential for the cell. They take part to a wide range of molecular mechanisms, such as gene regulation. One specificity of noncoding RNAs is that their function is largely determined by the spatial structure formed by base pairings. From a combinatorial point of view, they are complex objects, that can be modelled by trees, graphs or grammars.

Subject

RNA pattern matching is an important aspect of noncoding RNA analysis: Given a family of noncoding RNAs, the problem is to identify all potential occurrences of the RNA of a sequence, such as a newly sequenced genome. Existing methods mainly rely on stochastic context free grammars, also called covariation models [1]. These models are highly expressive and specific. On the counterpart, associated search algorithms are time-consuming [2]. Several improvements have been recently proposed [3,4,5,6], but the time complexity is still to high to allow for large-scale scanning.

The goal of this project is to propose simple models for RNA families inferred from stochastic context free grammars or directly from sequence data (available in RFAM database, for example). These models could be thinked as a series of independent modules, that would make be suitable for lossless filtering search.

This master proposal can be extended to a PhD project.

Prerequisite

Master in computer science, bioinformatics or computational biology. Skills in algorithms, programming in C

Bibliographical references

  1. Query-dependent banding (QDB) for faster RNA similarity searches, Nawrocki EP, Eddy SR, PLoS Comput Biology,3(3):e56, 2007
  2. Exploring genomic dark matter: A critical assessment of the performance of homology search methods on noncoding RNA. Gardner PP. Freyhult EK, Bollback JP. Genome Research, 17:117 -- 125, 2007
  3. Designing Secondary Structure Profiles for Fast ncRNA Identification. Yanni Sun, Jeremy Buhler, Computational Systems Bioinformatics 2008 [PDF]
  4. Searching genomes for noncoding RNA using FastR. Shaojie Zhang Haas, B. Eskin, E. Bafna, V., Computational Biology and Bioinformatics, IEEE transactions 2(4)- 4, 366- 379, 2005
  5. RNA Search with Decision Trees and Partial Covariance Models, J. A. Smith, IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 6(3), pp. 517-527, 2009 [PDF]
  6. Faster genome annotation of non-coding RNA families without loss of accuracy, Z. Weinberg and W.L. Ruzzo (2004), Proc. Eighth Annual Inter. Conf. on Computational Molecular Biology (RECOMB), p. 243-251. [PDF]