School of Informatics and Computing Menu

Haixu Tang, PhD: MGEScan: Identifying LTR and non-LTR Retroelements in Eukaryotic Genomes


Transposable elements (TEs), also called mobile genetic elements (MGE), have been found in most eukaryotic genomes. TEs often constitute a significant portion of the eukaryotic genome (e.g., 80% of the maize, 45% of the human, and 5.3% of the fruit fly genome) and play important roles in shaping its structure. Because they can transpose from one location to another within the genome or across genomes, the identification of TEs and the analysis of their dynamics are important for a better understanding of the structure and evolution of both genomes and TEs themselves. Thus, we developed a computational method MGEScan to identify long terminal repeats (LTR) and non-LTR retroelements in eukaryotic genomic sequences. MGEScan-LTR is a de novo method to identify LTR retroelements, an important class of TEs that transpose through reverse transcription of RNA intermediates. Intact LTR retroelements were identified using multiple empirical rules: similarity of a pair of LTRs at the both ends, the structure of internal regions (IRs), di(tri)-nucleotides at flanking ends, and target site duplications (TSDs).  The intact elements identified were clustered into families based on the sequence similarity of LTRs between elements (>80%). These frameworks were applied to indentify a large number of novel elements, which were subsequently analyzed to estimate the evolutionary history and relationships of TEs. MGEScan-nonLTR is a computational approach inspired by a generalized hidden Markov model (GHMM) to identify non-LTR retroelements in genomic sequences. To model common features of non-LTR retroelements in a large variety of genomes, we built a model consisting of twelve super-states, each corresponding to a different clade (for example, I, Jockey, and R1). Each super-state consists of one to three states, corresponding to protein domains and linker regions encoded by the non-LTR retroelements. To evaluate the scores for the state of protein domains and inter-domain region, we adopted two probabilistic models, a profile HMM (for the protein domains) and a Gaussian Bayes classifier (for the linker regions). MGEScan-nonLTR was tested on the genome sequences of four eukaryotic organisms, Drosophila melanogaster, Daphnia pulex, Ciona intestinalis, and Strongylocentrotus purpuratus. Notably, for the D. pulex genome, MGEScan-nonLTR found a significantly larger number of elements than did RepeatMasker, using the current version of the RepBase Update library.


Haixu Tang is currently an associate professor in the School of Informatics and Computing, and the director of Bioinformatics in the Center for Genomics and Bioinformatics at Indiana University, Bloomington. He received his Ph.D. from Shanghai Institute of Biochemistry, Chinese Academy of Sciences in 1998. He is a recipient of an NSF CAREER Award in 2007 and an outstanding junior faculty award from IU in 2009.