Improved Prediction of Regulatory Element Using Hybrid Abelian Complexity Features with DNA Sequences

Chengchao Wu; Jin Chen; Yunxia Liu; Xuehai Hu

doi:10.3390/ijms20071704

Improved Prediction of Regulatory Element Using Hybrid Abelian Complexity Features with DNA Sequences

International Journal of Molecular Sciences ◽

10.3390/ijms20071704 ◽

2019 ◽

Vol 20 (7) ◽

pp. 1704 ◽

Cited By ~ 1

Author(s):

Chengchao Wu ◽

Jin Chen ◽

Yunxia Liu ◽

Xuehai Hu

Keyword(s):

Dna Sequences ◽

Regulatory Element ◽

Superior Performance ◽

Accurate Identification ◽

Complexity Function ◽

Enhancer Prediction ◽

Complex Extension ◽

Subword Complexity ◽

Core Issues ◽

Broad Interest

Abstract: Deciphering the code of cis-regulatory element (CRE) is one of the core issues of current biology. As an important category of CRE, enhancers play crucial roles in gene transcriptional regulations in a distant manner. Further, the disruption of an enhancer can cause abnormal transcription and, thus, trigger human diseases, which means that its accurate identification is currently of broad interest. Here, we introduce an innovative concept, i.e., abelian complexity function (ACF), which is a more complex extension of the classic subword complexity function, for a new coding of DNA sequences. After feature selection by an upper bound estimation and integration with DNA composition features, we developed an enhancer prediction model with hybrid abelian complexity features (HACF). Compared with existing methods, HACF shows consistently superior performance on three sources of enhancer datasets. We tested the generalization ability of HACF by scanning human chromosome 22 to validate previously reported super-enhancers. Meanwhile, we identified novel candidate enhancers which have supports from enhancer-related ENCODE ChIP-seq signals. In summary, HACF improves current enhancer prediction and may be beneficial for further prioritization of functional noncoding variants.

Download Full-text

DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome

Bioinformatics ◽

10.1093/bioinformatics/btab083 ◽

2021 ◽

Author(s):

Yanrong Ji ◽

Zhihan Zhou ◽

Han Liu ◽

Ramana V Davuluri

Keyword(s):

Dna Sequences ◽

Regulatory Elements ◽

Ease Of Use ◽

Fine Tuning ◽

Supplementary Information ◽

Sequence Motifs ◽

Semantic Relationship ◽

Accurate Identification ◽

Conserved Sequence ◽

Genome Wide

Abstract Motivation Deciphering the language of non-coding DNA is one of the fundamental problems in genome research. Gene regulatory code is highly complex due to the existence of polysemy and distant semantic relationship, which previous informatics methods often fail to capture especially in data-scarce scenarios. Results To address this challenge, we developed a novel pre-trained bidirectional encoder representation, named DNABERT, to capture global and transferrable understanding of genomic DNA sequences based on up and downstream nucleotide contexts. We compared DNABERT to the most widely used programs for genome-wide regulatory elements prediction and demonstrate its ease of use, accuracy and efficiency. We show that the single pre-trained transformers model can simultaneously achieve state-of-the-art performance on prediction of promoters, splice sites and transcription factor binding sites, after easy fine-tuning using small task-specific labeled data. Further, DNABERT enables direct visualization of nucleotide-level importance and semantic relationship within input sequences for better interpretability and accurate identification of conserved sequence motifs and functional genetic variant candidates. Finally, we demonstrate that pre-trained DNABERT with human genome can even be readily applied to other organisms with exceptional performance. We anticipate that the pre-trained DNABERT model can be fined tuned to many other sequence analyses tasks. Availability and implementation The source code, pretrained and finetuned model for DNABERT are available at GitHub (https://github.com/jerryji1993/DNABERT). Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Small gene family encoding an eggshell (chorion) protein of the human parasite Schistosoma mansoni.

Molecular and Cellular Biology ◽

10.1128/mcb.8.8.3008 ◽

1988 ◽

Vol 8 (8) ◽

pp. 3008-3016 ◽

Cited By ~ 52

Author(s):

L A Bobek ◽

D M Rekosh ◽

P T LoVerde

Keyword(s):

Schistosoma Mansoni ◽

Dna Sequences ◽

Genomic Library ◽

Regulatory Element ◽

Gene Promoter ◽

Transcription Unit ◽

Human Parasite ◽

Intergenic Dna ◽

Eggshell Proteins ◽

Silkmoth Chorion

We have isolated six independent genomic clones encoding schistosome chorion or eggshell proteins from a Schistosoma mansoni genomic library. A linkage may of five of the clones spanning 35 kilobase pair (kbp) of the S. mansoni genome was constructed. The region contained two eggshell protein genes closely linked, separated by 7.5 kbp of intergenic DNA. The two genes of the cluster were arranged in the same orientation, that is, they were transcribed from the same strand. The sixth clone probably represents a third copy of the eggshell gene that is not contained within the 35-kbp region. The 5' end of the mRNA transcribed from these genes was defined by primer extension directly off the RNA. The ATCAT cap site sequence was homologous to a silkmoth chorion PuTCATT cap site sequence, where Pu indicates any purine. DNA sequence analysis showed that there were no introns in these genes. The DNA sequences of the three genes were very homologous to each other and to a cDNA clone, pSMf61-46, differing only in three or four nucleotides. A multiple TATA box was located at positions -23 to -31, and a CAAAT sequence was located at -52 upstream of the eggshell transcription unit. Comparison of sequences in regions further upstream with silkmoth and Drosophila sequences revealed several very short elements that were shared. One such element, TCACGT, recently shown to be an essential cis-regulatory element for silkmoth chorion gene promoter function, was found at a similar position in all three organisms.

Download Full-text

Evolutionary Multi-Objective Optimization for DNA Sequence Design

Multi-Objective Optimization in Computational Intelligence ◽

10.4018/978-1-59904-498-9.ch009 ◽

2011 ◽

pp. 239-264

Author(s):

Soo-Yong Shin ◽

In-Hee Lee ◽

Byoung-Tak Zhang

Keyword(s):

Evolutionary Algorithms ◽

Dna Sequence ◽

Dna Sequences ◽

Dna Microarrays ◽

Superior Performance ◽

Final Decision ◽

Sequence Design ◽

Multi Objective ◽

Dna Sequence Design ◽

Np Problems

Finding reliable and efficient DNA sequences is one of the most important tasks for successful DNArelated experiments such as DNA computing, DNA nano-assembly, DNA microarrays and polymerase chain reaction. Sequence design involves a number of heterogeneous and conflicting design criteria. Also, it is proven as a class of NP problems. These suggest that multi-objective evolutionary algorithms (MOEAs) are actually good candidates for DNA sequence optimization. In addition, the characteristics of MOEAs including simple addition/deletion of objectives and easy incorporation of various existing tools and human knowledge into the final decision process could increase the reliability of final DNA sequence set. In this chapter, we review multi-objective evolutionary approaches to DNA sequence design. In particular, we analyze the performance of e-multi-objective evolutionary algorithms on three DNA sequence design problems and validate the results by showing superior performance to previous techniques.

Download Full-text

Accurate Identification of Active Transcriptional Regulatory Elements from Global Run-On and Sequencing Data

10.1101/011353 ◽

2014 ◽

Cited By ~ 1

Author(s):

Charles G Danko ◽

Stephanie L Hyland ◽

Leighton J Core ◽

Andre L Martins ◽

Colin T Waters ◽

...

Keyword(s):

Multiple Scales ◽

Regulatory Element ◽

Cell Types ◽

Regulatory Elements ◽

Support Vector ◽

Sequencing Data ◽

Accurate Identification ◽

Single Experiment ◽

Transcriptional Regulatory Elements ◽

Transcriptional Regulatory

Identification of the genomic regions that regulate transcription remains an important open problem. We have recently shown that global run-on and sequencing (GRO-seq) with enrichment for 5′-capped RNAs reveals patterns of divergent transcription that accurately mark active transcriptional regulatory elements (TREs), including enhancers and promoters. Here, we demonstrate that active TREs can be identified with comparable accuracy by applying sensitive machine-learning methods to standard GRO-seq and PRO-seq data, allowing TREs to be assayed together with transcription levels, elongation rates, and other transcriptional features, in a single experiment. Our method, called discriminative Regulatory Element detection from GRO-seq (dREG), summarizes GRO-seq read counts at multiple scales and uses support vector regression to predict active TREs. The predicted TREs are strongly enriched for marks associated with functional elements, including H3K27ac, transcription factor binding sites, eQTLs, and GWAS-associated SNPs. Using dREG, we survey TREs in eight cell types and provide new insights into global patterns of TRE assembly and function.

Download Full-text

Glucocorticoid receptor binding to a specific DNA sequence is required for hormone-dependent repression of pro-opiomelanocortin gene transcription.

Molecular and Cellular Biology ◽

10.1128/mcb.9.12.5305 ◽

1989 ◽

Vol 9 (12) ◽

pp. 5305-5314 ◽

Cited By ~ 161

Author(s):

J Drouin ◽

M A Trifiro ◽

R K Plante ◽

M Nemer ◽

P Eriksson ◽

...

Keyword(s):

Glucocorticoid Receptor ◽

Dna Sequence ◽

Gene Transcription ◽

Dna Sequences ◽

Regulatory Element ◽

Site Directed Mutagenesis ◽

Glucocorticoid Response Element ◽

Response Element ◽

Glucocorticoid Response ◽

Pomc Gene

Glucocorticoids rapidly and specifically inhibit transcription of the pro-opiomelanocortin (POMC) gene in the anterior pituitary, thus offering a model for studying negative control of transcription in mammals. We have defined an element within the rat POMC gene 5'-flanking region that is required for glucocorticoid inhibition of POMC gene transcription in POMC-expressing pituitary tumor cells (AtT-20). This element contains an in vitro binding site for purified glucocorticoid receptor. Site-directed mutagenesis revealed that binding of the receptor to this site located at position base pair -63 is essential for glucocorticoid repression of transcription. Although related to the well-defined glucocorticoid response element (GRE) found in glucocorticoid-inducible genes, the DNA sequence of the POMC negative glucocorticoid response element (nGRE) differs significantly from the GRE consensus; this sequence divergence may result in different receptor-DNA interactions and may account at least in part for the opposite transcriptional properties of these elements. Hormone-dependent repression of POMC gene transcription may be due to binding of the receptor over a positive regulatory element of the promoter. Thus, repression may result from mutually exclusive binding of two DNA-binding proteins to overlapping DNA sequences.

Download Full-text

Genes for low-molecular-weight heat shock proteins of soybeans: sequence analysis of a multigene family.

Molecular and Cellular Biology ◽

10.1128/mcb.5.12.3417 ◽

1985 ◽

Vol 5 (12) ◽

pp. 3417-3428 ◽

Cited By ~ 65

Author(s):

R T Nagao ◽

E Czarnecka ◽

W B Gurley ◽

F Schöffl ◽

J L Key

Keyword(s):

Molecular Weight ◽

Amino Acid ◽

Heat Shock ◽

Dna Sequences ◽

Regulatory Element ◽

Amino Acid Sequences ◽

Striking Similarity ◽

Low Molecular Weight ◽

Genes Encoding ◽

Flanking Regions

Soybeans, Glycine max, synthesize a family of low-molecular-weight heat shock (HS) proteins in response to HS. The DNA sequences of two genes encoding 17.5- and 17.6-kilodalton HS proteins were determined. Nuclease S1 mapping of the corresponding mRNA indicated multiple start termini at the 5' end and multiple stop termini at the 3' end. These two genes were compared with two other soybean HS genes of similar size. A comparison among the 5' flanking regions encompassing the presumptive HS promoter of the soybean HS-protein genes demonstrated this region to be extremely homologous. Analysis of the DNA sequences in the 5' flanking regions of the soybean genes with the corresponding regions of Drosophila melanogaster HS-protein genes revealed striking similarity between plants and animals in the presumptive promoter structure of thermoinducible genes. Sequences related to the Drosophila HS consensus regulatory element were found 57 to 62 base pairs 5' to the start of transcription in addition to secondary HS consensus elements located further upstream. Comparative analysis of the deduced amino acid sequences of four soybean HS proteins illustrated that these proteins were greater than 90% homologous. Comparison of the amino acid sequence for soybean HS proteins with other organisms showed much lower homology (less than 20%). Hydropathy profiles for Drosophila, Xenopus, Caenorhabditis elegans, and G. max HS proteins showed a similarity of major hydrophilic and hydrophobic regions, which suggests conservation of functional domains for these proteins among widely dispersed organisms.

Download Full-text

Ancestors graph and an upper bound for the subword complexity function

Theoretical Computer Science ◽

10.1016/j.tcs.2012.11.014 ◽

2013 ◽

Vol 468 ◽

pp. 69-82

Author(s):

Delalleau Guillaume

Keyword(s):

Upper Bound ◽

Complexity Function ◽

Subword Complexity

Download Full-text

Association of genes with physiological functions by comparative analysis of pooled expression microarray data

Physiological Genomics ◽

10.1152/physiolgenomics.00116.2012 ◽

2013 ◽

Vol 45 (2) ◽

pp. 69-78 ◽

Cited By ~ 6

Author(s):

Iuan-bor D. Chen ◽

Vinay K. Rathi ◽

Diana S. DeAndrade ◽

Patrick Y. Jay

Keyword(s):

Transcription Factor ◽

Adipose Tissue ◽

Regulatory Element ◽

The Body ◽

Accurate Identification ◽

Physiological Functions ◽

Microarray Expression Data ◽

Multiple Organs ◽

Brown Adipose ◽

Sterol Regulatory Element

The physiological functions of a tissue in the body are carried out by its complement of expressed genes. Genes that execute a particular function should be more specifically expressed in tissues that perform the function. Given this premise, we mined public microarray expression data to build a database of genes ranked by their specificity of expression in multiple organs. The database permitted the accurate identification of genes and functions known to be specific to individual organs. Next, we used the database to predict transcriptional regulators of brown adipose tissue (BAT) and validated two candidate genes. Based upon hypotheses regarding pathways shared between combinations of BAT or white adipose tissue (WAT) and other organs, we identified genes that met threshold criteria for specific or counterspecific expression in each tissue. By contrasting WAT to the heart and BAT, the two most mitochondria-rich tissues in the body, we discovered a novel function for the transcription factor ESRRG in the induction of BAT genes in white adipocytes. Because the heart and other estrogen-related receptor gamma (ESRRG)-rich tissues do not express BAT markers, we hypothesized that an adipocyte co-regulator acts with ESRRG. By comparing WAT and BAT to the heart, brain, kidney and skeletal muscle, we discovered that an isoform of the transcription factor sterol regulatory element binding transcription factor 1 (SREBF1) induces BAT markers in C2C12 myocytes in the presence of ESRRG. The results demonstrate a straightforward bioinformatic strategy to associate genes with functions. The database upon which the strategy is based is provided so that investigators can perform their own screens.

Download Full-text

Molecular epidemiology of cystic echinococcosis

Parasitology ◽

10.1017/s0031182003003524 ◽

2003 ◽

Vol 127 (S1) ◽

pp. S37-S51 ◽

Cited By ~ 144

Author(s):

D. P. McMANUS ◽

R. C. A. THOMPSON

Keyword(s):

Dna Sequences ◽

Cystic Echinococcosis ◽

Molecular Genetic ◽

Sequence Information ◽

Accurate Identification ◽

Mt Dna ◽

Geographic Ranges ◽

Strain Characterization ◽

Epidemiological Surveys ◽

Complete Sequences

Echinococcus granulosusexhibits substantial genetic diversity that has important implications for the design and development of vaccines, diagnostic reagents and drugs effective against this parasite. DNA approaches that have been used for accurate identification of these genetic variants are presented here as is a description of their application in molecular epidemiological surveys of cystic echinococcosis in different geographical settings and host assemblages. The recent publication of the complete sequences of the mitochondrial (mt) genomes of the horse and sheep strains ofE. granulosusand ofE. multilocularis, and the availability of mt DNA sequences for a number of otherE. granulosusgenotypes, has provided additional genetic information that can be used for more in depth strain characterization and taxonomic studies of these parasites. This very rich sequence information has provided a solid molecular basis, along with a range of different biological, epidemiological, biochemical and other molecular-genetic criteria, for revising the taxonomy of the genusEchinococcus. This has been a controversial issue for some time. Furthermore, the accumulating genetic data may allow insight to several other unresolved questions such as confirming the occurrence and precise nature of theE. granulosusG9 genotype and its reservoir in Poland, whether it is present elsewhere, why the camel strain (G6 genotype) appears to affect humans in certain geographical areas but not others, more precise delineation of the host and geographic ranges of the genotypes characterised to date, and whether additional genotypes ofE. granulosusremain to be identified.

Download Full-text

DNA barcode data accurately identify higher taxa

10.7287/peerj.preprints.1633v1 ◽

2016 ◽

Author(s):

Jonathan A Coddington ◽

Ingi Agnarsson ◽

Ren-Chung Cheng ◽

Klemen Čandek ◽

Amy Driskell ◽

...

Keyword(s):

Dna Sequences ◽

Dna Barcode ◽

Accurate Method ◽

Reference Database ◽

Accurate Identification ◽

Sequencing Errors ◽

Sequence Identity ◽

Percent Sequence Identity ◽

Near Term ◽

Accuracy Of Results

The use of unique DNA sequences as a method for taxonomic identification is no longer fundamentally controversial, even though debate continues on the best markers, methods, and technology to use. Although both existing databanks such as GenBank and BOLD, as well as reference taxonomies, are imperfect, in best case scenarios “barcodes” (whether single or multiple, organelle or nuclear, loci) clearly are an increasingly fast and inexpensive method of identification, especially as compared to manual identification of unknowns by increasingly rare expert taxonomists. Because most species on Earth are undescribed, a complete reference database at the species level is impractical in the near term. The question therefore arises whether unidentified species can, using DNA barcodes, be accurately assigned to more inclusive groups such as genera and families—taxonomic ranks of putatively monophyletic groups for which the global inventory is more complete and stable. We used a carefully chosen test library of CO1 sequences from 49 families, 313 genera, and 816 species of spiders to assess the accuracy of genus and family-level identifications. We used BLAST queries of each sequence against the entire library and got the top ten hits resulting in 8160 hits. The percent sequence identity was reported from these hits (PIdent, range 75-100%). Accurate identification (PIdent above which errors totaled less than 5%) occurred for genera at PIdent values > 95 and families at PIdent values ≥ 91, suggesting these as heuristic thresholds for generic and familial identifications in spiders. Accuracy of identification increases with numbers of species/genus and genera/family in the library; above five genera per family and fifteen species per genus all identifications were correct. We propose that using percent sequence identity between conventional barcode sequences may be a feasible and reasonably accurate method to identify animals to family/genus. However, the quality of the underlying database impacts accuracy of results; many outliers in our dataset could be attributed to taxonomic and/or sequencing errors in BOLD and GenBank. It seems that an accurate and complete reference library of families and genera of life could provide accurate higher level taxonomic identifications cheaply and accessibly, within years rather than decades.

Download Full-text