scholarly journals Variation benchmark datasets: update, criteria, quality and applications

Database ◽  
2020 ◽  
Vol 2020 ◽  
Author(s):  
Anasua Sarkar ◽  
Yang Yang ◽  
Mauno Vihinen

Abstract Development of new computational methods and testing their performance has to be carried out using experimental data. Only in comparison to existing knowledge can method performance be assessed. For that purpose, benchmark datasets with known and verified outcome are needed. High-quality benchmark datasets are valuable and may be difficult, laborious and time consuming to generate. VariBench and VariSNP are the two existing databases for sharing variation benchmark datasets used mainly for variation interpretation. They have been used for training and benchmarking predictors for various types of variations and their effects. VariBench was updated with 419 new datasets from 109 papers containing altogether 329 014 152 variants; however, there is plenty of redundancy between the datasets. VariBench is freely available at http://structure.bmc.lu.se/VariBench/. The contents of the datasets vary depending on information in the original source. The available datasets have been categorized into 20 groups and subgroups. There are datasets for insertions and deletions, substitutions in coding and non-coding region, structure mapped, synonymous and benign variants. Effect-specific datasets include DNA regulatory elements, RNA splicing, and protein property for aggregation, binding free energy, disorder and stability. Then there are several datasets for molecule-specific and disease-specific applications, as well as one dataset for variation phenotype effects. Variants are often described at three molecular levels (DNA, RNA and protein) and sometimes also at the protein structural level including relevant cross references and variant descriptions. The updated VariBench facilitates development and testing of new methods and comparison of obtained performances to previously published methods. We compared the performance of the pathogenicity/tolerance predictor PON-P2 to several benchmark studies, and show that such comparisons are feasible and useful, however, there may be limitations due to lack of provided details and shared data. Database URL: http://structure.bmc.lu.se/VariBench

2019 ◽  
Author(s):  
Anasua Sarkar ◽  
Yang Yang ◽  
Mauno Vihinen

ABSTRACTDevelopment of new computational methods and testing their performance has to be done on experimental data. Only in comparison to existing knowledge can method performance be assessed. For that purpose, benchmark datasets with known and verified outcome are needed. High-quality benchmark datasets are valuable and may be difficult, laborious and time consuming to generate. VariBench and VariSNP are the two existing databases for sharing variation benchmark datasets. They have been used for training and benchmarking predictors for various types of variations and their effects. There are 419 new datasets from 109 papers containing altogether 329003373 variants; however there is plenty of redundancy between the datasets. VariBench is freely available athttp://structure.bmc.lu.se/VariBench/. The contents of the datasets vary depending on information in the original source. The available datasets have been categorized into 20 groups and subgroups. There are datasets for insertions and deletions, substitutions in coding and non-coding region, structure mapped, synonymous and benign variants. Effect-specific datasets include DNA regulatory elements, RNA splicing, and protein property predictions for aggregation, binding free energy, disorder and stability. Then there are several datasets for molecule-specific and disease-specific applications, as well as one dataset for variation phenotype effects. Variants are often described at three molecular levels (DNA, RNA and protein) and sometimes also at the protein structural level including relevant cross references and variant descriptions. The updated VariBench facilitates development and testing of new methods and comparison of obtained performance to previously published methods. We compared the performance of the pathogenicity/tolerance predictor PON-P2 to several benchmark studies, and showed that such comparisons are feasible and useful, however, there may be limitations due to lack of provided details and shared data.AUTHOR SUMMARYA prediction method performance can only be assessed in comparison to existing knowledge. For that purpose benchmark datasets with known and verified outcome are needed. High-quality benchmark datasets are valuable and may be difficult, laborious and time consuming to generate. We collected variation datasets from literature, website and databases. There are 419 separate new datasets, which however contain plenty of redundancy. VariBench is freely available athttp://structure.bmc.lu.se/VariBench/. There are datasets for insertions and deletions, substitutions in coding and non-coding region, structure mapped, synonymous and benign variants. Effect-specific datasets include DNA regulatory elements, RNA splicing, and protein property predictions for aggregation, binding free energy, disorder and stability. Then there are several datasets for molecule-specific and disease-specific applications, as well as one dataset for variation phenotype effects. The updated VariBench facilitates development and testing of new methods and comparison of obtained performance to previously published methods. We compared the performance of the pathogenicity/tolerance predictor PON-P2 to several benchmark studies and showed that such comparisons are possible and useful when the details of studies and the datasets are shared.


Genetics ◽  
1999 ◽  
Vol 153 (2) ◽  
pp. 753-762
Author(s):  
Günther E Roth ◽  
Sigrid Wattler ◽  
Hartmut Bornschein ◽  
Michael Lehmann ◽  
Günter Korge

Abstract The Drosophila melanogaster gene Sgs-1 belongs to the secretion protein genes, which are coordinately expressed in salivary glands of third instar larvae. Earlier analysis had implied that Sgs-1 is located at the 25B2-3 puff. We cloned Sgs-1 from a YAC covering 25B2-3. Despite using a variety of vectors and Escherichia coli strains, subcloning from the YAC led to deletions within the Sgs-1 coding region. Analysis of clonable and unclonable sequences revealed that Sgs-1 mainly consists of 48-bp tandem repeats encoding a threonine-rich protein. The Sgs-1 inserts from single λ clones are heterogeneous in length, indicating that repeats are eliminated. By analyzing the expression of Sgs-1/lacZ fusions in transgenic flies, cis-regulatory elements of Sgs-1 were mapped to lie within 1 kb upstream of the transcriptional start site. Band shift assays revealed binding sites for the transcription factor fork head (FKH) and the factor secretion enhancer binding protein 3 (SEBP3) at positions that are functionally relevant. FKH and SEBP3 have been shown previously to be involved in the regulation of Sgs-3 and Sgs-4. Comparison of the levels of steady state RNA and of the transcription rates for Sgs-1 and Sgs-1/lacZ reporter genes indicates that Sgs-1 RNA is 100-fold more stable than Sgs-1/lacZ RNA. This has implications for the model of how Sgs transcripts accumulate in late third instar larvae.


Genetics ◽  
2020 ◽  
Vol 217 (1) ◽  
Author(s):  
Jaclyn M Noshay ◽  
Alexandre P Marand ◽  
Sarah N Anderson ◽  
Peng Zhou ◽  
Maria Katherine Mejia Guerra ◽  
...  

Abstract Transposable elements (TEs) have the potential to create regulatory variation both through the disruption of existing DNA regulatory elements and through the creation of novel DNA regulatory elements. In a species with a large genome, such as maize, many TEs interspersed with genes create opportunities for significant allelic variation due to TE presence/absence polymorphisms among individuals. We used information on putative regulatory elements in combination with knowledge about TE polymorphisms in maize to identify TE insertions that interrupt existing accessible chromatin regions (ACRs) in B73 as well as examples of polymorphic TEs that contain ACRs among four inbred lines of maize including B73, Mo17, W22, and PH207. The TE insertions in three other assembled maize genomes (Mo17, W22, or PH207) that interrupt ACRs that are present in the B73 genome can trigger changes to the chromatin, suggesting the potential for both genetic and epigenetic influences of these insertions. Nearly 20% of the ACRs located over 2 kb from the nearest gene are located within an annotated TE. These are regions of unmethylated DNA that show evidence for functional importance similar to ACRs that are not present within TEs. Using a large panel of maize genotypes, we tested if there is an association between the presence of TE insertions that interrupt, or carry, an ACR and the expression of nearby genes. While most TE polymorphisms are not associated with expression for nearby genes, the TEs that carry ACRs exhibit enrichment for being associated with higher expression of nearby genes, suggesting that these TEs may contribute novel regulatory elements. These analyses highlight the potential for a subset of TEs to rewire transcriptional responses in eukaryotic genomes.


eLife ◽  
2014 ◽  
Vol 3 ◽  
Author(s):  
Andrew R Bassett ◽  
Asifa Akhtar ◽  
Denise P Barlow ◽  
Adrian P Bird ◽  
Neil Brockdorff ◽  
...  

Although a small number of the vast array of animal long non-coding RNAs (lncRNAs) have known effects on cellular processes examined in vitro, the extent of their contributions to normal cell processes throughout development, differentiation and disease for the most part remains less clear. Phenotypes arising from deletion of an entire genomic locus cannot be unequivocally attributed either to the loss of the lncRNA per se or to the associated loss of other overlapping DNA regulatory elements. The distinction between cis- or trans-effects is also often problematic. We discuss the advantages and challenges associated with the current techniques for studying the in vivo function of lncRNAs in the light of different models of lncRNA molecular mechanism, and reflect on the design of experiments to mutate lncRNA loci. These considerations should assist in the further investigation of these transcriptional products of the genome.


2000 ◽  
Vol 20 (16) ◽  
pp. 6040-6050 ◽  
Author(s):  
Jorge A. Iñiguez-Lluhí ◽  
David Pearce

ABSTRACT DNA regulatory elements frequently harbor multiple recognition sites for several transcriptional activators. The response mounted from such compound response elements is often more pronounced than the simple sum of effects observed at single binding sites. The determinants of such transcriptional synergy and its control, however, are poorly understood. Through a genetic approach, we have uncovered a novel protein motif that limits the transcriptional synergy of multiple DNA-binding regulators. Disruption of these conserved synergy control motifs (SC motifs) selectively increases activity at compound, but not single, response elements. Although isolated SC motifs do not regulate transcription when tethered to DNA, their transfer to an activator lacking them is sufficient to impose limits on synergy. Mechanistic analysis of the two SC motifs found in the glucocorticoid receptor N-terminal region reveals that they function irrespective of the arrangement of the receptor binding sites or their distance from the transcription start site. Proper function, however, requires the receptor's ligand-binding domain and an engaged dimer interface. Notably, the motifs are not functional in yeast and do not alter the effect of p160 coactivators, suggesting that they require other nonconserved components to operate. Many activators across multiple classes harbor seemingly unrelated negative regulatory regions. The presence of SC motifs within them, however, suggests a common function and identifies SC motifs as critical elements of a general mechanism to modulate higher-order interactions among transcriptional regulators.


eLife ◽  
2019 ◽  
Vol 8 ◽  
Author(s):  
Sinisa Hrvatin ◽  
Christopher P Tzeng ◽  
M Aurel Nagy ◽  
Hume Stroud ◽  
Charalampia Koutsioumpa ◽  
...  

Enhancers are the primary DNA regulatory elements that confer cell type specificity of gene expression. Recent studies characterizing individual enhancers have revealed their potential to direct heterologous gene expression in a highly cell-type-specific manner. However, it has not yet been possible to systematically identify and test the function of enhancers for each of the many cell types in an organism. We have developed PESCA, a scalable and generalizable method that leverages ATAC- and single-cell RNA-sequencing protocols, to characterize cell-type-specific enhancers that should enable genetic access and perturbation of gene function across mammalian cell types. Focusing on the highly heterogeneous mammalian cerebral cortex, we apply PESCA to find enhancers and generate viral reagents capable of accessing and manipulating a subset of somatostatin-expressing cortical interneurons with high specificity. This study demonstrates the utility of this platform for developing new cell-type-specific viral reagents, with significant implications for both basic and translational research.


2004 ◽  
Vol 16 (1) ◽  
pp. 23-28 ◽  
Author(s):  
ANTONIETTA LA TERZA ◽  
CRISTINA MICELI ◽  
PIERANGELO LUPORINI

In the Antarctic ciliate, Euplotes focardii, the heat-shock protein 70 (Hsp70) gene does not show any appreciable activation by a thermal stress. Yet, it is activated to appreciable transcriptional levels by oxidative and chemical stresses, thus implying that it evolved a mechanism of selective, stress-specific response. A basic step in investigating this mechanism is the determination of the complete nucleotide sequence of the E. focardii Hsp70 gene. This gene contains a coding region specific for an Hsp70 protein that carries unique amino acid substitutions of potential significance for cold adaptation, and a 5' regulatory region that includes sequence motifs denoting two distinct types of stress-inducible promoters, known as “Heat Shock Elements” (HSE) and “Stress Response Elements” (StRE). From the study of the interactions of these regulatory elements with their specific transactivator factors we expect to shed light on the adaptive modifications that prevent the Hsp70 gene of E. focardii from responding to thermal stress while being responsive to other stresses.


2019 ◽  
Vol 20 (12) ◽  
pp. 2883 ◽  
Author(s):  
Simon J. Baumgart ◽  
Ekaterina Nevedomskaya ◽  
Bernard Haendler

Recent advances in whole-genome and transcriptome sequencing of prostate cancer at different stages indicate that a large number of mutations found in tumors are present in non-protein coding regions of the genome and lead to dysregulated gene expression. Single nucleotide variations and small mutations affecting the recruitment of transcription factor complexes to DNA regulatory elements are observed in an increasing number of cases. Genomic rearrangements may position coding regions under the novel control of regulatory elements, as exemplified by the TMPRSS2-ERG fusion and the amplified enhancer identified upstream of the androgen receptor (AR) gene. Super-enhancers are increasingly found to play important roles in aberrant oncogenic transcription. Several players involved in these processes are currently being evaluated as drug targets and may represent new vulnerabilities that can be exploited for prostate cancer treatment. They include factors involved in enhancer and super-enhancer function such as bromodomain proteins and cyclin-dependent kinases. In addition, non-coding RNAs with an important gene regulatory role are being explored. The rapid progress made in understanding the influence of the non-coding part of the genome and of transcription dysregulation in prostate cancer could pave the way for the identification of novel treatment paradigms for the benefit of patients.


Sign in / Sign up

Export Citation Format

Share Document