Interoperable genome annotation with GBOL, an extendable infrastructure for functional data mining

Comparative Genome Annotation Systems

Advanced Data Mining Technologies in Bioinformatics ◽

10.4018/978-1-59140-863-5.ch016 ◽

2011 ◽

pp. 296-313

Author(s):

Kwangmin Choi ◽

Sun Kim

Keyword(s):

Data Mining ◽

Sequence Analysis ◽

Genome Annotation ◽

Genome Comparison ◽

Comparative Genome ◽

A Genome ◽

Multiple Genomes

Understanding the genetic content of a genome is a very important but challenging task. One of the most effective methods to annotate a genome is to compare it to the genomes that are already sequenced and annotated. This chapter is to survey systems that can be used for annotating genomes by comparing multiple genomes and discusses important issues in designing genome comparison systems such as extensibility, scalability, reconfigurability, flexibility, usability, and data mining functionality. We also discuss briefly further issues in developing genome comparison systems where users can perform genome comparison flexibly on the sequence analysis level.

Download Full-text

A Method for Improving the Accuracy and Efficiency of Bacteriophage Genome Annotation

International Journal of Molecular Sciences ◽

10.3390/ijms20143391 ◽

2019 ◽

Vol 20 (14) ◽

pp. 3391 ◽

Cited By ~ 7

Author(s):

Alicia Salisbury ◽

Philippos K. Tsourkas

Keyword(s):

Decision Making ◽

Genome Annotation ◽

Phage Genome ◽

Labor Cost ◽

Additional Benefit ◽

Student Recruitment ◽

Manual Curation ◽

A Genome ◽

Bacteriophage Genome ◽

Genome Annotations

Bacteriophages are the most numerous entities on Earth. The number of sequenced phage genomes is approximately 8000 and increasing rapidly. Sequencing of a genome is followed by annotation, where genes, start codons, and functions are putatively identified. The mainstays of phage genome annotation are auto-annotation programs such as Glimmer and GeneMark. Due to the relatively small size of phage genomes, many groups choose to manually curate auto-annotation results to increase accuracy. An additional benefit of manual curation of auto-annotated phage genomes is that the process is amenable to be performed by students, and has been shown to improve student recruitment to the sciences. However, despite its greater accuracy and pedagogical value, manual curation suffers from high labor cost, lack of standardization and a degree of subjectivity in decision making, and susceptibility to mistakes. Here, we present a method developed in our lab that is designed to produce accurate annotations while reducing subjectivity and providing a degree of standardization in decision-making. We show that our method produces genome annotations more accurate than auto-annotation programs while retaining the pedagogical benefits of manual genome curation.

Download Full-text

Data mining parasite genomes

Parasitology ◽

10.1017/s0031182004006857 ◽

2004 ◽

Vol 128 (S1) ◽

pp. S23-S31 ◽

Cited By ~ 3

Author(s):

M. BERRIMAN

Keyword(s):

Data Mining ◽

Genome Annotation ◽

Sequence Data ◽

Genome Project ◽

Coding Regions ◽

A Genome ◽

Term Data ◽

Lines Of Evidence ◽

Large Background

The term ‘data mining’ can be used to describe any process where useful information is extracted from data with a large background of ‘noise’. In the context of a genome project, several stages involve data mining. Amongst the sequence data, ‘signals’ need to be detected that indicate the presence of interesting features. Often this involves differentiating between transcribed and non-transcribed bases to predict coding regions. After detection, defining the roles of these sequences involves sifting through multiple lines of evidence. If these roles are accurately reflected in genome annotation, they can be used by researchers to frame queries and interrogate the data further.

Download Full-text

Comparative Genome Annotation Systems

Intelligent Information Technologies ◽

10.4018/978-1-59904-941-0.ch053 ◽

2011 ◽

pp. 901-916

Author(s):

Kwangmin Choi ◽

Sun Kim

Keyword(s):

Data Mining ◽

Sequence Analysis ◽

Genome Annotation ◽

Genome Comparison ◽

Comparative Genome ◽

A Genome ◽

Multiple Genomes

Understanding the genetic content of a genome is a very important but challenging task. One of the most effective methods to annotate a genome is to compare it to the genomes that are already sequenced and annotated. This chapter is to survey systems that can be used for annotating genomes by comparing multiple genomes and discusses important issues in designing genome comparison systems such as extensibility, scalability, reconfigurability, flexibility, usability, and data mining functionality. We also discuss briefly further issues in developing genome comparison systems where users can perform genome comparison flexibly on the sequence analysis level.

Download Full-text

Comparative Genome Annotation Systems

Data Warehousing and Mining ◽

10.4018/978-1-59904-951-9.ch105 ◽

2008 ◽

pp. 1784-1798

Author(s):

Kwangmin Choi ◽

Sun Kim

Keyword(s):

Data Mining ◽

Sequence Analysis ◽

Genome Annotation ◽

Genome Comparison ◽

Comparative Genome ◽

A Genome ◽

Multiple Genomes

Understanding the genetic content of a genome is a very important but challenging task. One of the most effective methods to annotate a genome is to compare it to the genomes that are already sequenced and annotated. This chapter is to survey systems that can be used for annotating genomes by comparing multiple genomes and discusses important issues in designing genome comparison systems such as extensibility, scalability, reconfigurability, flexibility, usability, and data mining functionality. We also discuss briefly further issues in developing genome comparison systems where users can perform genome comparison flexibly on the sequence analysis level.

Download Full-text

Accelerated Discovery of High-Refractive-Index Polyimides via First-Principles Molecular Modeling, Virtual High-Throughput Screening, and Data Mining

10.26434/chemrxiv.7670903.v1 ◽

2019 ◽

Author(s):

Mohammad Atif Faiz Afzal ◽

Mojtaba Haghighatlari ◽

Sai Prasad Ganesh ◽

Chong Cheng ◽

Johannes Hachmann

Keyword(s):

Data Mining ◽

Refractive Index ◽

High Throughput ◽

First Principles ◽

High Throughput Screening ◽

Large Scale ◽

Computational Study ◽

High Refractive Index ◽

Structural Features ◽

Learning Program

<div>We present a high-throughput computational study to identify novel polyimides (PIs) with exceptional refractive index (RI) values for use as optic or optoelectronic materials. Our study utilizes an RI prediction protocol based on a combination of first-principles and data modeling developed in previous work, which we employ on a large-scale PI candidate library generated with the ChemLG code. We deploy the virtual screening software ChemHTPS to automate the assessment of this extensive pool of PI structures in order to determine the performance potential of each candidate. This rapid and efficient approach yields a number of highly promising leads compounds. Using the data mining and machine learning program package ChemML, we analyze the top candidates with respect to prevalent structural features and feature combinations that distinguish them from less promising ones. In particular, we explore the utility of various strategies that introduce highly polarizable moieties into the PI backbone to increase its RI yield. The derived insights provide a foundation for rational and targeted design that goes beyond traditional trial-and-error searches.</div>

Download Full-text

Multi-GPU approach to global induction of classification trees for large-scale data mining

Applied Intelligence ◽

10.1007/s10489-020-01952-5 ◽

2021 ◽

Author(s):

Krzysztof Jurczuk ◽

Marcin Czajkowski ◽

Marek Kretowski

Keyword(s):

Data Mining ◽

Large Scale ◽

Real Life ◽

Population Based ◽

Tree Structure ◽

Global Approach ◽

Data Parallel ◽

Large Scale Data ◽

The Impact ◽

Scale Data

AbstractThis paper concerns the evolutionary induction of decision trees (DT) for large-scale data. Such a global approach is one of the alternatives to the top-down inducers. It searches for the tree structure and tests simultaneously and thus gives improvements in the prediction and size of resulting classifiers in many situations. However, it is the population-based and iterative approach that can be too computationally demanding to apply for big data mining directly. The paper demonstrates that this barrier can be overcome by smart distributed/parallel processing. Moreover, we ask the question whether the global approach can truly compete with the greedy systems for large-scale data. For this purpose, we propose a novel multi-GPU approach. It incorporates the knowledge of global DT induction and evolutionary algorithm parallelization together with efficient utilization of memory and computing GPU’s resources. The searches for the tree structure and tests are performed simultaneously on a CPU, while the fitness calculations are delegated to GPUs. Data-parallel decomposition strategy and CUDA framework are applied. Experimental validation is performed on both artificial and real-life datasets. In both cases, the obtained acceleration is very satisfactory. The solution is able to process even billions of instances in a few hours on a single workstation equipped with 4 GPUs. The impact of data characteristics (size and dimension) on convergence and speedup of the evolutionary search is also shown. When the number of GPUs grows, nearly linear scalability is observed what suggests that data size boundaries for evolutionary DT mining are fading.

Download Full-text

Arabidopsis Genes Essential for Seedling Viability: Isolation of Insertional Mutants and Molecular Cloning

Genetics ◽

10.1093/genetics/159.4.1765 ◽

2001 ◽

Vol 159 (4) ◽

pp. 1765-1778

Author(s):

Gregory J Budziszewski ◽

Sharon Potter Lewis ◽

Lyn Wegrich Glover ◽

Jennifer Reineke ◽

Gary Jones ◽

...

Keyword(s):

Large Scale ◽

Protein Translocation ◽

Gene Families ◽

Mutant Phenotype ◽

Lethal Mutant ◽

A Genome ◽

Genes Encoding ◽

High Level ◽

Mutant Lines ◽

Genome Scale

Abstract We have undertaken a large-scale genetic screen to identify genes with a seedling-lethal mutant phenotype. From screening ~38,000 insertional mutant lines, we identified >500 seedling-lethal mutants, completed cosegregation analysis of the insertion and the lethal phenotype for >200 mutants, molecularly characterized 54 mutants, and provided a detailed description for 22 of them. Most of the seedling-lethal mutants seem to affect chloroplast function because they display altered pigmentation and affect genes encoding proteins predicted to have chloroplast localization. Although a high level of functional redundancy in Arabidopsis might be expected because 65% of genes are members of gene families, we found that 41% of the essential genes found in this study are members of Arabidopsis gene families. In addition, we isolated several interesting classes of mutants and genes. We found three mutants in the recently discovered nonmevalonate isoprenoid biosynthetic pathway and mutants disrupting genes similar to Tic40 and tatC, which are likely to be involved in chloroplast protein translocation. Finally, we directly compared T-DNA and Ac/Ds transposon mutagenesis methods in Arabidopsis on a genome scale. In each population, we found only about one-third of the insertion mutations cosegregated with a mutant phenotype.

Download Full-text

Uncovering transcriptional dark matter via gene annotation independent single-cell RNA sequencing analysis

Nature Communications ◽

10.1038/s41467-021-22496-3 ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

Michael F. Z. Wang ◽

Madhav Mantri ◽

Shao-Pei Chou ◽

Gaetano J. Scuderi ◽

David W. McKellar ◽

...

Keyword(s):

Single Cell ◽

Genome Annotation ◽

Gene Annotation ◽

Active Regions ◽

Sequencing Analysis ◽

Biologically Relevant ◽

Mole Rat ◽

Genome Annotations ◽

Cell Expression ◽

High Quality Genome

AbstractConventional scRNA-seq expression analyses rely on the availability of a high quality genome annotation. Yet, as we show here with scRNA-seq experiments and analyses spanning human, mouse, chicken, mole rat, lemur and sea urchin, genome annotations are often incomplete, in particular for organisms that are not routinely studied. To overcome this hurdle, we created a scRNA-seq analysis routine that recovers biologically relevant transcriptional activity beyond the scope of the best available genome annotation by performing scRNA-seq analysis on any region in the genome for which transcriptional products are detected. Our tool generates a single-cell expression matrix for all transcriptionally active regions (TARs), performs single-cell TAR expression analysis to identify biologically significant TARs, and then annotates TARs using gene homology analysis. This procedure uses single-cell expression analyses as a filter to direct annotation efforts to biologically significant transcripts and thereby uncovers biology to which scRNA-seq would otherwise be in the dark.

Download Full-text

Genetic variation in recombination rate in the pig

Genetics Selection Evolution ◽

10.1186/s12711-021-00643-0 ◽

2021 ◽

Vol 53 (1) ◽

Author(s):

Martin Johnsson ◽

Andrew Whalen ◽

Roger Ros-Freixedes ◽

Gregor Gorjanc ◽

Ching-Yi Chen ◽

...

Keyword(s):

Genetic Variation ◽

Recombination Rate ◽

Large Scale ◽

Genome Wide Association Study ◽

Genetic Material ◽

A Genome ◽

Pig Genome ◽

Genetic Length ◽

Trait Locus ◽

Genomic Regions

Abstract Background Meiotic recombination results in the exchange of genetic material between homologous chromosomes. Recombination rate varies between different parts of the genome, between individuals, and is influenced by genetics. In this paper, we assessed the genetic variation in recombination rate along the genome and between individuals in the pig using multilocus iterative peeling on 150,000 individuals across nine genotyped pedigrees. We used these data to estimate the heritability of recombination and perform a genome-wide association study of recombination in the pig. Results Our results confirmed known features of the recombination landscape of the pig genome, including differences in genetic length of chromosomes and marked sex differences. The recombination landscape was repeatable between lines, but at the same time, there were differences in average autosome-wide recombination rate between lines. The heritability of autosome-wide recombination rate was low but not zero (on average 0.07 for females and 0.05 for males). We found six genomic regions that are associated with recombination rate, among which five harbour known candidate genes involved in recombination: RNF212, SHOC1, SYCP2, MSH4 and HFM1. Conclusions Our results on the variation in recombination rate in the pig genome agree with those reported for other vertebrates, with a low but nonzero heritability, and the identification of a major quantitative trait locus for recombination rate that is homologous to that detected in several other species. This work also highlights the utility of using large-scale livestock data to understand biological processes.

Download Full-text