Machine learning as an effective method for identifying true SNPs in polyploid plants

Mapping Intimacies ◽

10.1101/274407 ◽

2018 ◽

Cited By ~ 1

Author(s):

Walid Korani ◽

Josh P. Clevenger ◽

Ye Chu ◽

Peggy Ozias-Akins

Keyword(s):

Machine Learning ◽

Sequence Data ◽

Snp Array ◽

Real Data ◽

Large Set ◽

Nucleotide Polymorphisms ◽

Rna Seq ◽

Sequencing Data ◽

Single Nucleotide ◽

Accuracy Rates

AbstractSingle Nucleotide Polymorphisms (SNPs) have many advantages as molecular markers since they are ubiquitous and co-dominant. However, the discovery of true SNPs especially in polyploid species is difficult. Peanut is an allopolyploid, which has a very low rate of true SNP calling. A large set of true and false SNPs identified from the Arachis 58k Affymetrix array was leveraged to train machine learning models to select true SNPs straight from sequence data. These models achieved accuracy rates of above 80% using real peanut RNA-seq and whole genome shotgun (WGS) re-sequencing data, which is higher than previously reported for polyploids. A 48K SNP array, Axiom Arachis2, was designed using the approach which revealed 75% accuracy of calling SNPs from different tetraploid peanut genotypes. Using the method to simulate SNP variation in peanut, cotton, wheat, and strawberry, we show that models built with our parameter sets achieve above 98% accuracy in selecting true SNPs. Additionally, models built with simulated genotypes were able to select true SNPs at above 80% accuracy using real peanut data, demonstrating that our model can be used even if real data are not available to train the models. This work demonstrates an effective approach for calling highly reliable SNPs from polyploids using machine learning. A novel tool was developed for predicting true SNPs from sequence data, designated as SNP-ML (SNP-Machine Learning, pronounced “snip mill”), using the described models. SNP-ML additionally provides functionality to train new models not included in this study for customized use, designated SNP-MLer (SNP-Machine Learner, pronounced “snip miller”). SNP-ML is freely available for public use.

Download Full-text

GA-Based Data Mining Applied to Genetic Data for the Diagnosis of Complex Diseases

Soft Computing Methods for Practical Environment Solutions ◽

10.4018/978-1-61520-893-7.ch014 ◽

2010 ◽

pp. 219-239 ◽

Cited By ~ 5

Author(s):

Vanessa Aguiar ◽

Jose A. Seoane ◽

Ana Freire ◽

Ling Guo

Keyword(s):

Machine Learning ◽

Data Mining ◽

Genetic Algorithms ◽

Complex Disease ◽

Complex Diseases ◽

Real Data ◽

Genetic Data ◽

Nucleotide Polymorphisms ◽

Single Nucleotide ◽

Short Period

A new algorithm is presented for finding genotype-phenotype association rules from data related to complex diseases. The algorithm was based on genetic algorithms, a technique of evolutionary computation. The algorithm was compared to several traditional data mining techniques and it was proved that it obtained better classification scores and found more rules from the data generated artificially. It also obtained similar results when using some UCI Machine Learning datasets. In this chapter it is assumed that several groups of Single Nucleotide Polymorphisms (SNPs) have an impact on the predisposition to develop a complex disease like schizophrenia. It is expected to validate this in a short period of time on real data.

Download Full-text

Genomic diversity affects the accuracy of bacterial single-nucleotide polymorphism–calling pipelines

GigaScience ◽

10.1093/gigascience/giaa007 ◽

2020 ◽

Vol 9 (2) ◽

Cited By ~ 17

Author(s):

Stephen J Bush ◽

Dona Foster ◽

David W Eyre ◽

Emily L Clark ◽

Nicola De Maio ◽

...

Keyword(s):

Reference Genome ◽

Simulated Data ◽

Real Data ◽

Genomic Diversity ◽

Nucleotide Polymorphisms ◽

Sequencing Data ◽

Single Nucleotide ◽

Snp Calling ◽

Single Nucleotide Polymorphism Calling ◽

Nucleotide Divergence

Abstract Background Accurately identifying single-nucleotide polymorphisms (SNPs) from bacterial sequencing data is an essential requirement for using genomics to track transmission and predict important phenotypes such as antimicrobial resistance. However, most previous performance evaluations of SNP calling have been restricted to eukaryotic (human) data. Additionally, bacterial SNP calling requires choosing an appropriate reference genome to align reads to, which, together with the bioinformatic pipeline, affects the accuracy and completeness of a set of SNP calls obtained. This study evaluates the performance of 209 SNP-calling pipelines using a combination of simulated data from 254 strains of 10 clinically common bacteria and real data from environmentally sourced and genomically diverse isolates within the genera Citrobacter, Enterobacter, Escherichia, and Klebsiella. Results We evaluated the performance of 209 SNP-calling pipelines, aligning reads to genomes of the same or a divergent strain. Irrespective of pipeline, a principal determinant of reliable SNP calling was reference genome selection. Across multiple taxa, there was a strong inverse relationship between pipeline sensitivity and precision, and the Mash distance (a proxy for average nucleotide divergence) between reads and reference genome. The effect was especially pronounced for diverse, recombinogenic bacteria such as Escherichia coli but less dominant for clonal species such as Mycobacterium tuberculosis. Conclusions The accuracy of SNP calling for a given species is compromised by increasing intra-species diversity. When reads were aligned to the same genome from which they were sequenced, among the highest-performing pipelines was Novoalign/GATK. By contrast, when reads were aligned to particularly divergent genomes, the highest-performing pipelines often used the aligners NextGenMap or SMALT, and/or the variant callers LoFreq, mpileup, or Strelka.

Download Full-text

BAMixChecker: an automated checkup tool for matched sample pairs in NGS cohort

Bioinformatics ◽

10.1093/bioinformatics/btz479 ◽

2019 ◽

Vol 35 (22) ◽

pp. 4806-4808 ◽

Cited By ~ 2

Author(s):

Hein Chun ◽

Sangwoo Kim

Keyword(s):

Genomic Analysis ◽

Supplementary Information ◽

Nucleotide Polymorphisms ◽

Rna Seq ◽

Sequencing Data ◽

Single Nucleotide ◽

Frequent Problem ◽

Generation Sequencing ◽

User Intervention ◽

Genotype Concordance

Abstract Summary Mislabeling in the process of next generation sequencing is a frequent problem that can cause an entire genomic analysis to fail, and a regular cohort-level checkup is needed to ensure that it has not occurred. We developed a new, automated tool (BAMixChecker) that accurately detects sample mismatches from a given BAM file cohort with minimal user intervention. BAMixChecker uses a flexible, data-specific set of single-nucleotide polymorphisms and detects orphan (unpaired) and swapped (mispaired) samples based on genotype-concordance score and entropy-based file name analysis. BAMixChecker shows ∼100% accuracy in real WES, RNA-Seq and targeted sequencing data cohorts, even for small panels (<50 genes). BAMixChecker provides an HTML-style report that graphically outlines the sample matching status in tables and heatmaps, with which users can quickly inspect any mismatch events. Availability and implementation BAMixChecker is available at https://github.com/heinc1010/BAMixChecker Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Physiological RNA dynamics in RNA-Seq analysis

Briefings in Bioinformatics ◽

10.1093/bib/bby045 ◽

2018 ◽

Vol 20 (5) ◽

pp. 1725-1733 ◽

Cited By ~ 1

Author(s):

Zhongneng Xu ◽

Shuichi Asakawa

Keyword(s):

Rna Degradation ◽

Decay Rates ◽

Sequence Length ◽

Nucleotide Polymorphisms ◽

Rna Seq ◽

Sequencing Data ◽

Single Nucleotide ◽

Rna Dynamics ◽

Rna Quantification ◽

Rna Accumulation

Abstract Physiological RNA dynamics cause problems in transcriptome analysis. Physiological RNA accumulation affects the analysis of RNA quantification, and physiological RNA degradation affects the analysis of the RNA sequence length, feature site and quantification. In the present article, we review the effects of physiological degradation and accumulation of RNA on analysing RNA sequencing data. Physiological RNA accumulation and degradation probably led to such phenomena as incorrect estimations of transcription quantification, differential expressions, co-expressions, RNA decay rates, alternative splicing, boundaries of transcription, novel genes, new single-nucleotide polymorphisms, small RNAs and gene fusion. Thus, the transcriptomic data obtained up to date warrant further scrutiny. New and improved techniques and bioinformatics software are needed to produce accurate data in transcriptome research.

Download Full-text

Risk prediction and marker selection in nonsynonymous single nucleotide polymorphisms using whole genome sequencing data

Animal Cells and Systems ◽

10.1080/19768354.2020.1860125 ◽

2020 ◽

Vol 24 (6) ◽

pp. 321-328

Author(s):

Young-Sup Lee ◽

KyeongHye Won ◽

Donghyun Shin ◽

Jae-Don Oh

Keyword(s):

Single Nucleotide Polymorphisms ◽

Whole Genome Sequencing ◽

Risk Prediction ◽

Genome Sequencing ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Nucleotide Polymorphisms ◽

Sequencing Data ◽

Single Nucleotide ◽

Marker Selection

Download Full-text

Investigation of allele specific expression in various tissues of broiler chickens using the detection tool VADT

Scientific Reports ◽

10.1038/s41598-021-83459-8 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

M. Joseph Tomlinson ◽

Shawn W. Polson ◽

Jing Qiu ◽

Juniper A. Lake ◽

William Lee ◽

...

Keyword(s):

Broiler Chickens ◽

Nucleotide Polymorphisms ◽

Rna Seq ◽

Specific Expression ◽

Single Nucleotide ◽

Allele Specific Expression ◽

Detection Tool ◽

Commercial Broiler ◽

Significant Phenomenon ◽

Allele Specific

AbstractDifferential abundance of allelic transcripts in a diploid organism, commonly referred to as allele specific expression (ASE), is a biologically significant phenomenon and can be examined using single nucleotide polymorphisms (SNPs) from RNA-seq. Quantifying ASE aids in our ability to identify and understand cis-regulatory mechanisms that influence gene expression, and thereby assist in identifying causal mutations. This study examines ASE in breast muscle, abdominal fat, and liver of commercial broiler chickens using variants called from a large sub-set of the samples (n = 68). ASE analysis was performed using a custom software called VCF ASE Detection Tool (VADT), which detects ASE of biallelic SNPs using a binomial test. On average ~ 174,000 SNPs in each tissue passed our filtering criteria and were considered informative, of which ~ 24,000 (~ 14%) showed ASE. Of all ASE SNPs, only 3.7% exhibited ASE in all three tissues, with ~ 83% showing ASE specific to a single tissue. When ASE genes (genes containing ASE SNPs) were compared between tissues, the overlap among all three tissues increased to 20.1%. Our results indicate that ASE genes show tissue-specific enrichment patterns, but all three tissues showed enrichment for pathways involved in translation.

Download Full-text

Worldwide tracing of mutations and the evolutionary dynamics of SARS-CoV-2

10.1101/2020.08.07.242263 ◽

2020 ◽

Author(s):

Zhong-Yin Zhou ◽

Hang Liu ◽

Yue-Dong Zhang ◽

Yin-Qiao Wu ◽

Min-Sheng Peng ◽

...

Keyword(s):

Substitution Rate ◽

Evolutionary Dynamics ◽

Vaccine Development ◽

Sequence Data ◽

Immune Recognition ◽

Nucleotide Polymorphisms ◽

Single Nucleotide ◽

Protein Coding ◽

Theoretical Support ◽

Recurrent Mutations

AbstractUnderstanding the mutational and evolutionary dynamics of SARS-CoV-2 is essential for treating COVID-19 and the development of a vaccine. Here, we analyzed publicly available 15,818 assembled SARS-CoV-2 genome sequences, along with 2,350 raw sequence datasets sampled worldwide. We investigated the distribution of inter-host single nucleotide polymorphisms (inter-host SNPs) and intra-host single nucleotide variations (iSNVs). Mutations have been observed at 35.6% (10,649/29,903) of the bases in the genome. The substitution rate in some protein coding regions is higher than the average in SARS-CoV-2 viruses, and the high substitution rate in some regions might be driven to escape immune recognition by diversifying selection. Both recurrent mutations and human-to-human transmission are mechanisms that generate fitness advantageous mutations. Furthermore, the frequency of three mutations (S protein, F400L; ORF3a protein, T164I; and ORF1a protein, Q6383H) has gradual increased over time on lineages, which provides new clues for the early detection of fitness advantageous mutations. Our study provides theoretical support for vaccine development and the optimization of treatment for COVID-19. We call researchers to submit raw sequence data to public databases.

Download Full-text

An assembly-free method of phylogeny reconstruction using short-read sequences from pooled samples without barcodes

10.1101/2021.04.09.439138 ◽

2021 ◽

Author(s):

Thomas K. F. Wong ◽

Teng Li ◽

Louis Ranjard ◽

Steven Wu ◽

Jeet Sukumaran ◽

...

Keyword(s):

Dna Sequences ◽

Dna Barcode ◽

Real Data ◽

Reference Sequence ◽

Nucleotide Polymorphisms ◽

Data Set ◽

Single Nucleotide ◽

Short Read ◽

Pooled Samples ◽

Haplotype Information

AbstractA current strategy for obtaining haplotype information from several individuals involves short-read sequencing of pooled amplicons, where fragments from each individual is identified by a unique DNA barcode. In this paper, we report a new method to recover the phylogeny of haplotypes from short-read sequences obtained using pooled amplicons from a mixture of individuals, without barcoding. The method, AFPhyloMix, accepts an alignment of the mixture of reads against a reference sequence, obtains the single-nucleotide-polymorphisms (SNP) patterns along the alignment, and constructs the phylogenetic tree according to the SNP patterns. AFPhyloMix adopts a Bayesian model of inference to estimates the phylogeny of the haplotypes and their relative frequencies, given that the number of haplotypes is known. In our simulations, AFPhyloMix achieved at least 80% accuracy at recovering the phylogenies and frequencies of the constituent haplotypes, for mixtures with up to 15 haplotypes. AFPhyloMix also worked well on a real data set of kangaroo mitochondrial DNA sequences.

Download Full-text

TADA – a Machine Learning Tool for Functional Annotation based Prioritisation of Putative Pathogenic CNVs

10.1101/2020.06.30.180711 ◽

2020 ◽

Cited By ~ 1

Author(s):

J. Hertzberg ◽

S. Mundlos ◽

M. Vingron ◽

G. Gallone

Keyword(s):

Machine Learning ◽

Functional Annotation ◽

Copy Number Variants ◽

Enrichment Analysis ◽

Computational Prediction ◽

Nucleotide Polymorphisms ◽

Automated Classification ◽

Single Nucleotide ◽

Pathogenic Cnvs ◽

Disease Impact

AbstractThe computational prediction of disease-associated genetic variation is of fundamental importance for the genomics, genetics and clinical research communities. Whereas the mechanisms and disease impact underlying coding single nucleotide polymorphisms (SNPs) and small Insertions/Deletions (InDels) have been the focus of intense study, little is known about the corresponding impact of structural variants (SVs), which are challenging to detect, phase and interpret. Few methods have been developed to prioritise larger chromosomal alterations such as Copy Number Variants (CNVs) based on their pathogenicity. We address this issue with TADA, a method to prioritise pathogenic CNVs through manual filtering and automated classification, based on an extensive catalogue of functional annotation supported by rigorous enrichment analysis. We demonstrate that our machine-learning classifiers for deletions and duplications are able to accurately predict pathogenic CNVs (AUC: 0.8042 and 0.7869, respectively) and produce a well-calibrated pathogenicity score. The combination of enrichment analysis and classifications suggests that prioritisation of pathogenic CNVs based on functional annotation is a promising approach to support clinical diagnostic and to further the understanding of mechanisms that control the disease impact of larger genomic alterations.

Download Full-text

Genome-wide profiling in colorectal cancer identifies PHF19 and TBC1D16 as oncogenic super enhancers

Nature Communications ◽

10.1038/s41467-021-26600-5 ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

Qing-Lan Li ◽

Xiang Lin ◽

Ya-Li Yu ◽

Lin Chen ◽

Qi-Xin Hu ◽

...

Keyword(s):

Colorectal Cancer ◽

Colorectal Cancer Patient ◽

Nucleotide Polymorphisms ◽

Rna Seq ◽

Motif Analysis ◽

Single Nucleotide ◽

Super Enhancer ◽

Genome Wide ◽

Functional Factors ◽

Cancer Tissues

AbstractColorectal cancer is one of the most common cancers in the world. Although genomic mutations and single nucleotide polymorphisms have been extensively studied, the epigenomic status in colorectal cancer patient tissues remains elusive. Here, together with genomic and transcriptomic analysis, we use ChIP-Seq to profile active enhancers at the genome wide level in colorectal cancer paired patient tissues (tumor and adjacent tissues from the same patients). In total, we sequence 73 pairs of colorectal cancer tissues and generate 147 H3K27ac ChIP-Seq, 144 RNA-Seq, 147 whole genome sequencing and 86 H3K4me3 ChIP-Seq samples. Our analysis identifies 5590 gain and 1100 lost variant enhancer loci in colorectal cancer, and 334 gain and 121 lost variant super enhancer loci. Multiple key transcription factors in colorectal cancer are predicted with motif analysis and core regulatory circuitry analysis. Further experiments verify the function of the super enhancers governing PHF19 and TBC1D16 in regulating colorectal cancer tumorigenesis, and KLF3 is identified as an oncogenic transcription factor in colorectal cancer. Taken together, our work provides an important epigenomic resource and functional factors for epigenetic studies in colorectal cancer.

Download Full-text