VAPPER: High-throughput Variant Antigen Profiling in African trypanosomes

Mapping Intimacies ◽

10.1101/492074 ◽

2018 ◽

Author(s):

Sara Silva Pereira ◽

John Heap ◽

Andrew R Jones ◽

Andrew P. Jackson

Keyword(s):

High Throughput ◽

Large Scale ◽

Variant Calling ◽

Gene Families ◽

Automated Analysis ◽

Sequencing Data ◽

High Throughput Analysis ◽

African Trypanosomes ◽

Difficult Challenge ◽

Variant Antigen

Background: Analysing variant antigen gene families on a population scale is a difficult challenge for conventional methods of read mapping and variant calling due to the great variability in sequence, copy number and genomic loci. In African trypanosomes, hemoparasites of humans and animals, this is complicated by variant antigen repertoires containing hundreds of genes subject to various degrees of sequence recombination. Findings: We introduce Variant Antigen Profiler (VAPPER), a tool that allows automated analysis of variant antigen repertoires of African trypanosomes. VAPPER produces variant antigen profiles for any isolate of the veterinary pathogens Trypanosoma congolense and Trypanosoma vivax from genomic and transcriptomic sequencing data and delivers publication-ready figures that show how the queried isolate compares with a database of existing strains. VAPPER is implemented in Python. It can be installed to a local Galaxy instance from the ToolShed (https://toolshed.g2.bx.psu.edu/) or locally on a Linux platform via the command line (https://github.com/PGB-LIV/VAPPER). The documentation, requirements, examples, and test data are provided in the Github repository. Conclusion: Our approach is the first to allow large-scale analysis of trypanosome variant antigens and establishes two different methodologies that may be applicable to other multi-copy gene families that are otherwise refractory to high-throughput analysis.

Download Full-text

VAPPER: High-throughput variant antigen profiling in African trypanosomes of livestock

GigaScience ◽

10.1093/gigascience/giz091 ◽

2019 ◽

Vol 8 (9) ◽

Cited By ~ 1

Author(s):

Sara Silva Pereira ◽

John Heap ◽

Andrew R Jones ◽

Andrew P Jackson

Keyword(s):

High Throughput ◽

Large Scale ◽

Variant Calling ◽

Gene Families ◽

Automated Analysis ◽

Surface Glycoprotein ◽

Sequencing Data ◽

High Throughput Analysis ◽

African Trypanosomes ◽

Variant Antigen

Abstract Background Analysing variant antigen gene families on a population scale is a difficult challenge for conventional methods of read mapping and variant calling due to the great variability in sequence, copy number, and genomic loci. In African trypanosomes, hemoparasites of humans and animals, this is complicated by variant antigen repertoires containing hundreds of genes subject to various degrees of sequence recombination. Findings We introduce Variant Antigen Profiler (VAPPER), a tool that allows automated analysis of the variant surface glycoprotein repertoires of the most prevalent livestock African trypanosomes. VAPPER produces variant antigen profiles for any isolate of the veterinary pathogens Trypanosoma congolense and Trypanosoma vivax from genomic and transcriptomic sequencing data and delivers publication-ready figures that show how the queried isolate compares with a database of existing strains. VAPPER is implemented in Python. It can be installed to a local Galaxy instance from the ToolShed (https://toolshed.g2.bx.psu.edu/) or locally on a Linux platform via the command line (https://github.com/PGB-LIV/VAPPER). The documentation, requirements, examples, and test data are provided in the Github repository. Conclusion By establishing two different, yet comparable methodologies, our approach is the first to allow large-scale analysis of African trypanosome variant antigens, large multi-copy gene families that are otherwise refractory to high-throughput analysis.

Download Full-text

Accurate fetal variant calling in the presence of maternal cell contamination

10.1101/552414 ◽

2019 ◽

Cited By ~ 1

Author(s):

Elena Nabieva ◽

Satyarth Mishra Sharma ◽

Yermek Kapushev ◽

Sofya K. Garushyants ◽

Anna V. Fedotova ◽

...

Keyword(s):

High Throughput ◽

High Throughput Sequencing ◽

Chorionic Villus ◽

Genetic Diagnosis ◽

Variant Calling ◽

Data Availability ◽

Training Data ◽

Sequencing Data ◽

Maternal Cell ◽

Fetal Dna

AbstractHigh-throughput sequencing of fetal DNA is a promising and increasingly common method for the discovery of all (or all coding) genetic variants in the fetus, either as part of prenatal screening or diagnosis, or for genetic diagnosis of spontaneous abortions. In many cases, the fetal DNA (from chorionic villi, amniotic fluid, or abortive tissue) can be contaminated with maternal cells, resulting in the mixture of fetal and maternal DNA. This maternal cell contamination (MCC) undermines the assumption, made by traditional variant callers, that each allele in a heterozygous site is covered, on average, by 50% of the reads, and therefore can lead to erroneous genotype calls. We present a panel of methods for reducing the genotyping error in the presence of MCC. All methods start with the output of GATK HaplotypeCaller on the sequencing data for the (contaminated) fetal sample and both of its parents, and additionally rely on information about the MCC fraction (which itself is readily estimated from the high-throughput sequencing data). The first of these methods uses a Bayesian probabilistic model to correct the fetal genotype calls produced by MCC-unaware HaplotypeCaller. The other two methods “learn” the genotype-correction model from examples. We use simulated contaminated fetal data to train and test the models. Using the test sets, we show that all three methods lead to substantially improved accuracy when compared with the original MCC-unaware HaplotypeCaller calls. We then apply the best-performing method to three chorionic villus samples from spontaneously terminated pregnancies.Code and training data availabilityhttps://github.com/bazykinlab/ML-maternal-cell-contamination

Download Full-text

Genome Assemblies of the Warthog and Kenyan Domestic Pig Provide Insights into Suidae Evolution and Candidate Genes for African Swine Fever Tolerance

10.1101/2021.12.17.473133 ◽

2021 ◽

Author(s):

Wen Feng ◽

Lei Zhou ◽

Pengju Zhao ◽

Heng Du ◽

Chenguang Diao ◽

...

Keyword(s):

Large Scale ◽

Genetic Resistance ◽

African Swine Fever ◽

Gene Families ◽

Specific Gene ◽

Chromosome 2 ◽

Sequencing Data ◽

Domestic Pig ◽

Phacochoerus Africanus ◽

Contraction And Expansion

As warthog (Phacochoerus africanus) has innate immunity against African swine fever (ASF), it is critical to understanding the evolutionary novelty of warthog to explain its specific ASF resistance. Here, we present two completed new genomes of one warthog and one Kenyan domestic pig, as the fundamental genomic references to decode the genetic mechanism on ASF tolerance. Our results indicated, multiple genomic variations, including gene losses, independent contraction and expansion of specific gene families, likely moulded warthog's genome to adapt the environment. Importantly, the analysis of presence and absence of genomic sequences revealed that, the warthog genome had a DNA sequence absence of the lactate dehydrogenase B (LDHB) gene on chromosome 2 compared to the reference genome. The overexpression and siRNA of LDHB indicated that its inhibition on the replication of ASFV. The Combining with large scale sequencing data of 123 pigs from all over world, contraction and expansion of TRIM genes families revealed that TRIM family genes in the warthog genome were potentially responsible for its tolerance to ASF. Our results will help further improve the understanding of genetic resistance ASF in pigs.

Download Full-text

Variant calling and quality control of large-scale human genome sequencing data

Emerging Topics in Life Sciences ◽

10.1042/etls20190007 ◽

2019 ◽

Vol 3 (4) ◽

pp. 399-409 ◽

Cited By ~ 1

Author(s):

Brandon Jew ◽

Jae Hoon Sul

Keyword(s):

Quality Control ◽

Genome Sequencing ◽

Genetic Variants ◽

Large Scale ◽

Variant Calling ◽

Sequencing Data ◽

Computational Approaches ◽

Sequencing Errors ◽

Human Genome Sequencing ◽

Number Of Individuals

Abstract Next-generation sequencing has allowed genetic studies to collect genome sequencing data from a large number of individuals. However, raw sequencing data are not usually interpretable due to fragmentation of the genome and technical biases; therefore, analysis of these data requires many computational approaches. First, for each sequenced individual, sequencing data are aligned and further processed to account for technical biases. Then, variant calling is performed to obtain information on the positions of genetic variants and their corresponding genotypes. Quality control (QC) is applied to identify individuals and genetic variants with sequencing errors. These procedures are necessary to generate accurate variant calls from sequencing data, and many computational approaches have been developed for these tasks. This review will focus on current widely used approaches for variant calling and QC.

Download Full-text

ADFinder: accurate detection of programmed DNA elimination using NGS high-throughput sequencing data

Bioinformatics ◽

10.1093/bioinformatics/btaa226 ◽

2020 ◽

Vol 36 (12) ◽

pp. 3632-3636 ◽

Cited By ~ 2

Author(s):

Weibo Zheng ◽

Jing Chen ◽

Thomas G Doak ◽

Weibo Song ◽

Ying Yan

Keyword(s):

High Throughput ◽

Large Scale ◽

High Throughput Sequencing ◽

Supplementary Information ◽

Sequencing Data ◽

Source Codes ◽

High Throughput Sequencing Data ◽

Dna Elimination ◽

Multiple Alternative ◽

Dna Splicing

Abstract Motivation Programmed DNA elimination (PDE) plays a crucial role in the transitions between germline and somatic genomes in diverse organisms ranging from unicellular ciliates to multicellular nematodes. However, software specific for the detection of DNA splicing events is scarce. In this paper, we describe Accurate Deletion Finder (ADFinder), an efficient detector of PDEs using high-throughput sequencing data. ADFinder can predict PDEs with relatively low sequencing coverage, detect multiple alternative splicing forms in the same genomic location and calculate the frequency for each splicing event. This software will facilitate research of PDEs and all down-stream analyses. Results By analyzing genome-wide DNA splicing events in two micronuclear genomes of Oxytricha trifallax and Tetrahymena thermophila, we prove that ADFinder is effective in predicting large scale PDEs. Availability and implementation The source codes and manual of ADFinder are available in our GitHub website: https://github.com/weibozheng/ADFinder. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

ZF-AutoML: An Easy Machine-Learning-Based Method to Detect Anomalies in Fluorescent-Labelled Zebrafish

Inventions ◽

10.3390/inventions4040072 ◽

2019 ◽

Vol 4 (4) ◽

pp. 72

Author(s):

Ryota Sawaki ◽

Daisuke Sato ◽

Hiroko Nakayama ◽

Yuki Nakagawa ◽

Yasuhito Shimada

Keyword(s):

Machine Learning ◽

Image Analysis ◽

High Throughput ◽

High Throughput Screening ◽

Large Scale ◽

Fluorescent Protein ◽

Learning Program ◽

High Throughput Analysis ◽

Easy Method ◽

High Fecundity

Background: Zebrafish are efficient animal models for conducting whole organism drug testing and toxicological evaluation of chemicals. They are frequently used for high-throughput screening owing to their high fecundity. Peripheral experimental equipment and analytical software are required for zebrafish screening, which need to be further developed. Machine learning has emerged as a powerful tool for large-scale image analysis and has been applied in zebrafish research as well. However, its use by individual researchers is restricted due to the cost and the procedure of machine learning for specific research purposes. Methods: We developed a simple and easy method for zebrafish image analysis, particularly fluorescent labelled ones, using the free machine learning program Google AutoML. We performed machine learning using vascular- and macrophage-Enhanced Green Fluorescent Protein (EGFP) fishes under normal and abnormal conditions (treated with anti-angiogenesis drugs or by wounding the caudal fin). Then, we tested the system using a new set of zebrafish images. Results: While machine learning can detect abnormalities in the fish in both strains with more than 95% accuracy, the learning procedure needs image pre-processing for the images of the macrophage-EGFP fishes. In addition, we developed a batch uploading software, ZF-ImageR, for Windows (.exe) and MacOS (.app) to enable high-throughput analysis using AutoML. Conclusions: We established a protocol to utilize conventional machine learning platforms for analyzing zebrafish phenotypes, which enables fluorescence-based, phenotype-driven zebrafish screening.

Download Full-text

Determination of Clonality and Relatedness of Vibrio cholerae Isolates by Genomic Fingerprinting, Using Long-Range Repetitive Element Sequence-Based PCR

Applied and Environmental Microbiology ◽

10.1128/aem.00151-08 ◽

2008 ◽

Vol 74 (17) ◽

pp. 5392-5401 ◽

Cited By ~ 13

Author(s):

Nipa Chokesajjawatee ◽

Young-Gun Zo ◽

Rita R. Colwell

Keyword(s):

Vibrio Cholerae ◽

Long Range ◽

High Throughput ◽

Large Scale ◽

Rapid Screening ◽

Genomic Fingerprinting ◽

High Throughput Analysis ◽

Consensus Sequences ◽

Specialized Equipment ◽

Long Range Pcr

ABSTRACT A high-throughput method which is applicable for rapid screening, identification, and delineation of isolates of Vibrio cholerae, sensitive to genome variation, and capable of providing phylogenetic inferences enhances environmental monitoring of this bacterium. We have developed and optimized a method for genomic fingerprinting of V. cholerae based on long-range PCR. The method uses a primer set directed to enterobacterial repetitive intergenic consensus sequences, a high-fidelity DNA polymerase, and analysis via conventional agarose gel electrophoresis. Long (∼10 kb), highly reproducible amplicons were generated from V. cholerae isolates, including those from different geographical locations and historical strains isolated during the period 1931-2000. The amplicons yielded reduced variability in their densitometric band patterns to ≤10% and clonal distinction at <90% similarity. Rapid band-matching analysis was accomplished for fingerprints with ≥90% similarity, discriminating O serotypes and biotypes (classical versus El Tor) as well as pathogenic and nonpathogenic strains. Compared to genome similarity measured by DNA-DNA hybridization, the results showed good correlation (r = 0.7; P < 0.001), with five times less measurement error and without bias. The method permits both phylogenetic inference and clonal differentiation of individual V. cholerae strains, enables robust, high-throughput analysis, and does not require specialized equipment to perform. With access to a curated public database furnished with appropriate analytical software applications, the method should prove useful in large-scale multilaboratory surveys, especially those designed to detect specific pathogens in the natural environment.

Download Full-text

Automated analysis of high-throughput B-cell sequencing data reveals a high frequency of novel immunoglobulin V gene segment alleles

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.1417683112 ◽

2015 ◽

Vol 112 (8) ◽

pp. E862-E870 ◽

Cited By ~ 101

Author(s):

Daniel Gadala-Maria ◽

Gur Yaari ◽

Mohamed Uduman ◽

Steven H. Kleinstein

Keyword(s):

B Cell ◽

Large Scale ◽

Gene Segment ◽

Automated Analysis ◽

Sequencing Data ◽

Cell Repertoire ◽

Accurate Analysis ◽

V Genes ◽

Repertoire Sequencing ◽

Novel Alleles

Individual variation in germline and expressed B-cell immunoglobulin (Ig) repertoires has been associated with aging, disease susceptibility, and differential response to infection and vaccination. Repertoire properties can now be studied at large-scale through next-generation sequencing of rearranged Ig genes. Accurate analysis of these repertoire-sequencing (Rep-Seq) data requires identifying the germline variable (V), diversity (D), and joining (J) gene segments used by each Ig sequence. Current V(D)J assignment methods work by aligning sequences to a database of known germline V(D)J segment alleles. However, existing databases are likely to be incomplete and novel polymorphisms are hard to differentiate from the frequent occurrence of somatic hypermutations in Ig sequences. Here we develop a Tool for Ig Genotype Elucidation via Rep-Seq (TIgGER). TIgGER analyzes mutation patterns in Rep-Seq data to identify novel V segment alleles, and also constructs a personalized germline database containing the specific set of alleles carried by a subject. This information is then used to improve the initial V segment assignments from existing tools, like IMGT/HighV-QUEST. The application of TIgGER to Rep-Seq data from seven subjects identified 11 novel V segment alleles, including at least one in every subject examined. These novel alleles constituted 13% of the total number of unique alleles in these subjects, and impacted 3% of V(D)J segment assignments. These results reinforce the highly polymorphic nature of human Ig V genes, and suggest that many novel alleles remain to be discovered. The integration of TIgGER into Rep-Seq processing pipelines will increase the accuracy of V segment assignments, thus improving B-cell repertoire analyses.

Download Full-text

Assessing Study Reproducibility through M2RI: A Novel Approach for Large-scale High-throughput Association Studies

10.1101/2020.08.18.253740 ◽

2020 ◽

Author(s):

Zeyu Jiao ◽

Yinglei Lai ◽

Jujiao Kang ◽

Weikang Gong ◽

Liang Ma ◽

...

Keyword(s):

Sample Size ◽

Rna Sequencing ◽

High Throughput ◽

Large Scale ◽

Association Studies ◽

Structural Mri ◽

Data Sets ◽

Sequencing Data ◽

Novel Approach ◽

Magnetic Resonance Imaging Mri

AbstractHigh-throughput technologies, such as magnetic resonance imaging (MRI) and DNA/RNA sequencing (DNA-seq/RNA-seq), have been increasingly used in large-scale association studies. With these technologies, important biomedical research findings have been generated. The reproducibility of these findings, especially from structural MRI (sMRI) and functional MRI (fMRI) association studies, has recently been questioned. There is an urgent demand for a reliable overall reproducibility assessment for large-scale high-throughput association studies. It is also desirable to understand the relationship between study reproducibility and sample size in an experimental design. In this study, we developed a novel approach: the mixture model reproducibility index (M2RI) for assessing study reproducibility of large-scale association studies. With M2RI, we performed study reproducibility analysis for several recent large sMRI/fMRI data sets. The advantages of our approach were clearly demonstrated, and the sample size requirements for different phenotypes were also clearly demonstrated, especially when compared to the Dice coefficient (DC). We applied M2RI to compare two MRI or RNA sequencing data sets. The reproducibility assessment results were consistent with our expectations. In summary, M2RI is a novel and useful approach for assessing study reproducibility, calculating sample sizes and evaluating the similarity between two closely related studies.

Download Full-text

NGSphy: phylogenomic simulation of next-generation sequencing data

10.1101/197715 ◽

2017 ◽

Author(s):

Merly Escalona ◽

Sara Rocha ◽

David Posada

Keyword(s):

Next Generation Sequencing ◽

Variant Calling ◽

Gene Families ◽

Common Species ◽

Next Generation Sequencing Data ◽

Phylogenomic Analysis ◽

Next Generation ◽

Sequencing Data ◽

Sequencing Technologies ◽

Generation Sequencing

AbstractMotivationAdvances in sequencing technologies have made it feasible to obtain massive datasets for phylogenomic inference, often consisting of large numbers of loci from multiple species and individuals. The phylogenomic analysis of next-generation sequencing (NGS) data implies a complex computational pipeline where multiple technical and methodological decisions are necessary that can influence the final tree obtained, like those related to coverage, assembly, mapping, variant calling and/or phasing.ResultsTo assess the influence of these variables we introduce NGSphy, an open-source tool for the simulation of Illumina reads/read counts obtained from haploid/diploid individual genomes with thousands of independent gene families evolving under a common species tree. In order to resemble real NGS experiments, NGSphy includes multiple options to model sequencing coverage (depth) heterogeneity across species, individuals and loci, including off-target or uncaptured loci. For comprehensive simulations covering multiple evolutionary scenarios, parameter values for the different replicates can be sampled from user-defined statistical distributions.AvailabilitySource code, full documentation and tutorials including a quick start guide are available at http://github.com/merlyescalona/[email protected]. [email protected]

Download Full-text