pdxBlacklist: Identifying artefactual variants in patient-derived xenograft samples

Mapping Intimacies ◽

10.1101/180752 ◽

2017 ◽

Cited By ~ 1

Author(s):

Max Salm ◽

Sven-Eric Schelhorn ◽

Lee Lancashire ◽

Thomas Grombacher

Keyword(s):

Human Genome ◽

Human Tissue ◽

False Positive ◽

Tumor Xenograft ◽

Supplementary Information ◽

Supplementary Data ◽

Variant Call ◽

Patient Derived Xenograft ◽

Novel Approach ◽

Supplementary Material

SummaryPatient-derived tumor xenograft (PDX) samples typically represent a mixture of mouse and human tissue. Variant call sets derived from sequencing such samples are commonly contaminated with false positive variants that arise when mouse-derived reads are mapped to the human genome. pdxBlacklist is a novel approach designed to rapidly identify these false-positive variants, and thus significantly improve variant call set quality.Availability:pdxBlacklist is freely available on GitHub: https://github.com/MaxSalm/pdxBlacklistContact:[email protected] information:Supplementary data are available.

Download Full-text

PathScore: a web tool for identifying altered pathways in cancer data

10.1101/067090 ◽

2016 ◽

Cited By ~ 2

Author(s):

Stephen G. Gaffney ◽

Jeffrey P. Townsend

Keyword(s):

Web Application ◽

Somatic Mutations ◽

Supplementary Information ◽

Web Tool ◽

Cancer Data ◽

Link Type ◽

Novel Approach ◽

Supplementary Material ◽

User Friendly ◽

Pathway Effect

ABSTRACTSummaryPathScore quantifies the level of enrichment of somatic mutations within curated pathways, applying a novel approach that identifies pathways enriched across patients. The application provides several user-friendly, interactive graphic interfaces for data exploration, including tools for comparing pathway effect sizes, significance, gene-set overlap and enrichment differences between projects.Availability and ImplementationWeb application available at pathscore.publichealth.yale.edu. Site implemented in Python and MySQL, with all major browsers supported. Source code available at github.com/sggaffney/pathscore with a GPLv3 [email protected] InformationAdditional documentation can be found at http://pathscore.publichealth.yale.edu/faq.

Download Full-text

Crosslink: A fast, scriptable genetic mapper for outcrossing species

10.1101/135277 ◽

2017 ◽

Cited By ~ 6

Author(s):

Robert J. Vickerstaff ◽

Richard J. Harrison

Keyword(s):

Large Datasets ◽

Supplementary Information ◽

Supplementary Data ◽

Link Type ◽

Mapping Software ◽

Outcrossing Species ◽

Supplementary Material ◽

Novel Approaches ◽

Similar Accuracy ◽

General Public License

AbstractSummaryCrosslink is genetic mapping software for outcrossing species designed to run efficiently on large datasets by combining the best from existing tools with novel approaches. Tests show it runs much faster than several comparable programs whilst retaining a similar accuracy.Availability and implementationAvailable under the GNU General Public License version 2 from https://github.com/eastmallingresearch/[email protected] informationSupplementary data are available at Bioinformatics online and from https://github.com/eastmallingresearch/crosslink/releases/tag/v0.5.

Download Full-text

FQSqueezer: k-mer-based compression of sequencing data

10.1101/559807 ◽

2019 ◽

Cited By ~ 1

Author(s):

Sebastian Deorowicz

Keyword(s):

Data Compression ◽

State Of The Art ◽

Genomic Data ◽

General Purpose ◽

Supplementary Information ◽

Supplementary Data ◽

Sequencing Data ◽

Partial Matching ◽

Supplementary Material ◽

Better Than

AbstractMotivationThe amount of genomic data that needs to be stored is huge. Therefore it is not surprising that a lot of work has been done in the field of specialized data compression of FASTQ files. The existing algorithms are, however, still imperfect and the best tools produce quite large archives.ResultsWe present FQSqueezer, a novel compression algorithm for sequencing data able to process single- and paired-end reads of variable lengths. It is based on the ideas from the famous prediction by partial matching and dynamic Markov coder algorithms known from the general-purpose-compressors world. The compression ratios are often tens of percent better than offered by the state-of-the-art tools.Availability and Implementationhttps://github.com/refresh-bio/[email protected] informationSupplementary data are available at publisher’s Web site.

Download Full-text

GTShark: Genotype compression in large project

10.1101/494104 ◽

2018 ◽

Author(s):

Sebastian Deorowicz ◽

Agnieszka Danek

Keyword(s):

Web Site ◽

Supplementary Information ◽

Supplementary Data ◽

Link Type ◽

Large Project ◽

Supplementary Material

AbstractSummaryNowadays large sequencing projects handle tens of thousands of individuals. The huge files summarizing the findings definitely require compression. We propose a tool able to compress large collections of genotypes as well as single samples in such projects to sizes not achievable to date.Availability and Implementationhttps://github.com/refresh-bio/[email protected] informationSupplementary data are available at publisher’s Web site.

Download Full-text

Haplotype-aware graph indexes

10.1101/559583 ◽

2019 ◽

Cited By ~ 7

Author(s):

Jouni Sirén ◽

Erik Garrison ◽

Adam M. Novak ◽

Benedict Paten ◽

Richard Durbin

Keyword(s):

Genetic Variation ◽

Chromosome 17 ◽

Supplementary Information ◽

Whole Genome ◽

Supplementary Data ◽

1000 Genomes Project ◽

1000 Genomes ◽

Link Type ◽

Supplementary Material ◽

Haplotype Information

AbstractMotivationThe variation graph toolkit (VG) represents genetic variation as a graph. Although each path in the graph is a potential haplotype, most paths are nonbiological, unlikely recombinations of true haplotypes.ResultsWe augment the VG model with haplotype information to identify which paths are more likely to exist in nature. For this purpose, we develop a scalable implementation of the graph extension of the positional Burrows–Wheelertransform (GBWT). We demonstrate the scalability of the new implementation by building a whole-genome index of the 5,008 haplotypes of the 1000 Genomes Project, and an index of all 108,070 TOPMed Freeze 5 chromosome 17 haplotypes. We also develop an algorithm for simplifying variation graphs for k-mer indexing without losing any k-mers in the haplotypes.AvailabilityOur software is available at https://github.com/vgteam/vg, https://github.com/jltsiren/gbwt, and https://github.com/jltsiren/[email protected] informationSupplementary data are available.

Download Full-text

VCFShark: how to squeeze a VCF file

Bioinformatics ◽

10.1093/bioinformatics/btab211 ◽

2021 ◽

Author(s):

Sebastian Deorowicz ◽

Agnieszka Danek ◽

Marek Kokot

Keyword(s):

Large Datasets ◽

Main Memory ◽

Supplementary Information ◽

Genotype Data ◽

Supplementary Data ◽

Variant Call Format ◽

Variant Call ◽

Order Of Magnitude ◽

Better Than ◽

De Facto Standards

Abstract Summary Variant Call Format (VCF) files with results of sequencing projects take a lot of space. We propose the VCFShark, which is able to compress VCF files up to an order of magnitude better than the de facto standards (gzipped VCF and BCF). The advantage over competitors is the greatest when compressing VCF files containing large amounts of genotype data. The processing speeds up to 100 MB/s and main memory requirements lower than 30 GB allow to use our tool at typical workstations even for large datasets. Availability and implementation https://github.com/refresh-bio/vcfshark. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Compact and evenly distributed k-mer binning for genomic sequences

Bioinformatics ◽

10.1093/bioinformatics/btab156 ◽

2021 ◽

Cited By ~ 1

Author(s):

Johan Nyström-Persson ◽

Gabriel Keeble-Gagnère ◽

Niamat Zawad

Keyword(s):

New Combination ◽

Supplementary Information ◽

Supplementary Data ◽

Size Estimation ◽

Counting Method ◽

Large Dataset ◽

Online Supplementary Material ◽

Metagenomics Data ◽

Supplementary Material ◽

Processing Algorithms

Abstract Motivation The processing of k-mers (subsequences of length k) is at the foundation of many sequence processing algorithms in bioinformatics, including k-mer counting for genome size estimation, genome assembly, and taxonomic classification for metagenomics. Minimizers—ordered m-mers where m < k—are often used to group k-mers into bins as a first step in such processing. However, minimizers are known to generate bins of very different sizes, which can pose challenges for distributed and parallel processing, as well as generally increase memory requirements. Furthermore, although various minimizer orderings have been proposed, their practical value for improving tool efficiency has not yet been fully explored. Results We present Discount, a distributed k-mer counting tool based on Apache Spark, which we use to investigate the behaviour of various minimizer orderings in practice when applied to metagenomics data. Using this tool, we then introduce the universal frequency ordering, a new combination of frequency-sampled minimizers and universal k-mer hitting sets, which yields both evenly distributed binning and small bin sizes. We show that this ordering allows Discount to perform distributed k-mer counting on a large dataset in as little as 1/8 of the memory of comparable approaches, making it the most efficient out-of-core distributed k-mer counting method available. Availability and implementation Discount is GPL licensed and available at https://github.com/jtnystrom/discount. The data underlying this article are available in the article and in its online supplementary material. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

ATLAS: Analysis Tools for Low-depth and Ancient Samples

10.1101/105346 ◽

2017 ◽

Cited By ~ 22

Author(s):

Vivian Link ◽

Athanasios Kousathanas ◽

Krishna Veeramah ◽

Christian Sell ◽

Amelie Scheu ◽

...

Keyword(s):

Genetic Diversity ◽

Ancient Dna ◽

State Of The Art ◽

Supplementary Information ◽

Post Mortem ◽

Supplementary Data ◽

Supplementary Material ◽

C Program ◽

User Friendly ◽

Proper Analysis

AbstractSummaryPost-mortem damage (PMD) obstructs the proper analysis of ancient DNA samples and can currently only be addressed by removing or down-weighting potentially damaged data. Here we present ATLAS, a suite of methods to accurately genotype and estimate genetic diversity from ancient samples, while accounting for PMD. It works directly from raw BAM files and enables the building of complete and customized pipelines for the analysis of ancient and other low-depth samples in a very user-friendly way. Based on simulations we show that, in the presence of PMD, a dedicated pipeline of ATLAS calls genotypes more accurately than the state-of-the-art pipeline of GATK combined with mapDamage 2.0.AvailabilityATLAS is an open-source C++ program freely available at https://bitbucket.org/phaentu/[email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text

dms2dfe: Comprehensive Workflow for Analysis of Deep Mutational Scanning Data

10.1101/072645 ◽

2016 ◽

Cited By ~ 2

Author(s):

Rohan Dandage ◽

Kausik Chakraborty

Keyword(s):

Noise Reduction ◽

High Throughput ◽

Critical Issue ◽

Supplementary Information ◽

Supplementary Data ◽

Selection Pressures ◽

Link Type ◽

Supplementary Material ◽

End To End ◽

Python Package

SummaryHigh throughput genotype to phenotype (G2P) data is increasingly being generated by widely applicable Deep Mutational Scanning (DMS) method. dms2dfe is a comprehensive end-to-end workflow that addresses critical issue with noise reduction and offers variety of crucial downstream analyses. Noise reduction is carried out by normalizing counts of mutants by depth of sequencing and subsequent dispersion shrinkage at the level of calculation of preferential enrichments. In downstream analyses, dms2dfe workflow provides identification of relative selection pressures, potential molecular constraints and generation of data-rich visualizations.Availabilitydms2dfe is implemented as a python package and it is available at https://kc-lab.github.io/[email protected], [email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text

LIONS: Analysis Suite for Detecting and Quantifying Transposable Element Initiated Transcription from RNA-seq

10.1101/149864 ◽

2017 ◽

Cited By ~ 2

Author(s):

Artem Babaian ◽

Richard Thompson ◽

Jake Lever ◽

Liane Gagnier ◽

Mohammad M. Karimi ◽

...

Keyword(s):

Transposable Elements ◽

Transposable Element ◽

Test Data ◽

Source Code ◽

Supplementary Information ◽

Transcriptional Networks ◽

Supplementary Data ◽

Rna Seq ◽

Instruction Manual ◽

Supplementary Material

AbstractSummaryTransposable Elements (TEs) influence the evolution of novel transcriptional networks yet the specific and meaningful interpretation of how TE-initiation events contribute to the transcriptome has been marred by computational and methodological deficiencies. We developed LIONS for the analysis of paired-end RNA-seq data to specifically detect and quantify TE-initiated transcripts.AvailabilitySource code, container, test data and instruction manual are freely available at www.github.com/ababaian/[email protected] or [email protected] or [email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text