Best Practices for Benchmarking Germline Small Variant Calls in Human Genomes

Mapping Intimacies ◽

10.1101/270157 ◽

2018 ◽

Cited By ~ 14

Author(s):

Peter Krusche ◽

Len Trigg ◽

Paul C. Boutros ◽

Christopher E. Mason ◽

Francisco M. De La Vega ◽

...

Keyword(s):

Best Practices ◽

Performance Metrics ◽

Variant Calling ◽

Confidence Regions ◽

List Type ◽

High Confidence ◽

Global Alliance ◽

Comparative Performance ◽

Standardized Report ◽

Genome Context

AbstractAssessing accuracy of NGS variant calling is immensely facilitated by a robust benchmarking strategy and tools to carry it out in a standard way. Benchmarking variant calls requires careful attention to definitions of performance metrics, sophisticated comparison approaches, and stratification by variant type and genome context. The Global Alliance for Genomics and Health (GA4GH) Benchmarking Team has developed standardized performance metrics and tools for benchmarking germline small variant calls. This team includes representatives from sequencing technology developers, government agencies, academic bioinformatics researchers, clinical laboratories, and commercial technology and bioinformatics developers for whom benchmarking variant calls is essential to their work. Benchmarking variant calls is a challenging problem for many reasons:Evaluating variant calls requires complex matching algorithms and standardized counting because the same variant may be represented differently in truth and query callsets.Defining and interpreting resulting metrics such as precision (aka positive predictive value = TP/(TP+FP)) and recall (aka sensitivity = TP/(TP+FN)) requires standardization to draw robust conclusions about comparative performance for different variant calling methods.Performance of NGS methods can vary depending on variant types and genome context; and as a result understanding performance requires meaningful stratification.High-confidence variant calls and regions that can be used as “truth” to accurately identify false positives and negatives are difficult to define, and reliable calls for the most challenging regions and variants remain out of reach.We have made significant progress on standardizing comparison methods, metric definitions and reporting, as well as developing and using truth sets. Our methods are publicly available on GitHub (https://github.com/ga4gh/benchmarking-tools) and in a web-based app on precisionFDA, which allow users to compare their variant calls against truth sets and to obtain a standardized report on their variant calling performance. Our methods have been piloted in the precisionFDA variant calling challenges to identify the best-in-class variant calling methods within high-confidence regions. Finally, we recommend a set of best practices for using our tools and critically evaluating the results.

Download Full-text

Reproducible integration of multiple sequencing datasets to form high-confidence SNP, indel, and reference calls for five human genome reference materials

10.1101/281006 ◽

2018 ◽

Cited By ~ 28

Author(s):

Justin M. Zook ◽

Jennifer McDaniel ◽

Hemang Parikh ◽

Haynes Heaton ◽

Sean A. Irvine ◽

...

Keyword(s):

Reference Materials ◽

Performance Metrics ◽

Genome Project ◽

Personal Genome ◽

High Confidence ◽

Global Alliance ◽

Human Genomes ◽

Open Consent ◽

Genome Context ◽

Human Genome Reference

AbstractBenchmark small variant calls from the Genome in a Bottle Consortium (GIAB) for the CEPH/HapMap genome NA12878 (HG001) have been used extensively for developing, optimizing, and demonstrating performance of sequencing and bioinformatics methods. Here, we develop a reproducible, cloud-based pipeline to integrate multiple sequencing datasets and form benchmark calls, enabling application to arbitrary human genomes. We use these reproducible methods to form high-confidence calls with respect to GRCh37 and GRCh38 for HG001 and 4 additional broadly-consented genomes from the Personal Genome Project that are available as NIST Reference Materials. These new genomes’ broad, open consent with few restrictions on availability of samples and data is enabling a uniquely diverse array of applications. Our new methods produce 17% more high-confidence SNPs, 176% more indels, and 12% larger regions than our previously published calls. To demonstrate that these calls can be used for accurate benchmarking, we compare other high-quality callsets to ours (e.g., Illumina Platinum Genomes), and we demonstrate that the majority of discordant calls are errors in the other callsets, We also highlight challenges in interpreting performance metrics when benchmarking against imperfect high-confidence calls. We show that benchmarking tools from the Global Alliance for Genomics and Health can be used with our calls to stratify performance metrics by variant type and genome context and elucidate strengths and weaknesses of a method.

Download Full-text

From FastQ Data to High‐Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline

Current Protocols in Bioinformatics ◽

10.1002/0471250953.bi1110s43 ◽

2013 ◽

Vol 43 (1) ◽

Cited By ~ 926

Author(s):

Geraldine A. Auwera ◽

Mauricio O. Carneiro ◽

Christopher Hartl ◽

Ryan Poplin ◽

Guillermo del Angel ◽

...

Keyword(s):

Best Practices ◽

Genome Analysis ◽

High Confidence ◽

Genome Analysis Toolkit

Download Full-text

Programmatic modeling for biological systems

10.1101/2021.02.26.433125 ◽

2021 ◽

Author(s):

Alexander L.R. Lubbock ◽

Carlos F. Lopez

Keyword(s):

Software Engineering ◽

Best Practices ◽

Computational Modeling ◽

User Interfaces ◽

Model Development ◽

List Type ◽

Cellular Processes ◽

Domain Specific ◽

Community Benefits ◽

Engineering Best Practices

AbstractComputational modeling has become an established technique to encode mathematical representations of cellular processes and gain mechanistic insights that drive testable predictions. These models are often constructed using graphical user interfaces or domain-specific languages, with SBML used for interchange. Models are typically simulated, calibrated, and analyzed either within a single application, or using import and export from various tools. Here, we describe a programmatic modeling paradigm, in which modeling is augmented with best practices from software engineering. We focus on Python - a popular, user-friendly programming language with a large scientific package ecosystem. Models themselves can be encoded as programs, adding benefits such as modularity, testing, and automated documentation generators while still being exportable to SBML. Automated version control and testing ensures models and their modules have expected properties and behavior. Programmatic modeling is a key technology to enable collaborative model development and enhance dissemination, transparency, and reproducibility.HighlightsProgrammatic modeling combines computational modeling with software engineering best practices.An executable model enables users to leverage all available resources from the language.Community benefits include improved collaboration, reusability, and reproducibility.Python has multiple modeling frameworks with a broad, active scientific ecosystem.

Download Full-text

Leveraging Best Practices for Climate Adaptation in the West African Sahel: The Emergence of Global Alliance for Resilience

Handbook of Climate Change Resilience ◽

10.1007/978-3-319-93336-8_164 ◽

2019 ◽

pp. 2035-2057

Author(s):

Samuel Aderemi Igbatayo ◽

Oladapo Opeyemi Babalola

Keyword(s):

Best Practices ◽

Climate Adaptation ◽

West African ◽

Global Alliance ◽

The West ◽

West African Sahel ◽

African Sahel

Download Full-text

ICT Based Performance Measurement and Benchmarking Methodology

International Journal of Advanced Information and Communication Technology ◽

10.46532/ijaict-2020036 ◽

2020 ◽

pp. 188-196

Author(s):

Nadja Yang Meng ◽

Karthikeyan K

Keyword(s):

Best Practices ◽

Performance Measurement ◽

Performance Metrics ◽

Performance Enhancement ◽

Performance Measurements ◽

Know How ◽

Performance Benchmarking ◽

Measurement Framework ◽

And Performance ◽

Made In

Performance benchmarking and performance measurement are the fundamental principles of performance enhancement in the business sector. For businesses to enhance their performance in the modern competitive world, it is fundamental to know how to measure the performance level in business that also incorporates telling how they will performance after a change has been made. In case a business improvement has been made, the performance processes have to be evaluated. Performance measurements are also fundamental in the process of doing comparisons of performance levels between corporations. The best practices within the industry are evaluated by the businesses with desirable levels of the kind of performance measures being conducted. In that regard, it is fundamental if similar businesses applied the same collection of performance metrics. In this paper, the NETIAS performance measurement framework will be applied to accomplish the mission of evaluating performances in business by producing generic collection of performance metrics, which businesses can utilize to compare and measure their organizational activities.

Download Full-text

VariFAST: a variant filter by automated scoring based on tagged-signatures

BMC Bioinformatics ◽

10.1186/s12859-019-3226-2 ◽

2019 ◽

Vol 20 (S22) ◽

Author(s):

Hang Zhang ◽

Ke Wang ◽

Juan Zhou ◽

Jianhua Chen ◽

Yizhou Xu ◽

...

Keyword(s):

Best Practices ◽

False Positive ◽

Variant Calling ◽

Sequencing Data ◽

Novel Approach ◽

Variant Filtering ◽

Ngs Data Analysis ◽

High Consistency ◽

Ngs Data ◽

Manual Review

Abstract Background Variant calling and refinement from whole genome/exome sequencing data is a fundamental task for genomics studies. Due to the limited accuracy of NGS sequencing and variant callers, IGV-based manual review is required for further false positive variant filtering, which costs massive labor and time, and results in high inter- and intra-lab variability. Results To overcome the limitation of manual review, we developed a novel approach for Variant Filter by Automated Scoring based on Tagged-signature (VariFAST), and also provided a pipeline integrating GATK Best Practices with VariFAST, which can be easily used for high quality variants detection from raw data. Using the bam and vcf files, VariFAST calculates a v-score by sum of weighted metrics causing false positive variations, and marks tags in the manner of keeping high consistency with manual review, for each variant. We validated the performance of VariFAST for germline variant filtering using the benchmark sequencing data from GIAB, and also for somatic variant filtering using sequencing data of both malignant carcinoma and benign adenomas as well. VariFAST also includes a predictive model trained by XGBOOST algorithm for germline variants refinement, which reveals better MCC and AUC than the state-of-the-art VQSR, especially outcompete in INDEL variant filtering. Conclusion VariFAST can assist researchers efficiently and conveniently to filter the false positive variants, including both germline and somatic ones, in NGS data analysis. The VariFAST source code and the pipeline integrating with GATK Best Practices are available at https://github.com/bioxsjtu/VariFAST.

Download Full-text

Global Alliance for Genomics and Health Meets Bioconductor: Toward Reproducible and Agile Cancer Genomics at Cloud Scale

JCO Clinical Cancer Informatics ◽

10.1200/cci.19.00111 ◽

2020 ◽

pp. 472-479 ◽

Cited By ~ 1

Author(s):

Vincent J. Carey ◽

Marcel Ramos ◽

Benjamin J. Stubbs ◽

Shweta Gopaulakrishnan ◽

Sehyun Oh ◽

...

Keyword(s):

Best Practices ◽

Cancer Genomics ◽

Cost Benefit ◽

Unit Testing ◽

Use Case ◽

Global Alliance ◽

Workflow Execution ◽

The One ◽

Experimental Use ◽

Scale Data

PURPOSE Institutional efforts toward the democratization of cloud-scale data and analysis methods for cancer genomics are proceeding rapidly. As part of this effort, we bridge two major bioinformatic initiatives: the Global Alliance for Genomics and Health (GA4GH) and Bioconductor. METHODS We describe in detail a use case in pancancer transcriptomics conducted by blending implementations of the GA4GH Workflow Execution Services and Tool Registry Service concepts with the Bioconductor curatedTCGAData and BiocOncoTK packages. RESULTS We carried out the analysis with a formally archived workflow and container at dockstore.org and a workspace and notebook at app.terra.bio. The analysis identified relationships between microsatellite instability and biomarkers of immune dysregulation at a finer level of granularity than previously reported. Our use of standard approaches to containerization and workflow programming allows this analysis to be replicated and extended. CONCLUSION Experimental use of dockstore.org and app.terra.bio in concert with Bioconductor enabled novel statistical analysis of large genomic projects without the need for local supercomputing resources but involved challenges related to container design, script archiving, and unit testing. Best practices and cost/benefit metrics for the management and analysis of globally federated genomic data and annotation are evolving. The creation and execution of use cases like the one reported here will be helpful in the development and comparison of approaches to federated data/analysis systems in cancer genomics.

Download Full-text

Best practices for variant calling in clinical sequencing

Genome Medicine ◽

10.1186/s13073-020-00791-w ◽

2020 ◽

Vol 12 (1) ◽

Author(s):

Daniel C. Koboldt

Keyword(s):

Best Practices ◽

Single Molecule ◽

Best Practice ◽

Variant Calling ◽

Clinical Samples ◽

Clinical Genetic ◽

Inherited Disorders ◽

Clinical Sequencing ◽

Sequencing Technologies ◽

Downstream Analysis

Abstract Next-generation sequencing technologies have enabled a dramatic expansion of clinical genetic testing both for inherited conditions and diseases such as cancer. Accurate variant calling in NGS data is a critical step upon which virtually all downstream analysis and interpretation processes rely. Just as NGS technologies have evolved considerably over the past 10 years, so too have the software tools and approaches for detecting sequence variants in clinical samples. In this review, I discuss the current best practices for variant calling in clinical sequencing studies, with a particular emphasis on trio sequencing for inherited disorders and somatic mutation detection in cancer patients. I describe the relative strengths and weaknesses of panel, exome, and whole-genome sequencing for variant detection. Recommended tools and strategies for calling variants of different classes are also provided, along with guidance on variant review, validation, and benchmarking to ensure optimal performance. Although NGS technologies are continually evolving, and new capabilities (such as long-read single-molecule sequencing) are emerging, the “best practice” principles in this review should be relevant to clinical variant calling in the long term.

Download Full-text

SEC’s unwavering focus on disclosure of valuation methods and calculation of IRRs by fund sponsors

Journal of Investment Compliance ◽

10.1108/joic-08-2017-0060 ◽

2017 ◽

Vol 18 (4) ◽

pp. 29-30

Author(s):

Michael P. Earley ◽

Jessica Panza ◽

Katherine Thrapp

Keyword(s):

Best Practices ◽

Design Methodology ◽

Performance Metrics ◽

Investment Fund ◽

Rates Of Return ◽

Investment Performance ◽

Content Type ◽

Valuation Methods ◽

The Future ◽

History Of

Purpose To explain the SEC’s historical focus on the calculation of investment performance and to highlight important issues for fund sponsors in the future. Design/methodology/approach This article discusses the SEC’s recent subpoena of at least one fund sponsor for information related to the firm’s practices in calculating internal rates of return and then explains the history of SEC enforcement in this area. Findings The SEC continues to be focused on how fund sponsors calculate investment performance metrics, such as IRRs, and the related disclosure. Originality/value This article contains valuable information for fund sponsors, such as best practices for valuation methods and related investment performance disclosures, including the calculation of IRRs from experienced investment fund lawyers.

Download Full-text

Regional Internet Exchange for MENA Countries and its Performance Evaluation

Journal of Information & Knowledge Management ◽

10.1142/s0219649203000589 ◽

2003 ◽

Vol 02 (04) ◽

pp. 403-408

Author(s):

A. Al-Zoubaidi

Keyword(s):

Performance Evaluation ◽

Service Provision ◽

Performance Metrics ◽

Cost Effective ◽

Access Time ◽

Bandwidth Utilization ◽

Comparative Performance ◽

Internet Service ◽

Mena Countries

In this paper we propose a Regional Internet Exchange (RIX) scheme for MENA countries intra-regional traffic, compared with the existing situation for Internet service provision. The RIX architecture is proposed, implemented, and evaluated using simulation. Simultaneous comparative performance evaluation of Internet service provision for the existing and the proposed scenarios are presented. It is focused to measure utilization, message delays, access time and client perceived latencies performance metrics. The study shows that the proposed scheme results in less international bandwidth utilization and it reduces significantly the access time and most importantly it is inherently cost-effective.

Download Full-text