Maftools: Efficient analysis, visualization and summarization of MAF files from large-scale cohort based cancer studies

Mapping Intimacies ◽

10.1101/052662 ◽

2016 ◽

Cited By ~ 46

Author(s):

Anand Mayakonda ◽

H Phillip Koeffler

Keyword(s):

Large Scale ◽

R Package ◽

Data Representation ◽

Large Cohort ◽

File Format ◽

Germline Variants ◽

Class System ◽

Standard File Format ◽

Cancer Studies ◽

Publication Quality

AbstractMutation Annotation Format (MAF) has become a standard file format for storing somatic/germline variants derived from sequencing of large cohort of cancer samples. MAF files contain a list of all variants detected in a sample along with various annotations associated with the putative variant. MAF file forms the basis for many downstream analyses and provides complete landscape of the cohort. Here we introduce maftools–an R package that provides rich source of functions for performing various analyses, visualizations and summarization of MAF files. Maftools uses data.table library for faster processing/summarization and ggplot2 for generating rich and publication quality visualizations. Maftools also takes advantages of S4 class system for better data representation, with easy to use and flexible functions.Availability and Implementationmaftools is implemented as an R package available at https://github.com/PoisonAlien/[email protected]

Download Full-text

Mercator: a pipeline for multi-method, unsupervised visualization and distance generation

Bioinformatics ◽

10.1093/bioinformatics/btab037 ◽

2021 ◽

Author(s):

Zachary B Abrams ◽

Caitlin E Coombes ◽

Suli Li ◽

Kevin R Coombes

Keyword(s):

Large Scale ◽

R Package ◽

High Dimensional ◽

Vast Number ◽

Large Scale Data ◽

User Friendly ◽

Exploratory Pattern ◽

Scale Data ◽

Selection Of ◽

Publication Quality

Abstract Summary Unsupervised machine learning provides tools for researchers to uncover latent patterns in large-scale data, based on calculated distances between observations. Methods to visualize high-dimensional data based on these distances can elucidate subtypes and interactions within multi-dimensional and high-throughput data. However, researchers can select from a vast number of distance metrics and visualizations, each with their own strengths and weaknesses. The Mercator R package facilitates selection of a biologically meaningful distance from 10 metrics, together appropriate for binary, categorical and continuous data, and visualization with 5 standard and high-dimensional graphics tools. Mercator provides a user-friendly pipeline for informaticians or biologists to perform unsupervised analyses, from exploratory pattern recognition to production of publication-quality graphics. Availabilityand implementation Mercator is freely available at the Comprehensive R Archive Network (https://cran.r-project.org/web/packages/Mercator/index.html).

Download Full-text

MMS2plot: an R package for visualizing multiple MS/MS spectra for groups of modified and non-modified peptides

10.1101/2020.03.13.989152 ◽

2020 ◽

Author(s):

Liya Ming ◽

Yang Zou ◽

Yiming Zhao ◽

Luna Zhang ◽

Ningning He ◽

...

Keyword(s):

Liquid Chromatography ◽

Retention Time ◽

Large Scale ◽

R Package ◽

Mass Spectrometric ◽

File Format ◽

Peptide Fragments ◽

Batch Mode ◽

Post Translational Modifications ◽

Modified Peptides

ABSTRACTA large number of post-translational modifications (PTMs) in proteins are buried in the unassigned mass spectrometric (MS) spectra in shot-gun proteomics datasets. Because the modified peptide fragments are low in abundance relative to the corresponding non-modified versions, it is critical to develop tools that allow facile evaluation of assignment of PTMs based on the MS/MS spectra. Such tools would preferably have the ability to allow comparison of fragment ion spectra and retention time between the modified and unmodified peptide pairs or group. Herein, we describe MMS2plot, an R package for visualizing peptide-spectrum matches (PSMs) for multiple peptides. MMS2plot features a batch mode and generates the output images in vector graphics file format that facilitate evaluation and publication of the PSM assignment. We expect MMS2plot to play an important role in PTM discovery from large-scale proteomics datasets generated by LC (liquid chromatography)-MS/MS. The MMS2plot package is freely available at https://github.com/lileir/MMS2plot under the GPL-3 license.

Download Full-text

Controllability and Observability of a Large Scale Thermodynamical System via Connectability Approach

ASME 2010 Dynamic Systems and Control Conference, Volume 2 ◽

10.1115/dscc2010-4265 ◽

2010 ◽

Author(s):

Virdiansyah Permana ◽

Rahmat Shoureshi

Keyword(s):

Graph Theory ◽

Lie Algebra ◽

Large Scale ◽

Point Of View ◽

Rank Condition ◽

Computational Point ◽

New Approach ◽

Class System ◽

Necessary And Sufficient ◽

Controllability And Observability

This study presents a new approach to determine the controllability and observability of a large scale nonlinear dynamic thermal system using graph-theory. The novelty of this method is in adapting graph theory for nonlinear class and establishing a graphic condition that describes the necessary and sufficient terms for a nonlinear class system to be controllable and observable, which equivalents to the analytical method of Lie algebra rank condition. The directed graph (digraph) is utilized to model the system, and the rule of its adaptation in nonlinear class is defined. Subsequently, necessary and sufficient terms to achieve controllability and observability condition are investigated through the structural property of a digraph called connectability. It will be shown that the connectability condition between input and states, as well as output and states of a nonlinear system are equivalent to Lie-algebra rank condition (LARC). This approach has been proven to be easier from a computational point of view and is thus found to be useful when dealing with a large system.

Download Full-text

Feasibility and Evaluation of a Large-Scale External Validation Approach for Patient-Level Prediction in an International Data Network: Validation of models predicting stroke in female patients newly diagnosed with atrial fibrillation.

10.21203/rs.2.11750/v2 ◽

2020 ◽

Author(s):

Jenna Marie Reps ◽

Ross Williams ◽

Seng Chan You ◽

Thomas Falconer ◽

Evan Minty ◽

...

Keyword(s):

Atrial Fibrillation ◽

Large Scale ◽

Data Science ◽

Prediction Models ◽

External Validation ◽

Scale Up ◽

R Package ◽

Prognostic Models ◽

Healthcare Data ◽

Patient Level

Abstract Objective: To demonstrate how the Observational Healthcare Data Science and Informatics (OHDSI) collaborative network and standardization can be utilized to scale-up external validation of patient-level prediction models by enabling validation across a large number of heterogeneous observational healthcare datasets.Materials & Methods: Five previously published prognostic models (ATRIA, CHADS2, CHADS2VASC, Q-Stroke and Framingham) that predict future risk of stroke in patients with atrial fibrillation were replicated using the OHDSI frameworks. A network study was run that enabled the five models to be externally validated across nine observational healthcare datasets spanning three countries and five independent sites. Results: The five existing models were able to be integrated into the OHDSI framework for patient-level prediction and they obtained mean c-statistics ranging between 0.57-0.63 across the 6 databases with sufficient data to predict stroke within 1 year of initial atrial fibrillation diagnosis for females with atrial fibrillation. This was comparable with existing validation studies. The validation network study was run across nine datasets within 60 days once the models were replicated. An R package for the study was published at https://github.com/OHDSI/StudyProtocolSandbox/tree/master/ExistingStrokeRiskExternalValidation.Discussion: This study demonstrates the ability to scale up external validation of patient-level prediction models using a collaboration of researchers and a data standardization that enable models to be readily shared across data sites. External validation is necessary to understand the transportability or reproducibility of a prediction model, but without collaborative approaches it can take three or more years for a model to be validated by one independent researcher. Conclusion : In this paper we show it is possible to both scale-up and speed-up external validation by showing how validation can be done across multiple databases in less than 2 months. We recommend that researchers developing new prediction models use the OHDSI network to externally validate their models.

Download Full-text

Optimizing smartphone-based canopy hemispherical photography

10.1101/2021.03.17.435793 ◽

2021 ◽

Author(s):

Gastón Mauro Díaz

Keyword(s):

Large Scale ◽

Forest Canopy ◽

Accuracy Assessment ◽

Low Cost ◽

R Package ◽

Coefficient Of Determination ◽

Native Forest ◽

Hemispherical Photography ◽

Area Index ◽

Plant Area Index

1) Hemispherical photography (HP) is a long-standing tool for forest canopy characterization. Currently, there are low-cost fisheye lenses to convert smartphones into high-portable HP equipment; however, they cannot be used whenever since HP is sensitive to illumination conditions. To obtain sound results outside diffuse light conditions, a deep-learning-based system needs to be developed. A ready-to-use alternative is the multiscale color-based binarization algorithm, but it can provide moderate-quality results only for open forests. To overcome this limitation, I propose coupling it with the model-based local thresholding algorithm. I call this coupling the MBCB approach. 2) Methods presented here are part of the R package CAnopy IMage ANalysis (caiman), which I am developing. The accuracy assessment of the new MBCB approach was done with data from a pine plantation and a broadleaf native forest. 3) The coefficient of determination (R^2) was greater than 0.7, and the root mean square error (RMSE) lower than 20 %, both for plant area index calculation. 4) Results suggest that the new MBCB approach allows the calculation of unbiased canopy metrics from smartphone-based HP acquired in sunlight conditions, even for closed canopies. This facilitates large-scale and opportunistic sampling with hemispherical photography.

Download Full-text

Implementation of the Omega (ω) Index to detect large-scale systematic cheating

10.35542/osf.io/exwkp ◽

2019 ◽

Author(s):

Alvin Vista

Keyword(s):

Standardized Testing ◽

Large Scale ◽

Type I Error ◽

R Package ◽

Statistical Testing ◽

System Level ◽

Control Group ◽

Type I ◽

Data Contamination ◽

Cheating Detection

Cheating detection is an important issue in standardized testing, especially in large-scale settings. Statistical approaches are often computationally intensive and require specialised software to conduct. We present a two-stage approach that quickly filters suspected groups using statistical testing on an IRT-based answer-copying index. We also present an approach to mitigate data contamination and improve the performance of the index. The computation of the index was implemented through a modified version of an open source R package, thus enabling wider access to the method. Using data from PIRLS 2011 (N=64,232) we conduct a simulation to demonstrate our approach. Type I error was well-controlled and no control group was falsely flagged for cheating, while 16 (combined n=12,569) of the 18 (combined n=14,149) simulated groups were detected. Implications for system-level cheating detection and further improvements of the approach were discussed.

Download Full-text

intsvy: An R Package for Analyzing International Large-Scale Assessment Data

Journal of Statistical Software ◽

10.18637/jss.v081.i07 ◽

2017 ◽

Vol 81 (7) ◽

Cited By ~ 3

Author(s):

Daniel H. Caro ◽

Przemysaw Biecek

Keyword(s):

Large Scale ◽

R Package ◽

Assessment Data ◽

Large Scale Assessment ◽

Scale Assessment

Download Full-text

FrustratometeR: an R-package to compute Local frustration in protein structures, point mutants and MD simulations

10.1101/2020.11.26.400432 ◽

2020 ◽

Author(s):

Atilio O. Rausch ◽

Maria I. Freiberger ◽

Cesar O. Leonetti ◽

Diego M. Luna ◽

Leandro G. Radusky ◽

...

Keyword(s):

Protein Interactions ◽

Large Scale ◽

Protein Structures ◽

Md Simulations ◽

R Package ◽

Protein Protein Interactions ◽

Large Scale Analysis ◽

Functional Aspects ◽

Catalytic Sites ◽

Polypeptide Chains

Once folded natural protein molecules have few energetic conflicts within their polypeptide chains. Many protein structures do however contain regions where energetic conflicts remain after folding, i.e. they have highly frustrated regions. These regions, kept in place over evolutionary and physiological timescales, are related to several functional aspects of natural proteins such as protein-protein interactions, small ligand recognition, catalytic sites and allostery. Here we present FrustratometeR, an R package that easily computes local energetic frustration on a personal computer or a cluster. This package facilitates large scale analysis of local frustration, point mutants and MD trajectories, allowing straightforward integration of local frustration analysis in to pipelines for protein structural analysis.Availability and implementation: https://github.com/proteinphysiologylab/frustratometeR

Download Full-text

Feasibility and Evaluation of a Large-Scale External Validation Approach for Patient-Level Prediction in an International Data Network: Validation of models predicting stroke in female patients newly diagnosed with atrial fibrillation.

10.21203/rs.2.11750/v3 ◽

2020 ◽

Cited By ~ 1

Author(s):

Jenna Marie Reps ◽

Ross D Williams ◽

Seng Chan You ◽

Thomas Falconer ◽

Evan Minty ◽

...

Keyword(s):

Atrial Fibrillation ◽

Large Scale ◽

Data Science ◽

Prediction Models ◽

External Validation ◽

Scale Up ◽

R Package ◽

Prognostic Models ◽

Healthcare Data ◽

Patient Level

Abstract Background: To demonstrate how the Observational Healthcare Data Science and Informatics (OHDSI) collaborative network and standardization can be utilized to scale-up external validation of patient-level prediction models by enabling validation across a large number of heterogeneous observational healthcare datasets.Methods: Five previously published prognostic models (ATRIA, CHADS2, CHADS2VASC, Q-Stroke and Framingham) that predict future risk of stroke in patients with atrial fibrillation were replicated using the OHDSI frameworks. A network study was run that enabled the five models to be externally validated across nine observational healthcare datasets spanning three countries and five independent sites. Results: The five existing models were able to be integrated into the OHDSI framework for patient-level prediction and they obtained mean c-statistics ranging between 0.57-0.63 across the 6 databases with sufficient data to predict stroke within 1 year of initial atrial fibrillation diagnosis for females with atrial fibrillation. This was comparable with existing validation studies. The validation network study was run across nine datasets within 60 days once the models were replicated. An R package for the study was published at https://github.com/OHDSI/StudyProtocolSandbox/tree/master/ExistingStrokeRiskExternalValidation.Conclusion : This study demonstrates the ability to scale up external validation of patient-level prediction models using a collaboration of researchers and a data standardization that enable models to be readily shared across data sites. External validation is necessary to understand the transportability or reproducibility of a prediction model, but without collaborative approaches it can take three or more years for a model to be validated by one independent researcher. In this paper we show it is possible to both scale-up and speed-up external validation by showing how validation can be done across multiple databases in less than 2 months. We recommend that researchers developing new prediction models use the OHDSI network to externally validate their models.

Download Full-text

A bibliometric analysis of property valuation research

International Journal of Housing Markets and Analysis ◽

10.1108/ijhma-09-2020-0115 ◽

2021 ◽

Vol ahead-of-print (ahead-of-print) ◽

Author(s):

B .V Binoy ◽

M. A Naseer ◽

P.P Anil Kumar ◽

Nina Lazar

Keyword(s):

Bibliometric Analysis ◽

Large Scale ◽

R Package ◽

Data Availability ◽

Hedonic Price ◽

Content Type ◽

Property Valuation ◽

Real Estate Valuation ◽

Modeling Techniques ◽

Number Of Publications

Purpose Real estate valuation studies gained popularity with the availability of large-scale property transaction data in the latter part of the twentieth century. Hedonic price modeling (HPM) was the most popular method in the initial years until it was taken over by advanced modeling methods in the twenty-first century. Even though there exist a few literature reviews on this topic, no comprehensive bibliometric analysis is conducted in this area. In view of gaining a better understanding of the dynamics of property valuation studies, this paper aims to conduct a bibliometric analysis. Design/methodology/approach A comprehensive search in the Scopus database, followed by detailed screening resulted in 1,400 articles. The identified research articles spanning over five decades (1964–2019) are analyzed using the open-source R package “bibliometrix.” Findings The study found the USA to be the most productive country in various aspects, such as number of publications, number of authors and publication hotspots. The findings also demonstrate assessments on the publication trends, journals, citations, keywords, co-citation and collaboration networks. It was observed that there exists an upsurge in the number of publications after the year 2000 owing to improved data availability and better modeling techniques. Research limitations/implications This study is significant in understanding the major research areas and modeling techniques used in property valuation. Future studies can incorporate multiple database sources and include more articles. Originality/value The current study is one of the first bibliometric studies on property valuation. Previous studies have not explored the possibilities of geographic information system in bibliometric research. Spatial mapping and analysis of publications provide a geographical perspective of valuation research.

Download Full-text