A comprehensive resource for retrieving, visualizing, and integrating functional genomics data

Matthias Blum; Pierre-Etienne Cholley; Valeriya Malysheva; Samuel Nicaise; Julien Moehlin; Hinrich Gronemeyer; Marco Antonio Mendoza-Parra

doi:10.26508/lsa.201900546

A comprehensive resource for retrieving, visualizing, and integrating functional genomics data

Life Science Alliance ◽

10.26508/lsa.201900546 ◽

2019 ◽

Vol 3 (1) ◽

pp. e201900546

Author(s):

Matthias Blum ◽

Pierre-Etienne Cholley ◽

Valeriya Malysheva ◽

Samuel Nicaise ◽

Julien Moehlin ◽

...

Keyword(s):

Functional Genomics ◽

High Performance ◽

Large Scale ◽

Quality Data ◽

Multidimensional Data ◽

Online Resource ◽

Public Data ◽

User Friendly ◽

Quality Assessments ◽

Multidimensional Data Integration

The enormous amount of freely accessible functional genomics data is an invaluable resource for interrogating the biological function of multiple DNA-interacting players and chromatin modifications by large-scale comparative analyses. However, in practice, interrogating large collections of public data requires major efforts for (i) reprocessing available raw reads, (ii) incorporating quality assessments to exclude artefactual and low-quality data, and (iii) processing data by using high-performance computation. Here, we present qcGenomics, a user-friendly online resource for ultrafast retrieval, visualization, and comparative analysis of tens of thousands of genomics datasets to gain new functional insight from global or focused multidimensional data integration.

Download Full-text

BrainIAK tutorials: User-friendly learning materials for advanced fMRI analysis

10.31219/osf.io/j4sbc ◽

2019 ◽

Cited By ~ 2

Author(s):

Manoj Kumar ◽

Cameron Thomas Ellis ◽

Qihong Lu ◽

Hejia Zhang ◽

Mihai Capota ◽

...

Keyword(s):

Machine Learning ◽

Functional Connectivity ◽

Open Source ◽

Programming Languages ◽

High Performance ◽

Large Scale ◽

Markov Models ◽

Matrix Analysis ◽

Fmri Analysis ◽

User Friendly

Advanced brain imaging analysis methods, including multivariate pattern analysis (MVPA), functional connectivity, and functional alignment, have become powerful tools in cognitive neuroscience over the past decade. These tools are implemented in custom code and separate packages, often requiring different software and language proficiencies. Although usable by expert researchers, novice users face a steep learning curve. These difficulties stem from the use of new programming languages (e.g., Python), learning how to apply machine-learning methods to high-dimensional fMRI data, and minimal documentation and training materials. Furthermore, most standard fMRI analysis packages (e.g., AFNI, FSL, SPM) focus on preprocessing and univariate analyses, leaving a gap in how to integrate with advanced tools. To address these needs, we developed BrainIAK (brainiak.org), an open-source Python software package that seamlessly integrates several cutting-edge, computationally efficient techniques with other Python packages (e.g., Nilearn, Scikit-learn) for file handling, visualization, and machine learning. To disseminate these powerful tools, we developed user-friendly tutorials (in Jupyter format; https://brainiak.org/tutorials/) for learning BrainIAK and advanced fMRI analysis in Python more generally. These materials cover techniques including: MVPA (pattern classification and representational similarity analysis); parallelized searchlight analysis; background connectivity; full correlation matrix analysis; inter-subject correlation; inter-subject functional connectivity; shared response modeling; event segmentation using hidden Markov models; and real-time fMRI. For long-running jobs or large memory needs we provide detailed guidance on high-performance computing clusters. These notebooks were successfully tested at multiple sites, including as problem sets for courses at Yale and Princeton universities and at various workshops and hackathons. These materials are freely shared, with the hope that they become part of a pool of open-source software and educational materials for large-scale, reproducible fMRI analysis and accelerated discovery.

Download Full-text

Anbar: Collection and analysis of a large scale Urdu language Twitter corpus

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-219266 ◽

2021 ◽

pp. 1-12

Author(s):

Bilal Tahir ◽

Muhammad Amir Mehmood

Keyword(s):

Computational Linguistics ◽

High Performance ◽

Large Scale ◽

Temporal Frequency ◽

Quality Data ◽

User Characteristics ◽

Vocabulary Size ◽

High Quality ◽

Social Analytics ◽

One Year

The confluence of high performance computing algorithms and large scale high-quality data has led to the availability of cutting edge tools in computational linguistics. However, these state-of-the-art tools are available only for the major languages of the world. The preparation of large scale high-quality corpora for low resource language such as Urdu is a challenging task as it requires huge computational and human resources. In this paper, we build and analyze a large scale Urdu language Twitter corpus Anbar. For this purpose, we collect 106.9 million Urdu tweets posted by 1.69 million users during one year (September 2018-August 2019). Our corpus consists of tweets with a rich vocabulary of 3.8 million unique tokens along with 58K hashtags and 62K URLs. Moreover, it contains 75.9 million (71.0%) retweets and 847K geotagged tweets. Furthermore, we examine Anbar using a variety of metrics like temporal frequency of tweets, vocabulary size, geo-location, user characteristics, and entities distribution. To the best of our knowledge, this is the largest repository of Urdu language tweets for the NLP research community which can be used for Natural Language Understanding (NLU), social analytics, and fake news detection.

Download Full-text

NeuroPM toolbox: integrating Molecular, Neuroimaging and Clinical data for Characterizing Neuropathological Progression and Individual Therapeutic Needs

10.1101/2020.09.24.20200964 ◽

2020 ◽

Author(s):

Yasser Iturria-Medina ◽

Felix Carbonell ◽

Atoussa Assadi ◽

Quadri Adewale ◽

Ahmed F. Khan ◽

...

Keyword(s):

Open Access ◽

High Performance ◽

Large Scale ◽

Synergistic Interactions ◽

Academic Researchers ◽

Cross Platform ◽

Therapeutic Needs ◽

User Friendly ◽

Performance Computing

There is a critical need for a better multiscale and multifactorial understanding of neurological disorders, covering from genes to neuroimaging to clinical factors and treatments effects. Here we present NeuroPM-box, a cross-platform, user-friendly and open-access software for characterizing multiscale and multifactorial brain pathological mechanisms and identifying individual therapeutic needs. The implemented methods have been extensively tested and validated in the neurodegenerative context, but there is not restriction in the kind of disorders that can be analyzed. By using advanced analytic modeling of molecular, neuroimaging and/or cognitive/behavioral data, this framework allows multiple applications, including characterization of: (i) the series of sequential states (e.g. transcriptomic, imaging or clinical alterations) covering decades of disease progression, (ii) intra-brain spreading of pathological factors (e.g. amyloid and tau misfolded proteins), (iii) synergistic interactions between multiple brain biological factors (e.g. direct tau effects on vascular and structural properties), and (iv) biologically-defined patients stratification based on therapeutic needs (i.e. optimum treatments for each patient). All models outputs are biologically interpretable. A 4D-viewer allows visualization of spatiotemporal brain (dis)organization. Originally implemented in MATLAB, NeuroPM-box is compiled as standalone application for Windows, Linux and Mac environments: neuropm-lab.com/software. In a regular workstation, it can analyze over 150 subjects per day, reducing the need for using clusters or High-Performance Computing (HPC) for large-scale datasets. This open-access tool for academic researchers may significantly contribute to a better understanding of complex brain processes and to accelerating the implementation of Precision Medicine (PM) in neurology.

Download Full-text

Formal semantics and high performance in declarative machine learning using Datalog

The VLDB Journal ◽

10.1007/s00778-021-00665-6 ◽

2021 ◽

Author(s):

Jin Wang ◽

Jiacheng Wu ◽

Mingda Li ◽

Jiaqi Gu ◽

Ariyam Das ◽

...

Keyword(s):

Machine Learning ◽

High Performance ◽

Large Scale ◽

Formal Semantics ◽

Distributed Data ◽

Recursive Programs ◽

Diverse Application ◽

User Friendly ◽

Performance Gains ◽

New Framework

AbstractWith an escalating arms race to adopt machine learning (ML) in diverse application domains, there is an urgent need to support declarative machine learning over distributed data platforms. Toward this goal, a new framework is needed where users can specify ML tasks in a manner where programming is decoupled from the underlying algorithmic and system concerns. In this paper, we argue that declarative abstractions based on Datalog are natural fits for machine learning and propose a purely declarative ML framework with a Datalog query interface. We show that using aggregates in recursive Datalog programs entails a concise expression of ML applications, while providing a strictly declarative formal semantics. This is achieved by introducing simple conditions under which the semantics of recursive programs is guaranteed to be equivalent to that of aggregate-stratified ones. We further provide specialized compilation and planning techniques for semi-naive fixpoint computation in the presence of aggregates and optimization strategies that are effective on diverse recursive programs and distributed data platforms. To test and demonstrate these research advances, we have developed a powerful and user-friendly system on top of Apache Spark. Extensive evaluations on large-scale datasets illustrate that this approach will achieve promising performance gains while improving both programming flexibility and ease of development and deployment for ML applications.

Download Full-text

MaGenDB: a functional genomics hub for Malvaceae plants

Nucleic Acids Research ◽

10.1093/nar/gkz953 ◽

2019 ◽

Cited By ~ 3

Author(s):

Dehe Wang ◽

Weiliang Fan ◽

Xiaolong Guo ◽

Kai Wu ◽

Siyu Zhou ◽

...

Keyword(s):

Functional Genomics ◽

Large Scale ◽

Expression Profiles ◽

Individual Species ◽

Omics Data ◽

Comparison System ◽

Dynamic Expression ◽

Biological Discovery ◽

Multiple Species ◽

User Friendly

Abstract Malvaceae is a family of flowering plants containing many economically important plant species including cotton, cacao and durian. Recently, the genomes of several Malvaceae species have been decoded, and many omics data were generated for individual species. However, no integrative database of multiple species, enabling users to jointly compare and analyse relevant data, is available for Malvaceae. Thus, we developed a user-friendly database named MaGenDB (http://magen.whu.edu.cn) as a functional genomics hub for the plant community. We collected the genomes of 13 Malvaceae species, and comprehensively annotated genes from different perspectives including functional RNA/protein element, gene ontology, KEGG orthology, and gene family. We processed 374 sets of diverse omics data with the ENCODE pipelines and integrated them into a customised genome browser, and designed multiple dynamic charts to present gene/RNA/protein-level knowledge such as dynamic expression profiles and functional elements. We also implemented a smart search system for efficiently mining genes. In addition, we constructed a functional comparison system to help comparative analysis between genes on multiple features in one species or across closely related species. This database and associated tools will allow users to quickly retrieve large-scale functional information for biological discovery.

Download Full-text

FUSTr: a tool to find gene families under selection in transcriptomes

PeerJ ◽

10.7717/peerj.4234 ◽

2018 ◽

Vol 6 ◽

pp. e4234 ◽

Cited By ~ 6

Author(s):

T. Jeffrey Cole ◽

Michael S. Brewer

Keyword(s):

Molecular Evolution ◽

Positive Selection ◽

High Performance ◽

Large Scale ◽

Simulated Data ◽

Gene Families ◽

Strong Positive Selection ◽

Transcriptomic Data ◽

Downstream Analysis ◽

User Friendly

Background The recent proliferation of large amounts of biodiversity transcriptomic data has resulted in an ever-expanding need for scalable and user-friendly tools capable of answering large scale molecular evolution questions. FUSTr identifies gene families involved in the process of adaptation. This is a tool that finds genes in transcriptomic datasets under strong positive selection that automatically detects isoform designation patterns in transcriptome assemblies to maximize phylogenetic independence in downstream analysis. Results When applied to previously studied spider transcriptomic data as well as simulated data, FUSTr successfully grouped coding sequences into proper gene families as well as correctly identified those under strong positive selection in relatively little time. Conclusions FUSTr provides a useful tool for novice bioinformaticians to characterize the molecular evolution of organisms throughout the tree of life using large transcriptomic biodiversity datasets and can utilize multi-processor high-performance computational facilities.

Download Full-text

The Structure and Properties of MoSi2 Thin Film in Mos Process

Proceedings, annual meeting, Electron Microscopy Society of America ◽

10.1017/s1431927600001379 ◽

1980 ◽

Vol 38 ◽

pp. 326-327

Author(s):

C.K. Wu ◽

P. Chang ◽

N. Godinho

Keyword(s):

Thin Film ◽

Integrated Circuits ◽

High Performance ◽

Large Scale ◽

Process Development ◽

Structure And Properties ◽

Metal Silicides ◽

High Oxidation ◽

Important Approach ◽

High Oxidation Resistance

Recently, the use of refractory metal silicides as low resistivity, high temperature and high oxidation resistance gate materials in large scale integrated circuits (LSI) has become an important approach in advanced MOS process development (1). This research is a systematic study on the structure and properties of molybdenum silicide thin film and its applicability to high performance LSI fabrication.

Download Full-text

What Can Digitization Do For Formulated Product Innovation and Development

10.26434/chemrxiv.11763864.v1 ◽

2020 ◽

Author(s):

James McDonagh ◽

William Swope ◽

Richard L. Anderson ◽

Michael Johnston ◽

David J. Bray

Keyword(s):

High Performance ◽

Quality Data ◽

High Quality Data ◽

Base Level ◽

Hybrid Approaches ◽

Digital Ecosystem ◽

Recent Developments ◽

Chemical Simulation ◽

High Level ◽

Physical And Chemical

Digitization oﬀers signiﬁcant opportunities for the formulated product industry to transform the way it works and develop new methods of business. R&D is one area of operation that is challenging to take advantage of these technologies due to its high level of domain specialisation and creativity but the beneﬁts could be signiﬁcant. Recent developments of base level technologies such as artiﬁcial intelligence (AI)/machine learning (ML), robotics and high performance computing (HPC), to name a few, present disruptive and transformative technologies which could oﬀer new insights, discovery methods and enhanced chemical control when combined in a digital ecosystem of connectivity, distributive services and decentralisation. At the fundamental level, research in these technologies has shown that new physical and chemical insights can be gained, which in turn can augment experimental R&D approaches through physics-based chemical simulation, data driven models and hybrid approaches. In all of these cases, high quality data is required to build and validate models in addition to the skills and expertise to exploit such methods. In this article we give an overview of some of the digital technology demonstrators we have developed for formulated product R&D. We discuss the challenges in building and deploying these demonstrators.<br>

Download Full-text

RECOMMENDATIONS FOR THE CHOICE OF MILKING INSTALLATIONS IN LOOSE HOUSING SYSTEMS OF COWS

Molochnoe i miasnoe skotovodstvo ◽

10.33943/mms.2020.12.24.001 ◽

2020 ◽

Author(s):

В.В. ГОРДЕЕВ ◽

В.Е. ХАЗАНОВ

Keyword(s):

Dairy Cows ◽

High Performance ◽

Large Scale ◽

Dairy Farms ◽

Economic Indicators ◽

Technical Level ◽

Housing Systems ◽

Working Shift ◽

Technical And Economic Indicators

При выборе типа доильной установки и ее размера необходимо учитывать максимальное планируемое поголовье дойных коров и размер технологической группы, кратность и время одного доения, продолжительность рабочей смены дояров. Анализ технико-экономических показателей наиболее распространенных на сегодняшний день типов доильных установок одинакового технического уровня свидетельствует, что наилучшие удельные показатели имеет установка типа «Карусель» (1), а установка типа «Елочка» (2) требует более высоких затрат труда и средств. Установка «Параллель» (3) занимает промежуточное положение. Из анализа пропускной способности и количества необходимых операторов: установка 2 рекомендована для ферм с поголовьем дойного стада до 600 голов, 3 — не более 1200 дойных коров, 1 — более 1200 дойных коров. «Карусель» — наиболее рациональный, высокопроизводительный, легко автоматизируемый и, следовательно, перспективный способ доения в залах, особенно для крупных молочных ферм. The choice of the proper type and size of milking installations needs to take into account the maximum planned number of dairy cows, the size of a technological group, the number of milkings per day, and the duration of one milking and the operator's working shift. The analysis of technical and economic indicators of currently most common types of milking machines of the same technical level revealed that the Carousel installation had the best specific indicators while the Herringbone installation featured higher labour inputs and cash costs. The Parallel installation was found somewhere in between. In terms of the throughput and the required number of operators Herringbone is recommended for farms with up to 600 dairy cows, Parallel — below 1200 dairy cows, Carousel — above 1200 dairy cows. Carousel was found the most practical, high-performance, easily automated and, therefore, promising milking system for milking parlours, especially on the large-scale dairy farms.

Download Full-text

Faculty Opinions recommendation of A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.718350670.793493808 ◽

2014 ◽

Author(s):

Zdeněk Valenta

Keyword(s):

Functional Genomics ◽

Covariance Matrix ◽

Large Scale ◽

Covariance Matrix Estimation ◽

Matrix Estimation ◽

Scale Covariance

Download Full-text