Factors influencing taxonomic unevenness in scientific research: a mixed-methods case study of non-human primate genomic sequence data generation

Margarita Hernandez; Mary K. Shenk; George H. Perry

doi:10.1098/rsos.201206

Factors influencing taxonomic unevenness in scientific research: a mixed-methods case study of non-human primate genomic sequence data generation

Royal Society Open Science ◽

10.1098/rsos.201206 ◽

2020 ◽

Vol 7 (9) ◽

pp. 201206

Author(s):

Margarita Hernandez ◽

Mary K. Shenk ◽

George H. Perry

Keyword(s):

Mixed Methods ◽

Genomic Sequence ◽

Sequence Data ◽

Scientific Research ◽

Genomic Data ◽

Primate Species ◽

Data Generation ◽

Human Primate ◽

Study Species

Scholars have noted major disparities in the extent of scientific research conducted among taxonomic groups. Such trends may cascade if future scientists gravitate towards study species with more data and resources already available. As new technologies emerge, do research studies employing these technologies continue these disparities? Here, using non-human primates as a case study, we identified disparities in massively parallel genomic sequencing data and conducted interviews with scientists who produced these data to learn their motivations when selecting study species. We tested whether variables including publication history and conservation status were significantly correlated with publicly available sequence data in the NCBI Sequence Read Archive (SRA). Of the 179.6 terabases (Tb) of sequence data in SRA for 519 non-human primate species, 135 Tb (approx. 75%) were from only five species: rhesus macaques, olive baboons, green monkeys, chimpanzees and crab-eating macaques. The strongest predictors of the amount of genomic data were the total number of non-medical publications (linear regression; r 2 = 0.37; p = 6.15 × 10 −12 ) and number of medical publications ( r 2 = 0.27; p = 9.27 × 10 −9 ). In a generalized linear model, the number of non-medical publications ( p = 0.00064) and closer phylogenetic distance to humans ( p = 0.024) were the most predictive of the amount of genomic sequence data. We interviewed 33 authors of genomic data-producing publications and analysed their responses using grounded theory. Consistent with our quantitative results, authors mentioned their choice of species was motivated by sample accessibility, prior published work and relevance to human medicine. Our mixed-methods approach helped identify and contextualize some of the driving factors behind species-uneven patterns of scientific research, which can now be considered by funding agencies, scientific societies and research teams aiming to align their broader goals with future data generation efforts.

Download Full-text

Factors influencing taxonomic unevenness in scientific research: A mixed-methods case study of non-human primate genomic sequence data generation

10.1101/2020.04.16.045450 ◽

2020 ◽

Author(s):

Margarita Hernandez ◽

Mary K. Shenk ◽

George H. Perry

Keyword(s):

Genomic Sequence ◽

Sequence Data ◽

Scientific Research ◽

Phylogenetic Distance ◽

Data Generation ◽

Factors Influencing ◽

Funding Agencies ◽

Future Data ◽

Human Primate

ABSTRACTScholars have often noted major disparities in the extent of scientific research conducted among taxonomic groups. Such trends may cascade if future scientists gravitate towards study species with more data and resources already available. As new technologies emerge, do research studies employing these technologies continue these disparities? Here, using non-human primates as a case study, we first identified disparities in recently-generated massively-parallel genomic sequencing data and we then conducted interviews with the scientists who produced these data to learn their motivations when selecting species for study. Specifically, we tested whether variables including publication history and conservation status were significantly correlated with publicly-available sequence data in the NCBI Sequence Read Archive. Of the 179.6 terabases (Tb) of sequence data in this database for 519 non-human primate species, 135 Tb (~75%) were from only five species: rhesus macaques, olive baboons, green monkeys, chimpanzees, and crab-eating macaques. The strongest individual predictors of the amount of genomic data were the total number of non-medical scholarly publications (linear regression; r2=0.37; P=6.15×10-12) and number of medical publications (r2=0.27; P=9.27×10-9). In a generalized linear model, the number of non-medical publications (P=0.00064) and closer phylogenetic distance to humans (P=0.024) were the most predictive of the amount of genomic sequence data. We interviewed 33 authors of genomic data-producing publications and analyzed their responses using a grounded theory approach. Consistent with our quantitative results, authors mentioned that their choices of species were motivated by sample accessibility, prior published work, and perceived relevance (especially health-related) to humans. Our mixed-methods approach helped us to identify and contextualize some of the driving factors behind species-uneven patterns of scientific research, which can now be considered by funding agencies, scientific societies, and research teams aiming to align their broader goals with future data generation efforts.SIGNIFICANCE STATEMENTOur study sheds lights on the species-uneven distribution of genomic sequence data generation across the order Primates. We used a combination of quantitative data analyses and qualitative interviews with authors of data-producing studies to identify factors that have driven the observed pattern of unevenness; these included the extent of prior research conducted on each species, the relevance to human medicine, phylogenetic distance to humans, and sample accessibility. While our study focused on factors influencing non-human primate genomic sequence data, similar questions can be asked about how the scientific community engages with research projects more broadly. Our goal is to bring attention to the diversity of factors that influence scientists as they plan their projects, so that this process can be considered in the future by research groups and funding agencies aiming to align their broader goals with future data generation efforts.

Download Full-text

Estimation of Cross-Species Introgression Rates using Genomic Data Despite Model Unidentifiability

10.1101/2021.08.14.456331 ◽

2021 ◽

Author(s):

Ziheng Yang ◽

Thomas Flouris

Keyword(s):

Genomic Sequence ◽

Sequence Data ◽

Gene Tree ◽

Synthetic Data ◽

Genomic Data ◽

Sister Species ◽

Label Switching ◽

Cross Model ◽

Multispecies Coalescent ◽

Full Likelihood

The multispecies coalescent with introgression (MSci) model accommodates both the coalescent process and cross-species introgression/ hybridization events, two major processes that create genealogical fluctuations across the genome and gene-tree-species-tree discordance. Full likelihood implementations of the MSci model take such fluctuations as a major source of information about the history of species divergence and gene flow, and provide a powerful tool for estimating the direction, timing and strength of cross-species introgression using multilocus sequence data. However, introgression models, in particular those that accommodate bidirectional introgression (BDI), are known to cause unidentifiability issues of the label-switching type, whereby different models or parameters make the same predictions about the genomic data and thus cannot be distinguished by the data. Nevertheless, there has been no systematic study of unidentifiability when full likelihood methods are applied. Here we characterize the unidentifiability of arbitrary BDI models and derive simple rules for its identification. In general, an MSci model with k BDI events has 2^k unidentifiable towers in the posterior, with each BDI event between sister species creating within-model unidentifiability and each BDI between non-sister species creating cross-model unidentifiability. We develop novel algorithms for processing Markov chain Monte Carlo (MCMC) samples to remove label switching and implement them in the BPP program. We analyze genomic sequence data from Heliconius butterflies as well as synthetic data to illustrate the utility of the BDI models and the new algorithms.

Download Full-text

The genomic data deficit: On the need to inform research subjects of the informational content of their genomic sequence data in consent for genomic research

Computer Law & Security Review ◽

10.1016/j.clsr.2020.105427 ◽

2020 ◽

Vol 37 ◽

pp. 105427

Author(s):

Dara Hallinan

Keyword(s):

Genomic Sequence ◽

Sequence Data ◽

Genomic Data ◽

Genomic Research ◽

Informational Content ◽

Research Subjects

Download Full-text

Genomic Sequence Data Compression using Lempel-Ziv-Welch Algorithm with Indexed Multiple Dictionary

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.b3278.129219 ◽

2019 ◽

Vol 9 (2) ◽

pp. 541-547

Keyword(s):

Dna Sequence ◽

High Throughput ◽

Compression Ratio ◽

Genomic Sequence ◽

Sequence Data ◽

Genomic Data ◽

General Purpose ◽

Huge Amount ◽

Compression Time ◽

Average Size

With the advancement in technology and development of High Throughput System (HTS), the amount of genomic data generated per day per laboratory across the globe is surpassing the Moore’s law. The huge amount of data generated is of concern to the biologists with respect to their storage as well as transmission across different locations for further analysis. Compression of the genomic data is the wise option to overcome the problems arising from the data deluge. This paper discusses various algorithms that exists for compression of genomic data as well as a few general purpose algorithms and proposes a LZW-based compression algorithm that uses indexed multiple dictionaries for compression. The proposed method exhibits an average compression ratio of 0.41 bits per base and an average compression time of 6.45 secs for a DNA sequence of an average size 105.9 KB.

Download Full-text

Speciation over the edge: gene flow among non-human primate species across a formidable biogeographic barrier

Royal Society Open Science ◽

10.1098/rsos.170351 ◽

2017 ◽

Vol 4 (10) ◽

pp. 170351 ◽

Cited By ~ 6

Author(s):

Ben J. Evans ◽

Anthony J. Tosi ◽

Kai Zeng ◽

Jonathan Dushoff ◽

André Corvelo ◽

...

Keyword(s):

Gene Flow ◽

Genomic Data ◽

Primate Species ◽

Macaque Monkeys ◽

Transition Zones ◽

Terrestrial Vertebrates ◽

The World ◽

Biogeographic Barrier ◽

Wallace’S Line ◽

Human Primate

Many genera of terrestrial vertebrates diversified exclusively on one or the other side of Wallace’s Line, which lies between Borneo and Sulawesi islands in Southeast Asia, and demarcates one of the sharpest biogeographic transition zones in the world. Macaque monkeys are unusual among vertebrate genera in that they are distributed on both sides of Wallace‘s Line, raising the question of whether dispersal across this barrier was an evolutionary one-off or a more protracted exchange—and if the latter, what were the genomic consequences. To explore the nature of speciation over the edge of this biogeographic divide, we used genomic data to test for evidence of gene flow between macaque species across Wallace’s Line after macaques colonized Sulawesi. We recovered evidence of post-colonization gene flow, most prominently on the X chromosome. These results are consistent with the proposal that gene flow is a pervasive component of speciation—even when barriers to gene flow seem almost insurmountable.

Download Full-text

Exploring the Contribution of Work and Non-Work Sources of Social Support to Employee Well-being: A Mixed Methods Case Study

PsycEXTRA Dataset ◽

10.1037/e604062012-234 ◽

2012 ◽

Author(s):

Tina Kowalski

Keyword(s):

Social Support ◽

Mixed Methods ◽

Well Being ◽

Mixed Methods Case Study

Download Full-text

Intersecting Mixed Methods and Case Study Research: Design Possibilities and Challenges

International Journal of Multiple Research Approaches ◽

10.29034/ijmra.v10n1a1 ◽

2018 ◽

Vol 10 (1) ◽

pp. 14-29 ◽

Cited By ~ 2

Author(s):

Vicki L. Plano Clark ◽

◽

Lori A. Foote ◽

Janet B. Walton ◽

◽

...

Keyword(s):

Mixed Methods ◽

Research Design ◽

Case Study Research ◽

Study Research

Download Full-text

Faculty Opinions recommendation of A likelihood ratio test of speciation with gene flow using genomic sequence data.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.3540959.3240060 ◽

2010 ◽

Author(s):

Nicolas Galtier ◽

Julien Dutheil

Keyword(s):

Gene Flow ◽

Likelihood Ratio ◽

Likelihood Ratio Test ◽

Genomic Sequence ◽

Sequence Data ◽

Ratio Test

Download Full-text

PoGB-Pred: Prediction of Antifreeze Proteins Sequences using Amino Acid Composition with Feature Selection followed by a Sequential based Ensemble Approach

Current Bioinformatics ◽

10.2174/1574893615999200707141926 ◽

2020 ◽

Vol 15 ◽

Author(s):

Affan Alim ◽

Abdul Rafay ◽

Imran Naseem

Keyword(s):

Amino Acid ◽

Dimension Reduction ◽

Protein Identification ◽

Cold Water ◽

Genomic Sequence ◽

Sequence Data ◽

Antifreeze Proteins ◽

Building Blocks ◽

Gradient Boosting ◽

Proposed Model

Background: Proteins contribute significantly in every task of cellular life. Their functions encompass the building and repairing of tissues in human bodies and other organisms. Hence they are the building blocks of bones, muscles, cartilage, skin, and blood. Similarly, antifreeze proteins are of prime significance for organisms that live in very cold areas. With the help of these proteins, the cold water organisms can survive below zero temperature and resist the water crystallization process which may cause the rupture in the internal cells and tissues. AFP’s have attracted attention and interest in food industries and cryopreservation. Objective: With the increase in the availability of genomic sequence data of protein, an automated and sophisticated tool for AFP recognition and identification is in dire need. The sequence and structures of AFP are highly distinct, therefore, most of the proposed methods fail to show promising results on different structures. A consolidated method is proposed to produce the competitive performance on highly distinct AFP structure. Methods: In this study, we propose to use machine learning-based algorithms Principal Component Analysis (PCA) followed by Gradient Boosting (GB) for antifreeze protein identification. To analyze the performance and validation of the proposed model, various combinations of two segments composition of amino acid and dipeptide are used. PCA, in particular, is proposed to dimension reduction and high variance retaining of data which is followed by an ensemble method named gradient boosting for modelling and classification. Results: The proposed method obtained the superfluous performance on PDB, Pfam and Uniprot dataset as compared with the RAFP-Pred method. In experiment-3, by utilizing only 150 PCA components a high accuracy of 89.63 was achieved which is superior to the 87.41 utilizing 300 significant features reported for the RAFP-Pred method. Experiment-2 is conducted using two different dataset such that non-AFP from the PISCES server and AFPs from Protein data bank. In this experiment-2, our proposed method attained high sensitivity of 79.16 which is 12.50 better than state-of-the-art the RAFP-pred method. Conclusion: AFPs have a common function with distinct structure. Therefore, the development of a single model for different sequences often fails to AFPs. A robust results have been shown by our proposed model on the diversity of training and testing dataset. The results of the proposed model outperformed compared to the previous AFPs prediction method such as RAFP-Pred. Our model consists of PCA for dimension reduction followed by gradient boosting for classification. Due to simplicity, scalability properties and high performance result our model can be easily extended for analyzing the proteomic and genomic dataset.

Download Full-text

Change, Challenges, and Mixed Methods

10.1093/oso/9780199330010.003.0004 ◽

2017 ◽

Author(s):

Jeasik Cho

Keyword(s):

Qualitative Research ◽

Mixed Methods ◽

Mixed Methods Research ◽

Evaluation Criteria ◽

Scientific Research ◽

Evaluation Strategy ◽

Validity Criteria ◽

Research Questions ◽

The Way

This chapter discusses three ongoing issues related to the evaluation of qualitative research. First, the chapter considers whether a set of evaluation criteria is either determinative or changeable. Due to the evolving nature of qualitative research, it is likely that the way in which qualitative research is evaluated can change—not all at once, but gradually. Second, qualitative research has been criticized by newly resurrected positivists whose definitions of scientific research and evaluation criteria are narrow. “Politics of evidence” and a recent big-tent evaluation strategy are examined. Last, this chapter analyzes how validity criteria of qualitative research are incorporated into the evaluation of mixed methods research. The elements of qualitative research seem to be fairly represented but are largely treated as trivial. A criterion, the fit of research questions to design, is identified as distinctive in the review guide of the Journal of Mixed Methods Research.

Download Full-text