data independence
Recently Published Documents


TOTAL DOCUMENTS

68
(FIVE YEARS 3)

H-INDEX

11
(FIVE YEARS 0)

2021 ◽  
Author(s):  
Niloofar Borhani ◽  
Jafar Ghaisari ◽  
Maryam Abedi ◽  
Marzieh Kamali ◽  
Yousof Gheisari

Abstract Despite enormous achievements in production of high throughput datasets, constructing comprehensive maps of interactions remains a major challenge. The lack of sufficient experimental evidence on interactions is more significant for heterogeneous molecular types. Hence, developing strategies to predict inter-omics connections is essential to construct holistic maps of disease. Here, Data Integration with Deep Learning (DIDL), a novel nonlinear deep learning method is proposed to predict inter-omics interactions. It consists of an encoder that automatically extracts features for biomolecules according to existing interactions, and a decoder that predicts novel interactions. The applicability of DIDL is assessed with different networks namely drug-target protein, transcription factor-DNA element, and miRNA-mRNA. Also, the validity of novel predictions is assessed by literature surveys. Furthermore, DIDL outperformed state-of-the-art methods. Area under the curve, and area under the precision-recall curve for all three networks were more than 0.85 and 0.83, respectively. DIDL has several advantages like automatic feature extraction from raw data, end-to-end training, and robustness to sparsity. In addition, tensor decomposition structure, predictions solely based on existing interactions and biochemical data independence makes DIDL applicable for a variety of biological networks. DIDL paves the way to understand the underlying mechanisms of complex disorders through constructing integrative networks.


Author(s):  
Erta Kalanxhi ◽  
Gilbert Osena ◽  
Geetanjali Kapoor ◽  
Eili Klein

Abstract Background Antimicrobial resistance (AMR) is one of the greatest global health challenges today, but burden assessment is hindered by uncertainty of AMR prevalence estimates. Geographical representation of AMR estimates typically pools data collected from several laboratories; however, these aggregations may introduce bias by not accounting for the heterogeneity of the population that each laboratory represents. Methods We used AMR data from up to 381 laboratories in the United States from The Surveillance Network to evaluate methods for estimating uncertainty of AMR prevalence estimates. We constructed confidence intervals for the proportion of resistant isolates using (1) methods that account for the clustered structure of the data, and (2) standard methods that assume data independence. Using samples of the full dataset with increasing facility coverage levels, we examined how likely the estimated confidence intervals were to include the population mean. Results Methods constructing 95% confidence intervals while accounting for possible within-cluster correlations (Survey and standard methods adjusted to employ cluster-robust errors), were more likely to include the sample mean than standard methods (Logit, Wilson score and Jeffreys interval) operating under the assumption of independence. While increased geographical coverage improved the probability of encompassing the mean for all methods, large samples still did not compensate for the bias introduced from the violation of the data independence assumption. Conclusion General methods for estimating the confidence intervals of AMR rates that assume data are independent, are likely to produce biased results. When feasible, the clustered structure of the data and any possible intra-cluster variation should be accounted for when calculating confidence intervals around AMR estimates, in order to better capture the uncertainty of prevalence estimates.


2021 ◽  
Vol 7 ◽  
Author(s):  
Mark A. Stevenson

In the design of intervention and observational epidemiological studies sample size calculations are used to provide estimates of the minimum number of observations that need to be made to ensure that the stated objectives of a study are met. Justification of the number of subjects enrolled into a study and details of the assumptions and methodologies used to derive sample size estimates are now a mandatory component of grant application processes by funding agencies. Studies with insufficient numbers of study subjects run the risk of failing to identify differences among treatment or exposure groups when differences do, in fact, exist. Selection of a number of study subjects greater than that actually required results in a wastage of time and resources. In contrast to human epidemiological research, individual study subjects in a veterinary setting are almost always aggregated into hierarchical groups and, for this reason, sample size estimates calculated using formulae that assume data independence are not appropriate. This paper provides an overview of the reasons researchers might need to calculate an appropriate sample size in veterinary epidemiology and a summary of sample size calculation methods. Two approaches are presented for dealing with lack of data independence when calculating sample sizes: (1) inflation of crude sample size estimates using a design effect; and (2) simulation-based methods. The advantage of simulation methods is that appropriate sample sizes can be estimated for complex study designs for which formula-based methods are not available. A description of the methodological approach for simulation is described and a worked example provided.


2020 ◽  
Vol 14 (4) ◽  
pp. 498-506 ◽  
Author(s):  
Ingo Müller ◽  
Ghislain Fourny ◽  
Stefan Irimescu ◽  
Can Berker Cikis ◽  
Gustavo Alonso

This paper introduces Rumble, a query execution engine for large, heterogeneous, and nested collections of JSON objects built on top of Apache Spark. While data sets of this type are more and more wide-spread, most existing tools are built around a tabular data model, creating an impedance mismatch for both the engine and the query interface. In contrast, Rumble uses JSONiq, a standardized language specifically designed for querying JSON documents. The key challenge in the design and implementation of Rumble is mapping the recursive structure of JSON documents and JSONiq queries onto Spark's execution primitives based on tabular data frames. Our solution is to translate a JSONiq expression into a tree of iterators that dynamically switch between local and distributed execution modes depending on the nesting level. By overcoming the impedance mismatch in the engine , Rumble frees the user from solving the same problem for every single query, thus increasing their productivity considerably. As we show in extensive experiments, Rumble is able to scale to large and complex data sets in the terabyte range with a similar or better performance than other engines. The results also illustrate that Codd's concept of data independence makes as much sense for heterogeneous, nested data sets as it does on highly structured tables.


2020 ◽  
Author(s):  
Miaoshan Lu ◽  
Shaowei An ◽  
Ruimin Wang ◽  
Jinyin Wang ◽  
Changbin Yu

ABSTRACTWith the precision of mass spectrometer going higher and the emergence of data independence acquisition (DIA), the file size is increasing rapidly. Beyond the widely-used open format mzML (Deutsch 2008), near-lossless or lossless compression algorithms and formats have emerged. The data precision is often related to the instrument and subsequent processing algorithms. Unlike storage-oriented formats, which focusing more on lossless compression and compression rate, computation-oriented formats focus as much on decoding speed and disk read strategy as compression rate. Here we describe “Aird", an opensource and computation-oriented format with controllable precision, flexible indexing strategies and high compression rate. Aird uses JavaScript Object Notation (JSON) for metadata storage, multiple indexing, and reordered storage strategies for higher speed of data randomly reading. Aird also provides a novel compressor called Zlib-Diff-PforDelta (ZDPD) for m/z data compression. Compared with Zlib only, m/z data size is about 65% lower in Aird, and merely takes 33% decoding time.AvailabilityAird SDK is written in Java, which allow scholars to access mass spectrometry data efficiently. It is available at https://github.com/Propro-Studio/Aird-SDK AirdPro can convert vendor files into Aird files, which is available at https://github.com/Propro-Studio/AirdPro


2020 ◽  
Vol 34 (28) ◽  
pp. 2050311
Author(s):  
Satvik Vats ◽  
B. B. Sagar

In Big data domain, platform dependency can alter the behavior of the business. It is because of the different kinds (Structured, Semi-structured and Unstructured) and characteristics of the data. By the traditional infrastructure, different kinds of data cannot be processed simultaneously due to their platform dependency for a particular task. Therefore, the responsibility of selecting suitable tools lies with the user. The variety of data generated by different sources requires the selection of suitable tools without human intervention. Further, these tools also face the limitation of recourses to deal with a large volume of data. This limitation of resources affects the performance of the tools in terms of execution time. Therefore, in this work, we proposed a model in which different data analytics tools share a common infrastructure to provide data independence and resource sharing environment, i.e. the proposed model shares common (Hybrid) Hadoop Distributed File System (HDFS) between three Name-Node (Master Node), three Data-Node and one Client-node, which works under the DeMilitarized zone (DMZ). To realize this model, we have implemented Mahout, R-Hadoop and Splunk sharing a common HDFS. Further using our model, we run [Formula: see text]-means clustering, Naïve Bayes and recommender algorithms on three different datasets, movie rating, newsgroup, and Spam SMS dataset, representing structured, semi-structured and unstructured, respectively. Our model selected the appropriate tool, e.g. Mahout to run on the newsgroup dataset as other tools cannot run on this data. This shows that our model provides data independence. Further results of our proposed model are compared with the legacy (individual) model in terms of execution time and scalability. The improved performance of the proposed model establishes the hypothesis that our model overcomes the limitation of the resources of the legacy model.


Author(s):  
Safa Brahmia ◽  
Zouhaier Brahmia ◽  
Fabio Grandi ◽  
Rafik Bouaziz

The JSON Schema language lacks explicit support for defining time-varying schemas of JSON documents. Moreover, existing JSON NoSQL databases (e.g., MongoDB, CouchDB) do not provide any support for managing temporal data. Hence, administrators of JSON NoSQL databases have to use ad hoc techniques in order to specify JSON schema for time-varying instances. In this chapter, the authors propose a disciplined approach, named Temporal JSON Schema (τJSchema), for the temporal management of JSON documents. τJSchema allows creating a temporal JSON schema from (1) a conventional JSON schema, (2) a set of temporal logical characteristics, for specifying which components of a JSON document can vary over time, and (3) a set of temporal physical characteristics, for specifying how the time-varying aspects are represented in the document. By using such characteristics to describe temporal aspects of JSON data, τJSchema guarantees logical and physical data independence and provides a low-impact solution since it requires neither updates to existing JSON documents nor extensions to related JSON technologies.


Sign in / Sign up

Export Citation Format

Share Document