Using Greedy Random Adaptive Procedure to Solve the User Selection Problem in Mobile Crowdsourcing

Jian Yang; Xiaojuan Ban; Chunxiao Xing

doi:10.3390/s19143158

Using Greedy Random Adaptive Procedure to Solve the User Selection Problem in Mobile Crowdsourcing

Sensors ◽

10.3390/s19143158 ◽

2019 ◽

Vol 19 (14) ◽

pp. 3158

Author(s):

Jian Yang ◽

Xiaojuan Ban ◽

Chunxiao Xing

Keyword(s):

Mobile Networks ◽

Large Scale ◽

Rapid Development ◽

Synthetic Data ◽

Data Sets ◽

User Selection ◽

Adaptive Procedure ◽

Mobile Crowdsourcing ◽

Marginal Gain ◽

The Cost

With the rapid development of mobile networks and smart terminals, mobile crowdsourcing has aroused the interest of relevant scholars and industries. In this paper, we propose a new solution to the problem of user selection in mobile crowdsourcing system. The existing user selection schemes mainly include: (1) find a subset of users to maximize crowdsourcing quality under a given budget constraint; (2) find a subset of users to minimize cost while meeting minimum crowdsourcing quality requirement. However, these solutions have deficiencies in selecting users to maximize the quality of service of the task and minimize costs. Inspired by the marginalism principle in economics, we wish to select a new user only when the marginal gain of the newly joined user is higher than the cost of payment and the marginal cost associated with integration. We modeled the scheme as a marginalism problem of mobile crowdsourcing user selection (MCUS-marginalism). We rigorously prove the MCUS-marginalism problem to be NP-hard, and propose a greedy random adaptive procedure with annealing randomness (GRASP-AR) to achieve maximize the gain and minimize the cost of the task. The effectiveness and efficiency of our proposed approaches are clearly verified by a large scale of experimental evaluations on both real-world and synthetic data sets.

Download Full-text

Denoising large-scale biological data using network filters

10.21203/rs.3.rs-66071/v2 ◽

2021 ◽

Author(s):

Andrew J Kavran ◽

Aaron Clauset

Keyword(s):

Large Scale ◽

Synthetic Data ◽

Interaction Network ◽

Learning Task ◽

Biological Data ◽

Data Sets ◽

Proteomics Data ◽

Life History Variation ◽

Wide Range ◽

Underlying Processes

Abstract Background: Large-scale biological data sets are often contaminated by noise, which can impede accurate inferences about underlying processes. Such measurement noise can arise from endogenous biological factors like cell cycle and life history variation, and from exogenous technical factors like sample preparation and instrument variation.Results: We describe a general method for automatically reducing noise in large-scale biological data sets. This method uses an interaction network to identify groups of correlated or anti-correlated measurements that can be combined or “ﬁltered” to better recover an underlying biological signal. Similar to the process of denoising an image, a single network ﬁlter may be applied to an entire system, or the system may be ﬁrst decomposed into distinct modules and a diﬀerent ﬁlter applied to each. Applied to synthetic data with known network structure and signal, network ﬁlters accurately reduce noise across a wide range of noise levels and structures. Applied to a machine learning task of predicting changes in human protein expression in healthy and cancerous tissues, network ﬁltering prior to training increases accuracy up to 43% compared to using unﬁltered data.Conclusions: Network ﬁlters are a general way to denoise biological data and can account for both correlation and anti-correlation between diﬀerent measurements. Furthermore, we ﬁnd that partitioning a network prior to ﬁltering can signiﬁcantly reduce errors in networks with heterogenous data and correlation patterns, and this approach outperforms existing diﬀusion based methods. Our results on proteomics data indicate the broad potential utility of network ﬁlters to applications in systems biology.

Download Full-text

Cost Modeling and Range Estimation for Top-k Retrieval in Relational Databases

Theoretical and Practical Advances in Information Systems Development ◽

10.4018/978-1-60960-521-6.ch012 ◽

2011 ◽

pp. 295-315

Author(s):

Anteneh Ayanso ◽

Paulo B. Goes ◽

Kumar Mehta

Keyword(s):

Relational Databases ◽

Estimation Method ◽

Synthetic Data ◽

Cost Modeling ◽

Data Sets ◽

Range Estimation ◽

Mapping Techniques ◽

The Cost ◽

Query Estimation ◽

Query Mapping

Relational databases have increasingly become the basis for a wide range of applications that require efficient methods for exploratory search and retrieval. Top-k retrieval addresses this need and involves finding a limited number of records whose attribute values are the closest to those specified in a query. One of the approaches in the recent literature is query-mapping which deals with converting top-k queries into equivalent range queries that relational database management systems (RDBMSs) normally support. This approach combines the advantages of simplicity as well as practicality by avoiding the need for modifications to the query engine, or specialized data structures and indexing techniques to handle top-k queries separately. This paper reviews existing query-mapping techniques in the literature and presents a range query estimation method based on cost modeling. Experiments on real world and synthetic data sets show that the cost-based range estimation method performs at least as well as prior methods and avoids the need to calibrate workloads on specific database contents.

Download Full-text

MS-PyCloud: An open-source, cloud computing-based pipeline for LC-MS/MS data analysis

10.1101/320887 ◽

2018 ◽

Cited By ~ 2

Author(s):

Li Chen ◽

Bai Zhang ◽

Michael Schnaubelt ◽

Punit Shah ◽

Paul Aiyetan ◽

...

Keyword(s):

Cloud Computing ◽

Data Analysis ◽

Open Source ◽

High Performance ◽

Large Scale ◽

Rapid Development ◽

Data File ◽

Data Sets ◽

Proteomics Data ◽

Amazon Web Services

ABSTRACTRapid development and wide adoption of mass spectrometry-based proteomics technologies have empowered scientists to study proteins and their modifications in complex samples on a large scale. This progress has also created unprecedented challenges for individual labs to store, manage and analyze proteomics data, both in the cost for proprietary software and high-performance computing, and the long processing time that discourages on-the-fly changes of data processing settings required in explorative and discovery analysis. We developed an open-source, cloud computing-based pipeline, MS-PyCloud, with graphical user interface (GUI) support, for LC-MS/MS data analysis. The major components of this pipeline include data file integrity validation, MS/MS database search for spectral assignment, false discovery rate estimation, protein inference, determination of protein post-translation modifications, and quantitation of specific (modified) peptides and proteins. To ensure the transparency and reproducibility of data analysis, MS-PyCloud includes open source software tools with comprehensive testing and versioning for spectrum assignments. Leveraging public cloud computing infrastructure via Amazon Web Services (AWS), MS-PyCloud scales seamlessly based on analysis demand to achieve fast and efficient performance. Application of the pipeline to the analysis of large-scale iTRAQ/TMT LC-MS/MS data sets demonstrated the effectiveness and high performance of MS-PyCloud. The software can be downloaded at: https://bitbucket.org/mschnau/ms-pycloud/downloads/

Download Full-text

Efficiencies and Benefits Gained by Small Pipeline Companies From the Implementation of Cost-Effective GIS Technology

4th International Pipeline Conference, Parts A and B ◽

10.1115/ipc2002-27059 ◽

2002 ◽

Author(s):

Brent A. Jones

Keyword(s):

Information System ◽

Geographic Information System ◽

Large Scale ◽

Cost Effective ◽

Geographic Information ◽

Data Sets ◽

Gis Technology ◽

Energy Company ◽

The Cost ◽

Gis System

Many smaller pipeline operating companies see the benefits of implementing a Geographic Information System (GIS) to organize pipeline data and meet the requirements of 49 CFR 195, but cannot justify the cost of a large-scale AM/FM/GIS system. PPL Interstate Energy Company (PPL IE) is a pipeline company with 84 miles of main that implemented a GIS solution that leverages both existing technology and facility data investments. This paper discusses the process used to acquire landbase data, to organize existing pipeline data from a variety of paper-based and digital sources, and to integrate these data sets. It will also discuss the functionality and benefits of the resultant GIS.

Download Full-text

A Review of Distribution Network Planning with Large-Scale Intermittent Renewable Energy Access to Distribution Network

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.1008-1009.723 ◽

2014 ◽

Vol 1008-1009 ◽

pp. 723-728 ◽

Cited By ~ 1

Author(s):

Xu Dong Song ◽

Nan Hua Yu ◽

Jun Jun Liang ◽

Cheng Jun Xia

Keyword(s):

Renewable Energy ◽

Large Scale ◽

Rapid Development ◽

Distribution Network ◽

Research Direction ◽

Absorption Capacity ◽

Energy Access ◽

Multiple Sources ◽

Unique Source ◽

The Cost

With the rapid development of DG, especially the access of large-scale renewable energy, traditional simple distribution network with unique source turns into ADN with multiple sources, making the distribution network more complicated. In this paper, The power source and grid planning of traditional and intelligent distribution network are discussed, based on which the problems ADN faces and the research difficulty are focused on. The key technology of ADN planning is analyzed, including the uncertainty of load forecasting, the ADN absorption capacity for DG and the cost-effectiveness of ADN planning. Some suggestions for the research direction of ADN in the future are made at the end, providing reference for the ADN planning with large-scale renewable energy access.

Download Full-text

Joint MT and CSEM data inversion using a multiplicative cost function approach

Geophysics ◽

10.1190/1.3560898 ◽

2011 ◽

Vol 76 (3) ◽

pp. F203-F214 ◽

Cited By ~ 28

Author(s):

A. Abubakar ◽

M. Li ◽

G. Pan ◽

J. Liu ◽

T. M. Habashy

Keyword(s):

Cost Function ◽

Large Scale ◽

Joint Inversion ◽

A Priori ◽

Upper And Lower Bounds ◽

Model Parameters ◽

Data Sets ◽

Inversion Algorithm ◽

Earth Resistivity ◽

The Cost

We have developed an inversion algorithm for jointly inverting controlled-source electromagnetic (CSEM) data and magnetotelluric (MT) data. It is well known that CSEM and MT data provide complementary information about the subsurface resistivity distribution; hence, it is useful to derive earth resistivity models that simultaneously and consistently fit both data sets. Because we are dealing with a large-scale computational problem, one usually uses an iterative technique in which a predefined cost function is optimized. One of the issues of this simultaneous joint inversion approach is how to assign the relative weights on the CSEM and MT data in constructing the cost function. We propose a multiplicative cost function instead of the traditional additive one. This function does not require an a priori choice of the relative weights between these two data sets. It will adaptively put CSEM and MT data on equal footing in the inversion process. The inversion is accomplished with a regularized Gauss-Newton minimization scheme where the model parameters are forced to lie within their upper and lower bounds by a nonlinear transformation procedure. We use a line search scheme to enforce a reduction of the cost function at each iteration. We tested our joint inversion approach on synthetic and field data.

Download Full-text

Quick and efficient approach to develop genomic resources in orphan species: application in Lavandula angustifolia

10.1101/381400 ◽

2018 ◽

Author(s):

Berline Fopa Fomeju ◽

Dominique Brunel ◽

Aurélie Bérard ◽

Jean-Baptiste Rivoal ◽

Philippe Gallois ◽

...

Keyword(s):

Large Scale ◽

De Novo ◽

Rapid Development ◽

Genetic Distances ◽

Lavandula Angustifolia ◽

Alternative Medicines ◽

Snp Development ◽

High Level ◽

The Cost ◽

Next Generation Sequencing Ngs

AbstractNext-Generation Sequencing (NGS) technologies, by reducing the cost and increasing the throughput of sequencing, have opened doors of research efforts to generate genomic data to a range of previously poorly studied species. In this study, we proposed a method for the rapid development of a large scale molecular resources for orphan species. We studied as an example Lavandula angustifolia, a perennial sub-shrub plant native from the Mediterranean region and whose essential oil have numerous applications in cosmetics, pharmaceuticals, and alternative medicines.We first built a ‘Maillette’ reference Unigene, compound of coding sequences, thanks to de novo RNA-seq assembly. Then, we reconstructed the complete genes sequences (with exons and introns) using a transcriptome-guided DNA-seq assembly approach in order to maximize the possibilities of finding polymorphism between genetically close individuals. Finally, we used these resources for SNP mining within a collection of 16 lavender clones and tested the SNP within the scope of a phylogeny analysis. We obtained a cleaned reference of 8, 030 functionally annotated ‘genes’ (in silico annotation). We found up to 400K polymorphic sites, depending on the genotype analyzed, and observed a high SNP frequency (mean of 1 SNP per 90 bp) and a high level of heterozygosity (more than 60% of heterozygous SNP per genotype). We found similar genetic distances between pairs of clones, related to the out-crossing nature of the species, the restricted area of cultivation and the clonal propagation of the varieties.The method propose is transferable to other orphan species, requires little bioinformatics resources and can be realized within a year. This is the first reported large-scale SNP development on Lavandula angustifolia. All this data provides a rich pool of molecular resource to explore and exploit biodiversity in breeding programs.

Download Full-text

Advancing Medical Imaging Informatics by Deep Learning-Based Domain Adaptation

Yearbook of Medical Informatics ◽

10.1055/s-0040-1702009 ◽

2020 ◽

Vol 29 (01) ◽

pp. 129-138 ◽

Cited By ~ 1

Author(s):

Anirudh Choudhary ◽

Li Tong ◽

Yuanda Zhu ◽

May D. Wang

Keyword(s):

Deep Learning ◽

Medical Imaging ◽

Large Scale ◽

Domain Adaptation ◽

Rapid Development ◽

Synthetic Data ◽

Feature Space ◽

Imaging Informatics ◽

Training Data ◽

Domain Transformation

Introduction: There has been a rapid development of deep learning (DL) models for medical imaging. However, DL requires a large labeled dataset for training the models. Getting large-scale labeled data remains a challenge, and multi-center datasets suffer from heterogeneity due to patient diversity and varying imaging protocols. Domain adaptation (DA) has been developed to transfer the knowledge from a labeled data domain to a related but unlabeled domain in either image space or feature space. DA is a type of transfer learning (TL) that can improve the performance of models when applied to multiple different datasets. Objective: In this survey, we review the state-of-the-art DL-based DA methods for medical imaging. We aim to summarize recent advances, highlighting the motivation, challenges, and opportunities, and to discuss promising directions for future work in DA for medical imaging. Methods: We surveyed peer-reviewed publications from leading biomedical journals and conferences between 2017-2020, that reported the use of DA in medical imaging applications, grouping them by methodology, image modality, and learning scenarios. Results: We mainly focused on pathology and radiology as application areas. Among various DA approaches, we discussed domain transformation (DT) and latent feature-space transformation (LFST). We highlighted the role of unsupervised DA in image segmentation and described opportunities for future development. Conclusion: DA has emerged as a promising solution to deal with the lack of annotated training data. Using adversarial techniques, unsupervised DA has achieved good performance, especially for segmentation tasks. Opportunities include domain transferability, multi-modal DA, and applications that benefit from synthetic data.

Download Full-text

A Bayes Random Fields Approach for Integrative Large-Scale Regulatory Network Analysis

Journal of Integrative Bioinformatics ◽

10.1515/jib-2008-99 ◽

2008 ◽

Vol 5 (2) ◽

Author(s):

Yinyin Yuan ◽

Chang-Tsun Li

Keyword(s):

Random Fields ◽

Network Architecture ◽

Large Scale ◽

Synthetic Data ◽

Field Potential ◽

Data Sets ◽

Discriminative Ability ◽

Large Scale Networks ◽

Probabilistic Nature ◽

Influence Of Noise

SummaryWe present a Bayes-Random Fields framework which is capable of integrating unlimited data sources for discovering relevant network architecture of large-scale networks. The random field potential function is designed to impose a cluster constraint, teamed with a full Bayesian approach for incorporating heterogenous data sets. The probabilistic nature of our framework facilitates robust analysis in order to minimize the influence of noise inherent in the data on the inferred structure in a seamless and coherent manner. This is later proved in its applications to both large-scale synthetic data sets and Saccharomyces Cerevisiae data sets. The analytical and experimental results reveal the varied characteristic of different types of data and reflect their discriminative ability in terms of identifying direct gene interactions.

Download Full-text

Re-Pair in Small Space

Algorithms ◽

10.3390/a14010005 ◽

2020 ◽

Vol 14 (1) ◽

pp. 5

Author(s):

Dominik Köppl ◽

Tomohiro I ◽

Isamu Furuya ◽

Yoshimasa Takabatake ◽

Kensuke Sakai ◽

...

Keyword(s):

Large Scale ◽

Time Algorithm ◽

Data Sets ◽

Compression Scheme ◽

Working Space ◽

Large Scale Data ◽

Preliminary Version ◽

The Cost ◽

Large Frequency ◽

Scale Data

Re-Pairis a grammar compression scheme with favorably good compression rates. The computation of Re-Pair comes with the cost of maintaining large frequency tables, which makes it hard to compute Re-Pair on large-scale data sets. As a solution for this problem, we present, given a text of length n whose characters are drawn from an integer alphabet with size σ=nO(1), an O(min(n2,n2lglogτnlglglgn/logτn)) time algorithm computing Re-Pair with max((n/c)lgn,nlgτ)+O(lgn) bits of working space including the text space, where c≥1 is a fixed user-defined constant and τ is the sum of σ and the number of non-terminals. We give variants of our solution working in parallel or in the external memory model. Unfortunately, the algorithm seems not practical since a preliminary version already needs roughly one hour for computing Re-Pair on one megabyte of text.

Download Full-text