Arabic Dialect Identification

Omar F. Zaidan; Chris Callison-Burch

doi:10.1162/coli_a_00169

Arabic Dialect Identification

Computational Linguistics ◽

10.1162/coli_a_00169 ◽

2014 ◽

Vol 40 (1) ◽

pp. 171-202 ◽

Cited By ~ 61

Author(s):

Omar F. Zaidan ◽

Chris Callison-Burch

Keyword(s):

Arabic Language ◽

Data Sets ◽

Data Set ◽

Native Languages ◽

Arabic Speakers ◽

Regional Dialects ◽

Word Sequence ◽

Written Form ◽

On Line ◽

Almost All

The written form of the Arabic language, Modern Standard Arabic (MSA), differs in a non-trivial manner from the various spoken regional dialects of Arabic—the true “native languages” of Arabic speakers. Those dialects, in turn, differ quite a bit from each other. However, due to MSA's prevalence in written form, almost all Arabic data sets have predominantly MSA content. In this article, we describe the creation of a novel Arabic resource with dialect annotations. We have created a large monolingual data set rich in dialectal Arabic content called the Arabic On-line Commentary Data set (Zaidan and Callison-Burch 2011). We describe our annotation effort to identify the dialect level (and dialect itself) in each of more than 100,000 sentences from the data set by crowdsourcing the annotation task, and delve into interesting annotator behaviors (like over-identification of one's own dialect). Using this new annotated data set, we consider the task of Arabic dialect identification: Given the word sequence forming an Arabic sentence, determine the variety of Arabic in which it is written. We use the data to train and evaluate automatic classifiers for dialect identification, and establish that classifiers using dialectal data significantly and dramatically outperform baselines that use MSA-only data, achieving near-human classification accuracy. Finally, we apply our classifiers to discover dialectical data from a large Web crawl consisting of 3.5 million pages mined from on-line Arabic newspapers.

Download Full-text

BLIND SEPARATION OF MIXED KURTOSIS SIGNED SIGNALS USING PARTIAL OBSERVATIONS AND LOW COMPLEXITY ACTIVATION FUNCTIONS

International Journal of Computational Intelligence and Applications ◽

10.1142/s1469026804001239 ◽

2004 ◽

Vol 04 (02) ◽

pp. 207-223 ◽

Cited By ~ 1

Author(s):

KRISANA CHINNASARN ◽

CHIDCHANOK LURSINSAP ◽

VASILE PALADE

Keyword(s):

Blind Source Separation ◽

Source Separation ◽

Low Complexity ◽

Activation Function ◽

Data Sets ◽

Blind Separation ◽

Data Set ◽

Partial Observations ◽

On Line ◽

High Computational Complexity

Although several highly accurate blind source separation algorithms have already been proposed in the literature, these algorithms must store and process the whole data set which may be tremendous in some situations. This makes the blind source separation infeasible and not realisable on VLSI level, due to a large memory requirement and costly computation. This paper concerns the algorithms for solving the problem of tremendous data sets and high computational complexity, so that the algorithms could be run on-line and implementable on VLSI level with acceptable accuracy. Our approach is to partition the observed signals into several parts and to extract the partitioned observations with a simple activation function performing only the "shift-and-add" micro-operation. No division, multiplication and exponential operations are needed. Moreover, obtaining an optimal initial de-mixing weight matrix for speeding up the separating time will be also presented. The proposed algorithm is tested on some benchmarks available online. The experimental results show that our solution provides comparable efficiency with other approaches, but lower space and time complexity.

Download Full-text

DISC: Disambiguating homonyms using graph structural clustering

Journal of Information Science ◽

10.1177/0165551518761011 ◽

2018 ◽

Vol 44 (6) ◽

pp. 830-847 ◽

Cited By ~ 4

Author(s):

Ijaz Hussain ◽

Sohail Asghar

Keyword(s):

Ground Truth ◽

Detection Algorithm ◽

Bibliographic Databases ◽

Data Sets ◽

Database Integration ◽

Data Set ◽

Entity Disambiguation ◽

Structural Clustering ◽

Community Detection Algorithm ◽

Almost All

Author name ambiguity degrades information retrieval, database integration, search results and, more importantly, correct attributions in bibliographic databases. Some unresolved issues include how to ascertain the actual number of authors, how to improve the performance and how to make the method more effective in terms of representative clustering metrics (average cluster purity, average author purity, K-metric, pairwise precision, pairwise recall, pairwise-F1, cluster precision, cluster recall and cluster-F1). It is a non-trivial task to disambiguate authors using only the implicit bibliographic information. An effective method ‘DISC’ is proposed that uses graph community detection algorithm, feature vectors and graph operations to disambiguate homonyms. The citation data set is pre-processed and ambiguous author blocks are formed. A co-authors graph is constructed using authors and their co-author’s relationships. A graph structural clustering ‘gSkeletonClu’ is applied to identify hubs, outliers and clusters of nodes in a co-author’s graph. Homonyms are resolved by splitting these clusters of nodes across the hub if their feature vector similarity is less than a predefined threshold. DISC utilises only co-authors and titles that are available in almost all bibliographic databases. With little modifications, DISC can also be used for entity disambiguation. To validate the DISC performance, experiments are performed on two Arnetminer data sets and compared with five previous unsupervised methods. Despite using limited bibliographic metadata, DISC achieves on average K-metric, pairwise-F1, and cluster-F1 of 92%, 84% and 74%, respectively, using Arnetminer-S and 86%, 80% and 57%, respectively, using Arnetminer-L. About 77.5% and 73.2% clusters are within the range (ground truth clusters ± 3) in Arnetminer-S and Arnetminer-L, respectively.

Download Full-text

Estimates of parental-dominance and full-sib permanent environment variances in laying hens

Animal Science ◽

10.1017/s1357729800055326 ◽

2000 ◽

Vol 71 (3) ◽

pp. 421-426 ◽

Cited By ~ 9

Author(s):

I. Misztal ◽

B. Besbes

Keyword(s):

Egg Production ◽

Laying Hens ◽

Dominance Effect ◽

Data Sets ◽

Production Traits ◽

Data Set ◽

Dominance Variance ◽

Feasible Alternative ◽

Almost All ◽

Method R

AbstractEstimates of variance components for five egg traits on 26265 laying hens were obtained by restricted maximum likelihood (REML) using several models. In the DOMFS model, the effects included hatch group, additive genetic, full-sib, parental dominance and inbreeding depression. In the DOM model, the full-sib effect was eliminated. In the FS model, the parental dominance effect was eliminated. In the ADD model, both the full-sib and the dominance effects were eliminated. In the DOMFS model, the estimates of the full-sib variance were generally higher for egg production traits and lower for egg characteristics than those of the parental dominance variance. However, this model has partially failed in separating these two components. When the full-sib effect was removed from the model, almost all of its estimated variance moved to the estimated parental dominance variance. When the parental dominance effect was removed from the model, almost all of its estimated variance moved to the estimated full-sib variance. The estimates obtained with REML and the DOM model were compared with those obtained by method R and tilde-hat methodologies. The d2 (ratio of dominance variance to total variance) differed by up to 86% for method R and up to 225% for tilde-hat. The h2 differed by up to 26 and 28%, respectively. For data sets that are too large to be analysed with REML, method R is a feasible alternative. A model for estimation of dominance variance should also include the full-sib or a similar effect, provided the data set is large. Similarly, to analyse egg production traits, the model should include at least the full-sib effect.

Download Full-text

Perturbation of the human gastrointestinal tract microbial ecosystem by oral drugs to treat chronic disease results in a spectrum of individual specific patterns of extinction and persistence of dominant microbial strains

PLoS ONE ◽

10.1371/journal.pone.0242021 ◽

2020 ◽

Vol 15 (12) ◽

pp. e0242021

Author(s):

Hyunmin Koo ◽

Casey D. Morrow

Keyword(s):

Rheumatoid Arthritis ◽

Chronic Disease ◽

Published Data ◽

Data Sets ◽

Oral Drugs ◽

Data Set ◽

The Status ◽

Microbial Strains ◽

Almost All ◽

Gut Microbial Community

Background Oral drugs can have side effects such as diarrhea that indicate the perturbation of the gut microbial community. To further understand the dynamics of perturbation, we have assessed the strain relatedness of samples from previously published data sets from pre and post bowel evacuation, episodes of diarrhea, and administration of oral drugs to treat diabetes and rheumatoid arthritis. Methods We analyzed a total of published five data sets using our strain-tracking tool called Window-based Single Nucleotide Variant (SNV) Similarity (WSS) to identify related strains from the same individual. Results Strain-tracking analysis using the first data set from 8 individuals pre and 21–50 days post iso-osmotic bowel wash revealed almost all microbial strains were related in an individual between pre and post samples. Similarly, in a second study, strain-tracking analysis of 4 individuals pre and post sporadic diarrhea revealed the majority of strains were related over time (up to 44 weeks). In contrast, the analysis of a third data set from 22 individuals pre and post 3-day exposure of oral metformin revealed that no individuals had a related strain. In a fourth study, the data set taken at 2 and 4 months from 38 individuals on placebo or metformin revealed individual specific sharing of pre and post strains. Finally, the data set from 18 individuals with rheumatoid arthritis given disease-modifying antirheumatic drugs methotrexate or glycosides of the traditional Chinese medicinal component Tripterygium wilfordii showed individual specific sharing of pre and post strains up to 16 months. Conclusion Oral drugs used to treat chronic disease can result in individual specific microbial strain change for the majority of species. Since the gut community provides essential functions for the host, our study supports personalized monitoring to assess the status of the dominant microbial strains after initiation of oral drugs to treat chronic disease.

Download Full-text

A Day in the Life of the AMU– The Society for Acute Medicine’s Benchmarking Audit 2012 (SAMBA ‘12)

Acute Medicine Journal ◽

10.52964/amja.0289 ◽

2013 ◽

Vol 12 (2) ◽

pp. 69-73

Author(s):

Christian P Subbe ◽

◽

David Ward ◽

Lenny Latip ◽

Ivan Le Jeune ◽

...

Keyword(s):

High Reliability ◽

Published Data ◽

National Audit ◽

Data Set ◽

Acute Medicine ◽

On Line ◽

The Mean ◽

Test Feasibility ◽

Almost All ◽

Summary Data

Background: The absence of published data for benchmarking serves as a disincentive for Acute Medical Units to improve care. Aim: To test feasibility of a national audit in Acute Medicine for compliance with common standards Methods: On line questionnaire with summary data for patients admitted to participating Acute Medicine Units over a 24-hour-period. Results: 30 units submitted summary data. The mean number of admission was 36 (SD 14). Compliance with standards around timing of junior and senior review was highly variable. In almost all other standards only a small number of units achieved high reliability with compliance of more than 90%. Conclusion: SAMBA provides a data set that can be used for local and national benchmarking and quality improvement work. Annual audit might be beneficial to track improvements.

Download Full-text

Data Compatibility Issues: How to Prevent Miscoding and Dropped Observations When Using U.S. Office of Personnel Management Data Sets

Review of Public Personnel Administration ◽

10.1177/0734371x20904998 ◽

2020 ◽

Vol 40 (4) ◽

pp. 743-753

Author(s):

Ashley M. Alteri

Keyword(s):

Personnel Management ◽

Data Sets ◽

Combine Data ◽

Data Set ◽

Federal Employee ◽

Critical Comparison ◽

Multiple Data ◽

Office Of Personnel Management ◽

Multiple Data Sets ◽

Almost All

A critical comparison of the agency identifier codes in the Federal Employee Viewpoint Survey (FEVS) and FedScope data sets reveals three distinct types of issues will occur when researchers attempt to merge the data sets: (a) a single agency is assigned different codes across data sets; (b) a single code is assigned to different agencies across data sets; and (c) a single code is assigned to two or more agencies in the FEVS data set and a separate agency in the FedScope data set. Between 2013 and 2016, these issues are present in almost all major federal departments. Compatibility issues between the agency identifiers could cause the user to drop observations unnecessarily or unknowingly combine two different agencies’ data improperly. If uncorrected, these issues will distort the analysis of studies that rely on this combination of data. However, researchers can correct for this issue and still use Office of Personnel Management (OPM) identifiers to combine data across multiple data sets.

Download Full-text

Quality evaluation methods for wastewater treatment plant data

Water Science & Technology ◽

10.2166/wst.2008.151 ◽

2008 ◽

Vol 57 (10) ◽

pp. 1601-1609 ◽

Cited By ~ 11

Author(s):

M. Thomann

Keyword(s):

Treatment Plant ◽

Control Process ◽

Reference Value ◽

Systematic Errors ◽

Graphical Analysis ◽

Data Sets ◽

Data Set ◽

Significance Level ◽

On Line ◽

New Treatment

Non identified systematic errors in data sets can cause severe problems inducing wrong decisions in function control, process modelling or planning of new treatment infrastructure. In this paper statistical methods are shown to identify systematic errors in full-scale WWTP data sets. With a redundant mass balance approach analyzing five different mass balances, systematic errors of about 10%–20% compared to the input fluxes can be identified at a 5%-significance level. A Shewhart control-chart approach to survey the data quality of on-line-sensors allows a statistical as well as a fast graphical analysis of the measurement process. A 19 month data set indicates that NO3−, PO4− and NH4− on-line analyzers in the filter effluent and MLSS sensors in the aeration tanks were not disturbed by any systematic error for 85–95% of the measuring time. The in-control-interval (±3·standard deviation) has a width of ±12–17% (NO3-N), ±35–40% (PO4-P), ±83% (NH4-N) and ±12–15% (TS) of the measured reference value.

Download Full-text

CorPhU: an algorithm based on phase closure for the correction of unwrapping errors in SAR interferometry

Geophysical Journal International ◽

10.1093/gji/ggaa120 ◽

2020 ◽

Vol 221 (3) ◽

pp. 1959-1970 ◽

Cited By ~ 2

Author(s):

Angélique Benoit ◽

Béatrice Pinel-Puysségur ◽

Romain Jolivet ◽

Cécile Lasserre

Keyword(s):

Time Series ◽

Large Data ◽

Large Data Sets ◽

Sar Interferometry ◽

Data Sets ◽

Data Set ◽

Surface Displacements ◽

Interferometric Phase ◽

Almost All ◽

Central Turkey

SUMMARY Interferometric Synthetic Aperture Radar (InSAR) is commonly used in Earth Sciences to study surface displacements or construct high resolution topographic maps. Recent satellites such as those of the Sentinel-1 constellation allow to derive dense deformation maps with millimetric precision with high revisit frequency. However, InSAR is still limited by interferometric coherence. Interferometric phase noise resulting from a loss of coherence, due to changes in scattering properties between repeated SAR acquisitions, may lead to unwrapping errors, which then in turn lead to centimetric errors in time-series reconstruction. We present an algorithm based on interferometric phase closure to automatically correct unwrapping errors. We describe the algorithm and highlight its performances with two case studies, in Lebanon with Envisat satellite data and in Central Turkey with Sentinel-1 data. The first data set is particularly affected by unwrapping errors because of long spatial (500 m) and temporal baseline interferograms (6 yr) and decorrelation due, in particular, to vegetation. The second data set contains unwrapping errors because of temporal changes in the scattering properties of the ground. For these two examples, the algorithm allows the correction of almost all detectable unwrapping errors, without requiring visual inspection or manual deletions. Our algorithm is efficient especially on large data sets, such as with Sentinel-1 constellation, where interferometric phase is redundant and improves eventually the reconstruction of time-series.

Download Full-text

Confidentiality of Statistical Records: A Threat-Monitoring Scheme for On Line Dialogue

Methods of Information in Medicine ◽

10.1055/s-0038-1635718 ◽

1976 ◽

Vol 15 (01) ◽

pp. 36-42 ◽

Cited By ~ 14

Author(s):

J. Schlörer

Keyword(s):

Statistical Data ◽

Cost Benefit ◽

Data Bank ◽

High Ratio ◽

Point Of View ◽

Data Sets ◽

Monitoring Scheme ◽

Access Controls ◽

On Line ◽

Bona Fide

From a statistical data bank containing only anonymous records, the records sometimes may be identified and then retrieved, as personal records, by on line dialogue. The risk mainly applies to statistical data sets representing populations, or samples with a high ratio n/N. On the other hand, access controls are unsatisfactory as a general means of protection for statistical data banks, which should be open to large user communities. A threat monitoring scheme is proposed, which will largely block the techniques for retrieval of complete records. If combined with additional measures (e.g., slight modifications of output), it may be expected to render, from a cost-benefit point of view, intrusion attempts by dialogue valueless, if not absolutely impossible. The bona fide user has to pay by some loss of information, but considerable flexibility in evaluation is retained. The proposal of controlled classification included in the scheme may also be useful for off line dialogue systems.

Download Full-text

The social wasp Vespula germanica (Fabricius) (Hymenoptera: Vespidae) population dynamics in England over 39 years.

The Entomologist s monthly magazine ◽

10.31184/m00138908.1542.3906 ◽

2018 ◽

Vol 154 (2) ◽

pp. 149-155

Author(s):

Michael Archer

Keyword(s):

Population Dynamics ◽

Population Dynamic ◽

Ecological Factors ◽

Social Wasp ◽

Data Sets ◽

Data Set ◽

Vespula Germanica ◽

The Social ◽

Minimum Number ◽

Suction Traps

1. Yearly records of worker Vespula germanica (Fabricius) taken in suction traps at Silwood Park (28 years) and at Rothamsted Research (39 years) are examined. 2. Using the autocorrelation function (ACF), a significant negative 1-year lag followed by a lesser non-significant positive 2-year lag was found in all, or parts of, each data set, indicating an underlying population dynamic of a 2-year cycle with a damped waveform. 3. The minimum number of years before the 2-year cycle with damped waveform was shown varied between 17 and 26, or was not found in some data sets. 4. Ecological factors delaying or preventing the occurrence of the 2-year cycle are considered.

Download Full-text