WinoGrande

Keisuke Sakaguchi; Ronan Le Bras; Chandra Bhagavatula; Yejin Choi

doi:10.1145/3474381

WinoGrande

Communications of the ACM ◽

10.1145/3474381 ◽

2021 ◽

Vol 64 (9) ◽

pp. 99-106

Author(s):

Keisuke Sakaguchi ◽

Ronan Le Bras ◽

Chandra Bhagavatula ◽

Yejin Choi

Keyword(s):

High Performance ◽

Large Scale ◽

State Of The Art ◽

Bias Reduction ◽

Language Models ◽

Systematic Bias ◽

Commonsense Reasoning ◽

Word Associations ◽

Lower Accuracy ◽

Key Steps

Commonsense reasoning remains a major challenge in AI, and yet, recent progresses on benchmarks may seem to suggest otherwise. In particular, the recent neural language models have reported above 90% accuracy on the Winograd Schema Challenge (WSC), a commonsense benchmark originally designed to be unsolvable for statistical models that rely simply on word associations. This raises an important question---whether these models have truly acquired robust commonsense capabilities or they rely on spurious biases in the dataset that lead to an overestimation of the true capabilities of machine commonsense. To investigate this question, we introduce WinoGrande, a large-scale dataset of 44k problems, inspired by the original WSC, but adjusted to improve both the scale and the hardness of the dataset. The key steps of the dataset construction consist of (1) large-scale crowdsourcing, followed by (2) systematic bias reduction using a novel AFLITE algorithm that generalizes human-detectable word associations to machine-detectable embedding associations. Our experiments demonstrate that state-of-the-art models achieve considerably lower accuracy (59.4%-79.1%) on WINOGRANDE compared to humans (94%), confirming that the high performance on the original WSC was inflated by spurious biases in the dataset. Furthermore, we report new state-of-the-art results on five related benchmarks with emphasis on their dual implications. On the one hand, they demonstrate the effectiveness of WINOGRANDE when used as a resource for transfer learning. On the other hand, the high performance on all these benchmarks suggests the extent to which spurious biases are prevalent in all such datasets, which motivates further research on algorithmic bias reduction.

Download Full-text

WinoGrande: An Adversarial Winograd Schema Challenge at Scale

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i05.6399 ◽

2020 ◽

Vol 34 (05) ◽

pp. 8732-8740 ◽

Cited By ~ 1

Author(s):

Keisuke Sakaguchi ◽

Ronan Le Bras ◽

Chandra Bhagavatula ◽

Yejin Choi

Keyword(s):

Large Scale ◽

Human Performance ◽

State Of The Art ◽

Bias Reduction ◽

Training Data ◽

Language Models ◽

Systematic Bias ◽

Commonsense Reasoning ◽

Word Associations ◽

Key Steps

The Winograd Schema Challenge (WSC) (Levesque, Davis, and Morgenstern 2011), a benchmark for commonsense reasoning, is a set of 273 expert-crafted pronoun resolution problems originally designed to be unsolvable for statistical models that rely on selectional preferences or word associations. However, recent advances in neural language models have already reached around 90% accuracy on variants of WSC. This raises an important question whether these models have truly acquired robust commonsense capabilities or whether they rely on spurious biases in the datasets that lead to an overestimation of the true capabilities of machine commonsense.To investigate this question, we introduce WinoGrande, a large-scale dataset of 44k problems, inspired by the original WSC design, but adjusted to improve both the scale and the hardness of the dataset. The key steps of the dataset construction consist of (1) a carefully designed crowdsourcing procedure, followed by (2) systematic bias reduction using a novel AfLite algorithm that generalizes human-detectable word associations to machine-detectable embedding associations. The best state-of-the-art methods on WinoGrande achieve 59.4 – 79.1%, which are ∼15-35% (absolute) below human performance of 94.0%, depending on the amount of the training data allowed (2% – 100% respectively).Furthermore, we establish new state-of-the-art results on five related benchmarks — WSC (→ 90.1%), DPR (→ 93.1%), COPA(→ 90.6%), KnowRef (→ 85.6%), and Winogender (→ 97.1%). These results have dual implications: on one hand, they demonstrate the effectiveness of WinoGrande when used as a resource for transfer learning. On the other hand, they raise a concern that we are likely to be overestimating the true capabilities of machine commonsense across all these benchmarks. We emphasize the importance of algorithmic bias reduction in existing and future benchmarks to mitigate such overestimation.

Download Full-text

Evaluation of recent advances in recommender systems on Arabic content

Journal Of Big Data ◽

10.1186/s40537-021-00420-2 ◽

2021 ◽

Vol 8 (1) ◽

Author(s):

Mehdi Srifi ◽

Ahmed Oussous ◽

Ayoub Ait Lahcen ◽

Salma Mouline

Keyword(s):

Recommender Systems ◽

High Performance ◽

Large Scale ◽

State Of The Art ◽

Experimental Results ◽

Recent Advances ◽

Research Gap ◽

Text Preprocessing

AbstractVarious recommender systems (RSs) have been developed over recent years, and many of them have concentrated on English content. Thus, the majority of RSs from the literature were compared on English content. However, the research investigations about RSs when using contents in other languages such as Arabic are minimal. The researchers still neglect the field of Arabic RSs. Therefore, we aim through this study to fill this research gap by leveraging the benefit of recent advances in the English RSs field. Our main goal is to investigate recent RSs in an Arabic context. For that, we firstly selected five state-of-the-art RSs devoted originally to English content, and then we empirically evaluated their performance on Arabic content. As a result of this work, we first build four publicly available large-scale Arabic datasets for recommendation purposes. Second, various text preprocessing techniques have been provided for preparing the constructed datasets. Third, our investigation derived well-argued conclusions about the usage of modern RSs in the Arabic context. The experimental results proved that these systems ensure high performance when applied to Arabic content.

Download Full-text

Attending to Entities for Better Text Understanding

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i05.6254 ◽

2020 ◽

Vol 34 (05) ◽

pp. 7554-7561

Author(s):

Pengxiang Cheng ◽

Katrin Erk

Keyword(s):

Large Scale ◽

Human Performance ◽

State Of The Art ◽

Syntactic Structure ◽

Semantic Knowledge ◽

Training Data ◽

Language Models ◽

Long Distance ◽

Future Directions ◽

Text Understanding

Recent progress in NLP witnessed the development of large-scale pre-trained language models (GPT, BERT, XLNet, etc.) based on Transformer (Vaswani et al. 2017), and in a range of end tasks, such models have achieved state-of-the-art results, approaching human performance. This clearly demonstrates the power of the stacked self-attention architecture when paired with a sufficient number of layers and a large amount of pre-training data. However, on tasks that require complex and long-distance reasoning where surface-level cues are not enough, there is still a large gap between the pre-trained models and human performance. Strubell et al. (2018) recently showed that it is possible to inject knowledge of syntactic structure into a model through supervised self-attention. We conjecture that a similar injection of semantic knowledge, in particular, coreference information, into an existing model would improve performance on such complex problems. On the LAMBADA (Paperno et al. 2016) task, we show that a model trained from scratch with coreference as auxiliary supervision for self-attention outperforms the largest GPT-2 model, setting the new state-of-the-art, while only containing a tiny fraction of parameters compared to GPT-2. We also conduct a thorough analysis of different variants of model architectures and supervision configurations, suggesting future directions on applying similar techniques to other problems.

Download Full-text

Electrode Materials for High-Performance Sodium-Ion Batteries

Materials ◽

10.3390/ma12121952 ◽

2019 ◽

Vol 12 (12) ◽

pp. 1952 ◽

Cited By ~ 10

Author(s):

Santanu Mukherjee ◽

Shakir Bin Mujib ◽

Davi Soares ◽

Gurpreet Singh

Keyword(s):

High Performance ◽

Large Scale ◽

Electrode Materials ◽

State Of The Art ◽

Electronic Conductivity ◽

Lithium Ion ◽

Cycle Life ◽

Sodium Ion ◽

Sodium Ion Batteries ◽

Grid Storage

Sodium ion batteries (SIBs) are being billed as an economical and environmental alternative to lithium ion batteries (LIBs), especially for medium and large-scale stationery and grid storage. However, SIBs suffer from lower capacities, energy density and cycle life performance. Therefore, in order to be more efficient and feasible, novel high-performance electrodes for SIBs need to be developed and researched. This review aims to provide an exhaustive discussion about the state-of-the-art in novel high-performance anodes and cathodes being currently analyzed, and the variety of advantages they demonstrate in various critically important parameters, such as electronic conductivity, structural stability, cycle life, and reversibility.

Download Full-text

Team Solomon at SemEval-2020 Task 4: Be Reasonable: Exploiting Large-scale Language Models for Commonsense Reasoning

10.18653/v1/2020.semeval-1.74 ◽

2020 ◽

Author(s):

Vertika Srivastava ◽

Sudeep Kumar Sahoo ◽

Yeon Hyang Kim ◽

Rohit R.r ◽

Mayank Raj ◽

...

Keyword(s):

Large Scale ◽

Language Models ◽

Commonsense Reasoning

Download Full-text

A Pre-Training Technique to Localize Medical BERT and to Enhance Biomedical BERT

10.21203/rs.3.rs-103477/v1 ◽

2020 ◽

Author(s):

Shoya Wada ◽

Toshihiro Takeda ◽

Shiro Manabe ◽

Shozo Konishi ◽

Jun Kamohara ◽

...

Keyword(s):

Language Processing ◽

High Performance ◽

Large Scale ◽

Language Models ◽

Free Text ◽

Medical Databases ◽

Large Size ◽

Training Technique ◽

Medical Domain ◽

Medical Document

Abstract Background: Pre-training large-scale neural language models on raw texts has been shown to make a significant contribution to a strategy for transfer learning in natural language processing (NLP). With the introduction of transformer-based language models, such as Bidirectional Encoder Representations from Transformers (BERT), the performance of information extraction from free text by NLP has significantly improved for both the general domain and the medical domain; however, it is difficult for languages in which there are few publicly available medical databases with a high quality and a large size to train medical BERT models that perform well.Method: We introduce a method to train a BERT model using a small medical corpus both in English and in Japanese. Our proposed method consists of two interventions: simultaneous pre-training, which is intended to encourage masked language modeling and next-sentence prediction on the small medical corpus, and amplified vocabulary, which helps with suiting the small corpus when building the customized corpus by byte-pair encoding. Moreover, we used whole PubMed abstracts and developed a high-performance BERT model, Bidirectional Encoder Representations from Transformers for Biomedical Text Mining by Osaka University (ouBioBERT), in English via our method. We then evaluated the performance of our BERT models and publicly available baselines and compared them.Results: We confirmed that our Japanese medical BERT outperforms conventional baselines and the other BERT models in terms of the medical-document classification task and that our English BERT pre-trained using both the general and medical domain corpora performs sufficiently for practical use in terms of the biomedical language understanding evaluation (BLUE) benchmark. Moreover, ouBioBERT shows that the total score of the BLUE benchmark is 1.1 points above that of BioBERT and 0.3 points above that of the ablation model trained without our proposed method.Conclusions: Our proposed method makes it feasible to construct a practical medical BERT model in both Japanese and English, and it has a potential to produce higher performing models for biomedical shared tasks.

Download Full-text

NetSolP: predicting protein solubility in E. coli using language models

10.1101/2021.07.21.453084 ◽

2021 ◽

Author(s):

Vineet Thumuluri ◽

Hannah-Marie Martiny ◽

Jose J. Almagro Armenteros ◽

Jesper Salomon ◽

Henrik Nielsen ◽

...

Keyword(s):

Large Scale ◽

State Of The Art ◽

Protein Solubility ◽

Language Models ◽

Limiting Factor ◽

E Coli ◽

Sequence Identity ◽

Lab Experiments ◽

Wet Lab ◽

Minimal Bias

Solubility and expression levels of proteins can be a limiting factor for large-scale studies and industrial production. By determining the solubility and expression directly from the protein sequence, the success rate of wet-lab experiments can be increased. In this study, we focus on predicting the solubility and usability for purification of proteins expressed in Escherichia coli directly from the sequence. Our model NetSolP is based on deep-learning protein language models called transformers and we show that it achieves state-of-the-art performance and improves extrapolation across datasets. As we find current methods are built on biased datasets, we curate existing datasets by using strict sequence-identity partitioning and ensure that there is minimal bias in the sequences.

Download Full-text

Does 0.1 Micron = MACH 1?

MRS Proceedings ◽

10.1557/proc-380-165 ◽

1995 ◽

Vol 380 ◽

Cited By ~ 1

Author(s):

R. Fabian Pease

Keyword(s):

High Performance ◽

Large Scale ◽

State Of The Art ◽

Commercial Use ◽

Optical Projection ◽

Large Scale Integration ◽

On Chip ◽

Scale Integration ◽

The Cost ◽

The Military

ABSTRACTThe drive to increasingly higher density ultra-large-scale-integration (ULSI) (of electronic circuits) is fuelled primarily by cost; on-chip interconnects are far cheaper than the less dense offchip interconnects. At the same time the escalating cost of an IC factory (‘fab’) is making headlines as it goes through $1B and a large part of this escalation is the cost of high performance lithography tools. The lithographic technology to go below 0.1μm will almost certainly be very different from an extension of today's optical projection and the cost of replacing today's technology will be enormous. A second drawback to higher density is the resistance of narrow interconnects. As a result some people have suggested that this situation is analogous to that of airliner speed which increased over a period of thirty years from about 100 mph to close to 600 mph but has not increased in the last 35 years. Still faster speed was technically possible, and hence was pursued by the military, but is uneconomical for most commercial use. Current technology might take us to 0.1μm which will probably be state of the art 10 years hence so technologies for replacing optical lithography e.g. scanned arrays of proximal probes should be researched now. Other challenges include how to achieve useful interconnect networks employing 50 nm features.

Download Full-text

Text Rewriting Improves Semantic Role Labeling

Journal of Artificial Intelligence Research ◽

10.1613/jair.4431 ◽

2014 ◽

Vol 51 ◽

pp. 133-164 ◽

Cited By ~ 1

Author(s):

K. Woodsend ◽

M. Lapata

Keyword(s):

Gold Standard ◽

High Performance ◽

Large Scale ◽

State Of The Art ◽

The State ◽

Semantic Role ◽

Semantic Role Labeling ◽

Comparable Corpora ◽

Rewrite Rules ◽

Model Training

Large-scale annotated corpora are a prerequisite to developing high-performance NLP systems. Such corpora are expensive to produce, limited in size, often demanding linguistic expertise. In this paper we use text rewriting as a means of increasing the amount of labeled data available for model training. Our method uses automatically extracted rewrite rules from comparable corpora and bitexts to generate multiple versions of sentences annotated with gold standard labels. We apply this idea to semantic role labeling and show that a model trained on rewritten data outperforms the state of the art on the CoNLL-2009 benchmark dataset.

Download Full-text

Answering Binary Causal Questions Through Large-Scale Text Mining: An Evaluation Using Cause-Effect Pairs from Human Experts

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2019/695 ◽

2019 ◽

Cited By ~ 6

Author(s):

Oktie Hassanzadeh ◽

Debarun Bhattacharjya ◽

Mark Feblowitz ◽

Kavitha Srinivas ◽

Michael Perrone ◽

...

Keyword(s):

Neural Network ◽

Decision Making ◽

Large Scale ◽

State Of The Art ◽

Language Models ◽

Training Set ◽

Supervised Methods ◽

Weakly Supervised ◽

Answering Questions ◽

Large Corpus

In this paper, we study the problem of answering questions of type "Could X cause Y?" where X and Y are general phrases without any constraints. Answering such questions will assist with various decision analysis tasks such as verifying and extending presumed causal associations used for decision making. Our goal is to analyze the ability of an AI agent built using state-of-the-art unsupervised methods in answering causal questions derived from collections of cause-effect pairs from human experts. We focus only on unsupervised and weakly supervised methods due to the difficulty of creating a large enough training set with a reasonable quality and coverage. The methods we examine rely on a large corpus of text derived from news articles, and include methods ranging from large-scale application of classic NLP techniques and statistical analysis to the use of neural network based phrase embeddings and state-of-the-art neural language models.

Download Full-text