scholarly journals Automatic Generation of the Draft Procuratorial Suggestions Based on an Extractive Summarization Method: BERTSLCA

2021 ◽  
Vol 2021 ◽  
pp. 1-12
Author(s):  
Yufeng Sun ◽  
Fengbao Yang ◽  
Xiaoxia Wang ◽  
Hongsong Dong

The automatic generation of the draft procuratorial suggestions is to extract the description of illegal facts, administrative omission, description of laws and regulations, and other information from the case documents. Previously, the existing deep learning methods mainly focus on context-free word embeddings when addressing legal domain-specific extractive summarization tasks, which cannot get a better semantic understanding of the text and in turn leads to an adverse summarization performance. To this end, we propose a novel deep contextualized embeddings-based method BERTSLCA to conduct the extractive summarization task. The model is mainly based on the variant of BERT called BERTSUM. Firstly, the input document is fed into BERTSUM to get sentence-level embeddings. Then, we design an extracting architecture to catch the long dependency between sentences utilizing the Bi-Long Short-Term Memory (Bi-LSTM) unit, and at the end of the architecture, three cascaded convolution kernels with different sizes are designed to extract the relationships between adjacent sentences. Last, we introduce an attention mechanism to strengthen the ability to distinguish the importance of different sentences. To the best of our knowledge, this is the first work to use the pretrained language model for extractive summarization tasks in the field of Chinese judicial litigation. Experimental results on public interest litigation data and CAIL 2020 dataset all demonstrate that the proposed method achieves competitive performance.

2021 ◽  
Author(s):  
Yunjian Qiu ◽  
Yan Jin

Abstract In this study, the extractive summarization using sentence embeddings generated by the finetuned BERT (Bidirectional Encoder Representations from Transformers) models and the K-Means clustering method has been investigated. To show how the BERT model can capture the knowledge in specific domains like engineering design and what it can produce after being finetuned based on domain-specific datasets, several BERT models are trained, and the sentence embeddings extracted from the finetuned models are used to generate summaries of a set of papers. Different evaluation methods are then applied to measure the quality of summarization results. Both the automatic evaluation method like Recall-Oriented Understudy for Gisting Evaluation (ROUGE) and the statistical evaluation method are used for the comparison study. The results indicate that the BERT model finetuned with a larger dataset can generate summaries with more domain terminologies than the pretrained BERT model. Moreover, the summaries generated by BERT models have more contents overlapping with original documents than those obtained through other popular non-BERT-based models. It can be concluded that the contextualized representations generated by BERT-based models can capture information in text and have better performance in applications like text summarizations after being trained by domain-specific datasets.


Information ◽  
2021 ◽  
Vol 12 (8) ◽  
pp. 314
Author(s):  
Rashel Fam ◽  
Yves Lepage

In this paper, we inspect the theoretical problem of counting the number of analogies between sentences contained in a text. Based on this, we measure the analogical density of the text. We focus on analogy at the sentence level, based on the level of form rather than on the level of semantics. Experiments are carried on two different corpora in six European languages known to have various levels of morphological richness. Corpora are tokenised using several tokenisation schemes: character, sub-word and word. For the sub-word tokenisation scheme, we employ two popular sub-word models: unigram language model and byte-pair-encoding. The results show that the corpus with a higher Type-Token Ratio tends to have higher analogical density. We also observe that masking the tokens based on their frequency helps to increase the analogical density. As for the tokenisation scheme, the results show that analogical density decreases from the character to word. However, this is not true when tokens are masked based on their frequencies. We find that tokenising the sentences using sub-word models and masking the least frequent tokens increase analogical density.


Author(s):  
Yufei Li ◽  
Xiaoyong Ma ◽  
Xiangyu Zhou ◽  
Pengzhen Cheng ◽  
Kai He ◽  
...  

Abstract Motivation Bio-entity Coreference Resolution focuses on identifying the coreferential links in biomedical texts, which is crucial to complete bio-events’ attributes and interconnect events into bio-networks. Previously, as one of the most powerful tools, deep neural network-based general domain systems are applied to the biomedical domain with domain-specific information integration. However, such methods may raise much noise due to its insufficiency of combining context and complex domain-specific information. Results In this paper, we explore how to leverage the external knowledge base in a fine-grained way to better resolve coreference by introducing a knowledge-enhanced Long Short Term Memory network (LSTM), which is more flexible to encode the knowledge information inside the LSTM. Moreover, we further propose a knowledge attention module to extract informative knowledge effectively based on contexts. The experimental results on the BioNLP and CRAFT datasets achieve state-of-the-art performance, with a gain of 7.5 F1 on BioNLP and 10.6 F1 on CRAFT. Additional experiments also demonstrate superior performance on the cross-sentence coreferences. Supplementary information Supplementary data are available at Bioinformatics online.


Author(s):  
Kelvin Guu ◽  
Tatsunori B. Hashimoto ◽  
Yonatan Oren ◽  
Percy Liang

We propose a new generative language model for sentences that first samples a prototype sentence from the training corpus and then edits it into a new sentence. Compared to traditional language models that generate from scratch either left-to-right or by first sampling a latent sentence vector, our prototype-then-edit model improves perplexity on language modeling and generates higher quality outputs according to human evaluation. Furthermore, the model gives rise to a latent edit vector that captures interpretable semantics such as sentence similarity and sentence-level analogies.


Author(s):  
Maja Radović ◽  
Nenad Petrović ◽  
Milorad Tošić

The requirements of state-of-the-art curricula and teaching processes in medical education have brought both new and improved the existing assessment methods. Recently, several promising methods have emerged, among them the Comprehensive Integrative Puzzle (CIP), which shows great potential. However, the construction of such questions requires high efforts of a team of experts and is time-consuming. Furthermore, despite the fact that English language is accepted as an international language, for educational purposes there is also a need for representing data and knowledge in native language. In this paper, we present an approach for automatic generation of CIP assessment questions based on using ontologies for knowledge representation. In this way, it is possible to provide multilingual support in the teaching and learning process because the same ontological concept can be applied to corresponding language expressions in different languages. The proposed approach shows promising results indicated by dramatic speeding up of construction of CIP questions compared to manual methods. The presented results represent a strong indication that adoption of ontologies for knowledge representation may enable scalability in multilingual domain-specific education regardless of the language used. High level of automation in the assessment process proven on the CIP method in medical education as one of the most challenging domains, promises high potential for new innovative teaching methodologies in other educational domains as well.


2018 ◽  
Author(s):  
Andre Lamurias ◽  
Luka A. Clarke ◽  
Francisco M. Couto

AbstractRecent studies have proposed deep learning techniques, namely recurrent neural networks, to improve biomedical text mining tasks. However, these techniques rarely take advantage of existing domain-specific resources, such as ontologies. In Life and Health Sciences there is a vast and valuable set of such resources publicly available, which are continuously being updated. Biomedical ontologies are nowadays a mainstream approach to formalize existing knowledge about entities, such as genes, chemicals, phenotypes, and disorders. These resources contain supplementary information that may not be yet encoded in training data, particularly in domains with limited labeled data.We propose a new model, BO-LSTM, that takes advantage of domain-specific ontologies, by representing each entity as the sequence of its ancestors in the ontology. We implemented BO-LSTM as a recurrent neural network with long short-term memory units and using an open biomedical ontology, which in our case-study was Chemical Entities of Biological Interest (ChEBI). We assessed the performance of BO-LSTM on detecting and classifying drug-drug interactions in a publicly available corpus from an international challenge, composed of 792 drug descriptions and 233 scientific abstracts. By using the domain-specific ontology in addition to word embeddings and WordNet, BO-LSTM improved both the F1-score of the detection and classification of drug-drug interactions, particularly in a document set with a limited number of annotations. Our findings demonstrate that besides the high performance of current deep learning techniques, domain-specific ontologies can still be useful to mitigate the lack of labeled data.Author summaryA high quantity of biomedical information is only available in documents such as scientific articles and patents. Due to the rate at which new documents are produced, we need automatic methods to extract useful information from them. Text mining is a subfield of information retrieval which aims at extracting relevant information from text. Scientific literature is a challenge to text mining because of the complexity and specificity of the topics approached. In recent years, deep learning has obtained promising results in various text mining tasks by exploring large datasets. On the other hand, ontologies provide a detailed and sound representation of a domain and have been developed to diverse biomedical domains. We propose a model that combines deep learning algorithms with biomedical ontologies to identify relations between concepts in text. We demonstrate the potential of this model to extract drug-drug interactions from abstracts and drug descriptions. This model can be applied to other biomedical domains using an annotated corpus of documents and an ontology related to that domain to train a new classifier.


2020 ◽  
Vol 19 (2) ◽  
pp. 1:1
Author(s):  
Manuel Leduc ◽  
Gwendal Jouneaux ◽  
Thomas Degueule ◽  
Gurvan Le Guernic ◽  
Olivier Barais ◽  
...  

2021 ◽  
Author(s):  
Conor Wild ◽  
Loretta Norton ◽  
David Menon ◽  
David Ripsman ◽  
Richard Swartz ◽  
...  

Abstract As COVID-19 cases exceed hundreds of millions globally, it is clear that many survivors face cognitive challenges and prolonged symptoms. However, important questions about the cognitive impacts of COVID-19 remain unresolved. In the present online study, 485 volunteers who reported having had a confirmed COVID-positive test completed a comprehensive cognitive battery and an extensive questionnaire. This group performed significantly worse than pre-pandemic controls on cognitive measures of reasoning, verbal, and overall performance, and processing speed, but not short-term memory – suggesting domain-specific deficits. We identified two distinct factors underlying health measures: one varying with physical symptoms and illness severity, and one with mental health. Crucially, cognitive deficits were correlated with physical symptoms, but not mental health, and were evident even in cases that did not require hospitalisation. These findings suggest that the subjective experience of “long COVID” or “brain fog” relates to a combination of physical symptoms and cognitive deficits.


Sign in / Sign up

Export Citation Format

Share Document