scholarly journals Finding scientific names in Biodiversity Heritage Library, or how to shrink Big Data

Author(s):  
Dmitry Mozzherin ◽  
Alexander Myltsev ◽  
David Patterson

The Biodiversity Heritage Library contains 57 million pages of biological information. The majority of this information is a scanned and digitized non-structured text. This "raw" text is hard to access by computers or humans, without the addition of rich metadata. Recent improvements in natural language processing (NLP) and machine learning (ML) promise to facilitate the creation of such metadata. One obvious approach to improve BHL usability is to extract and provide an index of scientific names thereby enabling biologists to find useful information easier and faster. The Global Names Architecture (GNA) detects, verifies, collects, and indexes scientific names from many sources. Six years ago GNA developers created an index of the scientific names in the BHL by parsing every page one by one. This took 45 days to accomplish. Almost immediately BHL users began to find problems in the index and suggest improvements. However, the cost of repeating such a gigantic job was insurmountable and as a result the index remained nearly unchanged for 6 years. Two problems were at the heart of dealing with the “Big Data” of the BHL, the time it took to transfer the raw data prior to processing, and the computational time it took to detect the names themselves.To solve these problems we could either throw more hardware resources into the problem (expensive), or find ways to dramatically improve performance of the tasks (cheaper). We decided to achieve our goal by utilizing hardware more effectively, and by using fast, scalable programming languages. We wrote several Open Source applications in Go and Scala to detect candidate scientific names then verify them as names by comparing them to 27 million scientific name-strings aggregated by GNA. We were able to speed up data mobilization from 24 hours to 11 minutes, and decrease the time for name detection from 35 days to 5 hours. Name-verification time decreased from 10 days to 9 hours. Overall our computing requirements shrank from 4 high-end servers to one modern laptop. As a result we achieved our goal and indexed BHL in only 14 hours and unlocked the reality of iterative improvements to the scientific name index. We also wanted to make it possible to study BHL data in its entirety remotely, in real-time. We created an HTTP2 service that is able to stream gigantic amounts of BHL textual data together with scientific names to a researcher. Sending the text of 50 million pages with an associated 250 million name occurrences takes ~5 hours. For comparison, simply copying BHL text data from Smithsonian Institute to University of Illinois using more traditional methods took us 10 days. What do we hope to achieve with these tools as next steps? To make it possible for everyone to make new discoveries by computing in real-time across the complete BHL text. For example 20% of all names in BHL are abbreviated, and, as a result, very poorly searchable given their existing full-text indexing. We plan to develop algorithms to expand abbreviated genera reliably. Digitized texts contain huge amounts of character recognition mistakes. The tools might help to detect badly digitized pages and mark them for re-digitization. Tools can help to extract scientific names that are identical to "normal" words, such as "Atlanta", or "America", to find common names in texts, and to localize information on locations, adding new search contexts. Finally, we are exploring tools that allow researchers to stream such results back to source thereby growing the “Big Data” and ultimately improving the BHL’s end-user experience.

Author(s):  
Nicolas Zhou ◽  
Erin M. Corsini ◽  
Shida Jin ◽  
Gregory R. Barbosa ◽  
Trey Kell ◽  
...  

The concept of Big Data is changing the way that clinical research can be performed. Cardiothoracic surgeons need to understand the dynamic digital transformation taking place in the healthcare industry. In the last decade, technological advances and Big Data analytics have become powerful tools for businesses. In healthcare, rapid expansion of Big Data infrastructure has occurred in parallel with attempts to reduce cost and improve outcomes. Many hospitals around the country are augmenting traditional relational databases with Big Data infrastructure. Advanced data capture and categorization tools such as natural language processing and optical character recognition are being developed for clinical and research use, while Internet of Things in the form of wearable technology serves as an additional source of data usable for research. As cardiothoracic surgeons seek ways to innovate, novel approaches to data acquisition and analysis enable a more rigorous level of investigatory efforts.


Various fields like Text Mining, Linguistics, Decision Making and Natural Language Processing together form the basis for Opinion Mining or Sentiment Analysis. People share their feelings, observations and thoughts on social media, which has emerged as a powerful tool for rapidly growing enormous repository of real time discussions and thoughts shared by people. In this paper, we aim to decipher the current popular opinions or emotions from various sources, hence, contributing to sentiment analysis domain. Text from social media, blogs and product reviews are classified according to the sentiment they project. We re-examine the traditional processes of sentiment extraction, to incorporate the increase in complexity and number of the data sources and relevant topics, while re-populating the meaning of sentiment. Working across and within numerous streams of social media, expression of sentiment and classification of polarity is re-examined, thereby redefining and enhancing the realm of sentiment. Numerous social media streams are analyzed to build datasets that are topical for each stream and are later polarized according to their sentiment expression. In conclusion, defining a sentiment and developing tools for its analysis in real time of human idea exchange is the motive.


2019 ◽  
Vol 37 (4) ◽  
pp. 433-450 ◽  
Author(s):  
Sultan Amed ◽  
Srabanti Mukherjee ◽  
Prasun Das ◽  
Biplab Datta

Purpose The purpose of this paper is to determine the triggers of positive electronic word of mouth (eWOM) using real-time Big Data obtained from online retail sites/dedicated review sites. Design/methodology/approach In this study, real-time Big Data has been used and analysed through support vector machine, to segregate positive and negative eWOM. Thereafter, using natural language processing algorithms, this study has classified the triggers of positive eWOM based on their relative importance across six product categories. Findings The most important triggers of positive eWOM (like product experience, product type, product characteristics) were similar across different product categories. The second-level antecedents of positive eWOM included the person(s) for whom the product is purchased, the price and the source of the product, packaging and eagerness in patronising a brand. Practical implications The findings of this study indicate that the marketers who are active in the digital forum should encourage and incentivise their satisfied consumers to disseminate positive eWOM. Consumers with special interest for any product type (mothers or doctors for baby food) may be incentivised to write positive eWOM about the product’s ingredients/characteristics. Companies can launch the sequels of existing television or online advertisements addressing “for whom the product is purchased”. Originality/value This study identified the triggers of the positive eWOM using real-time Big Data extracted from online purchase platforms. This study also contributes to the literature by identifying the levels of triggers that are most, more and moderately important to the customers for writing positive reviews online.


2021 ◽  
Author(s):  
Takemasa Miyoshi ◽  
Takumi Honda ◽  
Arata Amemiya ◽  
Shigenori Otsuka ◽  
Yasumitsu Maejima ◽  
...  

<p>The Japan’s Big Data Assimilation (BDA) project started in October 2013 and ended its 5.5-year period in March 2019. Here, we developed a novel numerical weather prediction (NWP) system at 100-m resolution updated every 30 seconds for precise prediction of individual convective clouds. This system was designed to fully take advantage of the phased array weather radar (PAWR) which observes reflectivity and Doppler velocity at 30-second frequency for 100 elevation angles at 100-m range resolution. By the end of the 5.5-year project period, we achieved less than 30-second computational time using the Japan’s flagship K computer, whose 10-petaflops performance was ranked #1 in the TOP500 list in 2011, for past cases with all input data such as boundary conditions and observation data being ready to use. The direct follow-on project started in April 2019 under the Japan Science and Technology Agency (JST) AIP (Advanced Intelligence Project) Acceleration Research. We continued the development to achieve real-time operations of this novel 30-second-update NWP system for demonstration at the time of the Tokyo 2020 Olympic and Paralympic games. The games were postponed, but the project achieved real-time demonstration of the 30-second-update NWP system at 500-m resolution using a powerful supercomputer called Oakforest-PACS operated jointly by the Tsukuba University and the University of Tokyo. The additional developments include parameter tuning for more accurate prediction and complete workflow to prepare all input data in real time, i.e., fast data transfer from the novel dual-polarization PAWR called MP-PAWR in Saitama University, and real-time nested-domain forecasts at 18-km, 6-km, and 1.5-km to provide lateral boundary conditions for the innermost 500-m-mesh domain. A real-time test was performed during July 31 and August 7, 2020 and resulted in the actual lead time of more than 27 minutes for 30-minute prediction with very few exceptions of extended delay. Past case experiments showed that this system could capture rapid intensification and decays of convective rains that occurred in the order of less than 10 minutes, while the JMA nowcasting did not predict the rapid changes by its design. This presentation will summarize the real-time demonstration during August 25 and September 7 when Tokyo 2020 Paralympic games were supposed to take place.</p>


Author(s):  
David J. Lobina

The study of cognitive phenomena is best approached in an orderly manner. It must begin with an analysis of the function in intension at the heart of any cognitive domain (its knowledge base), then proceed to the manner in which such knowledge is put into use in real-time processing, concluding with a domain’s neural underpinnings, its development in ontogeny, etc. Such an approach to the study of cognition involves the adoption of different levels of explanation/description, as prescribed by David Marr and many others, each level requiring its own methodology and supplying its own data to be accounted for. The study of recursion in cognition is badly in need of a systematic and well-ordered approach, and this chapter lays out the blueprint to be followed in the book by focusing on a strict separation between how this notion applies in linguistic knowledge and how it manifests itself in language processing.


Sign in / Sign up

Export Citation Format

Share Document