Supporting Book Search: A Comprehensive Comparison of Tags vs. Controlled Vocabulary Metadata

Toine Bogers; Vivien Petras

doi:10.1515/dim-2017-0004

Supporting Book Search: A Comprehensive Comparison of Tags vs. Controlled Vocabulary Metadata

Data and Information Management ◽

10.1515/dim-2017-0004 ◽

2017 ◽

Vol 1 (1) ◽

pp. 17-34 ◽

Cited By ~ 1

Author(s):

Toine Bogers ◽

Vivien Petras

Keyword(s):

Information Needs ◽

Large Scale ◽

Controlled Vocabulary ◽

Test Collection ◽

Search Performance ◽

Empirical Comparison ◽

Controlled Vocabularies ◽

Complex Information ◽

Depth Analysis ◽

Comprehensive Comparison

Abstract Book search is far from a solved problem. Complex information needs often go beyond bibliographic facts and cover a combination of different aspects, such as specific genres or plot elements, engagement or novelty. Conventional book metadata may not be sufficient to address these kinds of information needs. In this paper, we present a large-scale empirical comparison of the effectiveness of book metadata elements for searching complex information needs. Using a test collection of over 2 million book records and over 330 real-world book search requests, we perform a highly controlled and in-depth analysis of topical metadata, comparing controlled vocabularies with social tags. Tags perform better overall in this setting, but controlled vocabulary terms provide complementary information, which will improve a search. We analyze potential underlying factors that contribute to search performance, such as the relevance aspect(s) mentioned in a request or the type of book. In addition, we investigate the possible causes of search failure. We conclude that neither tags nor controlled vocabularies are wholly suited to handling the complex information needs in book search, which means that different approaches to describe topical information in books are needed.

Download Full-text

Internet of Samples: Progress report

Biodiversity Information Science and Standards ◽

10.3897/biss.5.75797 ◽

2021 ◽

Vol 5 ◽

Author(s):

Dave Vieglais ◽

Stephen Richard ◽

Hong Cui ◽

Neil Davies ◽

John Deck ◽

...

Keyword(s):

Efficient Solution ◽

The United States ◽

Controlled Vocabulary ◽

The Internet ◽

Controlled Vocabularies ◽

The Core ◽

Research Programs ◽

Depth Analysis ◽

Interdisciplinary Collaborations ◽

Material Sample

Material samples form an important portion of the data infrastructure for many disciplines. Here, a material sample is a physical object, representative of some physical thing, on which observations can be made. Material samples may be collected for one project initially, but can also be valuable resources for other studies in other disciplines. Collecting and curating material samples can be a costly process. Integrating institutionally managed sample collections, along with those sitting in individual offices or labs, is necessary to faciliate large-scale evidence-based scientific research. Many have recognized the problems and are working to make data related to material samples FAIR: findable, accessible, interoperable, and reusable. The Internet of Samples (i.e., iSamples) is one of these projects. iSamples was funded by the United States National Science Foundation in 2020 with the following aims: enable previously impossible connections between diverse and disparate sample-based observations; support existing research programs and facilities that collect and manage diverse sample types; facilitate new interdisciplinary collaborations; and provide an efficient solution for FAIR samples, avoiding duplicate efforts in different domains (Davies et al. 2021) enable previously impossible connections between diverse and disparate sample-based observations; support existing research programs and facilities that collect and manage diverse sample types; facilitate new interdisciplinary collaborations; and provide an efficient solution for FAIR samples, avoiding duplicate efforts in different domains (Davies et al. 2021) The initial sample collections that will make up the internet of samples include those from the System for Earth Sample Registration (SESAR), Open Context, the Genomic Observatories Meta-Database (GEOME), and Smithsonian Institution Museum of Natural History (NMNH), representing the disciplines of geoscience, archaeology/anthropology, and biology. To achieve these aims, the proposed iSamples infrastructure (Fig. 1) has two key components: iSamples in a Box (iSB) and iSamples Central (iSC). The iSC component will be a permanent Internet service that preserves, indexes, and provides access to sample metadata aggregated from iSBs. It will also ensure that persistent identifiers and sample descriptions assigned and used by individual iSBs are synchronized with the records in iSC and with identifier authorities like International Geo Sample Number (IGSN) or Archival Resource Key (ARK). The iSBs create and maintain identifiers and metadata for their respective collection of samples. While providing access to the samples held locally, an iSB also allows iSC to harvest its metadata records. The metadata modeling strategy adopted by the iSamples project is a metadata profile-based approach, where core metadata fields that are applicable to all samples, form the core metadata schema for iSamples. Each individual participating collectionis free to include additional metadata in their records, which will also be harvested by iSC and are discoverable through the iSC user interface or APIs (Application Programming Interfaces), just like the core. In-depth analysis of metadata profiles used by participating collections, including Darwin Core, has resulted in an iSamples core schema currently being tested and refined through use. See the current version of the iSamples core schema. A number of properties require a controlled vocabulary. Controlled vocabularies used by existing records are kept, while new vocabularies are also being developed to support high-level grouping with consistent semantics across collection types. Examples include vocabularies for Context Category, Material Category, and Specimen Type (Table 1). These vocabularies were also developed in a bottom-up manner, based on the terms used in the existing collections. For each vocabulary, a decision tree graph was created to illustrate relations among the terms, and a card sorting exercise was conducted within the project team to collect feedback. Domain experts are invited to take part in this exercise here, here, and here. These terms will be used as upper-level terms to the existing category terms used in the participating collections and hence create connections among individual participating collections. iSample project members are also active in the TDWG Material Sample Task Group and the global consultation on Digital Extended Specimens. Many members of the iSamples project also lead or participate in a sister research coordination network (RCN), Sampling Nature. The goal of this RCN is to develop and refine metadata standards and controlled vocabularies for the iSamples and other projects focusing on material samples. We cordially invite you to participate in the Sampling Nature RCN and help shape the future standards for material samples. Contact Sarah Ramdeen ([email protected]) to engage with the RCN.

Download Full-text

SchenQL: in-depth analysis of a query language for bibliographic metadata

International Journal on Digital Libraries ◽

10.1007/s00799-021-00317-8 ◽

2021 ◽

Author(s):

Christin Katharina Kreutz ◽

Michael Wolz ◽

Jascha Knack ◽

Benjamin Weyers ◽

Ralf Schenkel

Keyword(s):

Information Needs ◽

User Study ◽

Query Language ◽

Information Access ◽

Domain Experts ◽

Domain Specific ◽

Complex Information ◽

Depth Analysis ◽

Information Exploration ◽

High Level

AbstractInformation access to bibliographic metadata needs to be uncomplicated, as users may not benefit from complex and potentially richer data that may be difficult to obtain. Sophisticated research questions including complex aggregations could be answered with complex SQL queries. However, this comes with the cost of high complexity, which requires for a high level of expertise even for trained programmers. A domain-specific query language could provide a straightforward solution to this problem. Although less generic, it can support users not familiar with query construction in the formulation of complex information needs. In this paper, we present and evaluate SchenQL, a simple and applicable query language that is accompanied by a prototypical GUI. SchenQL focuses on querying bibliographic metadata using the vocabulary of domain experts. The easy-to-learn domain-specific query language is suitable for domain experts as well as casual users while still providing the possibility to answer complex information demands. Query construction and information exploration are supported by a prototypical GUI. We present an evaluation of the complete system: different variants for executing SchenQL queries are benchmarked; interviews with domain-experts and a bipartite quantitative user study demonstrate SchenQL’s suitability and high level of users’ acceptance.

Download Full-text

Vocabulary Control in Nautical Information Resources

European Journal of Education ◽

10.26417/847ovu20r ◽

2021 ◽

Vol 4 (2) ◽

pp. 126

Author(s):

Edgardo A. Stubbs

Keyword(s):

Natural Language ◽

Information Needs ◽

Access Point ◽

Essential Elements ◽

Controlled Vocabulary ◽

Controlled Vocabularies ◽

Essential Information ◽

Safe Navigation ◽

Maritime Navigation ◽

Source Of Information

Nautical charts are an essential information resource for safe navigation. However, they are not only a useful resource for navigators; According to the International Hydrological Organization (IHO), they essentially fulfill two functions: 1) Maritime navigation, since most hydrographic services have the obligation to provide coverage of the nautical charts of their national waters, in all coastal waters, including major ports and smaller marinas of purely local interest. and 2) as a source of information, since national nautical charts present a detailed configuration of the seabed. Information on the shape of the seabed is required by a diversity of users in addition to navigators; for example, engineers interested in onshore construction, dredging contractors, oceanographers, defense agencies, coastal zone managers, etc. Traditionally, there are three essential elements that play an important role in information retrieval: title, author, and subject access point. Among the latter, one can distinguish indexing by natural language and by controlled vocabularies. The thematic access points, makes it easier for the user to search and retrieve all types of resources that satisfy their information needs. Traditionally in the processing of nautical charts the natural language is used predominantly, motivated by a lack of availability of a controlled vocabulary specific to the área in Spanish. The objective of this work is aimed at establishing the criteria for the construction of a controlled vocabulary in Spanish in the field of nautical charts.

Download Full-text

Supporting Complex Information Needs via Large-Scale Collaborative Search

10.14236/ewic/fdia2017.1 ◽

2017 ◽

Author(s):

Felipe Tavares de Moraes

Keyword(s):

Information Needs ◽

Large Scale ◽

Complex Information ◽

Collaborative Search

Download Full-text

Comparison of Different Analytical Methods for the On-Site Analysis of Traces at Clandestine Drug Laboratories

Applied Sciences ◽

10.3390/app11093754 ◽

2021 ◽

Vol 11 (9) ◽

pp. 3754

Author(s):

René Reiss ◽

Frank Hauser ◽

Sven Ehlert ◽

Michael Pütz ◽

Ralf Zimmermann

Keyword(s):

Large Scale ◽

Ambient Pressure ◽

Relevant Information ◽

Information Value ◽

Site Analysis ◽

Complex Information ◽

Skilled Operator ◽

Validation Parameters ◽

Clandestine Laboratories ◽

To Receive

While fast and reliable analytical results are crucial for first responders to make adequate decisions, these can be difficult to establish, especially at large-scale clandestine laboratories. To overcome this issue, multiple techniques at different levels of complexity are available. In addition to the level of complexity their information value differs as well. Within this publication, a comparison between three techniques that can be applied for on-site analysis is performed. These techniques range from ones with a simple yes or no response to sophisticated ones that allows to receive complex information about a sample. The three evaluated techniques are immunoassay drug tests representing easy to handle and fast to explain systems, ion mobility spectrometry as state-of-the-art equipment that needs training and experience prior to use and ambient pressure laser desorption with the need for a highly skilled operator as possible future technique that is currently under development. In addition to the measurement of validation parameters, real case samples are investigated to obtain practically relevant information about the capabilities and limitations of these techniques for on-site operations. Results demonstrate that in general all techniques deliver valid results, but the bandwidth of information widely varies between the investigated techniques.

Download Full-text

Large-scale transcriptomics to dissect 2 years of the life of a fungal phytopathogen interacting with its host plant

BMC Biology ◽

10.1186/s12915-021-00989-3 ◽

2021 ◽

Vol 19 (1) ◽

Author(s):

Elise J. Gay ◽

Jessica L. Soyer ◽

Nicolas Lapalu ◽

Juliette Linglin ◽

Isabelle Fudal ◽

...

Keyword(s):

Host Plant ◽

Large Scale ◽

Field Experiments ◽

Plant Residues ◽

Plant Infection ◽

Effector Genes ◽

Controlled Conditions ◽

Niche Adaptation ◽

Depth Analysis ◽

Fungal Genes

Abstract Background The fungus Leptosphaeria maculans has an exceptionally long and complex relationship with its host plant, Brassica napus, during which it switches between different lifestyles, including asymptomatic, biotrophic, necrotrophic, and saprotrophic stages. The fungus is also exemplary of “two-speed” genome organisms in the genome of which gene-rich and repeat-rich regions alternate. Except for a few stages of plant infection under controlled conditions, nothing is known about the genes mobilized by the fungus throughout its life cycle, which may last several years in the field. Results We performed RNA-seq on samples corresponding to all stages of the interaction of L. maculans with its host plant, either alive or dead (stem residues after harvest) in controlled conditions or in field experiments under natural inoculum pressure, over periods of time ranging from a few days to months or years. A total of 102 biological samples corresponding to 37 sets of conditions were analyzed. We show here that about 9% of the genes of this fungus are highly expressed during its interactions with its host plant. These genes are distributed into eight well-defined expression clusters, corresponding to specific infection lifestyles or to tissue-specific genes. All expression clusters are enriched in effector genes, and one cluster is specific to the saprophytic lifestyle on plant residues. One cluster, including genes known to be involved in the first phase of asymptomatic fungal growth in leaves, is re-used at each asymptomatic growth stage, regardless of the type of organ infected. The expression of the genes of this cluster is repeatedly turned on and off during infection. Whatever their expression profile, the genes of these clusters are enriched in heterochromatin regions associated with H3K9me3 or H3K27me3 repressive marks. These findings provide support for the hypothesis that part of the fungal genes involved in niche adaptation is located in heterochromatic regions of the genome, conferring an extreme plasticity of expression. Conclusion This work opens up new avenues for plant disease control, by identifying stage-specific effectors that could be used as targets for the identification of novel durable disease resistance genes, or for the in-depth analysis of chromatin remodeling during plant infection, which could be manipulated to interfere with the global expression of effector genes at crucial stages of plant infection.

Download Full-text

Knowledge Graphs for COVID-19: An Exploratory Review of the Current Landscape

Journal of Personalized Medicine ◽

10.3390/jpm11040300 ◽

2021 ◽

Vol 11 (4) ◽

pp. 300

Author(s):

Avishek Chatterjee ◽

Cosimo Nardi ◽

Cary Oberije ◽

Philippe Lambin

Keyword(s):

Information Needs ◽

Drug Repurposing ◽

Research Literature ◽

Google Scholar ◽

Knowledge Graph ◽

Search Term ◽

Complex Information ◽

Big Picture ◽

Improving Patient Care ◽

Keyword Searches

Background: Searching through the COVID-19 research literature to gain actionable clinical insight is a formidable task, even for experts. The usefulness of this corpus in terms of improving patient care is tied to the ability to see the big picture that emerges when the studies are seen in conjunction rather than in isolation. When the answer to a search query requires linking together multiple pieces of information across documents, simple keyword searches are insufficient. To answer such complex information needs, an innovative artificial intelligence (AI) technology named a knowledge graph (KG) could prove to be effective. Methods: We conducted an exploratory literature review of KG applications in the context of COVID-19. The search term used was “covid-19 knowledge graph”. In addition to PubMed, the first five pages of search results for Google Scholar and Google were considered for inclusion. Google Scholar was used to include non-peer-reviewed or non-indexed articles such as pre-prints and conference proceedings. Google was used to identify companies or consortiums active in this domain that have not published any literature, peer-reviewed or otherwise. Results: Our search yielded 34 results on PubMed and 50 results each on Google and Google Scholar. We found KGs being used for facilitating literature search, drug repurposing, clinical trial mapping, and risk factor analysis. Conclusions: Our synopses of these works make a compelling case for the utility of this nascent field of research.

Download Full-text

Synchronization in 5G networks: a hybrid Bayesian approach toward clock offset/skew estimation and its impact on localization

EURASIP Journal on Wireless Communications and Networking ◽

10.1186/s13638-021-01963-x ◽

2021 ◽

Vol 2021 (1) ◽

Author(s):

Meysam Goodarzi ◽

Darko Cvetkovski ◽

Nebojsa Maletic ◽

Jesús Gutiérrez ◽

Eckhard Grass

Keyword(s):

Bayesian Approach ◽

Large Scale ◽

Hybrid Approach ◽

Position Estimation ◽

5G Networks ◽

Clock Skew ◽

Depth Analysis ◽

The Impact ◽

Mean Square Errors ◽

Clock Offset

AbstractClock synchronization has always been a major challenge when designing wireless networks. This work focuses on tackling the time synchronization problem in 5G networks by adopting a hybrid Bayesian approach for clock offset and skew estimation. Furthermore, we provide an in-depth analysis of the impact of the proposed approach on a synchronization-sensitive service, i.e., localization. Specifically, we expose the substantial benefit of belief propagation (BP) running on factor graphs (FGs) in achieving precise network-wide synchronization. Moreover, we take advantage of Bayesian recursive filtering (BRF) to mitigate the time-stamping error in pairwise synchronization. Finally, we reveal the merit of hybrid synchronization by dividing a large-scale network into local synchronization domains and applying the most suitable synchronization algorithm (BP- or BRF-based) on each domain. The performance of the hybrid approach is then evaluated in terms of the root mean square errors (RMSEs) of the clock offset, clock skew, and the position estimation. According to the simulations, in spite of the simplifications in the hybrid approach, RMSEs of clock offset, clock skew, and position estimation remain below 10 ns, 1 ppm, and 1.5 m, respectively.

Download Full-text

Large-Scale Landslide Displacement Rate Prediction Based on Multi-Factor Support Vector Regression Machine

Applied Sciences ◽

10.3390/app11041381 ◽

2021 ◽

Vol 11 (4) ◽

pp. 1381

Author(s):

Xiuzhen Li ◽

Shengwei Li

Keyword(s):

Support Vector Regression ◽

Water Level ◽

Large Scale ◽

Displacement Rate ◽

Support Vector ◽

Single Factor ◽

Reservoir Water ◽

Reservoir Water Level ◽

Depth Analysis ◽

Three Factor

Forecasting the development of large-scale landslides is a contentious and complicated issue. In this study, we put forward the use of multi-factor support vector regression machines (SVRMs) for predicting the displacement rate of a large-scale landslide. The relative relationships between the main monitoring factors were analyzed based on the long-term monitoring data of the landslide and the grey correlation analysis theory. We found that the average correlation between landslide displacement and rainfall is 0.894, and the correlation between landslide displacement and reservoir water level is 0.338. Finally, based on an in-depth analysis of the basic characteristics, influencing factors, and development of landslides, three main factors (i.e., the displacement rate, reservoir water level, and rainfall) were selected to build single-factor, two-factor, and three-factor SVRM models. The key parameters of the models were determined using a grid-search method, and the models showed high accuracies. Moreover, the accuracy of the two-factor SVRM model (displacement rate and rainfall) is the highest with the smallest standard error (RMSE) of 0.00614; it is followed by the three-factor and single-factor SVRM models, the latter of which has the lowest prediction accuracy, with the largest RMSE of 0.01644.

Download Full-text

The use of thesauri in online retrieval

Journal of Information Science ◽

10.1177/016555158400800204 ◽

1984 ◽

Vol 8 (2) ◽

pp. 63-66 ◽

Cited By ~ 8

Author(s):

C.P.R. Dubois

Keyword(s):

Information Retrieval ◽

Data Base ◽

Case Studies ◽

Controlled Vocabulary ◽

Free Text ◽

Data Bases ◽

Online Data ◽

Controlled Vocabularies ◽

Semantic Maps ◽

Actual Use

The controlled vocabulary versus the free text approach to information retrieval is reviewed from the mid 1960s to the early 1980s. The dominance of the free text approach following the Cranfield tests is increasingly coming into question as a result of tests on existing online data bases and case studies. This is supported by two case studies on the Coffeeline data base. The differences and values of the two approaches are explored considering thesauri as semantic maps. It is suggested that the most appropriate evaluatory technique for indexing languages is to study the actual use made of various techniques in a wide variety of search environments. Such research is becoming more urgent. Economic and other reasons for the scarcity of online thesauri are reviewed and suggestions are made for methods to secure revenue from thesaurus display facilities. Finally, the promising outlook for renewed develop ment of controlled vocabularies with more effective online display techniques is mentioned, although such development must be based on firm research of user behaviour and needs.

Download Full-text