document structures Latest Research Papers

The financial statement analysis is a fundamental part of the credit risk attribution process, producing documents that are valuable sources of information about companies’ economic and financial wealth. Large volumes of that type of document demand automatic data extraction, and locators drive the tools for that task. However, due to the lack of regulation, there is not a standard layout for such documents, which originates a variety of document structures. Such variety burdens the feature extraction tools, reducing their performance. Clustering analysis overcomes such burden by finding the best document clusters, allowing the development of fine-tuned locators for each cluster based on their main characteristics, which is the main objective of this work. We applied state-of-the-art clustering techniques, RNG-HDBSCAN*, FOSC and MustaCHE, over financial statements documents to assess their clusters and main structures, separate outliers, and analyze their main features. The result allows the specialists to define proper locators for each cluster, increasing the performance of the data extraction tools.

Download Full-text

Assisting nurses in care documentation: from automated sentence classification to coherent document structures with subject headings

Journal of Biomedical Semantics ◽

10.1186/s13326-020-00229-7 ◽

2020 ◽

Vol 11 (1) ◽

Author(s):

Hans Moen ◽

Kai Hakala ◽

Laura-Maria Peltonen ◽

Hanna-Maria Matinolli ◽

Henry Suhonen ◽

...

Keyword(s):

Subject Headings ◽

Sentence Classification ◽

Document Structures

Download Full-text

Transcending structure: Applying shared markup vocabularies with your friends and enemies

Proceedings of the Symposium on Markup Vocabulary Ecosystems ◽

10.4242/balisagevol22.beck01 ◽

2018 ◽

Author(s):

Jeffrey Beck

Keyword(s):

Best Practices ◽

Large Scale ◽

Development Costs ◽

Communities Of Interest ◽

Document Structures ◽

Over Time ◽

Maintenance Mode

Markup makes it easier to share. We share documents with our peers, our partners, and even our competitors. Communities of interest form, they define document structures, test them in practice, and affirm them by adoption. Joining a community has obvious advantages: reduced development costs, ease of interchange, tried and tested tools, and an available pool of authors, editors, and developers already familiar with the vocabulary. Over time, the pace of vocabulary evolution slows naturally. The major structures are developed, applied, tested, and accepted. New structures are added more slowly, and more reluctantly. The community has transitioned into maintenance mode where large scale refactorings and backwards-incompatible changes are known to have burdonsome costs and “best practices” are known to make sharing easier. What can the “Markup Community in General” do to support these stricter best practices communities?

Download Full-text

Representing concurrent document structures using Trojan Horse markup

Proceedings of Balisage: The Markup Conference 2018 ◽

10.4242/balisagevol21.sperberg-mcqueen01 ◽

2018 ◽

Author(s):

C. M. Sperberg-McQueen

Keyword(s):

Trojan Horse ◽

Document Structure ◽

The Creation ◽

Document Structures ◽

Unnecessary Complication

The need for markup to handle multiple concurrent document structures has been clear at least since SGML introduced the CONCUR feature to support such markup. Few SGML users found the use of CONCUR necessary, few products ever supported it, and the designers of XML dropped it as an unnecessary complication. But those who need concurrent markup really need it. Fortunately, the functionality of CONCUR can be recreated more or less successfully in XML: one document structure can use conventional XML, while others use Trojan-Horse markup (DeRose 2004). Rabbit/duck grammars can be used to validate the document and to guide the creation of conventional schemas for use in editing tools.

Download Full-text

Improved query reformulation for concept location using CodeRank and document structures

2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE) ◽

10.1109/ase.2017.8115655 ◽

2017 ◽

Cited By ~ 5

Author(s):

Mohammad Masudur Rahman ◽

Chanchal K. Roy

Keyword(s):

Query Reformulation ◽

Concept Location ◽

Document Structures

Download Full-text

Improved query reformulation for concept location using CodeRank and document structures

10.7287/peerj.preprints.3186 ◽

2017 ◽

Cited By ~ 1

Author(s):

Mohammad Masudur Rahman ◽

Chanchal Roy

Keyword(s):

Software Maintenance ◽

Source Code ◽

Quality Analysis ◽

Query Reformulation ◽

Search Terms ◽

Concept Location ◽

Document Structures ◽

The Right ◽

Change Requests ◽

Change Task

During software maintenance, developers usually deal with a significant number of software change requests. As a part of this, they often formulate an initial query from the request texts, and then attempt to map the concepts discussed in the request to relevant source code locations in the software system (a.k.a., concept location). Unfortunately, studies suggest that they often perform poorly in choosing the right search terms for a change task. In this paper, we propose a novel technique --ACER-- that takes an initial query, identifies appropriate search terms from the source code using a novel term weight --CodeRank, and then suggests effective reformulation to the initial query by exploiting the source document structures, query quality analysis and machine learning. Experiments with 1,675 baseline queries from eight subject systems report that our technique can improve 71% of the baseline queries which is highly promising. Comparison with five closely related existing techniques in query reformulation not only validates our empirical findings but also demonstrates the superiority of our technique.

Download Full-text

Improved query reformulation for concept location using CodeRank and document structures

10.7287/peerj.preprints.3186v2 ◽

2017 ◽

Cited By ~ 1

Author(s):

Mohammad Masudur Rahman ◽

Chanchal Roy

Keyword(s):

Software Maintenance ◽

Source Code ◽

Quality Analysis ◽

Query Reformulation ◽

Search Terms ◽

Concept Location ◽

Document Structures ◽

The Right ◽

Change Requests ◽

Change Task

During software maintenance, developers usually deal with a significant number of software change requests. As a part of this, they often formulate an initial query from the request texts, and then attempt to map the concepts discussed in the request to relevant source code locations in the software system (a.k.a., concept location). Unfortunately, studies suggest that they often perform poorly in choosing the right search terms for a change task. In this paper, we propose a novel technique --ACER-- that takes an initial query, identifies appropriate search terms from the source code using a novel term weight --CodeRank, and then suggests effective reformulation to the initial query by exploiting the source document structures, query quality analysis and machine learning. Experiments with 1,675 baseline queries from eight subject systems report that our technique can improve 71% of the baseline queries which is highly promising. Comparison with five closely related existing techniques in query reformulation not only validates our empirical findings but also demonstrates the superiority of our technique.

Download Full-text

Improved query reformulation for concept location using CodeRank and document structures

10.7287/peerj.preprints.3186v1 ◽

2017 ◽

Author(s):

Mohammad Masudur Rahman ◽

Chanchal Roy

Keyword(s):

Software Maintenance ◽

Source Code ◽

Quality Analysis ◽

Query Reformulation ◽

Search Terms ◽

Concept Location ◽

Document Structures ◽

The Right ◽

Change Requests ◽

Change Task

During software maintenance, developers usually deal with a significant number of software change requests. As a part of this, they often formulate an initial query from the request texts, and then attempt to map the concepts discussed in the request to relevant source code locations in the software system (a.k.a, concept location). Unfortunately, studies suggest that they often perform poorly in choosing the right search terms for a change task. In this paper, we propose a novel technique –ACER– that takes an initial query, identifies appropriate search terms from the source code using a novel term weight –CodeRank, and then suggests effective reformulation to the initial query by exploiting the source document structures, query quality analysis and machine learning. Experiments with 1,675 baseline queries from eight subject systems report that our technique can improve 71% of the baseline queries which is highly promising. Comparison with five closely related existing techniques in query reformulation not only validates our empirical findings but also demonstrates the superiority of our technique.

Download Full-text

It's more than just overlap: Text As Graph

Proceedings of Balisage: The Markup Conference 2017 ◽

10.4242/balisagevol19.dekker01 ◽

2017 ◽

Cited By ~ 6

Author(s):

Ronald Haentjens Dekker ◽

David J. Birnbaum

Keyword(s):

Alternative Model ◽

Full Range ◽

White Space ◽

User Community ◽

Document Modeling ◽

As Graph ◽

Document Structures ◽

High Level ◽

Level Analysis ◽

Large User

The XML tree paradigm has several well-known limitations for document modeling and processing. Some of these have received a lot of attention (especially overlap), and some have received less (e.g., discontinuity, simultaneity, transposition, white space as crypto-overlap). Many of these have work-arounds, also well known, but—as is implicit in the term “work-around”—these work-arounds have disadvantages. Because they get the job done, however, and because XML has a large user community with diverse levels of technological expertise, it is difficult to overcome inertia and move to a technology that might offer a more comprehensive fit with the full range of document structures with which researchers need to interact both intellectually and programmatically. A high-level analysis of why XML has the limitations it has can enable us to explore how an alternative model of Text as Graph (TAG) might address these types of structures and tasks in a more natural and idiomatic way than is available within an XML paradigm.

Download Full-text

document structures
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

Exploiting Document Structures and Cluster Consistencies for Event Coreference Resolution

Improving automatic data extraction from financial statements with clustering analysis

Assisting nurses in care documentation: from automated sentence classification to coherent document structures with subject headings

Transcending structure: Applying shared markup vocabularies with your friends and enemies

Representing concurrent document structures using Trojan Horse markup

Improved query reformulation for concept location using CodeRank and document structures

Improved query reformulation for concept location using CodeRank and document structures

Improved query reformulation for concept location using CodeRank and document structures

Improved query reformulation for concept location using CodeRank and document structures

It's more than just overlap: Text As Graph

Export Citation Format

document structuresRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

Exploiting Document Structures and Cluster Consistencies for Event Coreference Resolution

Improving automatic data extraction from financial statements with clustering analysis

Assisting nurses in care documentation: from automated sentence classification to coherent document structures with subject headings

Transcending structure: Applying shared markup vocabularies with your friends and enemies

Representing concurrent document structures using Trojan Horse markup

Improved query reformulation for concept location using CodeRank and document structures

Improved query reformulation for concept location using CodeRank and document structures

Improved query reformulation for concept location using CodeRank and document structures

Improved query reformulation for concept location using CodeRank and document structures

It's more than just overlap: Text As Graph

document structures
Recently Published Documents