document structures
Recently Published Documents


TOTAL DOCUMENTS

45
(FIVE YEARS 1)

H-INDEX

7
(FIVE YEARS 0)

2020 ◽  
Author(s):  
Victor Ferraz ◽  
Gabriel Olivato ◽  
Igor Magollo ◽  
Murilo Naldi

The financial statement analysis is a fundamental part of the credit risk attribution process, producing documents that are valuable sources of information about companies’ economic and financial wealth. Large volumes of that type of document demand automatic data extraction, and locators drive the tools for that task. However, due to the lack of regulation, there is not a standard layout for such documents, which originates a variety of document structures. Such variety burdens the feature extraction tools, reducing their performance. Clustering analysis overcomes such burden by finding the best document clusters, allowing the development of fine-tuned locators for each cluster based on their main characteristics, which is the main objective of this work. We applied state-of-the-art clustering techniques, RNG-HDBSCAN*, FOSC and MustaCHE, over financial statements documents to assess their clusters and main structures, separate outliers, and analyze their main features. The result allows the specialists to define proper locators for each cluster, increasing the performance of the data extraction tools.


2020 ◽  
Vol 11 (1) ◽  
Author(s):  
Hans Moen ◽  
Kai Hakala ◽  
Laura-Maria Peltonen ◽  
Hanna-Maria Matinolli ◽  
Henry Suhonen ◽  
...  

Author(s):  
Jeffrey Beck

Markup makes it easier to share. We share documents with our peers, our partners, and even our competitors. Communities of interest form, they define document structures, test them in practice, and affirm them by adoption. Joining a community has obvious advantages: reduced development costs, ease of interchange, tried and tested tools, and an available pool of authors, editors, and developers already familiar with the vocabulary. Over time, the pace of vocabulary evolution slows naturally. The major structures are developed, applied, tested, and accepted. New structures are added more slowly, and more reluctantly. The community has transitioned into maintenance mode where large scale refactorings and backwards-incompatible changes are known to have burdonsome costs and “best practices” are known to make sharing easier. What can the “Markup Community in General” do to support these stricter best practices communities?


Author(s):  
C. M. Sperberg-McQueen

The need for markup to handle multiple concurrent document structures has been clear at least since SGML introduced the CONCUR feature to support such markup. Few SGML users found the use of CONCUR necessary, few products ever supported it, and the designers of XML dropped it as an unnecessary complication. But those who need concurrent markup really need it. Fortunately, the functionality of CONCUR can be recreated more or less successfully in XML: one document structure can use conventional XML, while others use Trojan-Horse markup (DeRose 2004). Rabbit/duck grammars can be used to validate the document and to guide the creation of conventional schemas for use in editing tools.


Author(s):  
Mohammad Masudur Rahman ◽  
Chanchal Roy

During software maintenance, developers usually deal with a significant number of software change requests. As a part of this, they often formulate an initial query from the request texts, and then attempt to map the concepts discussed in the request to relevant source code locations in the software system (a.k.a., concept location). Unfortunately, studies suggest that they often perform poorly in choosing the right search terms for a change task. In this paper, we propose a novel technique --ACER-- that takes an initial query, identifies appropriate search terms from the source code using a novel term weight --CodeRank, and then suggests effective reformulation to the initial query by exploiting the source document structures, query quality analysis and machine learning. Experiments with 1,675 baseline queries from eight subject systems report that our technique can improve 71% of the baseline queries which is highly promising. Comparison with five closely related existing techniques in query reformulation not only validates our empirical findings but also demonstrates the superiority of our technique.


Author(s):  
Mohammad Masudur Rahman ◽  
Chanchal Roy

During software maintenance, developers usually deal with a significant number of software change requests. As a part of this, they often formulate an initial query from the request texts, and then attempt to map the concepts discussed in the request to relevant source code locations in the software system (a.k.a., concept location). Unfortunately, studies suggest that they often perform poorly in choosing the right search terms for a change task. In this paper, we propose a novel technique --ACER-- that takes an initial query, identifies appropriate search terms from the source code using a novel term weight --CodeRank, and then suggests effective reformulation to the initial query by exploiting the source document structures, query quality analysis and machine learning. Experiments with 1,675 baseline queries from eight subject systems report that our technique can improve 71% of the baseline queries which is highly promising. Comparison with five closely related existing techniques in query reformulation not only validates our empirical findings but also demonstrates the superiority of our technique.


2017 ◽  
Author(s):  
Mohammad Masudur Rahman ◽  
Chanchal Roy

During software maintenance, developers usually deal with a significant number of software change requests. As a part of this, they often formulate an initial query from the request texts, and then attempt to map the concepts discussed in the request to relevant source code locations in the software system (a.k.a, concept location). Unfortunately, studies suggest that they often perform poorly in choosing the right search terms for a change task. In this paper, we propose a novel technique –ACER– that takes an initial query, identifies appropriate search terms from the source code using a novel term weight –CodeRank, and then suggests effective reformulation to the initial query by exploiting the source document structures, query quality analysis and machine learning. Experiments with 1,675 baseline queries from eight subject systems report that our technique can improve 71% of the baseline queries which is highly promising. Comparison with five closely related existing techniques in query reformulation not only validates our empirical findings but also demonstrates the superiority of our technique.


Author(s):  
Ronald Haentjens Dekker ◽  
David J. Birnbaum

The XML tree paradigm has several well-known limitations for document modeling and processing. Some of these have received a lot of attention (especially overlap), and some have received less (e.g., discontinuity, simultaneity, transposition, white space as crypto-overlap). Many of these have work-arounds, also well known, but—as is implicit in the term “work-around”—these work-arounds have disadvantages. Because they get the job done, however, and because XML has a large user community with diverse levels of technological expertise, it is difficult to overcome inertia and move to a technology that might offer a more comprehensive fit with the full range of document structures with which researchers need to interact both intellectually and programmatically. A high-level analysis of why XML has the limitations it has can enable us to explore how an alternative model of Text as Graph (TAG) might address these types of structures and tasks in a more natural and idiomatic way than is available within an XML paradigm.


Sign in / Sign up

Export Citation Format

Share Document