Evolution of Characteristic Tree Structured Patterns from Semistructured Documents

Discovery of Maximally Frequent Tag Tree Patterns with Contractible Variables from Semistructured Documents

Advances in Knowledge Discovery and Data Mining - Lecture Notes in Computer Science ◽

10.1007/978-3-540-24775-3_17 ◽

2004 ◽

pp. 133-144 ◽

Cited By ~ 7

Author(s):

Tetsuhiro Miyahara ◽

Yusuke Suzuki ◽

Takayoshi Shoudai ◽

Tomoyuki Uchida ◽

Kenichi Takahashi ◽

...

Keyword(s):

Tree Patterns ◽

Semistructured Documents

Download Full-text

Automatic Genre-Specific Text Classification

Encyclopedia of Data Warehousing and Mining, Second Edition ◽

10.4018/978-1-60566-010-3.ch020 ◽

2011 ◽

pp. 120-127

Author(s):

Xiaoyan Yu ◽

Manas Tungare ◽

Weiguo Fan ◽

Manuel Pérez-Quiñones ◽

Edward A. Fox ◽

...

Keyword(s):

Text Mining ◽

Text Classification ◽

Information Needs ◽

Question Answering ◽

Class Schedule ◽

Semistructured Documents ◽

Linkage Information ◽

Filter Noise ◽

Topic Tracking ◽

Course Syllabus

Starting with a vast number of unstructured or semistructured documents, text mining tools analyze and sift through them to present to users more valuable information specific to their information needs. The technologies in text mining include information extraction, topic tracking, summarization, categorization/ classification, clustering, concept linkage, information visualization, and question answering [Fan, Wallace, Rich, & Zhang, 2006]. In this chapter, we share our hands-on experience with one specific text mining task — text classification [Sebastiani, 2002]. Information occurs in various formats, and some formats have a specific structure or specific information that they contain: we refer to these as `genres’. Examples of information genres include news items, reports, academic articles, etc. In this paper, we deal with a specific genre type, course syllabus. A course syllabus is such a genre, with the following commonly-occurring fields: title, description, instructor’s name, textbook details, class schedule, etc. In essence, a course syllabus is the skeleton of a course. Free and fast access to a collection of syllabi in a structured format could have a significant impact on education, especially for educators and life-long learners. Educators can borrow ideas from others’ syllabi to organize their own classes. It also will be easy for life-long learners to find popular textbooks and even important chapters when they would like to learn a course on their own. Unfortunately, searching for a syllabus on the Web using Information Retrieval [Baeza-Yates & Ribeiro-Neto, 1999] techniques employed by a generic search engine often yields too many non-relevant search result pages (i.e., noise) — some of these only provide guidelines on syllabus creation; some only provide a schedule for a course event; some have outgoing links to syllabi (e.g. a course list page of an academic department). Therefore, a well-designed classifier for the search results is needed, that would help not only to filter noise out, but also to identify more relevant and useful syllabi.

Download Full-text

Documents and Topic Maps

Medical Informatics ◽

10.4018/978-1-60566-050-9.ch182 ◽

2011 ◽

pp. 2423-2442

Author(s):

Frédérique Laforest

Keyword(s):

Medical Record ◽

Medical Records ◽

Software System ◽

Topic Maps ◽

Semistructured Documents ◽

Long Time ◽

Unified View ◽

Structured Documents ◽

Different Sources

Medical records have been used for a long time with different forms, aims, and usages. This heterogeneity is the result of different professions, ways of working, and needs. It is but prejudicial to querying and sharing data and documents. Moreover, we consider that the system must be as close as possible to a more classical, noncomputerized way of working, such as paper-based medical record, and should thus manage documents. Medical records are often loosely or semistructured documents, impeding easy retrieval. In our approach, a medical record is considered as a set of documents and a set of data. In this article, we propose a software system useful for extracting data from loosely-structured documents coming from different sources and for querying them in a hybrid way. Querying can be done in a navigation space which represents extracted data or entire documents. Two main parts are described: the extraction of data in loosely-structured documents and the navigation in a unified view of documents and data.

Download Full-text