LINGUISTIC ANALYZER: AUTOMATIC TRANSFORMATION OF NATURAL LANGUAGE TEXTS INTO INFORMATION DATA STRUCTURE
Latest Publications


TOTAL DOCUMENTS

10
(FIVE YEARS 0)

H-INDEX

0
(FIVE YEARS 0)

Published By St. Petersburg State University

9785288059278

Drawing on Charles Bally’s distinction between dictum and modus, the final chapter deals with the pragmatic analysis of propositions identified at the previous step. Propositions can be expressed by an attribute linking together a number of objects or by a single word (implicit proposition). Implicit propositions should be made explicit, i.e. missing objects should be brought out. Each proposition is attributed a conceptual type. The authors posit four types of propositions, namely, fact, opinion, evaluation and expression of will. While the distinction between fact and opinion is common enough, evaluation emphasizes the emotional, or expressive, aspect of an utterance, and the expression of will has to do with persuasive speech acts. Next, relations between propositions, both local and distant, are established, which makes it possible to bring them together within a rhetorical text structure. The units of rhetorical structure are quite different from both the syntactic and communicative ones and vary depending on the its type (statement, persuasion, directive, etc.).


The goal of communicative analysis is to transform the Subject-Predicate-Objects-Adverbials structure of individual sentences into a metalinguistic information structure. In the latter, two basic data types are identified, objects and attributes. Attributes are linked to a single object or establish connections between a number of objects (in which case they can be also called predicates). An attribute together with its object(s) forms a proposition. In the information structure, objects can be related to one another either directly or via attributes. Attributes are also interrelated by temporal, causal, and other links. There is no direct correspondence between the constituents of the Subject-Predicate-Objects-Adverbials structure and the information structure components, e.g. the subject of a sentence does not necessarily become an object in the information structure, etc. The analysis of successive sentences requires to keep track of objects and attributes, which brings in the problem of reference. Information on space and time should also be accounted for and represented in the overall structure. As no restrictions are imposed on the type of the analyzed texts, the chapter summarizes the distinctive features of various text types. However, they prove to be insufficient for the type identification. The situation is further complicated by their intricate combinations (narration can alternate with description, persuasion can enter into narration, etc.). This chapter is a pilot study of quite a number of issues central for natural language processing but as yet lacking attention and satisfactory solution.


Graphematical analysis marks the first stage of text processing. However, prior to it, basic text structuring takes place, resulting in the identification of paragraphs and their types, e.g. title, subtitle, author name(s), chapter and section titles, footnotes, endnotes, figures, appendices, epigraphs, etc. After that, graphematical analysis proper begins. Its aim is to decompose the flow of letter and non-letter graphemes into character strings such as individual words, abbreviations, numbers, and hybrid strings (e.g. mathematical formulae). The procedure implies an iterative process of unit assembling, from individual characters to what is called atoms, next to tokens (roughly equivalent to word occurrences), sentence parts and finally, a whole sentence. At every stage, each unit is assigned its type. Assembling relies on the rules based solely on a thorough structural analysis of context. No formal models or statistical methods are applied, this being a central principle of the linguistic analyzer, inherent in all its algorithms. At this stage, complications arise primarily through the ambiguity of punctuation marks. They are discussed at length throughout the chapter.


At this stage, a semantic-syntactic dictionary is activated, in order to supply the letter tokens with the information concerning their contextual realization and meaning proper. The list of features assigned to a word form is thus augmented by semantic and syntactic ones. Contextual realization is specified by means of subcategorization frames. Meaning is described in terms of semantic features which may or may not be hierarchically arranged. Among them, basic features are identified, e.g. ‘living being’, ‘action’, ‘space’, ‘time’, ‘quantity’, that have no direct bearing on the parts of speech. Thus, the feature ‘action’ can be assigned both to verbs and nouns. Additional semantic features make it possible to account for manifold aspects of word meaning. Thus, horse is marked both as ‘domestic animal’ and ‘vehicle’, which would be impossible in case of a strict hierarchy of features. The algorithm of syntactic analysis draws on the same basic idea as that of graphematical analysis, viz. iterative assembling of individual tokens into syntactic constructions, from local groups to larger ones until a resulting syntactic structure of a whole sentence emerges consisting of Subject-Predicate-Objects-Adverbials. Syntactic rules are ranked accordingly. The chapter illustrates the work of algorithm in a variety of cases and discusses multiple difficulties resulting from the syntactic ambiguity.


The aim of this step is to check the previously assigned token types and to provide the letter tokens with morphological information, i.e. values of the relevant grammatical categories. As far as word forms are concerned, the procedure is normally called morphological analysis. However, since a text may contain other token types (e.g. abbreviations, formulae, Internet hyperlinks, phone numbers), it is generally referred to as token attribution. In the chapter, a wide range of token types are considered from the processing viewpoint. In particular, the letter token analysis presupposes a search in a number of dictionaries. Apart from a regular Russian morphological dictionary, search is also performed in the dictionaries of abbreviations, of fixed phrases, and of proper names (personal, geographical, etc.). Morphological analysis often yields more than a single attribution. In some cases, the ambiguity can be reduced by taking into account graphematical information, but most often, it will remain and pose further problems for the syntactic analysis. If the search for a letter token fails in all the dictionaries, the algorithm tries to identify its lemma and predict the grammatical meaning. The attribution of other token types is made by mapping them onto a range of patterns. Typical problems of both operations are discussed.


Sign in / Sign up

Export Citation Format

Share Document