Spoken Natural Language Dialog Systems
Latest Publications


TOTAL DOCUMENTS

10
(FIVE YEARS 0)

H-INDEX

0
(FIVE YEARS 0)

Published By Oxford University Press

9780195091878, 9780197560686

Author(s):  
Ronnie W. Smith ◽  
D. Richard Hipp

Consider the following dialog situation. The computer is providing a human user with assistance in fixing an electronic circuit that causes a Light Emitting Diode (LED) to display in a certain way. The current focus of the task and dialog is to determine the status of a wire between labeled connectors 84 and 99, a wire needed for the circuit that is absent. Figures 3.1 and 3.2 show two possible dialog interactions that could occur at this point. In figure 3.1, the computer has total dialog control, and a total of 29 utterances are needed to guide the user through the rest of the dialog. In figure 3.2, the human user has overall dialog control, but the computer is allowed to provide direct assistance as needed (i.e. in helping add the wire). Only 11 utterances are needed for the experienced user to complete the dialog. These samples are from interactions with a working spoken natural language dialog system. To engage in such dialog interactions, a system must exhibit the behaviors mentioned at the beginning of chapter 1: (1) problem solving for providing task assistance, (2) conducting subdialogs to achieve appropriate subgoals, (3) exploiting user model to enable useful interactions, (4) exploiting context dependent expectations when interpreting user inputs, and (5) engaging in variable initiative dialogs. Achieving these behaviors while facilitating the measurement of system performance via experimental interaction requires a theory of dialog processing that integrates the following subtheories. • An abstract model of interactive task processing. • A theory about the purpose of language within the interactive task processing environment. • A theory of user model usage. • A theory of contextual interpretation. • A theory of variable initiative dialog. This chapter presents such a theory of dialog processing. Frequent reference to the dialog examples in figures 3.1 and 3.2 will guide the discussion. The first section discusses the overall system architecture that facilitates integrated dialog processing. The remainder of the chapter addresses each subtheory in turn, emphasizing how each fits into the overall architecture. The chapter concludes with a summary description of the integrated model.


Author(s):  
Ronnie W. Smith ◽  
D. Richard Hipp

Without development of an actual working system it is impossible to empirically validate the proposed computational model. Thus, the architecture introduced in section 3.1 has been implemented on a Sun 4 workstation and later ported to a Spare II workstation. The majority of the code is written in Quintus Prolog while the parser is written in C. The system software is available via anonymous FTP as described in appendix C. The overall hardware configuration is illustrated in figure 6.1. Speech recognition is performed by a Verbex 6000 user-dependent connected-speech recognizer running on an IBM PC. The vocabulary is currently restricted to the 125 words given in table 7.1. Users are required to begin each utterance with the word “verbie” and end with the word “over” (e.g. “verbie, the switch is up, over”). The Verbex speech recognizer acknowledges each input with a small beep. These sentinel interactions act as a synchronization mechanism for the user arid the machine. Speech output is performed by a DECtalk DTCO1 text-to-speech converter. This chapter discusses the following technical aspects of the implementation. • The various knowledge representation formalisms. • The implemented domain processor, an expert system for assisting in simple circuit repair. • The implemented generation component. • The basic physical resource utilization of the system. The basis for the implementation has been the logic programming language, Prolog. Clocksin and Mellish [CM87] provide an introduction to this language. Pereira and Shieber [PS87] arid McCord [McC87] can be consulted for a discussion of the usage of Prolog for natural language analysis. Prolog allows the expression of rules and facts in a subset of first-order logic called Horn clauses. Prolog is supplemented with non-logical features that aid in efficient computation as well, but as a representational formalism, its utility in representing rules and facts in a declarative format provides a basis for the representation of knowledge and rules within the model. The Goal and Action Description Language was introduced in section 3.2.2. A detailed description is provided in appendix A. It is used as a standard formalism for representing goals that may be accomplished during a task.


Author(s):  
Ronnie W. Smith ◽  
D. Richard Hipp

This chapter describes the computational model that has evolved from the theory of integrated dialog processing presented in the previous chapter. The organization of this chapter follows. 1. A high-level description of the basic dialog processing algorithm. 2. A detailed discussion of the major steps of the algorithm. 3. A concluding critique that evaluates the model’s effectiveness at handling several fundamental problems in dialog processing. The system software that implements this model is available via anonymous FTP. Details on obtaining the software are given in appendix C. Figure 4.1 describes the basic steps of the overall dialog processing algorithm that is executed by the dialog controller. By necessity, this description is at a very high level, but specifics will be given in subsequent sections. The motivation for these steps is presented below. Since the computer is providing task assistance, an important part of the algorithm must be the selection of a task step to accomplish (steps 1 and 2). Because the characterization of task steps is a function of the domain processor, the dialog controller must receive recommendations from the domain processor during the selection process (step 1). However, since a dialog may have arbitrary suspensions and resumptions of subdialogs, the dialog controller cannot blindly select the domain processor’s recommendation. The relationship of the recommended task step to the dialog as well as the dialog status must be considered before the selection can be made (step 2). Once a task step is selected, the dialog controller must use the general reasoning facility (i.e. the interruptible theorem prover, IPSIM) in step 3 to determine when the task step is accomplished. Whenever the theorem prover cannot continue due to a missing axiom, the dialog controller uses available knowledge about linguistic realizations of utterances in order to communicate a contextually appropriate utterance as well as to compute expectations for the response. After the response is received and its relationship to the missing axiom determined, the dialog controller must decide how to continue the task step completion process.


Author(s):  
Ronnie W. Smith ◽  
D. Richard Hipp

Building a working spoken natural language dialog system is a complex challenge. It requires the integration of solutions to many of the important subproblems of natural language processing. This chapter discusses the foundations for a theory of integrated dialog processing, highlighting previous research efforts. The traditional approach in AI for problem solving has been the planning of a complete solution. We claim that the interactive environment, especially one with variable initiative, renders such a strategy inadequate. A user with the initiative may not perform the task steps in the same order as those planned by the computer. They may even perform a different set of steps. Furthermore, there is always the possibility of miscommunication. Regardless of the source of complexity, the previously developed solution plan may be rendered unusable and must be redeveloped. This is noted by Korf [Kor87]: . . . Ideally, the term planning applies to problem solving in a real-world environment where the agent may not have complete information about the world or cannot completely predict the effects of its actions. In that case, the agent goes through several iterations of planning a solution, executing the plan, and then replanning based on the perceived result of the solution. Most of the literature on planning, however, deals with problem solving with perfect information and prediction. . . . Wilkins [W1184] also acknowledges this problem: . . . In real-world domains, things do not always proceed as planned. Therefore, it is desirable to develop better execution-monitoring techniques and better capabilities to replan when things do not go as expected. This may involve planning for tests to verify that things are indeed going as expected.... The problem of replanning is also critical. In complex domains it becomes increasingly important to use as much as possible of the old plan, rather than to start all over when things go wrong. . . . Consequently, Wilkins adopts the strategy of producing a complete plan and revising it rather than reasoning in an incremental fashion.


Author(s):  
Ronnie W. Smith ◽  
D. Richard Hipp

Every natural language parser will sometimes misunderstand its input. Misunderstandings can arise from speech recognition errors or inadequacies in the language grammar, or they may result from an input that is ungrammatical or ambiguous. Whatever their cause, misunderstandings can jeopardize the success of the larger system of which the parser is a component. For this reason, it is important to reduce the number of misunderstandings to a minimum. In a dialog system, it is possible to reduce the number of misunderstandings by requiring the user to verify each utterance. Some speech dialog systems implement verification by requiring the user to speak every utterance twice, or to confirm a word-by-word readback of every utterance. Such verification is effective at reducing errors that result from word misrecognitions, but does nothing to abate misunderstandings that result from other causes. Furthermore, verification of all utterances can be needlessly wearisome to the user, especially if the system is working well. A superior approach is to have the spoken language system verify the deduced meaning of an input only under circumstances where the accuracy of the deduced meaning is seriously in doubt, or correct understanding is essential to the success of the dialog. The verification is accomplished through the use of a verification subdialog—a short sequence of conversational exchanges intended to confirm or reject the hypothesized meaning. The following example of a verification subdialog will suffice to illustrate the idea. . . . computer: What is the LED displaying? user: The same thing. computer: Did you mean to say that the LED is displaying the same thing? user: Yes. . . . As will be further seen below, selective verification via a subdialog results in an unintrusive, human-like exchange between user and machine. A recent enhancement to the Circuit Fix-it Shop dialog system is a subsystem that uses a verification subdialog to verify the meaning of the user’s utterance only when the meaning is in doubt or when accuracy is critical for the success of the dialog. Notable features of this new verification subsystem include the following.


Author(s):  
Ronnie W. Smith ◽  
D. Richard Hipp

The results of the experiments reported in the previous chapter were independently analyzed in order to measure the accuracy of the speech recognizer and parser. A summary of the key results of this analysis follows. • On average, one out of every three words emitted by the speech recognizer was in error. • On average, three out of every four sentences contained one or more speech recognition errors. • In spite of the high speech recognition error rate, the meanings of the spoken utterances were correctly deduced 83% of the time. • Finally, and perhaps most surprisingly, it was found that dialog expectation was helpful only as a tie-breaker in deducing the correct meaning of spoken utterances. . . . The remainder of this chapter describes the analysis in detail. The performance measurements of the speech recognizer and parser were computed from transcripts of 2804 individual utterances2 taken from the second and third sessions of the 8 experimental subjects. No information from the pilot subjects or from the first session with each subject was used in this analysis. Information about each utterance was collected and converted to a standardized, machine-readable format. The information that was collected follows. • The sequence of words actually spoken by the user. These were manually entered by the experimenters based on the audio recordings of the experiment. • The sequence of words recognized by the speech recognizer. This information was recorded automatically during the experiments. • The set of up to K minimum matching strings between elements of the hypothesis set and dialog expectations, together with an utterance cost and an expectation cost for each. (See section 5.8.5). • The final output of the parser. • The text spoken by the dialog controller immediately prior to the user’s utterance, and notes concerning the user’s utterance which were entered by the person who transcribed the utterance from the audio tapes. This information was used to assist in manually judging the correctness of each parse. . . . After the above information was collected and carefully audited to remove errors, the following additional features of each utterance were computed through a combination of automatic and manual processing.


Author(s):  
Ronnie W. Smith ◽  
D. Richard Hipp

Both the user of the system and the dialog controller communicate using language. The user’s language consists of strings of English words ordered according to the rules of English syntax. The dialog controller’s language is made from strings of ground logic terms, variables, and punctuation symbols all connected according to the syntax of Prolog expressions. This chapter describes the design of a parser whose task is to translate strings from the user’s language into strings with approximately the same meaning in the dialog controller’s language. Some difficulties encountered by the parser follow. 1. Because of the less than flawless performance of speech recognizers, the parser will not know exactly what the user has said. Instead, the parser will receive as input one or more estimates of what was spoken, none of which may be absolutely correct. 2. Even what was spoken might not be what the user was thinking. The user may have mispronounced part of the utterance, or may make simple grammatical errors in the utterance. 3. The user may deliberately omit small structure words, such as “the” and “of”, from what is said. This is a natural and subconscious response of native speakers when speaking to a machine which has less than perfect language skills. 4. The parser’s grammar of the English language is probably not identical to the user’s. Thus, even without recognition errors, mispronunciations, or omitted words;, the input to the parser may not be syntactically well-formed. 5. The mapping from the user’s language to the dialog controller’s language is not one-to-one. Many inputs will have multiple meanings, and will thus need to be translated into two or more outputs. Conversely, a particular output may result from several syntactically unsimilar inputs. Together with [EM81], [HHCT86], [HM81], and [YHW+90], we argue that due to the above problems, the traditional natural language parsing techniques that accept written or typed input are not adequate for systems which accept speech input. New parsing architectures are required to deal with the high level of uncertainty that is inherent to speech input.


Author(s):  
Ronnie W. Smith ◽  
D. Richard Hipp

This book has presented a computational model for integrated dialog processing. The primary contributions of this research follow. • A mechanism (the Missing Axiom Theory) for integrating subtheories that each address an independently studied subproblem of dialog processing (i.e. interactive task processing, the role of language, user modeling, and exploiting dialog expectation for contextual interpretation and plan recognition). • A computational theory for variable initiative behavior that enables a system to vary its responses at any given moment according to its level of initiative. • Detailed experimental results from the usage of a spoken natural language dialog system that illustrate the viability of the theory and identify behavioral differences of users as a function of their experience and initiative level. This chapter provides a concluding critique, which identifies areas of ongoing work and offers some advice for readers interested in developing their own spoken natural language dialog systems. This section describes important issues we did not successfully address in this research because either (1) we studied the problem but do not as yet have a satisfactory answer; or (2) it was not necessary to investigate the problem for the current system. Regardless of the reason, incorporating solutions to these problems is needed to strengthen the overall model. In section 4.7.3 we have already discussed the difficulties in determining when and how to change the level of initiative during a dialog as well as the problems in maintaining coherence when such a change occurs. Ongoing work in this area is being conducted by Guinn [Gui93]. His model for setting the initiative is based on the idea of “evaluating which participant is better capable of directing the solution of a goal by an examination of the user models of the two participants.” He provides a formula for estimating the competence of a dialog participant based on a probabilistic model of the participant’s knowledge about the domain. Using this formula, Guinn has conducted extensive experimental simulations testing four different methods of selecting initiative.


Author(s):  
Ronnie W. Smith ◽  
D. Richard Hipp

One of the main goals of this research was to develop a computational model that could be implemented and tested. Testing could serve at least two purposes: (1) Demonstrate the viability of the Missing Axiom Theory for dialog processing; and (2) Determine the ways that varying levels of dialog control influence the interaction between user and computer. Consequently, an experiment involving use of the system was constructed to test the effects of different levels of dialog control. The format and results of this experiment are reported in this chapter. The following hypotheses are proposed as performance differences by users as they gain experience and have the initiative. • Task completion time will decrease. • The number of utterances per dialog will decrease. • The percentage of “non-trivial” utterances will increase (a nontrivial utterance is any utterance longer than one word). • The average length of a non-trivial utterance will increase. • The rate of speech (number of utterances per minute) will decrease. These hypotheses are consistent with the intuition that as the user has more initiative, the user will put more thought into the process, reducing the rate of interaction. In addition, it is expected that when the user has more initiative, there would be an attempt to convey more detailed information in each non-trivial utterance. Finally, it is also believed that increased user initiative will be more helpful when the user gains experience and has more knowledge about performing the task independent of computer guidance. Two graduate students in computer science volunteered to use the system. Each subject received about 75 minutes of training on the speech recognizer with the 125 word vocabulary. The subjects then participated in three sessions on differing days. Each session consisted of four different problems where each problem consisted of a single missing wire. The results from these subjects tended to support our hypotheses. However, the experimental control for this testing was not well-defined. The two subjects are involved in AI and NL research and consequently have strong preconceptions about NL systems and what constitutes “proper” behavior toward such systems.


Author(s):  
Ronnie W. Smith ◽  
D. Richard Hipp

The most sophisticated and efficient means of communication between humans is spoken natural language (NL). It is a rare circumstance when two people choose to communicate via another means when spoken natural language is possible. Ochsman and Chapanis [OC74] conducted a study involving two person teams solving various problems using restricted means of communication such as typewriting and video, typewriting only, handwriting and video, voice and video, voice only, etc. Their conclusion included the following statement. . . . The single most important decision in the design of a telecommunications link should center around the inclusion of a voice channel. In the solution of factual real-world problems, little else seems to make a demonstrable difference . . . Thus, it would seem desirable to develop computer systems that can also communicate with humans via spoken natural language dialog. Furthermore, recent reports from the research community in speech recognition [Adv93] indicate that accuracy levels in speaker-independent continuous speech recognition have reached a threshold where practical applications of spoken natural language are viable. This book addresses the dialog issues that must be resolved in building effective spoken natural language dialog systems—systems where both the human and computer interact via spoken natural language. We present an architecture for dialog processing for which an implementation in the equipment repair domain has been constructed that exhibits a number of behaviors required for efficient human-machine dialog. These behaviors include the following. • Problem solving to achieve a target goal. • The ability to carry out subdialogs to achieve appropriate subgoals and to pass control arbitrarily from one subdialog to another. • The use of a user model to enable useful verbal exchanges and to inhibit unnecessary ones. • The ability to use context dependent expectations to correct speech recognition and track user movement to new subdialogs. • The ability to vary the task/dialog initiative from strongly computer controlled to strongly user controlled or somewhere in between.


Sign in / Sign up

Export Citation Format

Share Document