There has been much talk about synthetic human

Introduction

An artificially intelligent agent equipped with natural speech capabilities such as “Hal” – the computer character from 2001: A Space Odyssey – does not seem far fetched when we consider how the field of linguistics, with its wide spectrum of methods to study interactive speech, provides the building blocks for spoken language systems that simulate human dialog. But how do we get from enticing visions of talking robots to the realistic production of such simulacra?

Before we can achieve fully interactive speech interfaces that can simulate human discourse, we must marshal the resources of a variety of different disciplines.

The need for sound linguistic methods has become ever greater as voice-enabled technology is used more and more in mobile devices, automotive applications, interactive toys and smart-home controls, among other uses. The increasingly sophisticated applications of voice-enabled products and services that allow users to speak naturally, rather than to follow a scripted text of menu choices, make it imperative for software engineers to draw from the rich field of linguistics to make these systems live up to their potential.

The Disciplinary Divide

Recently, I discussed two linguistic methods that are often at variance with one another in their approach to the design of spoken language systems: computational linguistics and conversation analysis. Whereas computational linguistics focuses on grammatical discourse structure, conversation analysis focuses on social action and its associated interactional features.

Some experts argue for the necessity of using conversation analytic principles in the design of spoken language systems because computational linguists tend to focus on aspects of conversational organization in the abstract. Their methods tend not to be empirical. They conclude that it is necessary to know about the ways in which everyday (conversational) interaction is organized before a computer system can be designed that either simulates or reproduces the essence of human communication.

However, applying conversation analytic research findings in the design of speech interfaces is not without its difficulties. Rules operating in conversation are not givens, nor are they finite. They cannot be very easily codified or reduced to an algorithm. Instead, conversation rules are a resource that speakers discover very easily. This is why there are conversation analysts who are fundamentally opposed to deriving programming rules from conversation analytic findings. They contend that while a finite set of rules might be nice for computational linguistics, in actual practice, the conversational speech is not constrained in that way.

Even so, computational linguists are already using bits and pieces of conversation analytic findings, such as the turn-taking model for the allocation of speakership rights. This illustrates the need for the use for conversation analysis in building dialog systems in spite of its inherent difficulties. However, without the assistance of trained conversation analysts, this piecemeal use of conversation analytic research findings by computational linguists may not be adequate for the design of an artificially intelligent, speech-enabled computer with human-like speech recognition accuracy and communication abilities. Thus, it may be the disciplinary conflicts within the broad field of linguistics, rather than the inadequacy of the methods available to us, that impede progress in the design of truly intelligent devices that use natural language understanding.

Against this current of disciplinary conflicts, there have nevertheless been some clarion calls over the years for collaboration between computational linguists and conversation analysts. Some have argued that interactional and linguistic concerns have to be mutually addressed in computational linguistics. Interactional demands simply cannot be ignored in spoken language.

Even among the skeptics – conversation analysts who believe that natural language systems cannot simulate human dialog – there are those who strongly encourage speech interface designers to incorporate critical features of the turn-taking model (e.g., turn transition relevance). These are necessary first steps toward an open collaborative relationship between computational linguists and conversation analysts. That leads to more interactive conversational interfaces. But we will need more of this collaboration.

The New Frontier

As speech recognition software and speech synthesis become perfected over time, designing systems that simulate human dialog – that might even one day be “human” enough to pass the Turing Test – becomes much less science fiction and more reality. By integrating the many disciplines and sub disciplines in linguistics we can provide a rich corpora of knowledge to serve as the foundation for artificial agents that can perform human tasks.

Allow me to propose three new ways speech interfaces can more closely resemble human dialog via the application of conversation analytic research findings to the existing speech recognition systems built on computational linguistics.

1) Conversational Dialog: Use of natural speech rather than menu driven voice applications that follow a scripted text. To take this technology to its next level, a system should someday be able to understand a speaker who does NOT use the appropriate key words, such as when a speaker attempts to make an airline reservation, becomes frustrated, but does NOT request transfer to a human “agent” or “operator.” At present, in the absence of the use of those key words (and when there is no manual zero out option), the system would not transfer a frustrated user to an agent for assistance. A system that would indeed understand the interactional signs of user frustration not based on key words, but rather based on discrete patterns of sequentially organized conversational features that are consistent with frustration, would bring us closer to a real life “Hal.”

2) Idiomatic Expressions: Non-literal words or word phrases that are used for their symbolic meaning. Everyday language is punctuated by idioms. However, it is far too costly for a system to be equipped with enough intelligence about the world at large to grant accurate meaning to every idiom in the English language. Given that in everyday language speech idioms are rarely taken literally but are rather granted their symbolic meaning, how do we get a speech interface to do likewise? The answer may lie, first and foremost, in the study of how and when idioms are used in interactive dialog. Then, algorithms may be formulated that depict common patterns of usage of idioms so that a system may be able to spot an idiom by virtue of “how” and “where” it appears in interactive dialog.

3) Empathy: A display of understanding of what the speaker is trying to convey. True “Hal”-like features someday would include a display of proper empathy by an intelligent agent. When callers seek assistance from human operated help-lines, they often show signs of needing empathy from the human agent. It is at such junctures in a help-line call that a human agent can be most useful by placating an irate caller through showing support for the legitimacy of the caller’s grievance. A natural dialog system that can algorithmically map out the common sequentially organized conversational features of a caller’s attempts to elicit empathy, can more responsively handle a caller’s complaints. A system that is more responsive to the caller’s emotions is a system that can better simulate human dialog.

Amy Neustein, Ph.D.,

President and Founder

Linguistic Technology Systems

lingtec@banet.net

November 5, 2001 Copyright © 1998 - 2001 ejTalk

Home News Services Presentations Opinions Notebook Demos Members Press People Company Search Contact