An artificially intelligent agent equipped with natural speech
capabilities such as “Hal” – the computer character from 2001: A Space
Odyssey – does not seem far fetched when we consider how the field of
linguistics, with its wide spectrum of methods to study interactive speech,
provides the building blocks for spoken language systems that simulate human
But how do we get from enticing visions of talking robots to the
realistic production of such simulacra?
we can achieve fully interactive speech interfaces that can simulate human
discourse, we must marshal the resources of a variety of different disciplines.
need for sound linguistic methods has become ever greater as voice-enabled
technology is used more and more in mobile devices, automotive applications,
interactive toys and smart-home controls, among other uses. The increasingly
sophisticated applications of voice-enabled products and services that allow
users to speak naturally, rather than to follow a scripted text of menu choices,
make it imperative for software engineers to draw from the rich field of
linguistics to make these systems live up to their potential.
Recently, I discussed two linguistic methods that are often at variance with one another in their approach to the design of spoken language systems: computational linguistics and conversation analysis. Whereas computational linguistics focuses on grammatical discourse structure, conversation analysis focuses on social action and its associated interactional features.
Some experts argue for the necessity of using conversation analytic principles in the design of spoken language systems because computational linguists tend to focus on aspects of conversational organization in the abstract. Their methods tend not to be empirical. They conclude that it is necessary to know about the ways in which everyday (conversational) interaction is organized before a computer system can be designed that either simulates or reproduces the essence of human communication.
applying conversation analytic research findings in the design of speech
interfaces is not without its difficulties. Rules operating in
conversation are not givens, nor are they finite. They cannot be very easily
codified or reduced to an algorithm. Instead, conversation rules are a resource
that speakers discover very easily. This is why there are
conversation analysts who are fundamentally opposed to deriving programming
rules from conversation analytic findings. They contend that while a
finite set of rules might be nice for computational linguistics, in actual
practice, the conversational speech is not constrained in that way.
Even so, computational linguists are already using bits and pieces of
conversation analytic findings, such as the turn-taking model for the allocation
of speakership rights. This illustrates the need for the use for conversation
analysis in building dialog systems in spite of its inherent difficulties.
However, without the assistance of trained conversation analysts, this piecemeal
use of conversation analytic research findings by computational linguists may
not be adequate for the design of an artificially intelligent, speech-enabled
computer with human-like speech recognition accuracy and communication
abilities. Thus, it may be the disciplinary conflicts within the broad field of
linguistics, rather than the inadequacy of the methods available to us, that
impede progress in the design of truly intelligent devices that use natural
this current of disciplinary conflicts, there have nevertheless been some
clarion calls over the years for collaboration between computational linguists
and conversation analysts. Some have argued that interactional and linguistic
concerns have to be mutually addressed in computational linguistics.
Interactional demands simply cannot be ignored in spoken language.
among the skeptics – conversation analysts who believe that natural language
systems cannot simulate human dialog – there are those who strongly encourage
speech interface designers to incorporate critical features of the turn-taking
model (e.g., turn transition relevance). These are necessary first steps toward
an open collaborative relationship between computational linguists and
conversation analysts. That leads to more interactive conversational interfaces.
But we will need more of this collaboration.
As speech recognition software and speech synthesis become perfected over time,
designing systems that simulate human dialog – that might even one day be
“human” enough to pass the Turing Test – becomes much less science fiction
and more reality. By integrating the many disciplines and sub disciplines
in linguistics we can provide a rich corpora of knowledge to serve as the
foundation for artificial agents that can perform human tasks.
me to propose three new ways speech interfaces can more closely resemble human
dialog via the application of conversation analytic research findings to the
existing speech recognition systems built on computational linguistics.
Conversational Dialog: Use of natural speech rather than
menu driven voice applications that follow a scripted text. To take this
technology to its next level, a system should someday be able to understand a
speaker who does NOT use the appropriate key words, such as when a speaker
attempts to make an airline reservation, becomes frustrated, but does NOT
request transfer to a human “agent” or “operator.” At present, in the
absence of the use of those key words (and when there is no manual zero out
option), the system would not transfer a frustrated user to an agent for
assistance. A system that would indeed understand the interactional signs
of user frustration not based on key words, but rather based on discrete
patterns of sequentially organized conversational features that are consistent
with frustration, would bring us closer to a real life “Hal.”
Idiomatic Expressions: Non-literal words or word phrases
that are used for their symbolic meaning. Everyday language is punctuated by
idioms. However, it is far too costly for a system to be equipped with
enough intelligence about the world at large to grant accurate meaning to every
idiom in the English language. Given that in everyday language speech
idioms are rarely taken literally but are rather granted their symbolic meaning,
how do we get a speech interface to do likewise? The answer may lie, first
and foremost, in the study of how and when idioms are used in interactive
dialog. Then, algorithms may be formulated that depict common patterns of
usage of idioms so that a system may be able to spot an idiom by virtue of
“how” and “where” it appears in interactive dialog.
Empathy: A display of understanding of what the speaker
is trying to convey. True “Hal”-like features someday would include a
display of proper empathy by an intelligent agent. When callers seek
assistance from human operated help-lines, they often show signs of needing
empathy from the human agent. It is at such junctures in a help-line call
that a human agent can be most useful by placating an irate caller through
showing support for the legitimacy of the caller’s grievance. A natural
dialog system that can algorithmically map out the common sequentially organized
conversational features of a caller’s attempts to elicit empathy, can more
responsively handle a caller’s complaints. A system that is more
responsive to the caller’s emotions is a system that can better simulate human
Amy Neustein, Ph.D.,
President and Founder
Linguistic Technology Systems