“My God, it speaks!” he cried in Portuguese. The arresting part of this quote is the word “it.” Machines are not supposed to talk. While we have always been comfortable with machines producing text (at least since the invention of the printing press), speech seems in contrast to be the natural province of humans.
A century of experience with talking machines since Bell’s demonstration has not cured our ambivalence about machines that talk. The telephone itself at least has a human at the other end. Nonetheless, it has taken the better part of this last century for us to learn how to use it easily and efficiently. In 1877 an advertisement for the telephone proclaimed the following:
The proprietors of the Telephone are now prepared to furnish Telephones for the transmission of articulate speech through instruments not more than twenty miles apart. Conversation can easily be carried on after slight practice and with occasional repetition of a word or sentence. On first listening to the Telephone, though the sound is perfectly audible, the articulation seems to be indistinct; but after a few trials the ear becomes accustomed to the peculiar sound.
Some hint of the difficulty that early users were experiencing is contained in the following instructions from another telephone advertisement.
After speaking transfer the Telephone from the mouth to the ear very promptly. When replying to communication from another, do not speak too promptly. Much trouble is caused from both parties speaking at the same time. When you are not speaking, you should be listening.
It is reported that users of the telephone suffered from a stage fright that made them tongue-tied. Even the simplest conversation was a major undertaking fraught with physical and psychological difficulties. This seems humorous now, but I have noticed that people from the older generation do not always seem to know how to conduct telephone conversations. I myself can certainly remember when the very words “long distance calling” were an ominous harbinger of fortune and tragedy, pressing upon the recipient the necessity for a heavy formality worth of the occasion.
When a machine originates speech on its own, instead of merely transmitting the speech of another human talker, the implications are more disturbing. Most of us have only begun to encounter computer-synthesized speech in the last few years. Hopefully it will not take us as long to become accustomed to dealing with speaking machines as it did for us to gain our everyday ease and efficiency in the usage of the telephone. This recent emergence of synthesized speech is attributable to the discovery of efficient mathematical algorithms that are suitable for this purpose, and to the new availability of integrated circuits designed specifically to implement these algorithms. Simply stated, now is the time that the machines of the world are finding their voices. We wait expectantly to hear what they will have to say.
The history of synthesized speech is entertaining and even mildly illuminating. In 1779 the Imperial Academy of St. Petersburg offered a prize to whoever could build a machine that could speak the vowels a, e, i, o, u. This prize was claimed by a Christian Gottlieb Kratzenstein, who constructed a pipe-organ-like set of five variously shaped resonators that could sound the vowels when activated by air blowing through a vibrating reed (as in the clarinet). Other inventors in subsequent years built similar wind-instrument approximations of the human vocal tract. Alexander Graham Bell himself, prior to his invention of the telephone, constructed a facsimile of the human skull with a working tongue and a controllable vocal tract. Bell claimed that his model could produce a “few simple utterances,” but who knows? Perhaps if he had continued this work his fame today might have equaled that of Christian Kratzenstein.
The drawings of mechanical speech inventions appear somewhat ludicrous. The idea of a human as a walking pipe organ brings to mind the hilarious old movies of aircraft that tried to fly by flapping wings. Modeling nature is not always a good way to design machines intended to reproduce natural behavior. Nonetheless, speech synthesis has been the exception, and today’s best synthesis systems have been motivated by study of the human vocal system. Instead of bellows and pipes they have their electronic equivalents in noise and tone generators and variable electrical filters.
The public was introduced to speech synthesis at the 1939 World’s Fair in New York, where the Bell System exhibit featured the giant speaking machine, called the Voder, shown in Figure 11. (“Voder” sounds like something from Star Wars, but it merely stands for “Voice Operation Demonstrator.”) The Voder was the ingenious invention of Homer Dudley, who conceived of the idea of using electrical networks with resonances similar to those of speech. Dudley’s Voder used ten electrical filters, each of which could adjust the amplitude of sound within a given narrow frequency band, just as today’s high fidelity audio systems use graphic equalizers for customization of audio response. An operator played the Voder through a keyboard in such a way as to produce speech-like sounds. In a sense, the Voder was an electronic organ, but instead of producing pure tones the Voder produced tone-like noises that could be strung together to emulate speech.
The Voder was quite difficult to play, and it took skilled operators more than a year to learn to produce speech. Nevertheless, the Voder was much more than a stunt. It has influenced speech synthesis to this day, and people at Bell Labs still refer to it with the kind of reverence given to legendary sagas. For those in the public who attended the World’s Fair the Voder must have been a memorable experience. Imagine being talked to (or at) by that awesome machine when you had never heard artificial speech before! Today we might even fear that this ominous-looking machine would breathe loudly between sentences. It would be scary.
My own first experience with synthesized speech was in 1967 when I received a copy of the journal Bell Laboratories Record, which contained a plastic phonograph record with segments of computer-generated speech. I rushed home to play the record, and I was particularly taken with a segment in which the computer sang “Daisy, Daisy, give me your answer, do,” using a too-perfect computerized pitch. I played the record for many of my friends, but I doubt their enthusiasm for the performance came up to my expectations. Did they realize the implications of a speaking computer? Did I? In any event, it was only a few years later when this same segment became a centerpiece in a celebrated movie by Stanley Kubrick. As you read this dialogue, think of the relative impact and nuance when the computer gives voice to its words, as opposed to if the words had instead been merely displayed as text on a terminal display.