Net.Speech -- Desktop Audio comes to the Net

Press, L., Communications of the ACM, Vol. 38, No. 10, pp 25-31, October, 1995.

In the 1940s, the first research computers worked on numeric data. A few years later, the UNIVAC{footnote 1}, was designed for both numeric and alphanumeric data. Over time, technology advances have made other data types economically feasible -- text, images, sound, and video.

Evolution of data types is also occurring on networks. A few years ago, most Internet users had terminals or small personal computers emulating terminals and 9.6 Kb/s modems, so they were limited to working with text, numbers, binary files. Today, many have 486 or Pentium PCs (see Table 1), or 68040 or Power PC Macs with 14.4 or 28.8 Kb/s modems, so images have become a common data type, and the time is right for speech to bloom on the Internet.

Applications

When one thinks of speech processing, many applications come to mind - - speech recognition, speaker identification, speech synthesis, language identification, and so forth. Since we will focus on the Net, this article covers recording, compression, and playback, saving other applications for another time. We will begin with a look at current telephone, radio, and utility software, followed by a discussion of future possibilities.

Internet Telephony

There are several Internet telephone programs on the market or being developed, and I have experimented with two of them, Internet Phone (Iphone) [10] from Vocaltec, which runs on Windows-based PCs and NetPhone from Electric Magic, which runs on the Macintosh. (For pointers to others, see [16]).

Iphone and NetPhone are similar. Both compress speech data so it can be transmitted and received by a 14.4 Kb/s modem, allowing people with dial-up IP accounts to use them. Conversations are sent as unchecked User Datagram Protocol packets, with no re-transmission of lost or late data. (A late packet is lost in telephony). Conversation using either program is intelligible, but there is more variance in quality than with a telephone, and there are occasional bouts of static and choppy reception. Transmission delays are also noticeable, and echo is a problem if speakers are used instead of a headset. (Perceived echo becomes worse as delay rises).

Figure 1 shows statistics for two Iphone conversations. The first was between a PC with a 14.4 Kb/s modem in New York and a PC on an Ethernet with a T1 connection in California. In spite of the lost packets, the conversation was intelligible, though there was quality variance and noise. The average round-trip delay (RTD) of 743 milliseconds was typical of such conversations. This RTD, plus encoding and decoding time, resulted in a conversation with noticeably greater delay than a conventional phone call.{footnote 2}.

The second conversation was between two 486-100 PCs on the same Ethernet segment. The reduced packet loss improved voice quality, but the most noticeable improvement was due to lower RTD. This conversation seemed as natural as a telephone call. There were still delays (which could be heard by speaking into the microphone on one machine and listening on the other) due to encode and decode time, but they did not cause distraction during a conversation.

These were half-duplex conversations, with only one person able to speak at a time. During silence, the first person to speak gains control, and you cannot interrupt a person who is talking. This restriction is due to limitations of some PC sound cards and drivers and Macintosh computers. Although full-duplex hardware will be common in the future, it may not be useful until delay times improve. (Speaker phones use half-duplex to cut echo). But again, the half- duplex LAN-based conversation felt natural.

Making a network connection is more complicated than making a telephone call. NetPhone has two means of making a connection, and Iphone only one. With NetPhone, if you wish to speak with someone who has their own IP address{footnote 3}, you enter it and, If they are also running NetPhone, they will hear a ring and answer your call. With NetPhone or Iphone, if you wish to speak with someone without a fixed IP address, or just to chat ham-radio style with strangers, you must register with a server that lets you find a communication partner. After logging onto a server, you select one or more topic areas (see Figure 2), which brings up a list of people registered under those topics, and you initiate a conversation by clicking on someone's name. You can also establish private topics which only members can see, allowing exclusive connection between coworkers and other closed groups.

Global Audio on Demand

Let us shift from telephone conversations to prerecorded or broadcast material. Until recently, there have been two primary alternatives, multicasting on the MBONE [5, 11] or storing prerecorded files which users could download and play.

The most prominent speech multicaster is Carl Malamud, who broadcasts regular "Geek of the Week" interviews of Internet celebrities and other talks [14]. To hear a muliticast program, the user must have access to properly configured equipment, and be present at the time of the broadcast or record it. There are also many speech and music archives on the Internet (including Malamud's), from which users download material and play it locally. The drawback to this approach is that files can be quite large, and must be downloaded before playing.

A new alternative, audio on demand, has become feasible with the growth of the installed base of fast PCs and modems. Progressive Networks has pioneered this approach with their Real Audio (RA) software. With RA, the user selects prerecorded material on a server, but instead of downloading the entire file, it is sent in a continuous stream. After a few seconds of buffering, the program begins to play, and the user can pause and resume or jump forward and back at any time (see Figure 3).

There are several RA servers on the Internet (see [14, 17]), and they are typically accessed using a WWW browser. Progressive claims that sound quality is as good as AM radio, but, one hears occasional glitches due to server or network overload. To my ear, even a smooth transmission sounds worse than a radio. Still, when the server is not overloaded, the material is easily intelligible, the user interface is simple, and I have been testing pre-release software (both the player and server).

At times, RA suffers from network and server delay. For example, if I am listening to an RA program, and load a program from a LAN server, there is a break in the sound. This problem might be overcome with larger buffers or an improved operating system, but server delay may still be a problem. At times I have experienced very poor reception. For example, the session shown in Figure 4 was barely intelligible. Since pinging the server showed an average transmission time of only 258 milliseconds, I assume server overload was the problem. As servers become overloaded, capacity will have to be increased, and it is difficult to predict the cost/revenue of various applications without more real experience.

Progressive offers three programs, Player, Studio and Server. Beta copies of the player are available at no charge and it will be bundled with most WWW clients in the future. Studio is used to prepare material for playing on an RA server. It accepts input files in various sound formats, and compresses them into the RA format which plays at roughly 1 kilobyte/second. Studio will be released in standard and professional versions, and will be available first for Windows. RA Servers will be available for both UNIX and Windows NT. While the server will have an industrial price tag, one hopes audio servers will eventually end up on everyone's desktop.

Utility Software

There are also speech compression and transformation utilities from companies like Echo Speech, and Voxware. These companies use knowledge of the characteristics of speech data and of human speech perception and discrimination to write compression programs. They work poorly on music or many people speaking at once, but are highly effective for speech or singing by one person at a time.

I tested compression programs from Voxware, Echo, and the beta version of RA Studio on a 90 MHz Pentium PC. The 3 test files were 1, 8, and 60 minutes, recorded as 16-bit monaural samples at 11.025 KHz. As Table 2 shows, compressed file sizes ranged from 4.5 to 7.5% of the originals, and compression time ranged from 53 to 734% of playing time. For all programs, compression times and file sizes varied nearly linearly with playing time.

These figures are encouraging since the sound of the compressed files was almost as good as the originals, and the times and file sizes are reasonable for many applications. {footnote 4} Variance among the programs reflects design tradeoffs. For example, Echo has two algorithms, one tuned for concise compression, the other for high quality sound. (For these tests, using a typical PC sound system, I could not hear a quality difference, though Echo assures me that developers with better equipment can). With either, compression is relatively fast, and playback efficient, taking only about 50% of a 386DX/25 (without a floating-point processor) or 6% of a Pentium 90. (The others require 486 speed). Voxware compresses files more slowly, but it is more flexible, because, in addition to playing the files back, a number of play-time transformations are possible. Voxware files can be shifted in frequency by an octave, speeded or slowed by a factor of five, or played back in a transformed voice.

Echo's compression can be run as a stand-alone DOS program or from within programs that conform to Microsoft's Audio Compression Manager specification, thereby allowing plug-in compression modules. I tested it in Sound Forge, a very flexible, professional quality sound editor. Sound Forge offers many transformations which would be useful in the preparation of speech (or music) for serving on the Net.

Why The Fuss? Looking to the Future

After reading about radio that sounds worse than AM broadcasts and telephony with a complex user interface and erratic quality, you might be wondering why I am interested in Net speech. For perspective, we should recall that early telephones produced sound at the speaker with about 1/10,000 the power of the original speech, even when the speaker and mike were directly connected [6]. Calls were placed by operators, and competing phone companies could not interoperate. Early radios had few programs and were difficult to tune and faint. By these standards, Net speech is a marvel, and there are reasons to expect improvement.

Integration

Most important is the opportunity for the integration of speech with other data types and applications. We already have an example of the integration of speech and slow scan video in CuSeeMe [12, 13],{footnote 5} which allows Internet video conferences between several people. Shared-whiteboard applications integrated with Internet telephony would allow users to take notes together, run other programs, make sketches, and so forth. With Web browser integration, one could make calls with a single mouse click. Server-based answering machines can also be integrated with Internet telephony, as can fax and email. Hardware integration is also possible. I have two information lines coming to my office, a telephone and an Ethernet connection. I would be happy to get rid of one.

The ultimate integration would be with the commercial telephone network. If Internet telephony becomes popular, telephone companies can either fight or join. Telcom Finland has entered the Internet connectivity market, and plans to make Iphone available to their IP customers. Others may see it as a threat, and attempt to block it legally and via business practice. The same issue arises for the broadcast radio industry. Since several networks are experimenting with RA, there may be a trend toward joining in. (I regularly listen to National Public Radio programs on RA now, selecting program segments I wish to hear, at convenient times, and without RF interference from my computer). Nationwide broadcasts require networks today; tomorrow a single server may suffice.

Flexibility

Recorded digital audio is flexible. For example, I have experimented with recording classroom lectures using an Audio-Technica wireless mike connected to a portable computer with a Media Vision PCMCIA sound card. For speech recording the quality seems comparable to that achieved with a full-size sound card and desktop mike, and the automatic gain control on the sound card makes for trouble free operation. The recordings can be compressed and placed on a server for student access. Transformations from Voxware or Sound Forge would also let the student play the lectures at faster than normal speed. With a bit more effort, still images keyed to the lecture can be added. Try all that with an analog tape recorder!

Delay Reduction

When you ask Net telephone users what they like about it, they invariably mention the price. With many Internet billing systems, the marginal cost of calls is zero, and even if you pay by the the hour, long distance calls are cheap. Packet-switched voice uses bandwidth more efficiently than circuit switched, and therefore should be cheaper. But, the vagaries of packet delay and loss cause quality problems. While these are noticeable today, there are good prospects for improvement.

Delays are caused by encoding time and the network. Improved hardware will enable reduced encoding time. Today, many Macintosh and PC sound systems use digital signal processors (DSPs), programmable devices optimized for fast execution of multiply-accumulate, looping, and other operations needed in speech processing. With the advent of fast CPUs, engineers realized DSPs could be eliminated for some applications [9]. Intel has concluded that much signal processing can be handled by a Pentium CPU (see Table 3), and, in conjunction with real-time processing partner Spectron Microsystems, has prepared a Native Signal Processing (NSP) reference platform and design kit [15]. NSP would cut low-end multimedia system cost and complexity by replacing add-in boards with software, and giving developers a more uniform platform. While NSP is feasible, it still costs CPU cycles and operating system overhead{footnote 6}, so high performance applications in multitasking environments would still use DSPs (which are also improving, see, for example, [2]). NSP will succeed if it performs well, and is not bogged down in business-driven conflicts with Microsoft.

Some hardware improvement will be budgeted to cut delays, other will go to more complex encoding algorithms. One can trade encoding complexity for bandwidth. A complex compression algorithm may require more CPU cycles than a simple one, but produce smaller bit-streams with the same perceived sound quality. Or the extra cycles may be used to increase quality in the same bandwidth. Another form of redundancy is explored in [3] which describes experiments in which a second, quick-and-dirty encoding of each packet is appended to the following packet. If a packet is lost, the low-quality encoding in the following packet is played in its place rather than substituting silence, white noise or interpolating.

Even if encoding time were reduced to zero, there would be network delay. RTD in packet-switched networks will always be variable. We can expect excellent speech over LANs in the near future, but conversations between Iceland and Antarctica may always falter. This problem will one day be solved by ATM networks with guaranteed speed and latency for virtual circuits, but that will take time, and when that infrastructure is in place, I would not be surprised if the marginal cost of voice traffic were so small as to be included in fixed service charges with no time or distance-based billing.

The Last Mile

There is also the modem bottleneck. Today's systems are designed for a last-mile connection of 14.4 Kb/s, but prices of 28.8 Kb/s modems are falling fast.{footnote 7} Even this is only an interim step, because last-mile speed to desktops will rise significantly as ISDN and new telephone and CATV infrastructure become common.

Perhaps more significant is the potential for wireless connectivity at rates sufficient for speech. Much of today's telephony and radio listening takes place in autos and other tetherless locations. Your car radio may one day bring you programs from servers around the world.

Terminal Intelligence

Computers are not restricted to fixed, limited capabilities as are radio receivers and telephones. In addition to the encoding and decoding discussed above, a computer can execute a sophisticated communication protocol. For example, compression algorithms, buffer sizes, and packet sizes can be adjusted dynamically as a function of RTD and the speeds and other characteristics of the two communicating computers.{footnote 8} The protocol can allow adjustment of the other person's computer to enhance intelligibility at your end. Programmed encryption is also possible. (As far as I know, there have been no discussion on common protocols, but they will come eventually).

Computers can also have better, more flexible user interfaces. For example, RA uses Web browsers for program selection, and a simple, intuitive interface (see Figure 3) for player control. In addition to establishing and managing sessions, the user interface could take user preferences into account. Sliders can be used to vary volume, mike sensitivity, base and treble levels, and so forth, or they could be adjusted automatically.

Transducers

The microphones, amplifiers, and speakers on typical multimedia PCs are not designed for telephony or hi-fidelity radio, but we can expect improvement [1]. Affordable signal processing will allow the computer to make adjustments based on the room configuration, headset, and mike. For example, If your PC has speakers and a mike rather than a telephone handset or headset, active detection and suppression of echo and ambient noise and voice enhancement will be possible. A mixer would also allow adjusting mike input.

While we wait for these improvements, we should also notice differences among current products. I have tried several microphones, and, for speech recording applications, cannot discern a significant difference between the low-cost mikes bundled with sound cards, and a more expensive Audio-Technica model. (For speech recognition and other applications, this may not be the case). Neither could I hear a difference between the Sound Blaster 16 and the more expensive AWE-32.

I noticed more variance among speakers. I have tried low-cost, no- amplifier speakers, and low and high-cost speaker-amplifier combinations. The differences are noticeable even for speech. For example, I have used a moderately priced SC&T PS-5000 equalizer with 60-watt speaker-amplifiers. Sound quality is noticeably better than with low-cost systems that come with sound cards, and the ability to adjust power in various frequency ranges enhances intelligibility. (A simple "treble" knob would probably do as well for speech). Moving up to a Bose MediaMate system is even better. The bass is excellent, making me sound like a pro. It seems you do get what you pay for in speakers.

Conclusion

As we see, there are reasons to expect progress. Today Net speech is crude, but it is just the start. Time will tell if it is the "ham radio" of the Internet, or the beginning of the end of analog, circuit-switched telephone service and network radio. Net telephony will surely play an important role on LANs, and globally-accessible radio on demand is now possible. In 1984 I published a sketch of my cluttered desk [7], and predicted that by the year 2,000 the telephone would have disappeared.{footnote 9} It may.


Pointers

Audio-Technica, 1221 Commerce Drive, Stow, OH 44224, (216) 686-2000.

Bose, The Mountain, Framingham, MA 01701-9168, (508) 879-7330, (800) 444-2673.

Echo Speech Corporation, 6460 Via Real, Carpinteria, CA 93013, 805.684.4593, info@echospeech.com.

Echo speech compression software may be downloaded from ftp://oak.oakland.edu/SimTel/win3/sound/espch10.zip, ftp://wuarchive.wustl.edu/pub/msdos_uploads/speech/espch10.zip, or ftp://ftp.cica.indiana.edu/pub/pc/win3/sounds/espch10.zip.

Electric Magic, makers of NetPhone, , 209 Downey Street, San Francisco, CA 94117-4421, (415)759-4100. You can get a demo version restricted to 60-second calls at http://www.emagic.com.

For information on NSP, contact Intel at nsp@intel.com or (800)628- 8686 or see http://www.intel.com/ or contact Spectron Microsystems at iasopx@spectron.com or see http://www.dialogic.com/spectron.

Media Vision, 47300 Bayside Parkway, Fremont, CA 94538, (510) 770- 8600, http://www.mediavis.com or mediavision on AOL.

Progressive Networks, makers of Real Audio, 616 First AVenue, Suite 701, Seattle, WA 98104, http://www.realaudio/com.

SC&T International 3837 E. La Salle St., Phoenix, AZ 85040, (602)470- 1334.

Sonic Foundry, makers of Sound Forge, 100 South Baldwin, Suite 204, Madison, WI 53791-8062, (608) 256-5555.

Vocaltec, makers of Iphone, 157 Veterans Drive, Northvale, NJ 07647, (201) 768-9400, info@vocaltec.com. You can get a demo version restricted to 60-second calls at http://www.vocaltec.com.

Voxware, 172 Tamarack Circle, Skillman, NJ 08558, (609)497-1212, info@voxware.com.

To join an RA listserver, send the message "subscribe raplay" to listserv@prognet.com.

To join an Iphone listserver, send a message saying "subscribe iphone" to majordomo@pulver.com.

For discussion of speech research and products, read the comp.speech newsgroup, with a FAQ at http://fortis.speech.su.oz.au/comp.speech/index.html.

For a subscription to PC-Telephony, an electronic journal devoted to the integration of the telephone and computer, send email to listserv@netcom.com with the message subscribe pc-telephony.

If you are interested in following the progress of NSP, and generally staying a year or so ahead of the semiconductor market, consider a subscription to The Microprocessor Report, Micro Design Resources, 874 Gravenstein Highway South, Sebastopol, CA 95472, (707)824-4001, cs@mdr.zd.com. Also look at Computer Letter, 120 Wooster Street, New York, NY 10012, (212)343-1900, which puts more emphasis on business and financial matters, and is also excellent.

For an overview of speech processing, applications, and research at the MIT Media Lab, see Schmandt, C., "Voice Communication with Computers," Van Nostrand Reinhold, New York, 1994.


Footnotes

1. The UNIVersal Automatic Calculator was the first commercial computer. First sold in 1951, it was designed for both scientific (numeric) and business (alphanumeric) data, hence the name "universal." IBM, which dominated computing by the late 1950s, had separate scientific and commercial product lines until they were unified as the IBM 360 family (360 degrees) in 1964.

2. According to [9], human factors research work has established the requirement for maximum RTD of around 300 milliseconds for phone calls. Another paper [3], suggests that delay should be kept below 600 milliseconds in the absence of echoes, and refers to another paper which calls for a 400 millisecond maximum. See [8] for discussion of the issues of evaluation of speech intelligibility and subjective quality.

3. Today, many dial-up (Slip/PPP) users receive dynamically assigned IP addresses. The next generation of IP will extend the address space, enabling each user to have a fixed address at a given location; however, many will still dial in to servers and also access the Net from mobile locations.

4. At todays disk prices (about $.3/megabyte), storage for an hour of compressed speech from any of these programs costs less than $2.00.

5. One might expect video conferencing to make Internet telephony unnecessary, but the bandwidth requirements are much greater, and audio conveys meaning very well. For example, I regularly listen to the soundtrack of Public Television's MacNeil Leher report, and do not feel I miss anything by not seeing the video portion. (This would not be the case for all material of course).

6. For example, using Windows 3.1, if you copy a file while listening to a local RA file, the audio stalls. Improved hardware and operating systems may help with such problems.

7. With poor lines, that rate may not be sustainable, and the modem will back off to 14.4 Kb/s.

8. NetPhone allows a wide choice of compression algorithms, but the users make the choice, not the protocol.

9. I must confess I failed to predict my radio's demise.


Figure Captions

1. Error rates and delay varied on these two calls. One was across the country, the other between two computers on the same LAN. Net delays (and encoding delays) cause unnatural conversation patterns and exacerbate echo problems.

2a. To place a call, you first pick a topic.

2b. Then you choose from a list of people registered for that topic.

3. With Real Audio, the user begins hearing the material as soon as a buffer is full, and can move forward and backward at will.

4. Real Audio server delay can cause problems, making a program like this one nearly unintelligible.


References

1. Baumhauer, J. C., Early, S. H., Gikus, J. H., Gay, S. L., and Zuniga, M. A., "Audio Technology used in AT&T's Terminal Equipment," AT&T Technical Journal, March/April, 1995, pp 57-69.

2. Gwennap, L., "Nvidia Launches Multimedia Accelerator," Microprocessor Report, July 10, 1995, pp 13-15.

3. Hardman, V., Sasse, M. A., Handley, M., and Watson, A., "Reliable Audio for Use over the Internet," Proceedings of the Internet Society International Networking Conference, INET '95, Honolulu, HI, pp 171- 178, June, 1995. http://www.isoc.org.

4. Lapsley, P., "NSP Shows Promise on Pentium, Power PC," Microprocessor Report, May 8, 1995, pp 11-14.

5. Macedonia, M. R. and Brutzman, D. P., "MBONE Provides Audio and Video across the Internet," IEEE Computer, April, 1994, pp. 30-36. ftp://taurus.cs.nps.navy.mil/pub/mbmg/mbone.html.

6. Pierce, J. R., and Noll, M. A., "Signals, The Science of Telecommunications," Scientific American Library, New York, 1990.

7. Press, L., "The IBM PC and its Applications," John Wiley and Sons, New York, 1984.

8. Rabiner, L. R., "Toward Vision 2001: Voice and Audio Processing Considerations," AT&T Technical Journal, March/April, 1995, pp 4-13.

9. Stewart, L. G., Payne, A. C., and Levergood, T. M., "Are DSP Chips Obsolete?" Proceedings of the International Conference on Signal Processing Applications and Technology," pp 178-187, November, 1992, Boston, MA or send email saying "send postscript 92/10" to tech- report@crl.dec.com.

10. ftp://ftp.pulver.com//pub/iphone/iphone.faq.

11. ftp://venera.isi.edu/mbone/faq.txt.

12. http://cuseeme.cornell.edu.

13. http://pogo.wright.edu/cuseeme/cuseeme.html.

14. http://town.hall.org/.

15. http://www.intel.com/pc-supp/nsp.html.

16. http://www.northcoast.com/~savetz/voice-faq.html.

17. http://www.realaudio.com/.


Tables

CPU/Memory   
                                   1993 1994 1995 1996

486 or Pentium with 8+ MB RAM       10%  18%  39%  61%
386, 486, Pentium with 4-7 MB RAM   19%  24%  22%  15%
386, 486, Pentium with < 4 MB RAM   20%  22%  14%   9%
286 and below                       51%  37%  25%  15%
Table 1: Worldwide Installed Base of x86 PCs. Source: Computer Intelligence Infocorp, 1995
-----

Table 2 a:  File Compression Times

  Play       Compression Time        Compression Rate
  Time           (seconds)          (compress/play time)
(minutes) 

            EH    EL    RA    VW      EH   EL   RA   VW
    1       33    32    41   441    0.55 0.53 0.68 7.34
    8      274   263   313 3,433    0.57 0.55 0.65 7.15
   60    2,061 1,959 2,350   n/a    0.57 0.54 0.65  n/a



Table 2 b:  Degrees of Compression

Source          Target Size               Percent of
 Size             (Kbytes)                  Source
(Kbytes)

             EH    EL    RA    VW      EH   EL   RA   VW
1,292        97    70    59    75     7.5% 5.4% 4.5% 5.8%
10,335      773   563   470   597     7.5% 5.4% 4.5% 5.8%
77,534    5,802 4,220 3,524   n/a     7.5% 5.4% 4.5% n/a

Table 2. Comparison of Echo Speech (high and low quality sound), Real Audio (RA), and Voxware (VW) for compression of three speech files. To my ear, sound quality was equal using a typical PC sound system.

-----

CPU 
Load      Application                                  

7%        GSM speech:  encoding/decoding of 
          monophonic speech at 8k samples/sec.

7%        MIDI synthesis:  wavetable synthesis of 
          8 voices at 22k samples/sec.

9%        Audio mixing: adding 8 channels of 
          stereo audio at 22k samples/sec.

17%       DigiTalk speech: encoding/decoding of 
          monophonic speech at 8k samples/sec).

22%       ProShare video conferencing:  
          encoding/decoding and display of a 160 x 
          120 window at 10 frames/sec.
Table 3, Native Signal Processing. These are Intel estimates of the percent of 100-MHz Pentium cycles necessary for various signal processing applications.

Disclaimer: The views and opinions expressed on unofficial pages of California State University, Dominguez Hills faculty, staff or students are strictly those of the page authors. The content of these pages has not been reviewed or approved by California State University, Dominguez Hills.