Sunday, April 29. 2007
Did you notice how speech things are always cat content? With LiONS for speech output (TTS, text-to-speech), and sphinx and SpeechLion for speech input (speech recognition)? Couldn't forego that, could I. So let's see how to use speech I/O with linux, with all free components, how to avoid some common pitfalls that are not in the manual, and how to get desktop control, look at where emacs, KDE, beryl, and others come in, the works. It's not for the squeamish though, so here's the linux speech how-to (aka, the parts that aren't in the manual)!
- Speech Output (text-to-speech)
- FreeTTS is a text-to-speech solution in Java. It comes with some "native" voices, some imported FestVox (festival, see below) voices, and can use English Mbrola voices. (I won't be explaining the above terms; you don't need to know to run FreeTTS; if you'd like to know, kindly see the linked documents.) If you run OpenSuse, PackMan has the packages; otherwise, download and install as per the instructions given. There are many examples, but the most immediate should be running it like so: java -jar /usr/share/java/freetts/lib/freetts.jar -voice kevin16 -text "Hallo kitty\!" That aside the major benefit of FreeTTS might be the fact that all the Java apps use it, like Firefox Accessibar or SpeechBot, to name but two. Maybe more importantly, the speech recognition with
SpeechLion we'll be discussing below can also utilize FreeTTS to confirm your commands to you. But first things first!
- festival and Mbrola may be the "classic TTS systems" on linux, so just a few extra pointers:
- If festival locks your sound device (as in, you can't use it if music's playing, or if you're using it, you can't use the music player, that sort of thing), you may wish to use a program for output that plays nicer, by putting the following into festival's siteinit.scm (/usr/share/festival/siteinit.scm on OpenSuse):
(Parameter.set 'Audio_Command "aplay -q -c 1 -t raw -f s16 -r $SR $FILE")
(Parameter.set 'Audio_Method 'Audio_Command)
(aplay is a program for ALSA, the advanced linux sound architecture — if you don't have ALSA, you don't have aplay, or if you do, it likely won't do you any good.)
- The YagiTalk Firefox extension uses festival voices (it can bring its own, or use a festival you might already have) to read webpages to your in the Firefox browser.
- espeakf integrates festival with emacspeak, the emacs TTS integration.
- Alternatively, cepstral swift is an affordable commercial solution; the Diane voice seems quite workable to me. It comes with its own command-line tool, but you can also easily configure festival to use it, so if your existing solution relies on festival, integrating the cepstral voices is a breeze.
Acapela's system also seems to sound rather nice, but while it exists for Windows and Mac (InfoVox Desktop etc.), there seems to be no linux variant, and the demo at least won't install via Wine.
(Apropos, an overview of systems that speak at least German, via flawed.)
- For speech in KDE, use kttsd, part of the kdeaccessibility3 package. You can set it up to use festival specifically, or "just any old program." Both worked for me, but I found the latter preferable, especially with the cepstral voices, despite their good integration with festival. Your mileage may vary.
If something goes wrong, my kttsd_shutup script will remove all kttsd speech jobs, for near-instant silence. (Like if you end up having a MUSH-banner read to you, see below.)
- If you use TinyFugue for your online-roleplaying and you're feeling adventurous, you may wish to put something like this tf-snippet into your ~/.tfrc. It uses our speak command to read your RP to you (once you have turned it on by entering /vox).
- Speech Input (speech recognition)
- Our speech recognition engine will be sphinx4 (license). It was written in Java, but contrary to common expectations, it's fast, it works, and it's not too much of a pain to use. If you absolutely insist on non-Java software, you can always use sphinx3. Mine went to /opt/sphinx4. From the sphinx-dir, run the demo like so: java -jar bin/Dialog.jar Be suitably impressed (even though the example just recognises your commands; it doesn't actually execute them — we'll fix that later.).
The java will pick the first "mixer" by default; this may not be what you want. In my case, the first is that of the built-in soundcard that I use for music and the telephone (sipphone, skype, ...) ringing; the second is that of my USB-headset. The manual has instructions of how to use other mixers, but those only work if the program actually sees those mixers, which in my case, they don't for some reason (all non-Java apps do, however). A quick work-around here is to use skype-DSP-hijacker, which will sneakily change a given program's request to open one sound device to another. So in our case, sphinx4 thinks it opens the first device, when it actually opens the second! For this to work, we need to fix the script to not just blindly call skype, but whatever application we pass to it on the command-line. Painful enough for ya? For your amusement, the archive features skype_dsp_hijacker-0.7 as source and binary, and an extra script, dsp_hijacker, that does just what we need. make install, go to /opt/sphinx4, then call like so: dsp_hijacker --2nd java -Xmx200m -jar bin/Dialog.jar
- For it to actually do something on recognizing a command, we'll use SpeechLion. SpeechLion has the distinction of being written in jython, which simply speaking runs python-scripts on a java-engine, and can thus use python-modules as well as java ones. The drawback is that jython development is behind python development; cpython implements (at the time of this writing) python 2.5, while jython implements python 2.2, so if jython sees your 2.5 modules, it will just get confused and die. It's really only using UserDict$py.class for the time being though, so it will suffice to make a python2.2 directory in your SpeechLion directory (/opt/speechlion in my case) and drop the 2.2 version of that file there. Enough pain? There you go, calling speechlion may work for you now! It didn't for me, of course, so I had to do some extra parameter magic, plus of course I wanted to use the second soundcard via dsp_hijacker as per above, so I ended up writing a little wrapper, speechlioness. SpeechLioness assumes the directory structure (in /opt) as given above. But yes, it might be a worthwhile exercise to port over SpeechLion from jython to Java.
- Now SpeechLion is fun, but you will probably wish to configure it to your needs. This is done by adjusting the grammars that live in the gram folder within the speechlion directory (in my case, /opt/speechlion/gram). These grammars are pretty much EBNF, so they're not rocket science to read, understand, and change. You may execute shell commands, press keys, or operate the mouse from within those grammars. For further reference, I've made available the gram I currently use, which extends the original examples. Read them reference, or replace your original gram folder with mine (the individual grammars may include others, like common.gram or mode_*.gram do, so if you only want to replace some of the files, it's up to you to get the dependencies right).
Editing the gram-files takes a moment or two to get right, but is rather rewarding.
For instance, next window in the original SpeechLion grammar is a bit lacking — much better to be able to raise specific windows. There's DCOP for that, or DBUS magic, but the generic solution at this point would likely be the use of wmctl. My QND raise script will raise and focus the window whose name it was given as a parameter; it will prefer a match in the current workspace, barring that, on the curring desktop, or any match otherwise. (This will work correctly on beryl's/compiz' funny desktop cube.) If passed other parameters (like -F for exact match), the whole thing will be passed through to wmctrl without further ado; see there for possible parameters. If like me you have several emacsen on different workspaces, say one private, one work, this will more likely raise the right one (the "nearest" one).
Another handy script, play_by_artist will, when given an artist name, enqueue all songs by that artist rated higher than five in the amarok playlist. It will start playing as soon as the first song is enqueued. It will not throw away your existing playlist, but append to it. This implements a take on the sphinx4 examples'
listen to music.
And there you go. If in addition to all that (World of pain, Daggio!), you also have the sexiness that is the Beryl Window-Manager, nothing should stop you now from telling your desktop to
— and be understood
! So, are talking cats
all the craze, or what? : )