Me2Me: building the foundations of contextual voice

Every now and then we encounter a really interesting product someone's come up with. Last time it was Fonolo, the search engine for call centres that digs through all the IVRs in the world, tries all the options, and plots their structure on their Web site. Then, it lets you click-to-call directly to the organisational function you want to talk to, and log the results of your call so that others can learn and do better.

Now it's Me2Me. A Swisscom Ventures-funded startup, Me2Me uses voice-recognition technology to provide a sort of telephony-based personal organiser service. You call it, leave a note, and later retrieve it, either in a pull mode (you go looking for it) or in a push mode (it comes looking for you, for example when a pre-set reminder comes up). It's not the first time this has been done, but the crucial thing here is that the message you phone in is itself converted to text and stored in a relational database along with the actual recorded sound.

me2me2.png

This is, of course, a crucial step. Remember that classical telephony preserves everything except for the context of the call and the semantic content of the call; every crackle, passing helicopter, cough, etc, but nothing of where, when, how, why, and not even anything of what was said. In the past, sound and video technologies have concentrated exclusively on the waves in the air. This is a problem. We can apply incredibly rich methodologies to work with information that comes in textual or numerical forms; we can search, merge, filter, match, mapreduce, select, count, join, template, compare, and we can link different operations so as to carry out complex processing.

We can do comparatively little of this with images, and next to nothing with sound or vision. Usually, when we do this with sound or vision, we cheat and add text metadata to it, and then process that instead. Me2Me is doing the opposite; it's adding the complete text plus the call metadata to the sound file.

Think of the difference between a tape recorder, or a collection of MP3 files without metadata, and an e-mail inbox; the stuff in the inbox can be searched, threaded, and filtered on a whole wealth of criteria. And a lot of e-mail clients let you use the filters to invoke other programs. The sound recordings cannot do anything like this. But once you've extracted text from the sound, all these options are back on. This means you can search through the stored information, you can filter it, and generally do intelligent things with it. The next clever bit, though, is that the database full of text is connected to the natural habitat of text, the World Wide Web. The system looks like this:

me2me3.png

This means that you can interact with it visually as well - which is useful if you need to do complicated things or to work with messages in bulk. And that diagram also shows the really clever bit; the backend talks to all kinds of third party Web service APIs, so your messages can make things happen. As the voice recognition system identifies some words as system commands, you can tell it to do things with the information you're about to give it.

For example, the current prototype lets you query the Swiss Federal Railways' timetables, stock exchange prices, ski reports and various other stuff. But as the system improves, there will be more, and there will probably be user-defined commands and filters, so (for example) if you ask for a reminder of an event, the "reminder" filter triggers a call to Google Calendar (or your favourite groupware) to insert it in your visual calendar as well. You probably won't be surprised to know that we suggested they ought to integrate Fonolo as soon as it gets to Europe or they get to the States.

And, a primary target for Me2Me is to funnel the contents of existing voicemail systems into the big bucket of data, thus tackling the world's most maladaptive communications system itself.

All very cool, especially as it develops further. Me2Me's CMO, Christian Giroux, describes it as a "voice command line" (hey, he may be the CMO but he's actually an engineer), analogous to Mozilla Ubiquity, the project which lets you control Firefox and the Web services you browse with it through short user-defined keyboard commands (for example, "trainuk london leeds" gives you a list of train times). Both aim to find more humane ways of interacting with computer and telecoms systems; both want to make human language and programming overlap.

But the really interesting opportunities are two-sided, in more ways than one; not only is Me2Me a way of interacting with the Web by speech, as well as a clever to-do list, it's also potentially a way of interacting with call centres by both speech and the Web. For example, you could issue a voice command that would result in a Web service call to some company or other - for example, a bank - which passes back an HTML form you can fill in so as to capture your data whilst waiting for the call to be answered, and accepts a whole variety of context data in return. And you could route the call depending on that.

me2me1.png

And Me2Me is marketed as a managed service to telcos. So as well as charging a subscription to your customers, perhaps you could profit by having upstream partners link their CRM systems with it, creating new voice commands that benefit from all this, and sharing some of the benefit to them with you? Me2Me is building the foundations of contextual voice.

Another thought; when we said that telcos need to cherish people with good Unix/Linux/open source skills, we weren't joking. Underneath the bonnet, and leaving the voice recognition stuff aside for a moment, it's all done with Asterisk, the open-source PBX and telephony toolkit.