Creating a new busuu experience for the Google Assistant on Google Home

We made a smart conversation action for language learning using Google Assistant’s voice recognition and speech synthesis, a Node.js app, and the API.AI platform.

Thomas Didierjean
Busuu Tech

--

The phone call

When Google got in touch with us to join the early access program for Google Assistant on Google Home, we had already dabbled in conversational user experiences by working on a Facebook bot for busuu. While we didn’t end up publishing the chat bot, it still got us very excited about the possibilities of a more interactive learning experience.

Integrating with Google’s platform

We started developing our agent (“agent” is the name given to apps built with API.AI) when the Assistant platform was still evolving quickly. While it sometimes was a bit frustrating to develop against a moving target, it was also pretty interesting to see their process in action for iterating and adding features to the platform.

Integrating a third-party agent works by providing a manifest of actions or “intents” to Google that describes the interactions between the Assistant itself, which provides voice recognition and text-to-speech capabilities, and a REST API hosted by the third-party (i.e. busuu).

Interactions between the Assistant (represented by the colored circles) and the agent (blue rectangle). Original image from Google Developers documentation.

At the beginning of the project Google provided its own protocol and API for registering actions and inputs. They also encouraged developers to use a JavaScript SDK acting as a wrapper for the API. At busuu we’re mainly a PHP shop, but in that case we had perfectly good excuse to do something a bit different so we picked Node.js for the job. As it turned out, the SDK stopped being a requirement soon after which meant our excuse for using Node.js was gone… But by this point we were well underway with the development and kept going.

Things get complicated

We felt that we had built a decent prototype of the agent when Google decided to put their recent acquisition of API.AI to good use and asked us to consider rebuilding our agent using that tool as a middleware between the Google Assistant and busuu. We were approaching the deadline for submitting and were not exactly thrilled by the move, but API.AI turned out to be a much nicer interface and worth the hassle. It provides developers with a graphical interface to describe in natural language the inputs that we expect from the user, and expands on those by using natural language processing and machine learning.

Interface for specifying user inputs with API.AI

Creating a conversational experience

busuu’s language learning course has been created and perfected over many years and we didn’t want to have to create new specific course content for a voice interface for the first version of our agent. To keep things simple we also decided to focus on the Spanish course.

When we were previously working on our Facebook chat bot, we realised that the bot, acting as a language teacher, needed a personality and a tone. We wanted it to appear serious, to give it credibility as a teacher, but also cheerful and lightly humorous. We aimed to do the same with our Google Assistant agent.

We made some changes to the course content to give it a more conversational feel by shortening some activities and adding clear interaction cues.

Example interaction using the simulator

There were a few challenges to overcome: the Google Assistant does have a lovely voice when speaking English, but its Spanish pronunciation is not quite as good… We had to make sure all Spanish text in the course content was replaced on the fly by an audio file spoken by a native speaker. Luckily this was easy to achieve using Google’s SSML implementation.

The next difficulty was matching the user’s spoken input with an expected answer: it’s easy enough in writing form to differentiate between “blond” and “blonde” or “how’s” and “how is”, but the very similar sounds cause confusion with the voice recognition and force us to add in a well calibrated levenshtein distance check.

Example of output ready for text-to-speech synthesis by the Assistant:

<speak>Good.<break time=\"2s\"/>\"Good afternoon / Good evening\" is <audio src=\"https://s3-eu-west-1.amazonaws.com/busuu-assistant/copy_1_1_1_e_es_4.wav\"></audio><break time=\"2s\"/>\"Good night\" is <audio src=\"https://s3-eu-west-1.amazonaws.com/busuu-assistant/buenasnoches_1462534875.wav\"></audio><break time=\"2s\"/>What's the English for<audio src=\"https://s3-eu-west-1.amazonaws.com/busuu-assistant/copy_1_1_1_e_es_4.wav\"></audio></speak>

Publishing the agent

At the time of writing, Google has set guidelines for agent invocation and discovery within the Assistant, but not revealed what the directory or “app store for agents” will look like. Google asks developers to provide agent descriptions and media assets as part of the publishing process, so we expect that they will come up with something soon.

Part of the agent publishing form

Going live

Actions on Google was just released and delivers conversation actions via the Google Assistant on Google Home. The busuu agent will be available in the coming days as part of the first batch of third-party apps on this platform. Owners of a Google Home device (Pixel phone and Google Allo support is due to come later) will be able to interact with busuu just by saying: “Ok Google, let me talk to busuu”. We can’t wait to see how people will explore this new way of learning a language! The possibilities opened by intelligent conversational interfaces are tremendous, this is just the beginning…

--

--