Inside Facebook's speech recognition factory

Published Fri, Jul 7 201710:00 AM EDTUpdated Fri, Jul 7 20171:22 PM EDT

Key Points

Speech, video and audio technologies are coming out of Facebook's Applied Machine Learning group.
Last year, Facebook implemented a speech recognition system that responds to the words "Hey Oculus."

Joaquin Quinonero Candela, director of Facebook's Applied Machine Learning group, speaks at Facebook's 2017 F8 developer conference.

Apple has Siri and Amazon's got Alexa. Microsoft created Cortana and Alphabet launched the Google Assistant.

The technology giants are racing to bring speech recognition to consumers through a host of mass-market devices and apps. But one company has been curiously absent: Facebook.

While Mark Zuckerberg has pushed his app across the globe — last week it crossed 2 billion active users to go along with 1.2 billion people on chat service WhatsApp — Facebook has lagged behind its rivals when it comes to voice control.

There's too much at stake for the company to stand still. Research firm Markets and Markets predicted last year that the speech recognition market would reach $10 billion by 2022. Beyond the money, internet companies need consumers using their speech tools so they accumulate even more data in order to improve accuracy.

Device makers have a big advantage in pushing out voice technology because they have direct access to consumers. Unlike, Apple, Amazon and Alphabet, Facebook doesn't have a piece of hardware or a mobile operating system that's in many millions of people's pockets or homes.

The closest thing Facebook has in terms of hardware is Oculus, the virtual reality headset maker Zuckerberg snapped up for $2 billion in 2014. As Facebook seeks to make headway into speech recognition, Oculus is one of its testing grounds.

A participant demos Oculus VR glasses at CES 2016 in Las Vegas.

Justin Solomon | CNBC

Here's the idea: While wearing the headset, you can say "Hey Oculus" and get a response to an inquiry. For example, you can ask for the view to be recentered or to open up a particular game or to search the app store. The technology works on the Oculus Rift and the Samsung Gear VR, which is powered by Oculus.

"To explore any interesting hands-free interfaces, you will definitely need speech," Joaquin Quinonero Candela, the head of Facebook's Applied Machine Learning group, said in an interview last week at the company's Silicon Valley headquarters.

Facebook's use of speech recognition goes well beyond Oculus. The company also deployed a system for automatically generating captions for certain videos. And more speech-enabled products are on the way.

Facebook made a clear jump into artificial intelligence when it hired Yann LeCun from New York University in 2013. LeCun is a longtime academic in machine learning who was tapped to lead the new Facebook Artificial Intelligence Research group.

Push and pull

Within a few months, Facebook engineers were taking products developed in LeCun's research group and getting them ready for wider use. That process was formalized with the establishment of the Applied Machine Learning group in September 2015, under the leadership of Candela, a veteran of Microsoft Research who had arrived at Facebook three years earlier.

Facebook has kept its progress in speech recognition largely under wraps, even as Alphabet, Apple and Microsoft have been touting improvements in the accuracy of their systems in recent years.

Candela said his group began working on speech enhancements about 2½ years ago, with help from Jibbigo, a start-up Facebook acquired in 2013.

Facebook's research and development, Candela said, falls into two categories: push and pull. Push involves betting that some capability will be useful in any number of ways in the future and then setting out to create it, while pull is when engineers are asking for a new feature to be built internally.

Speech was squarely in the push category.

One use case the researchers came up with was to automatically generate captions of videos, something Google started doing for YouTube videos in 2009.

'Looked for a problem'

Facebook initially focused on ads. The rationale was that at the time people were typically scrolling through their feeds with the sound off, so for advertisers to get their message across they needed text to run inside their video ads.

"We looked for a problem space in the speech recognition area through which we could deliver value to users," said Reena Philip, an engineering manager for Facebook's speech infrastructure group. Joining forces with the ads team, "we collaborated closely on a prototype," she said.

The feature launched in the second quarter of 2016. Facebook then went deeper with the technology, automatically generating subtitles for videos that organizations posted to their dedicated pages in U.S. English.

"We did experiments — we got a double-digit increase in engagement with videos if we captioned them," Candela said.

Unlike the video captioning system, the Oculus speech recognition feature was an example of a pull that triggered Candela's group's work.

On the Gear VR, apps and games became difficult to find as more were made available. Saying a name out loud became a viable alternative for hunting down something specific. The bigger challenge was related to titles that eschewed normal English words, such as Vrideo.

The Oculus Voice user interface you see when using Facebook's Oculus Rift virtual reality headset.

"'Lucky's Tale' is probably something we could do fine on," said Merlyn Deng, a Facebook product manager, referring to a game that comes bundled with the Oculus Rift.

But Philip, who worked on Amazon Alexa before joining Facebook in 2015, said non-English words in titles are "very typical."

Facebook also had to make sure that Samsung smartphones hooked into Gear VR headsets would respond to the words "Hey Oculus." Unlike the Oculus Rift headset, a Gear VR is just powered by a mobile phone and doesn't have a beefy computing system behind it.

"The footprint can only handle so much space, but it's getting better and better," Philip said. Apple and Google have found ways to squish voice activation into phones, and now Facebook has followed.

When speech recognition launched in Oculus in the fourth quarter, it only worked for U.S. English, but the team has since added support for more English dialects, Philip said.

We want to build a deep semantic understanding of people's interests, and also of content.
Merlyn Deng
Facebook product manager

Over time, Facebook could make the Oculus speech recognition technology work offline, Philip said. The company also may eventually support languages other than English.

Beyond that, Facebook employees weren't specific about exactly where the company is headed.

At a high level, said Deng, "we want to build a deep semantic understanding of people's interests, and also of content."

It's possible to guess about what could come next if you think about where Facebook excels. The company has data about your interests, your friends' interests and their friends' interests. It has users' pictures, videos and text posts, along with articles and other content that people have shared on the social network over the years.

"Other voice assistants may be geared toward what they have data for," Deng said.

Today, Facebook is all about community. That's another good guideline to consider when you imagine what sorts of voice-activated experiences Facebook might decide are worth pursuing.

"The stuff we would attempt to do has to be lined up with the mission and also the data that we have here," Deng said.