The use of voice assistants is steadily increasing. However, the undisputed benefits and convenience the solution has brought to our lives have been overshadowed by the growing concern about the privacy and security of captured user data and the dangers of cloud storage, resulting in a discussion on a global scale. But just imagine a voice assistant that could meet all your data security requirements and easily integrate into your systems. Such features would increase its chances for customer acceptance enormously. And intive has the winning solution!
Everyone is familiar with voice assistants: in many households, Alexa is practically a member of the family, and whenever kids nowadays want to hear their favorite tunes, all they need to do is ask, “Alexa, play … “. Every iPhone user knows Siri, every Android phone user might know the Google Assistant. And, of course, the list goes on: there is Samsung’s Bixby, Microsoft’s Cortana, Huawei’s HiVoice, etc. However, to be fair, 97% of all voice assistant users choose one of the big three: Alexa, Siri, or Google Assistant.
Voice assistants are integrated into smart devices, like Alexa in Amazon Echo and Amazon Echo Display, Apple’s Siri in HomePod or Google Assistant in Google Home. They are also usually compatible with other digital equipment, such as smart TVs, mobile phones etc. Moreover, Alexa’s functionality, for instance, can be extended not only by Amazon itself, (functions called “Alexa skills”), but also through skills provided by third party users (e.g., radio station x provides the skill “Alexa, play [radio station x]” in the Amazon’s skill store).
And, last but certainly not least, there is the automotive industry. Nearly every car company either provides their own voice assistant, an interface for Google Assistant and/or Siri, or integrates Alexa directly. Every driver of a modern Audi, Mercedes or BMW knows their “Hey [Audi | Mercedes | BMW]” activating commands.
Currently, 4.2 billion digital assistants are being used in the world, and this number is predicted to double by 2024.
What has made it all possible? The answer is short and simple: AI (Artificial Intelligence). The rise of the voice assistant coincided with the rise of AI in the 2010s, receiving a massive boost in 2011, with a huge commercial and marketing success following the release of Apple’s Siri. AI, or to be more precise, its sub-field Machine Learning, which was gaining a lot of traction at the same time, is the main reason why the popularity of voice assistants has so exponentially soared. Machine Learning has not only significantly improved the recognition of single words, it has also improved in the field of Natural Language Processing (NLP), which basically means that you no longer need to use specific words and build your sentences in a specific way, but you can interact with the system in a natural, spontaneous way.
However, despite their great success, Alexa, Siri, Google Assistant and other voice assistants at large are facing today the same controversy: in a cloud environment you have potentially unlimited computing power at your fingertips, and all the most popular voice assistants run on their providers’ cloud, which means that all the input data is sent to that cloud.
Many users voice data privacy concerns, claiming that providers can analyze and use customer data for their own purposes, since that kind of information is of great value to the digital giants.
The challenge for car manufacturers, on the other hand, is that they have to decide between running a cloud-based service supported by one of the big players or building one of their own, which may in turn cause issues when access to such a mobile network is not available.
intive’s own voice assistant can run locally on a PC or on embedded devices, ensuring great quality, since it is based on state-of-the-art Deep Learning algorithms. It could even run in the cloud e.g., on a private cloud or a trusted cloud partner, but with the big difference that the input data sent to the cloud can be encrypted and would be visible to our client only (the client who potentially uses our voice assistant as part of his service/app). It is also possible to run our assistant in a hybrid mode: in the cloud, if the mobile network is available, and locally otherwise.
Voice assistants make use of several components. Usually, the interaction starts with a wake word. The voice assistant does not try to understand what the user is saying until the wake word has been recognized – “Alexa”, “Hey Audi”, and so on. The next step is Speech Recognition, where the system – to put it simply - converts an incoming audio stream into written text. Finally, the system tries to understand your sentence with Natural Language Processing (NLP). All these steps are based on Machine Learning / Deep Learning.
On a side note, Deep Learning is a subset of special state-of-the-art Machine Learning algorithms, Deep Neural Networks. Machine Learning, in turn, is a subset of AI. AI contains algorithms that can seem intelligent, and Machine Learning is the subset of algorithms that can actually learn from the provided data and develop new algorithms as needed without the help of a human, based on what they already “know” (not all AI algorithms work that way, though).
When the system has recognized your command, it can perform an action that has been predefined in the system e.g., if you tell the system “Drive me to the next gas station”, the car will drive you to the next gas station – assuming autonomous driving is a feature available in your car.
By the way, technically the wake word detection is already making use of ML, Speech Recognition and Natural Language Processing, but on a much smaller scale, designed to detect the wake word only.
Machine Learning applications have the big advantage over others in that they are able to generalize by learning from input data. If a Machine Learning system learns to tell the difference between cats and dogs on a specific dataset, it is later on able to distinguish between any cats and dogs, even though it has never seen them before (i.e., they were not part of the original dataset used for training).
The voice assistant operates on the same principle. Its speech recognition is of such an outstanding quality because it learns the meanings of the commands (it’s a philosophical question whether it does in fact learn, but that’s a discussion for another day), instead of just a fixed set of expressions. It is trained on general language data to understand the language itself, and then learns more specifically from use-case relevant data. We can even include typical background noises in the learning process, allowing the system to take them into account as well, and, by doing that, enhance the overall quality of our system even more. Imagine a voice assistant that can identify and ignore interfering vehicle noises, and massively reduce recognition errors!
What is seen last, is remembered best: here is a short summary of the qualities that make intive voice assistant unique and superior to other solutions on the market:
It is a standalone solution, running locally and offline on a device – no cloud or internet connection needed. But can be run in the cloud or in a hybrid mode, if needed.
Our voice assistant is built as an easily and highly adaptable framework – not a “one fits all” approach, but a combination of reusable modules which can be fine-tuned to our customers’ needs e.g., supporting different languages, a custom wake word, robustness to usage-dependent background noises, and trained on data typical for the use-cases needed.
Interaction feels very natural – the system does not expect predefined commands but understands diverse phrasings and expressions.
The big overall advantage for our customers is that they get a voice assistant that respects data privacy - which increases user acceptance - and one that can be easily integrated in their services, adapting perfectly to their use-case to guarantee high quality performance.