A simple request, a short sentence – for the human brain to interpret what is meant, make the connection and initiate an appropriate reaction is child's play. For a machine this is much more complicated. To control technical devices with speech requires many individual steps.

Detecting and interpreting speech

"Give me a pen!" – this may be a very simple command, but it makes the computer work hard in the background. Firstly, the spoken sentence is turned into text. The speech recognition software must overcome many challenges in order to identify the words used by their frequency patterns: unclear pronunciation, similar-sounding words with different meanings and different intonations or dialects. By comparing them with extensive databases, in which countless examples of words and their frequency patterns are stored, the software works out what the words are.

The next step is working out the meaning of the sentence. To do this, the software sends the text to a language interface that checks it for certain keywords. Beforehand, the programmer must determine all the necessary terms and commands – called intents – as well as their synonyms, and define which action lies behind each of them. For example, ‘give’ is identified as the request to transport an object to a particular place, whilst the word ‘me’ is understood to be a person or an objective of the action.

Artificial intelligence finds the optimal solution

Once the interface has identified the meaning of the sentence, it supplies a context object, which is a software code with which the device control system can work. In order to give the machine a clear instruction, the artificial intelligence now gets to work using other software. This evaluates the content of the context object and at the same time gets information from various sensors about the position of the device and its surroundings. The software houses modules for different solutions which are assigned to certain actions.

程序综合所有这些信息构建一条命令(例如,机械手臂该如何运动以及向哪个方向运动),然后将命令发送给设备控制器。传感器技术会根据命令识别铅笔在书桌上的具体位置,以及机器要采取何种路线才能拿起它,并将其递给某个人。软件逐渐学习适合各种动作的最佳解决方案,并在下一次做动作时加以运用。

所有这些复杂的流程必须在转瞬间完成,因为人类希望设备能够迅速作出正确响应。经过三十年的应用实践,语音识别功能已然相对稳定,但要让人类与机器像与邻居一般自如交流,在机器语音控制方面还有很多的研究与开发工作要做。