As always, we are grateful to everyone that participated in yesterday’s expert connect!
There were a couple of items that we did not get to during the event so, like last time, we wanted to provide answers to those questions here.
Thank you again for your time and participation yesterday!
We had a couple of questions asking about the process of internationalization and Jibo's conversational technology so I wanted to make sure that we provided feedback that covers those questions:
We understand expanding Jibo's presence and capabilities outside the US and Canada is extremely important to both our consumers developers around the globe, particularly when it comes to Jibo's conversational technologies and adding new languages.
Conversational technologies are only one aspect of the complex task of internationalization. With that in mind, we are focused on delivering Jibo to our Indiegogo backers in the US/Canada and will have more details about our internationalization after those Jibos are delivered.
Q: How accurate is Jibo’s NLU. Does Jibo have any advantages in this area, e.g. does turning toward sound improve accuracy, could Jibo’s cameras be used in the future to provide more accurate NLU by watching the speaker’s lips, etc.?
A: As Roberto mentioned during his talk, we are using the most advanced technologies currently available to develop Jibo's conversational capabilities. As with any device using these technologies, there are outside factors that may impact Jibo's ability to understand the speaker, like the acoustics of a room, background noise, and how loudly the user is speaking, for example.
Q: Since many dozens or hundreds of skills might be installed into Jibo, can you please describe how the system would avoid collision of listeners. Which would respond first if multiple skills are listening for "pizza"?
A: This is a common challenge in speech recognition, and there are several options for addressing it when it comes to the handling of third party skills that will be available via Jibo’s skill store. Our conversational technologies and research teams are in the process of evaluating those options.
Q: What will Jibo have that can help users to be mindful of keywords i.e. "Hey Jibo turn on kitchen lights" The device keyword is "kitchen lights", if the user doesn't say both words what will Jibo do to help?
A: This is something we have been working on, and that will be addressed in the next SDK release using Multimodal Interaction Modules (MIMs). You can learn more about MIMs in our developer blog post on the subject. MIMs allow you to develop error handling within your skill, so Jibo can prompt users towards correct utterances if he is not hearing a match and, if a user continues to have trouble with the correct utterance, Jibo can also automatically generate Graphical User Interface (GUI) controls that can be used as an alternative to voice interaction.
Lastly, as promised, here are the audio files that show the progress of Jibo's TTS that Roberto demonstrated at 11:37 in the connect stream.
Very first Jibo voice:
Human Voice Used for Modeling:
Original TTS voice derived from Modeling the above human voice:
TTS after further progress and work:
TTS with adjusted algorithm and pitch changed to make the voice more youthful:
TTS with added adjustment to allow for emotion in the voice: