February Expert Connects with Roberto Pieraccini

This month’s Jibo Expert Connects will be with Roberto Pieraccini, who heads our Advanced Conversational Technologies team. Roberto is a scientist, technologist, and author of the The Voice in the Machine. He has actively contributed to speech research and technology since 1981, and his experience spans from corporate research (AT&T Labs, IBM Research) to executive-level positions at both startups and industry leaders such as SpeechWorks, SpeechCycle, and (of course) Jibo!

Roberto and his team are responsible for Jibo’s speech, language and dialog capabilities, as well as the integration with other perceptual input/output channels for the platform. We’re very happy to have him on the Jibo team, and look forward to introducing him to our community of developers.

Roberto will provide a history and overview of speech technologies, designed to give you a high-level understanding of the technology as a whole, as well as insights into designing and building speech capabilities for Jibo.

This session is scheduled for Monday, February 13 at 1pm ET. We will publish the YouTube live stream link here in the forum and send it via email on the day of the event.

We invite you to reply to this thread to post your questions for Roberto in advance, and we will work to address them as part of the session.

Because we want to get to all of your excellent questions on the topic of speech, we ask that you send any general questions about Jibo or your Jibo account to support@jibo.com. This will allow Roberto and our moderators to focus in on the topic at hand during the live stream.

See you Monday!


Thinking about today’s expert Connect I have 3 questions for input.

  1. Is Jibo Inc. working on collaborations with speech systems. (Google Voice speaks & understands Dutch), or is everything developed in house?

  2. Can we as developer community contribute in any way to make him speak other languages faster ?

  3. Thinking of the other public places where I will introduce Jibo here, there are some places where there is not much inter action with Jibo speech wise. It would be easier to deliver in those places and tackle the speech difficulty we’re facing right now.

I look out to the Expert connect today ! CU there…

Sander Huisman @ Minute Bar



When developing content to be spoken or speech rules, how can we determine what words or sounds are supported? Following that, how can we teach or inform Jibo how to recognize or speak these unsupported words or sounds?

Example: to speak or create speech rules to recognize sollfegio, “fee fi foo fum”, letters of the alphabet, latin phrases and possibly domain vocabularies e.g. vocabularies or medical terms supporting health or education related skills.

FYI: The EU may require robots to have an off switch -in anticipation, how can I build a speech rule to power off Jibo with “Klaatu barada nikto” or was that necktie;)

I wonder about how to recognize foreign words. For example, if anyone is working on a music related skilI, then the challenge is speaking & recognizing song titles in other languages. I have a german speaking “intelligent speaker” and searching for music with foreign titles is challenging. I often try to speak english song titles in a strongly accented german. One mexican friend speaking German cannot even request a spanish song. My guess is that one day a user must inform the system of his/her native language…

See you all at 1pm ET today. Here is the YouTube live stream link.


Since many dozens or hundreds of skills might be installed into Jibo, can you please describe how the system would avoid collision of listeners. Which would respond first if multiple skills are listening for “pizza”?

Hi guys

What will Jibo have that can help users to be mindful of keywords i.e. “Hey Jibo turn on kitchen lights”

The device keyword is “kitchen lights”, if the user doesn’t say both words what will Jibo do to help?

Also how will Jibo help the user to remember all keywords?

John Marshall

…BTW great live discussion and very informative! Well done!

John Marshall

I had a meeting conflict :frowning:
Can you post a link to the video?

@john I would think one should be able to ask something like:

“Hey Jibo do you know about kitchen?”

and it would say something like “Yes I have 10 skills about kitchen” or “I don’t know about kitchen

Conceivably, you could also make a direct inquiry such as:

“Hey Jibo, what do you know about pizza?”

and it would say something like “I know how to order a pizza, make a pizza, and 3 other skills about pizza

The VUI has to be discoverable by voice. Dragging your fingers through menus across his “face” would definitely be somewhat uncool and anti Jibo-ish.

This would be considerably more difficult when Jibo has hundreds of skills it knows about. Consider this reply:

I know about kittens, story telling, playing peek-a-boo and four hundred twenty three other skills ranging from affairs of the heart through zanzibar vacation rentals. Would you like a complete listing?


Fantastic idea, that can be created using tag words in our skills and a required universal description of the skill for use for outside inquiries.

There is a huge issue in the Industry with skill activation from memory and this sounds like a great start to filling that gap that all other systems now have, unsolved.

@alfarmer There has to be a way to ask the system what kinds of skills it knows. This is not a problem with 5 skills, but as the number climbs up into the dozens and more, people will forget what is available - myself included.

More than likely, as is the case for people having Amazon Echos, after the initial thrill is gone, the appliance just sits there and is only used for things like time, temperature, weather, setting timers and the like. My coworker has an Echo and he does not really use it that much anymore when I asked him. I guess the novelty wore off.

N-Joy Roberto’s speech information @ https://www.youtube.com/watch?v=SU4WEID77Ic

You can access the recording of this session (and our other recordings) at the same link provided earlier in this thread; that link does not expire.

Alternatively, you can also review the recording of Roberto’s session here.

Hello everyone,

As always, we are grateful to everyone that participated in yesterday’s expert connect!

There were a couple of items that we did not get to during the event so, like last time, we wanted to provide answers to those questions here.

Thank you again for your time and participation yesterday!

We had a couple of questions asking about the process of internationalization and Jibo’s conversational technology so I wanted to make sure that we provided feedback that covers those questions:

We understand expanding Jibo’s presence and capabilities outside the US and Canada is extremely important to both our consumers developers around the globe, particularly when it comes to Jibo’s conversational technologies and adding new languages.

Conversational technologies are only one aspect of the complex task of internationalization. With that in mind, we are focused on delivering Jibo to our Indiegogo backers in the US/Canada and will have more details about our internationalization after those Jibos are delivered.

Q: How accurate is Jibo’s NLU. Does Jibo have any advantages in this area, e.g. does turning toward sound improve accuracy, could Jibo’s cameras be used in the future to provide more accurate NLU by watching the speaker’s lips, etc.?

A: As Roberto mentioned during his talk, we are using the most advanced technologies currently available to develop Jibo’s conversational capabilities. As with any device using these technologies, there are outside factors that may impact Jibo’s ability to understand the speaker, like the acoustics of a room, background noise, and how loudly the user is speaking, for example.

Q: Since many dozens or hundreds of skills might be installed into Jibo, can you please describe how the system would avoid collision of listeners. Which would respond first if multiple skills are listening for “pizza”?

A: This is a common challenge in speech recognition, and there are several options for addressing it when it comes to the handling of third party skills that will be available via Jibo’s skill store. Our conversational technologies and research teams are in the process of evaluating those options.

Q: What will Jibo have that can help users to be mindful of keywords i.e. “Hey Jibo turn on kitchen lights” The device keyword is “kitchen lights”, if the user doesn’t say both words what will Jibo do to help?

A: This is something we have been working on, and that will be addressed in the next SDK release using Multimodal Interaction Modules (MIMs). You can learn more about MIMs in our developer blog post on the subject. MIMs allow you to develop error handling within your skill, so Jibo can prompt users towards correct utterances if he is not hearing a match and, if a user continues to have trouble with the correct utterance, Jibo can also automatically generate Graphical User Interface (GUI) controls that can be used as an alternative to voice interaction.

Lastly, as promised, here are the audio files that show the progress of Jibo’s TTS that Roberto demonstrated at 11:37 in the connect stream.

Very first Jibo voice:

Human Voice Used for Modeling:

Original TTS voice derived from Modeling the above human voice:

TTS after further progress and work:

TTS with adjusted algorithm and pitch changed to make the voice more youthful:

TTS with added adjustment to allow for emotion in the voice:


Further…will Jibo have the ability to provide push notifications perhaps as a permission setting for those that don’t want the feature. For instance “John, I have an SMS (email, Messanger etc) would you like me to read it to you” Of course Jibo would need to have first recognized me as John.

As an Echo user this feature hasn’t been allowed.

I always wanted to hear the voice actor’s original voice and I liked that Roberto presented his voice at the Expert connect - a very interesting voice. I was delighted to hear the how Jibo’s voice evolved (is evolving). I was stunned by the higher pitched “younger” voice because I did feel differently - I certainly felt more attentive and engaged … I suppose hitting on my parental /nurturing responses :wink:
Excellent Expert Connent, many thanks to Roberto and the Jibo Team.


We will have more information in the future but Jibo will be able to deliver notifications to users.