Local STT processing of short responses to reduce latency?


I expect that skills will listen for a common set of words/phrases and one idea is to process these short responses locally, rather than on the cloud. Already we have “hey jibo” processed locally.

The most obvious are the factory “word rules” such as the affirmation factory yes_no and variations we’ve seen discussed in the forum.
Likewise there are many possible candidate words such as commands and navigation related (north, south, east, west, left, right up down, next back, go, stop and all that…). Overall, these “words/short phrases” could be defined in special factories so that the usage is clear in the rule files.

As I design skills that expect the user to say navigation commands, afirmations (yes, no,… ), etc. , I realize that I have a higher expectation of responsiveness when using particular words.

Anyways, I assume it’s not so simple, but just suggesting…

1 Like


I had read a couple of years ago that Jibo was using the TrulyNatural speech-recognition engine from Sensory in this article. Is this still the case? If so, it seems like having offline speech recognition might be more readily available, especially for the typical use cases Bob mentioned above.

1 Like


We don’t have plans at this time to support local speech recognition for phrases beyond Hey Jibo. Our team is aware this is something you would like to see expanded in the future to include factory rules, such as yes_no. I will also let the team know you would like to see additional factory rules for commands and navigation.



Is it possible to open access to the SSML ‘Speech Synthesis Markup Language’ elements in the new Jibo SKD like ‘prosody’, ‘emphasis’ and ’ break’?

I am interested in testing if I can get Jibo to sing.

Also can you answer the question does Jibo use TrulyNatural speech-recognition engine from Sensory?



Because the information is proprietary, we are not able to discuss what speech recognition engine Jibo uses.

In answer to your question about impacting things like prosody in Jibo’s voice using the new SDK, there will be features that allow you to adjust variables like that as mentioned in the final section (“Embodied Speech Markup”) of this blog post of ours from October.



Thanks. I well try out the Embodied Speech Markup when the new SDK comes out.


closed #7