When/if multiple speakers fire a rule...?

I’m working on behaviours that explicitly and implicitly determine participants in a group …for storytelling or game. Looking again at the language rules, it seems that a rule be triggered by multiple persons (speakerIds)….? In that case, simple rules might be triggered when invoking the rule once.

(listener) => {
listener.on(‘cloud’, function(asrResult, speakerIds) {

});

}

In response to Jibo’s skill question, “who wants ice cream?”, the rule is triggered by “I do | me | yes”. Two kids say “I do”, one kid says “me”. Is it correct, that potentially, Jibo could hear three children?

One person could say “I”, the other says “do” and the rule fires?

Of course, to collect members of a group, we might loop the “listen” behaviour for a few seconds so that Jibo can detect multiple confirmations from speakers.

It is helpful to understand though, how “multiple speakers” might help our work when the rules are simple. In contrast, it is good to understand the problems if a rule is fired by multiple persons when we expect only one speaker!

Thanks for any insights (if explained in the future documentation, I can wait)
Best, Bob

1 Like

This is a great question, which spawns a whole slew of answers and discussions which could probably be turned into an Engineering Ph.D. Thesis. :wink:

There are two areas of speech technologies you are hitting on: sound processing, and natural language understanding (NLU). The first one focuses on how to process audio waves into cleaner signals, and the second focuses on how to interpret strings of text.

Let’s start with the first one. On the robot, Jibo uses a sound segmentation algorithm called beam-forming. Jibo has 6 microphones and can project any number of beams. A beam is a cone that projects outwards from Jibo’s head. Jibo is able to segment the sound inside any one beam, that is, he can filter out sound outside any particular beam. Jibo has, at any time, multiple beams he’s processing. He’ll then use special algorithms to determine which beam is the “most interesting”, and by “most interesting” I mean, which beam’s harmonics most closely matches human speech. If two people say two different commands to Jibo at once (especially if they’re spatially well separated), Jibo will only choose one beam to send to the Audio Speech Recognizer (ASR) for conversion into text. In this case, he’ll send the beam that is the loudest and also contains the harmonics of voice. In essence, Jibo will have selective hearing and ignore everyone but the loudest. Now, of course, if people get smart and takes turns saying each word, Jibo will feed each of those words into the ASR as if they’re coming from one person.

Now on to NLU. NLU, by nature, is fraught with ambiguity. Let’s take a simple rule

TopRule = $* (yes{what='yes'} | no{what='no'}) $*;

Now what would happen if I ran the following string through this rule

"hey jibo, no, I think the answer is yes."

Immediately, there’s an ambiguity. Our rule states that it’s listening either for “yes” or for “no”. So the Robust Parser would actually return two results. One result with what='yes' and one with what='no'. But which result should we take more seriously? This is where the confidence score comes into play. Currently, the parser only includes the result with the highest confidence. In the future, we are going to change this to return n-best results and let the developer decide which one is most important. The parser uses a set of heuristics to determine which results are more important. If you get two results back, for example, you might want to have Jibo repeat the question, or have him ask the person to clarify the response.

This may make it seem like with just a little bit of effort a person could seriously confuse Jibo by responding ambiguously on purpose, but, luckily, humans are also bound by the same laws of logic. I’ve had this conversation at least a few times:

Me, "So you don’t want to go out?"
Friend, "Yes, no."
Me, “Um is that a no?”

Because Jibo can interact socially, he doesn’t need to be able to unambiguously resolve all language. He can ask follow up questions (like in the conversation above), or better yet, guide people how to interact with him. People very quickly get a feel for how to interact with Jibo as long as their interactions are consistent. Unsurprisingly, things that make it easier to follow group conversations for people (turn taking, being clear and concise, not speaking over one an other) are the exact same things that make it easier for Jibo to interact with people.

Hope this answered your question.

-Jonathan

2 Likes

Thank you Jonathan for a great answer! It certainly helps to understand what’s happening so that we can device our skill interaction patterns.
I certainly agree with your statements about human group interactions, however,to support particular games or storytelling, the leader (Jibo) must expect or even provoke “overspeaking”. Quite fun when everyone is shouting - humans playing charades is one example, people yelling out proper nouns… In a multiplay skill, I listen for the group feedback ( yes, no, north, south, …) and followup…etc
In most skills, I want to track participation/involvement as their are motivation strategies that can be applied - so listening for simple responses from group members is one aspect.
Given that, the future implementation involving [quote=“jonathan, post:2, topic:413”]
n-best results[/quote] will be very helpful.
My only concern is that how “best” is determined might involve filtering valid responses.
For example, if best is determined by the “loudest” and “closest” voice, then kids will figure this out and scream and shout to affect the multiplay game/storytelling skill… the quieter kids are filtered out or not detected for some reason…here might be configurations to get more hits. Of course, not so different from human strategies - be the loudest - except that Jibo right now can’t simply correct the group to behave and speak softer.

Anyways, I understand enough now to apply. I look forward to learning more when the future changes are implemented. Thanks again!
Best, Bob