Entity recognition - locating persons (speakers, person(s) from NLU rule,etc )

(you are likely supporting all this, I just can’t tell from the API and documentation what you’ll be addressing)

I’m interested in understanding how to work with a group of persons, giving attention and interacting with the members, locating them, perhaps looking at/ following them… in my skill designs, there are a lot of nagging uncertainties for design and prototyping regarding how my skills can enage with humans and apply social behaviors, etc.

I’ll give you my basic use scenarios- though you likely will support them, you’ll have my views and perhaps my misconceptions - better strategies might be addressed in documentation/upcoming features guide.

a. identify the speaker (person name) that triggered an NLU rule, look at that person, and perhaps follow them. (continous lookup)
b. from the NLU rule, person names captured can be looked up to get more info about the person. I’m unsure what © requires.
c. locate a person in a frame … from a set of visual entities.
The goal is to match the person from (a) or (b) …or simply to identify persons participating in an activity but not mentioned in (a) or (b). By identifying persons, Jibo can interact with them using their names and even displaying their photo…etc.

Overall, it’s important to know the person and position to support lookAt funcationality.

Supporting (a) -

a first strategy is now possible: I can do (a) only by combining a two Listen bahaviors - one for speech, the other for the closest audible entity. From audible entity, we get the position for lookAt. I can now test that in the simulator (thanks!) To follow the person, then I can get the nearest visual entity and so forth.

Problems: relying on “closest” is not desireable in my group scenarios and I can’t confirm that the speaker is really the audible or visual entity. I need a means to match the speaker to a visual entity to the the speaker - at least to confirm it.

proposed second strategy: I want to also check a set of visual entities to locate the person. First check visually, if no match then turn while checking the visual entites (issue:turning slow enough to capture visual entities). In conjuction, the direction of turn could be determine from position of the closest audible entity. Alternatively, look at the audible entity, confirm the person is correct, if not then search visually …

In a group of kids plus parent or caretaker, it could get confusing - It does not make sense to rely only on audible or nearest visual. However, in a BT more than one strategy can be employed… starting with the most performant strategy and falling back on another strategy.

best, Bob

1 Like