SSML: inserting developer events to support content driven features...content driven behaviours, animations, multimedia

Dear Jibo team,
The speech related documentation indicates that a subset of SSML will be implemented in the future.

Can you make it possible that developers can insert their own tags in the content that fire events from within the text that can then be detected by a listener?

I want to fire events that will trigger listeners in the behaviour tree. For example, events that convey gestures, emotions, or tell Jibo how to engage the audience. For example, these events could fire animations, child behaviours. A developer could listen for these events from scripts , decorators to control animations or other behaviours. Some events might even be started and stopped/modified later by referencing a reference identifier or an early tag.

The goal is to use the text content to drive Jibo’s interactions - just as you will modify Jibo’s spoken words from the text content modified with SSML. Now the text author can control how the text can be animated, where to invoke particular social behaviours, where to trigger multimedia events, etc. Ultimately the idea is to read the marked up text from a database and let Jibo BTs be guided by the embedded events.

In addition to the event name, a message body can be includes whatever the developer specifies - an object to refine the event, callback,etc.

I don’t know the fine details of SSML, perhaps it will be possible given the subset that you will provide. A simple approach is best meaning that the event element does not need to wrap SSML elements (no spoken text).

Each event would have a event ID, reference ID and a message body. Just a rough example…

<dev:event event=”FEAR” ref=”123”><![CDATA[]]></dev:event>

Tag attributes:
event=”FEAR” - a very general event. Triggers - Jibo trembles, suspenseful music plays, image of a wizard displays, etc. Jibo acting as the Wizard of Oz Scarecrow says "but I’m not afraid of wizards!"
ref = “1234” - an identifier that other events or listeners could reference. For example, a STOP event references a prior event that was started. By default, event duration might be controlled by the listeners.
Tag body: anything to support fine tuning of the corresponding action to take.

Event examples: ANIMATE, DANCE, SAD, HEART. The details in the tag body provide additional details.

For example, a HEART event might trigger to display a heart image, or a heart animation with Jibo showing affection. In another context, a more complex BT might lean in to the person speaking….

Other benefits:

  1. We tie up a lot of content in the behaviour tree, this approach would enable better modularity. The BT’s need to tightly couple the speech content in order drive BT related actions. We can store our content separately, use events to drive the BTs and reduce the coupling of the spoken text to the BT.

  2. Developers can create general or specific events and even reuse content across multiple skills.

  3. Support stories that drive the BT animations, trigger images, sounds, etc.

  4. Use content to control how Jibo interacts with people (gestures, emotions…) and how Jibo engages. In my prior theatre life, there are notes about to interact with other characters on stage. While speaking a line LookAt someone, lean in, etc

Conclusion:
Right now, I prepare an array of events. Some events are my segmented texts for the TTS. I can evolve this, however, to support the developer SDK, I thought it would be good to have these features in the SDK.

5 Likes

Hi Bob,

As always with posts in this category, we will make sure the team is aware of this feature idea so that they can take it into account going forward.

With that being said, there may be a less robust way to do something similar to what you are looking to do already in the SDK.

Our TextToSpeech behavior includes a Behavior Argument called onWord that can be used to execute code each time Jibo says a word. The actual value of the word can be referenced using word.token in that argument.

As an example, you could have Hello my name is Jibo as the content of the words argument for a TextToSpeech behavior. You could then have the following code in the onWord argument:

if (word.token === "name"){ console.log("NAME WAS SAID");};

In that instance the log would appear only once (for “name”) but that script would run for each word. That argument could ultimately be used to set variable and trigger code based on when a specific word was said in TTS.

The relevant API docs for that can be seen here.

I should note that we are currently working on a bug that keeps that onWord argument form reliably working on some occasions so you may need to wait a little longer to use that but it may allow you to do a little of what you described at this time.

I hope that is helpful and thank you!

-John

Thanks John for the example. Another tool to use :wink:
best, Bob

ok, I’ve read the speech style guide again and trying to comprehend the future SSA elements …they should help greatly. If we can listen for the SSA elements as with each “word”, that would be fantastic. However, I’d still like to listen for additional “developer event tags+body” into the content that Jibo speaks.

1 Like

I think this is a useful feature. Since animations provide an event-ing mechanism that facilitates synchronization, it would be nice to have speech do the same. I do think that synchronizing speech and animation is at the heart of this feature, so it might be nice to have a way to combine the two in the animation editor. Also, it would be nice to know how long it is going to take Jibo to say something. I didn’t see a way to do this now, but possibly I missed something.

1 Like

I hear it is on the way! :slight_smile:

Hi Servo,

I will definitely let the team know about your desire to have a tighter integration between TTS and the animation tool. In the meantime you might find the getWordTimings API useful for getting timing info for TTS.

Thanks, I did miss getWordTimings(). I’ll check it out.