Dear Jibo team,
The speech related documentation indicates that a subset of SSML will be implemented in the future.
Can you make it possible that developers can insert their own tags in the content that fire events from within the text that can then be detected by a listener?
I want to fire events that will trigger listeners in the behaviour tree. For example, events that convey gestures, emotions, or tell Jibo how to engage the audience. For example, these events could fire animations, child behaviours. A developer could listen for these events from scripts , decorators to control animations or other behaviours. Some events might even be started and stopped/modified later by referencing a reference identifier or an early tag.
The goal is to use the text content to drive Jibo’s interactions - just as you will modify Jibo’s spoken words from the text content modified with SSML. Now the text author can control how the text can be animated, where to invoke particular social behaviours, where to trigger multimedia events, etc. Ultimately the idea is to read the marked up text from a database and let Jibo BTs be guided by the embedded events.
In addition to the event name, a message body can be includes whatever the developer specifies - an object to refine the event, callback,etc.
I don’t know the fine details of SSML, perhaps it will be possible given the subset that you will provide. A simple approach is best meaning that the event element does not need to wrap SSML elements (no spoken text).
Each event would have a event ID, reference ID and a message body. Just a rough example…
<dev:event event=”FEAR” ref=”123”><![CDATA]></dev:event>
event=”FEAR” - a very general event. Triggers - Jibo trembles, suspenseful music plays, image of a wizard displays, etc. Jibo acting as the Wizard of Oz Scarecrow says "but I’m not afraid of wizards!"
ref = “1234” - an identifier that other events or listeners could reference. For example, a STOP event references a prior event that was started. By default, event duration might be controlled by the listeners.
Tag body: anything to support fine tuning of the corresponding action to take.
Event examples: ANIMATE, DANCE, SAD, HEART. The details in the tag body provide additional details.
For example, a HEART event might trigger to display a heart image, or a heart animation with Jibo showing affection. In another context, a more complex BT might lean in to the person speaking….
We tie up a lot of content in the behaviour tree, this approach would enable better modularity. The BT’s need to tightly couple the speech content in order drive BT related actions. We can store our content separately, use events to drive the BTs and reduce the coupling of the spoken text to the BT.
Developers can create general or specific events and even reuse content across multiple skills.
Support stories that drive the BT animations, trigger images, sounds, etc.
Use content to control how Jibo interacts with people (gestures, emotions…) and how Jibo engages. In my prior theatre life, there are notes about to interact with other characters on stage. While speaking a line LookAt someone, lean in, etc
Right now, I prepare an array of events. Some events are my segmented texts for the TTS. I can evolve this, however, to support the developer SDK, I thought it would be good to have these features in the SDK.