Video (above): Take a look at our promo!

Team Sweet Talk from Carnegie Mellon University’s Entertainment Technology Center explored the use of voice interaction in virtual reality. Given VR’s currently limited interaction options, we explored what novel experiences we could create that would afford deeper character relationships as a result of this unconventional combination of interfaces. Using the natural language power behind both Amazon Echo and Google Home, we tested the limits of character-driven AI by exploring a variety of teaching/learning relationships between our guests and the AI.

Design document (below): The underlying structure of our experienceOne Page Design_Rev CS3

Scroll down to see a sampling of some of my writings on the team’s design process and discoveries.

For a more comprehensive collection of information and updates, please check out our website.

Video (below): Our current gameplay

Top Five Lessons:

1. Contextual awareness is the key to conversation.

  • Two entities cannot have a conversation without contextual awareness.  Without it, a spoken interaction becomes a series of unrelated one-offs.  Currently, this is one of the defining differences between the Echo and Home.  Google Assistant is able to remember context between transactions, enabling a true conversation, while this is not something Echo can currently do.
  • In our case, contextual awareness extends beyond the verbal into the visual.  If our robot doesn’t “see” what a guest is holding, immersion breaks.

2. Non-verbal cues in a face-to-face interaction are interpreted as feedback whether intended or not.

  • Body language has the potential to communicate much more that words in a person-to-person interaction, and the same goes for this type of interaction in VR.  For instance, if a character is looking elsewhere, guests assume they’re not paying attention.
  • Some examples:
    1. In our first digital demo, when our little girl stared up at the player, it implied a) a power dynamic in which the guest is in charge and b) that the girl was awaiting instruction.
    2. In our latest version, when our robot, Babs, starts beeping and/or looking away, guests understand that she is processing and will wait for her to speak.
    3. When Babs looks at the fruit in a player’s hand, they understand she is learning based upon that object.  Of course, from the back end, it doesn’t matter at all where she’s looking or if she even had eyes, but that implied contextual awareness is powerful.

3. Expectations from a voice interaction are affected by character species and environment.

  • The little girl model we used at Quarters had a profound impact upon guest expectations from an interaction: they expected her to be able to have a full human conversation.  This is why we switched to a robot, a type of character expected to have a more limited understanding of human language.
  • The environment has a strong effect upon the domain of a conversation.  For instance, a kitchen has an implied set of associated nouns and verbs.  In a kitchen, guests would likely not ask a robot to bring them a lawn mower, or at least they would excuse the robot’s refusal as reasonable.  In a context like a warehouse, for instance, expectations are much larger.

4. Today’s voice assistant devices complicate their underlying services.  What we really wanted was a chatbot.

  • Both Google Home and Amazon Echo devices put crippling constraints on interactions, including limited attention spans and set invocation words.  Having to say “Alexa ask Babs to get me a red fruit” every 16 seconds breaks immersion.  This is why we built our own patchwork of services that essentially does the same thing as these devices but without the restrictions.

5. In a voice interaction, guests follow social norms until those norms are violated.

  • Based on our observations, people are generally polite in a voice interaction—even with an AI agent—until they know they can be transgressive.
  • Social norms can be broken by a joke, a delightful discovery of transgressive ability, or even an error thrown by an assistant.  
  • In the case of an error, as soon as it’s clear that this is not a normal, well-functioning conversation, suspension of disbelief ends and norms are dropped.  This is when abusive behaviors begin.



Design Process Excerpts:

Week 5: A quarter through the project

Week 5 was an exciting and eventful week for Team SweetTalk: two prototypes, Quarters presentations, a decision on our direction, bananas, apples, and more!

During the first half of the week, we wrapped up our first formal sprint (for those unfamiliar with Agile, this just means a chunk of time—in this case a week—dedicated to reaching a certain milestone) with the testing of two prototypes that would both help us learn more about guest behaviors in our experience and showcase our progress for Quarters.  As a reminder from last week, our experience breaks down into three beats:

  1. Meet a character you don’t know and with whom you can’t fully communicate.
  2. Teach the character how to communicate with you.
  3. Use that relationship to achieve something together that you could not do on your own.

Digital Prototype: Object Description

Our digital prototype focused primarily on beats 1 & 2 but did have a simple goal at the end.  Using Windows Voice Assistant as a stand-in for Alexa, we created a demo that enabled guests to teach our AI character object descriptors and then ask her to go to go get that object.  Then we added a surprise moment at the end when the girl was in danger—we wanted to see if guests would go with their gut reactions and yell “stop!” or if they’d be confused.  For this first, very basic demo, we had about a 50/50 response either way, which was great given the fact that it was totally unclear what to do!

Take a look at our professor’s playthrough of the demo:

Paper Prototype: Collaboration

Our paper prototype used navigation through a maze as a proxy for an asymmetrical cooperation in which our guest had a map of the maze our character was in and needed to get information from them to understand where they were and in which direction they should move.  Here’s the map:

Our first round featured me (Andrew) as the character, pretending to not understand English.  In the first step, the guest was asked to teach everything they thought I’d need to know in order to get through the maze successfully, which involved both commands and descriptors. Our playtester taught me the following words:

  • Andrew
  • Dave (his name)
  • Blue
  • Green
  • Red
  • Yellow
  • Left
  • Attempted to teach “show me” but couldn’t figure it out
  • Turn
  • Right
  • Circle
  • Good (as in praise)
  • Stay
  • Walk
  • Get

Unfortunately, I died a few times because of poor playtest design (it was hard to get me to stop in front of a door), but we still left with a lot of valuable information:

  1. We thought this would take 15 minutes. It took about 50.  Turns out teaching all this basic stuff took quite a while!  This was mostly because, even though we stipulated that I had a perfect memory and was a fully verbal adult who just didn’t speak English, Dave’s instinct was to reinforce his lessons and test me at various points along the way.
  2. He spent a while discussing with us what words he should use to get through different situations and struggled with some harder concepts like “avoid” or “go around.”
  3. His approach also took a caveman/pet-like direction with both generally single-word interactions and a desire to provide consistent positive reinforcement.

Knowing the flaws in our design, our second iteration introduced stops halfway through corridors and eliminated the use of English!  Charlie had the inspired idea that perhaps we should try teaching me in a different language, so I then learned how to navigate the maze in Japanese from our next tester.


  1. Her approach followed a similar pattern of reinforcement and testing, but she chose to speak to me in full sentences. I’d never pick up on most of the words, but I actually did figure some out!  When asked why she did that, she said it was “because I wanted to be nice and be natural.”
  2. There was a big moment for her when she realized she could use the room around her to teach—something that bodes well for our use of VR in teaching. Initially, the tester tried to teach me forward by moving a triangle towards me repeatedly on the table.  But once she realized she could walk, she started acting out and picking up things to make the experience easier to understand and more dynamic.
  3. She taught me the following:
    • Walk
    • Left
    • Right
    • Stop
    • Door
    • Open
    • Blue
    • Red
    • Green
    • Yes
    • No


We showed off all our work at Quarters, but the star of the show was definitely our digital demo.  Guests were really impressed overall, saying that we already had what seemed like a complete yet unpolished experience, and felt that we had tapped into something universal yet novel.  Of course, because it was a demo, we got a lot of valuable feedback about what directions our players would like to see our work go in, namely:

  1. Establish a backstory or relationship so that guests know what do to. This could be done through environmental clues.
  2. Have the character speak first.
  3. Give players the ability to ask more questions to the character
    • Because she’s human, people expected to talk to her more and ask questions about what she can do and who she is.
    • Players wanted to see more feedback from her body.

Idea Moving Forward:

Given what we’ve explored to date, we’ve given ourselves a few constraints to design around:

  1. We don’t want to do a branched narrative. Too many dissatisfying moments.
  2. We don’t want to have the expectation for full conversation, aka our character shouldn’t be a fully verbal human, at least in our language. This is because we can’t deliver upon this promise given our scope, technology, and team structure.
  3. Interacting with a little girl provoked strong emotions that we really liked.
  4. It looks likely that we’ll have to design around Alexa’s relatively robotic voice, or at least something similar, in order to provide the freedom in teaching that we want. Voice over may be impossible given that we don’t want to restrict our dictionary.  We want our guest to be able to call an apple an artichoke, call red periwinkle and say 2+2=5.  #alternativefacts

We went through a quick round of brainstorming and have decided to go in the direction of creating a malfunctioning, robotic little girl that we may have created and need to help.  Not too many details at this point, but we’ll be exploring this space over the next week.


Week 10: Halfway through the project

Week 10 was an intense week of change, reflection, and reprioritization, but one with a strong ending.

As we started to realize around Week 8, because of our subject matter, our project was becoming—by necessity—a research & development project, as there was a lot of testing and iteration required to really understand why many voice interactions are dissatisfying and how we could build something that avoided those pitfalls with limited technology. It was a tall order!

However, we had stumbled upon something really great back in Week 5 with only part of the knowledge we have now. Why, then, was our Week 5 demo so much stronger than out Week 9 demo? Technology was certainly a key reason: our little girl only recognized certain keywords, and in order to give our character a natural language understanding (NLU) that would give the illusion of intelligence, we had to build a new framework from the ground up. That framework is very complex and, as of Week 9, we learned we needed to restart from the ground up a third time.

Perhaps more importantly, we misunderstood what was successful in our demo. We thought the joy was in teaching, but really, it seems the delight was in discovering how to teach and then using that common knowledge.

We thought the joy was in teaching, but really, it seems the delight was in discovering how to teach.

Further Feedback & Iteration

On Monday, Anthony Daniels came to visit, which is always exciting. He took a look at both of our digital demos, and we acted out part of our new soup demo, which was not yet implemented. He enjoyed the empathy he experienced with the little girl but was frustrated because she seemed pre-programmed. As for the soup demo, Anthony didn’t accept the fact that a robot who could speak perfect English didn’t know what a tomato was—the story felt too forced. Carl, our faculty advisor, agreed and added that a) empathy might be tangential to our experience goals and b) that the story was too complex & confusing.

With this feedback and the halves feedback in mind, we set out to return to the basics of what we knew would work. The team had a fruitful debate on Tuesday as to whether or not we could get cooking to be interesting, as teaching ingredients left little room for orthogonal descriptors—for example, it would be very strange to describe a tomato as a red sphere… it’s a tomato. We also debated the merits and downsides of a fetching-centric game: it would certainly be simpler and easier, but what would our new domain be? Kitchens have a nice way of restricting the domain of what a guest would expect to be present that other environments might not.

A New Combined Cooking/Fetching Game

And, alas, a solution emerged somewhere in the middle! We decided upon a two-way fetching/serving game set in a kitchen that would provide a framework for cooking if we have the time in the future but didn’t necessitate it. The experience breaks into three beats:

  1. Greeting/introductions: We are in an alien restaurant for its first day of operation, and the player is working with an illiterate robot to serve up orders. She’s at the order window, and the player is at the pick-up window.
  2. Getting the order: The robot hands the player an order form written in an alien language, and using a translation chart on the wall, players will decipher what the code means and then teach the robot the names of alien fruits as well as the color of the required fruit. Once the robot understands, it fetches the ingredient.
  3. Serving: The robot, since she got the order, knows who it goes to. She tells the player to give it to the alien as identified by race (Gloffglorp, etc). The player needs to learn from the robot how to identify the correct alien and serve it appropriately.

We immediately playtested to see if a) the overall concept worked and b) to see if players would make up words or stick to the English domain, as our system is incapable of recognizing made-up words. You can take a look at the videos below:

Turns out that the experience was pretty fun and that, at least in these tests, there were no made-up words from the player end. With those encouraging results in mind, we decided to build out a playable prototype for our client/faculty meeting the next Tuesday.

A Technology Update

As if those changes weren’t enough, we’ve got a fairly major technology change to share. As you’ll remember, we officially put the Amazon Echo aside due to the crippling technology constraints, and we had high hopes that Amazon Lex, a new chatbot service in developer preview, might help us get around those constraints. While the system seems great, it is indeed in its very early stages, and we learned from the Lex team that the SDK required to integrate with Unity would not be released until late in Week 10 or early in Week 11. Sadly, we did not have the time to wait, so our new plan involves a bit of a patchwork:

  • Speech-to-text: Windows Dictation, as we had used previously
  • NLU: (now owned by Google)
  • Text-to-speech: TBD, but we’re looking at IBM’s Watson because of the high-quality voice and high level of emotional control

We’re off to the races with our new plan, and we’re excited since it seems to incorporate what we’ve learned and has the technology to bring it to life. Look for more updates next week!