Conversational UI

Research focused on understanding how users interact with conversational speech-based text editing

Project domain

Voice interfaces, speech interaction

Stakeholder

NUS HCI Lab (Singapore)

My role

UX researcher

Team

UX researcher, PhD candidate, Post-doc

Tools

In-house built prototype for speech recognition, Camtasia, Quirkos, Excel

Duration

5 months

The challenge of eyes-free text editing

Conversational interfaces allow us to interact with computers as we do with other people to accomplish tasks like booking a flight.

Searching flights

But for text editing using our voice, we need to learn a predefined syntax and repeat exact phrases such as "Delete cat. Insert bat". We can't yet talk to a dictation system as we'd do to a human secretary, for example.

This means, amongst other things, that to edit text using our voice we need to have visual contact with our device. That is not possible, for instance, when we are walking.

Understanding speech behaviour

Facilitating a more fluid interaction with text would be beneficial for users in situations where they can't use their hands and eyes, and also appropriate for visually impaired users.

But before we can design user-centered systems allowing for eyes-free text editing, it is necessary to understand how users would interact with those systems. The purpose of this research is to understand how users compose and revise text using a voice interface when not constrained by the use of a fixed syntax (e.g. delete, insert...).

Simulating a voice interface

Our goal was to simulate a conversational UI for a text editing system and to observe participants composing and revising pieces of text using the system.

Pilot studies

Prior to the study, we conducted a series of pilot studies to define crucial system operations and responses. We observed and analysed speech behavior of 22 participants to be able to predict how the simulated system should operate and respond.

 

Participant 15 to System: I make use of a varie- eh of different technologies (pause). Actually, maybe add moreover in the beginning of the sentence.

Wizard of Oz study

After the system operations and responses had been defined, we conducted a Wizard of Oz study in which one of us played the role of the system.

We asked 7 participants to perform text composition and revision tasks using the simulated system. They could speak to the system in however way they preferred and were not aware that they were interacting with a person. After the study, we interviewed participants and asked them about their experience and behavior.

Qualitative analysis of the data

We recorded the experiments and interviews and analyzed the audio data in a qualitative way.

Experiment setup

Experiment setup: participants did not know they were interacting with a human.

Experiment setup

Experiment setup: participants did not know they were interacting with a human.

Results

We collected, transcribed and analyzed more than 7 hours of audio data. Five themes emerged from the thematic analysis.

Observed strategies

One of the themes was "Editing strategies". Most participants in the study use commands to request the system to edit text. That means, they use certain keywords, although they do not stick to a specific grammar (e.g. they use synonyms).

 

Participant 1 to System: Change continuous to frequent

Participant 3 to System: Could you replace student foreign with stolen phone

Observed editing strategies

Observed editing strategies

Observed editing strategies

Observed editing strategies

This observation is interesting. Even if users are allowed to speak to the system in a more conversational way, they still use a lot of keywords.

It may be because we are used to interacting with computers that way, using commands. But it can also be that users do not trust the system to perform the desired operations and want to be more precise.

Level of specificity

Level of specificity

Level of specificity

Level of specificity

Users' utterances range from very specific to very unspecific. Utterances are very specific when both the operation and the target are clearly spoken out.

 

Participant 4 to System: Read the last sentence

Instead, utterances are vague when users fumble or think out loud. If a user thinks out loud, he does not intend his thoughts to be typed by the system.

 

Participant 2 to System: I don't remember what I was going to say

Design guidelines

From the results of the study, we were able to elaborate a set of recommendations to inform the design of a future system allowing for conversational interaction.

When users are not limited by a specific grammar, even if they use commands, their utterances are not always precise.

It would be interesting if the system was able to understand a user's intention. Only then the system would be able to discard the noise from the user's utterance.