Conversational interfaces allow us to interact with computers as we do with other people to accomplish tasks like booking a flight.
But for text editing using our voice, we need to learn a predefined syntax and repeat exact phrases such as "Delete cat. Insert bat". We can't yet talk to a dictation system as we'd do to a human secretary, for example.
This means, amongst other things, that to edit text using our voice we need to have visual contact with our device. That is not possible, for instance, when we are walking.
Facilitating a more fluid interaction with text would be beneficial for users in situations where they can't use their hands and eyes, and also appropriate for visually impaired users.
But before we can design user-centered systems allowing for eyes-free text editing, it is necessary to understand how users would interact with those systems. The purpose of this research is to understand how users compose and revise text using a voice interface when not constrained by the use of a fixed syntax (e.g. delete, insert...).
Our goal was to simulate a conversational UI for a text editing system and to observe participants composing and revising pieces of text using the system.
Prior to the study, we conducted a series of pilot studies to define crucial system operations and responses. We observed and analysed speech behavior of 22 participants to be able to predict how the simulated system should operate and respond.
Participant 15 to System: I make use of a varie- eh of different technologies (pause). Actually, maybe add moreover in the beginning of the sentence.
After the system operations and responses had been defined, we conducted a Wizard of Oz study in which one of us played the role of the system.
We asked 7 participants to perform text composition and revision tasks using the simulated system. They could speak to the system in however way they preferred and were not aware that they were interacting with a person. After the study, we interviewed participants and asked them about their experience and behavior.
We recorded the experiments and interviews and analyzed the audio data in a qualitative way.
We collected, transcribed and analyzed more than 7 hours of audio data. Five themes emerged from the thematic analysis.
One of the themes was "Editing strategies". Most participants in the study use commands to request the system to edit text. That means, they use certain keywords, although they do not stick to a specific grammar (e.g. they use synonyms).
Participant 1 to System: Change continuous to frequent
Participant 3 to System: Could you replace student foreign with stolen phone
This observation is interesting. Even if users are allowed to speak to the system in a more conversational way, they still use a lot of keywords.
It may be because we are used to interacting with computers that way, using commands. But it can also be that users do not trust the system to perform the desired operations and want to be more precise.
Users' utterances range from very specific to very unspecific. Utterances are very specific when both the operation and the target are clearly spoken out.
Participant 4 to System: Read the last sentence
Instead, utterances are vague when users fumble or think out loud. If a user thinks out loud, he does not intend his thoughts to be typed by the system.
Participant 2 to System: I don't remember what I was going to say
From the results of the study, we were able to elaborate a set of recommendations to inform the design of a future system allowing for conversational interaction.
When users are not limited by a specific grammar, even if they use commands, their utterances are not always precise.
It would be interesting if the system was able to understand a user's intention. Only then the system would be able to discard the noise from the user's utterance.