Design Thinking for Data Scientists — Prototype Evaluation Planning

9 min readSep 30, 2020

Author(s): Tara Su

Abstract. In this report, I planned out three evaluations for my redesign of the mail app of the iPhone. I planned qualitative evaluation (post-event protocol) for my Wizard of Oz prototype of my voice interface, the empirical evaluation for my paper prototype of my incremental improvement of the app, and the cognitive walkthrough for the textual prototype of my virtual reality (VR) interface. I decided to execute the post-event protocol and cognitive walkthrough in my next step.

Qualitative method: post-event protocols

Figure 1: Target user profile

The first evaluation I want to employ is a qualitative method, specifically post-event protocols. This evaluation method is particularly good for the Wizard of Oz prototype. The interface I try to improve is the Mail app on the iPhone.

The requirements I decided after need-finding are 1) sort emails into customized folders 2) making notes while reading emails 3) adding calendar events easily 4) summarize email and attachment content 5) extract important links such as unsubscription, account activation, etc. I proposed a voice interface to facilitate these features. While voice interface is potentially a good interface for regular usage when the user is alone, it can deliver outstanding experiences in the context of other user activities that occupy their hands, such as driving and holding a baby.

Evaluation plan

Who are the participants: The participants I plan to recruit for this evaluation will fit in the original user profile I had in my mind (Figure 1).

How to recruit participants: Most of my colleagues fits into this profile. I also have friends and family members that fit into this profile. I plan to recruit 4–6 participants from them.

Place of evaluation: Because of the nature of the voice interface, I plan on conduct most of the interviews in a quiet and private place, which is a similar context I would expect users to be in real life. For interviews with friends and family, I plan to have the interview in the car (without driving for safety at this stage) and when they are holding their babies.

Recording format: I will first interact with the participants using my Wizard of Oz prototype. I will conduct post-event protocols and record the results using notes. This method is non-intrusive and easy to analyze later.

Content of evaluation

Directions to the participants: I will first explain the basic background of the evaluations, including the app I am trying to improve, the format of this prototype (Wizard of Oz), how the post-event protocol would work. I will explain the basic features (check email, making notes, create calendar events, provide important links, etc.) to the users. After going through the prototype, I will ask for their feedback.

Data gathering: Before going through the prototype with the user, I plan to gather the data according to data inventory: user information, location during the evaluation, the context of trying out the prototype. After going through the prototype, I would like to collect information such as the user’s goal when using the interface, what they needed (what are addressed in the prototype and what are not addressed), what tasks and subtasks they think they went through. In addition, I also want to make notes about the users' reactions during the tryout of the prototype and the task and subtasks I think they went through and what functions they actually engaged with.

Questions after experiencing the prototype: The initial list of questions I want to ask is as below:

  1. What was your goal when using the prototype?
  2. What did you need to accomplish your goal? What part of your need was addressed by the interface? What part was not addressed or not good in the interface?
  3. What tasks and subtasks do you have in your mind to accomplish your goal?
  4. What do you like? What you don’t like? What improvement do you want?
  5. In what context do you find the prototype useful? What context would it be difficult to use?
  6. What other feedback you have for the prototype?

The major requirements are baked into the prototype, which will be explained to the users briefly with a neutral tune before trying out the prototype. All the major components of the data inventory will be collected as mentioned above. This prototype will help me gauge the usability of the voice interface for the mail app in terms of the idea itself, the design, and the suitable context. I will also be able to observe the features they use, so I can refine my requirement.

Empirical evaluation

I would like to test my simple improvement of the current interface with the empirical evaluation. This prototype is a paper prototype that uses different tabs for different folders. Presort emails based on user behavior and the email content into folders. Allow users to move them among folders. Able to add email content to the calendar and make notes in the app while reading.

Empirical evaluation is normally conducted at a later stage of the design. Therefore it is very difficult to simply apply empirical evaluation to this early stage paper prototype. To really test it out, I would change this paper prototype to a real app prototype which makes the design of the paper prototype come to life. This way, I can do way more quantitative analysis such as analyzing the log data to get the time span of each step of the tasks, the path length of the interaction, and the efficiency and accuracy of note-taking or schedule meetings when going through emails. Unfortunately, for now, I just have to stick with the current prototype.

Experimental conditions:

What are you testing: Empirical evaluation excels when comparing narrow scopes. To be more focused, I want to test a specific function of my interface: note-taking while going through emails.

What are you using as a point of comparison: To make it a fair comparison, the control group will be getting a paper prototype without the note-taking function backed in the app. The user can use other apps to take notes. The test group will be using the new interface with the note-taking function in the same app.


The null hypothesis is that the new user interface has a similar or less user satisfaction score as the old one for note-taking while going through emails. The alternative hypothesis is that the new user interface has a higher satisfaction score for note-taking.

Experimental methods

Between or within-subject: I would like to conduct the experiment as a within-group comparison.

Assign subject to groups: Each user will be randomly assigned to see the interface without note-taking function first or to see the interface with note-taking function first.

What will they complete: I will present both paper prototypes of the control and the test in the sequence according to their random assignment. I will let them know that they will need to rate their satisfaction with both interfaces.

Data generation: After I walk them through both interfaces, I will ask them to rate satisfaction (highly dissatisfied, dissatisfied, neutral, satisfied, highly satisfied) for both interfaces.

Analysis to be used on data: I will perform the Kolmogorov-Smirnov test on the count of people’s ratings on the control and test group. This test is particularly useful when the categorical dependant variables have an ordinal relationship among them.

Confounding factors: The confounding variables that may compromise my analysis include but not limited to:

1) people are more familiar with note taking the old fashioned way.

2) the user I recruited are mostly highly educated. They may be naturally attracted to new solutions-the one they never used in real life.

3) the drawing other than the functionality is better in one paper prototype than the other.

Predictive evaluation

For predictive evaluation, I would leverage the cognitive walkthrough method to evaluate my VR interface for improvement of the mail app on the iPhone. In this prototype, I proposed creating a VR experience with virtual assistants to help users go through and organize their mails.

This interface is very costly to build. Most parts of the design are just in my mind. I used a textual prototype to represent this design. It may be difficult for users to vividly imagine the real product, whereas it is relatively easier for me. In this case, a method with no user involvement such as a cognitive walkthrough can be particularly useful.

In this cognitive walkthrough, the user’s goal would be to go through all the emails received in the past day, make notes, and add calendar events.

The emails will be related to work, personal important matters, social activities, and promotions. The emails will be sorted and waiting in different rooms according to what kind of emails they are. The user can also check everything in the “lobby” in an unsorted form. There will be a butler (virtual agent) appear any time the users call him. The butler can help the user execute any task or answer all questions.

In this walkthrough, I will specifically focus on evaluating the gulf of execution and gulf of evaluation. The interface is a completely novel interface. Although it is very similar to direct manipulation in the real world, it is aimed to be better than the real world, which means it will have functions that normal people never experienced (such as having a butler), and functionalities that won’t happen in real life (such as teleport). The discoverability and the feedback become extra important in this scenario, as well as other design principles such as flexibility, ease of use, simplicity, etc.

I will start by imagining myself entering the lobby and ask myself questions related to the gulf of execution and evaluation every step down the journey. What would I perceive? What would I think? What kind of goal I would have in my mind? Would I know what to do? How should I discover the functionalities? How will I interact with the butler? How would I “walk” into the room? Will my interactions with the “mails” feel natural? How would I take notes while going through them? How would I add calendar events? How would I even know that these tasks are possible? How would I move from one room to the other? Would I feel it is tedious or a good chance of mood? During the whole process, would I get good feedback on my actions? Would I be confident that all my actions lead to the results I wanted?

Overall, I will put myself into the user’s shoes and imagine every detail of the experience of interacting with my newly designed interface.

The benefit and caveats of the cognitive walkthrough are obvious. The benefit is that I can start building a good user experience without a lot of investment into actually building this interface. I had the vision of the design, so it is very easy for me to imagine every detail of the virtual world, which could be hard for most of the users. On the other hand, caveats are clear. I am not my user. No matter how hard I try to put myself into their shoes, I am not my user. It would be hard for me to know what they really feel and many of the difficulties they will encounter.

Overall, this is a helpful step of evaluation for this specific prototype.

Final conclusion

After carefully considering the aforementioned 3 evaluation methods, I decided that I will execute the qualitative method (post-event protocol) for my voice interfaces and cognitive walkthrough for my virtual reality interface.

The post-event protocol is especially useful for the Wizard of Oz prototype I have for the voice interface. By executing it, I will have first-hand user feedback in details about my design. I can refine my design accordingly.

For an interface as novel and challenging as virtual reality, a cognitive walkthrough is particularly suitable. I can do it as many times as I want, think through every detail needed to be considered before I invest more money and user’s time. Nevertheless, the caveats of the cognitive walkthrough are clear. After refining my prototype enough, I would use other methods such as qualitative methods (think-aloud, post-event protocol, etc.) to further evaluate it.

Last, the reason I didn’t choose the empirical evaluation is that what I can quantify with my paper prototype is very limited. I would prefer to develop my paper prototype into a real interface at a later stage of the design process, then do a more thorough empirical evaluation based on not only user feedback (satisfaction), but also computer logs, which is way higher consistency, validity, and precision.