Have you ever filled out a intake form at the doctor's office and had to answer the same question multiple times? How about a survey question where the answer should have been obvious based on your previous answers? A questionnaire that was so long you just gave up halfway through?
Let's develop an intelligent survey that adapts to your previous answers. It asks the most broad questions first and then only asks more specific questions when relevant.
As for the "predicting everything" the blog post title promises, I'll get to that in just a sec!
Decision trees
Decision trees, built with the ID3 and C4.5 algorithms, use the math of information theory to select attributes from a dataset that are the most informative. We can use this same approach to select questions from a survey that are the least redundant.
Figure 1: Decision tree — Medical triage is like a tree where each question narrows down the possible diagnoses until we can predict the urgency of treatment.
> Disclaimer: This is not medical advice. I'm not a doctor and have no real idea how medical triage works.
Entropy
The first concept we need to define is entropy, which measures how uncertain we are about how a survey participant would respond to a question. We'd like to prompt the participant with questions that decrease entropy. The faster we get entropy down to zero, the sooner we can stop asking questions.
- H is the entropy function
- X is a random variable representing survey responses
- \mathcal{X} is the set of possible values for X
- p(x) is the probability that X = x
In this formula, -\log_2 p(x) is called the "information content" of outcome x. The less likely an outcome is, the more information we gain when we observe it. When we weight by p(x) and sum over all possible outcomes, we get the expected information content, which is also known as the entropy.
For simplicity, these code snippets focus on categorical data and don't cover entropy over continuous attributes. Fortunately, this will work totally fine for multiple choice or yes/no survey questions.
Information gain
Next, we'll look at information gain, which measures how much we expect to learn if the participant answers a particular question. Finding the question with the highest information gain is the key to building an adaptive survey.
- D is the full dataset -- all previous survey responses
- A is the attribute we’re splitting on -- the question we’re asking
- a is a possible value of attribute A -- an answer to a question
- D_a is the subset of D where A = a
- p(a) = \frac{|D_a|}{|D|} is the proportion of samples with A = a
We don't know which of the possible answers to attribute A the participant will give, so we take a weighted average over the possible answers. The information gain is the expected difference in entropy before and after asking the question.
Selecting a question
The only thing left to do is use entropy and information gain to select questions. It's a fairly simple algorithm.
1. Loop over all attributes.
2. Compute the information gain for each.
3. Select the attribute with the highest information gain.
4. Repeat steps 1-3 until either:
1. Information gain for all attributes drops below some threshold (configurable).
2. All attributes in the dataset have already been selected.
And that's all of it! There's a bit of math to it, but at the end of the day, not so complicated. Now we have an adaptive survey engine that avoids asking questions that don't teach us anything.
Predicting everything
Ok, now that we have the basics ironed out, let's think about ambitious applications. What could we do if survey taker effort didn't scale with the number of questions?
The paradox of the adaptive survey is that while each participant spends less time taking the survey, you are able to include more total questions in the survey, including those that do not apply to a large portion of the population, but are highly informative for some individuals.
With enough questions and participants you could create a survey that selects the right questions to predict a huge range of target variables.
The vast majority of our questions don't even have to be good, but now and then we may uncover an unlikely gem of a question that is surprisingly predictive of some outcome. Maybe that outcome is something whimsical like "Which Friends character are you?" Or maybe it's something more scientific, "What is your OCEAN personality type?"
A similar idea is the New York Times Dialect Quiz that predicts where in the US you are from based on how you speak. I find it surprising and fascinating that a survey can pinpoint where you're from based on simple questions like whether you say "frosting" or "icing", "soda" or "pop", "water fountain" or "drinking fountain", "highway" or "freeway".
You could use this as part of the matching algorithm for a serious dating app. Or help high school and college students think about potential careers based on their interests and personality traits.
I should caution that testing multiple hypotheses with the same dataset is always fraught. With enough data, you're sure to find spurious correlations. To avoid p-hacking, any interesting patterns uncovered would need to be validated with a separate study. Still, even if we don't consider our immediate findings conclusive, the adaptive survey could be a really powerful tool for hypothesis generation.
Wrapping up
I built this survey engine because often a random question pops into my head that feels like it would split the population in an interesting way. I don't have a target variable in mind to predict. I just have a feeling that the responses to my question might be correlated to something interesting. Of course, the vast majority of these ideas will correlate with nothing very special at all. And yet, there's hope that one needle in the haystack will lead to a grand discovery.
Acknowledgments
Thanks to Yuki Zaninovich for his valuable feedback on a draft of this post.
: The C4.5 algorithm extends the ID3 algorithm to handle continuous features. Entropy of a continuous features is calculated by iterating over each numeric value and calculating entropy of the target for all target values where the attribute is greater than the threshold.
: I like to think of this as similar to a genome-wide association study, but instead of measuring correlations in genetics for large populations, we're finding correlations in psychology.