Episode 4: Statistical Sampling Bias – Show Notes
Bias sneaks in to algorithms and data science from multiple sources. Primarily, it comes from statistical or cognitive biases that then lead to biased conclusions or results. In today’s episode, we look at four types statistical sampling bias to understand how biased samples skew algorithms.[2:54] Selection Bias – a statistical sampling bias from selecting groups that are not representative of the population of interest. Example: Breast cancer and gender [3:12] Self-Selection Bias – a statistical sampling bias from participants opting to respond who are not as diverse as the underlying population. Example: Product reviews [3:43] Non-Response Bias – a statistical sampling bias from differences in characteristics between those responding and those not responding. Example: Surveys and age [4:09] Survivorship Bias – a statistical sampling bias from looking only at a remaining group rather than the full population. Example: Mortgage default and current homeowners [4:28] Falling Cats – a further example of survivorship bias
Additional Links on Statistical Sampling Bias
Sampling Bias – Wikipedia lists several statistical sampling biases beyond those covered on today’s episode
On Landing Like a Cat – New York Times article detailing the falling cats study from 1989
Looking for ways to spot or correct for statistical sampling bias? Try some of these titles from Amazon:
Episode TranscriptView Episode Transcript
Welcome to the Data Science Ethics Podcast. My name is Lexy and I’m your host. This is episode 4: Statistical Sampling Bias.
Many of the ethical issues around data science focus on the concept of bias. It’s discussed in the context of data, machine learning, AI – bias is seemingly everywhere. When most people say that something is biased, they intend it to mean that it is systematically prejudiced for or against a specific group. There is a grand hope that algorithms, as cold and unfeeling as they are, can help to eliminate these biases. Yet it’s been proven in many contexts that algorithms absolutely are biased and reinforce existing prejudices.
How can this happen? How do algorithms become biased? And how capable are data scientists to prevent this?
The answer is that biases creep in from humans. We imperfect humans are making all manner of decisions throughout the data science process that bias algorithms. Humans identify what problems to solve with an algorithm. Humans tell algorithms what data to look at – namely, what the humans think will be important. Humans tell algorithms what is a right or wrong answer to a problem. Humans, often a small number of them at a time, are the heart of an algorithm. And humans are fallible.
We make mistakes. We fail to consider every perspective. We include or exclude things based on our own experience and ideas. We use what is fast or convenient, rather than what is most effective. We are often unconscious of the many influences that shape our perspectives or our data. These influences are the sources of bias.
There are two major classes of bias – statistical and cognitive bias. Statistical bias happens in data collection and estimation. Cognitive bias happens in flawed decision making and thought patterns. Both can be brought into algorithms and neither the data scientist nor the algorithm will know that the bias is there.
In today’s episode, we will talk about some of the statistical biases that stem from the way data was initially gathered. This class of biases, called statistical sampling biases, are different ways that data collected may be skewed from the population it is meant to represent.
As an example, suppose that you are a medical researcher studying people who have had breast cancer. Around 99% of breast cancers occur in women and only 1% occur in men. An unbiased sample would therefore need to include 99 women for every 1 man, because that is the occurrence in the population. Note that this is markedly different than the roughly 50/50 split of female and male in the general population because the population of interest is only those people with breast cancer. If you instead received a set of data with 50% female and 50% male, it would be biased.
That scenario we just walked through is an example of selection bias. This is when the people gathering the data, usually researchers, choose a sample that is not representative of the group they are looking to estimate. The operative work there is “choose.” Other statistical sampling biases offer less opportunity for choice.
For instance, self-selection bias happens when people choose whether or not they want to participate. A good example of this is in product reviews. Most people will not bother to review something unless they have a strong opinion about it. Either they think the product is great or they think it is terrible and therefore elect to share their review. It is rare that someone will self-select to rate a product that they feel is middle-of-the-road. Check it out on Amazon sometime. Twos and threes are usually the least common review values out of five.
The inverse, or complement in many ways, to self-selection bias is non-response bias. This is when the characteristics of people who do participate differ somehow from those who opt not to participate. Non-response bias often happens in surveys. Older people are generally more likely to complete a survey than younger people. Researchers relying on survey data may get a biased viewpoint if the proportion of older respondents is higher than the proportion of older people in the population.
The final bias we will cover today is survivorship bias. Survivorship is when you only look at some remaining group and therefore omit the group that did not remain. An example of this might be looking at mortgage default rates among current homeowners. If prior homeowners, like those who had defaulted on their mortgage, are excluded, then the default rate would appear artificially low.
Another, perhaps stranger, example of survivorship bias was a 1980’s study that looked at the mortality rate of cats falling from various heights. It stated that cats who fell from higher heights were somehow more likely to survive. The way they studied this was to look at veterinary records of cats treated for injuries sustained from falling. What they failed to take into consideration was that cats who died from higher falls may not have been taken to the vet at all.
Even worse, multiple biases can apply to a study. Suppose that you are taking a poll to get a sense of the sentiment of the town on a topic in local politics. You decide to visit a polling station at 10 a.m. on election day to survey voters. Which of these biases would influence the results?
By choosing one specific polling location, you limit to number of communities represented from the town. This introduces selection bias.
By using election day as the time to collect data, you may find only people who are most passionate about a local issue turn out. This is self-selection bias.
By visiting at 10 a.m., there is a lower probability that working voters will be in attendance during your data collection. That’s non-response bias.
That’s three biases all applying to one study.
These biases throw many wrenches into the works. They often occur well before any data scientist is involved. In fact, data scientists may receive data long after it was originally collected and may not have insight into how the studies were conducted. Once in the data, it is difficult to spot statistical sampling bias and even more difficult to either remove or adjust for it. And as insidious as these are, they’re just the tip of the iceberg.
Different types of statistical and cognitive biases abound in prejudicing the results of models. Data scientists must be vigilant to take into consideration the sources that they are working with and how those sources came to be. We may not be able to eliminate all bias, but we can shine a spotlight on where it is lurking and attempt to detail the impact it has on our analyses.
In our next episode, we have a quick take on data science trying to eliminate bias in hiring practices.
I hope you’ve enjoyed listening to this episode of the Data Science Ethics Podcast. If you have please, like and subscribed via your favorite podcast app. Join in the conversation at datascienceethics.com, or on Facebook and Twitter at @DSEthics where we’re discussing model behavior. See you next time.