Data Science Ethics Podcast – Episode 1 Show Notes
As a starting point, we’re laying some groundwork. In this first informational episode, we talk about algorithms – what they are, what they do, and why they’re important to data science ethics.
Algorithms perform a set of steps on inputs to get to an output. In data science, we commonly use algorithms to predict what is likely to occur. For instance, how likely someone is to default on a loan or what the weather will be at 5 p.m. today. Other algorithms rate or rank things, like in a search engine ranking or a product recommendation engine.
Part of the reason to develop algorithms in the first place is to be able to make decisions consistently many times over. They form a crucial part of systems we all rely on to be accurate, fair, and fit for purpose. If the algorithm is flawed or biased, it instills that flaw or bias in every decision that it makes, every time. That’s why data scientists must be cautious in creating algorithms that are powerful yet ethical.
Additional Links on Algorithms
Most algorithm articles and resources out there speak about computer science more than data science. These links provide some good lists of common data science algorithms and their use. Incidentally, the websites that these come from are some of my favorites for general data science information. They often have articles with code snippets to help solve common business problems with analytics.
Episode TranscriptView Episode Transcript
Today we’re going to talk a bit about algorithms. Specifically, we’ll define what algorithms are; we’ll talk about what types of algorithms there are; we’ll look at some opportunities for where you may have seen algorithms in your travels; and then we’ll dig a little bit more into what algorithms have to do with ethics in the larger setting of data science ethics.
An algorithm is a function or a formula, that’s generally very complex, and it’s programmed for a computer to perform. Algorithms look kind of like a formula you may have had in school – like an equation where you have inputs on one side and you have an output on the other side. It doesn’t always work this way. Sometimes algorithms look like a very large case statement. They are, in general, a number of different functions that happen by a computer to create an outcome based on some amount of data that was put into a system. The algorithm is all those steps that go on in between the inputs and the output.
Data scientists often classify the types of algorithms there are based on how statistics divides them. So you may hear terms like a “regression” or a “classification” or “clustering” or an “anomaly detection” or an “association pattern”… something like that. The more modern techniques are things like deep learning using neural networks or reinforcement learning or natural language processing. These are all kinds of algorithms that represent different functions that happen to input data. But that’s not how you would see them in your normal life. So let’s talk a little bit about the kinds of algorithms that you use everyday.
One algorithm that many of us are very familiar with is a credit score. A credit score is a classification algorithm. In this case, we’re predicting a “yes” or a “no” as to whether or not someone would be likely to default on their next loan or their next credit card. This type of algorithm creates a probability which is saying how likely is it that this person would default. The more likely the person is to default, the lower their credit score. And so that probability then turns into the score that you see.
Another very common type of algorithm is a search engine ranking. If you use Google, Bing, or any other search engine, you’ve used this, even if it was behind the scenes. A search engine ranking algorithm identifies the most relevant pages to your contextual search. I say contextual because there are a number of things that come into play. Every time you enter a new phrase into the search engine, it has to interpret the context.
So as an example, if I were to search for “husky pictures”, I would be expecting to see images of really adorable dogs. However, there are other types of huskies that I might see. I might see the UConn Huskies. I might see Husky brand. Search engines have to know that the context in which I’m searching is for a dog as opposed to one of these other options. So it’s not just that the pages have the keywords that I’ve used but that the context is properly interpreted by the algorithm. That tends to involve a number of different actual functions that comprise the total process – the total algorithm – that is a search engine ranking.
Another very common algorithm is a weather forecast. If you look at your phone every morning and check to see if you need to bring an umbrella, you can thank an algorithm for that. A weather forecast is a time series analysis. They’re also called forecasts, so that helps. The time series analysis looks to see what’s likely to happen at any given point in the future. The weather forecast uses information that we’ve seen from prior weather patterns, as well as current conditions, and blends them to identify what is likely to occur over the next several days. Beyond that, the forecast gets very uncertain and so we don’t necessarily project far into the future.
So why bother worrying about the ethics of an algorithm? Well, an algorithm, as I mentioned before, is a computer program. Most often a computer program does the same type of thing over and over again. And so it’s important that when it does whatever that thing is that it does it fairly, it does it accurately, and it does it in a way that considers the context in which it’s going to be used.
As an example, let’s think back about the weather forecast. What if a meteorologist were developing a completely new forecasting algorithm – one that would at least ninety percent of the time be accurate? But the way that they figured that out was that they measured only how many days were sunny. And they said, “well, ninety percent of the time it’s sunny. So if I always predict that it will be sunny, ninety percent of the time I’ll get it right.”
Ninety percent accuracy can be seen as very high in some situations, but with weather, we want more precision. We want a more accurate forecast. If ten percent of the time you went outside, you didn’t have an umbrella because you thought it was going to be sunny and it started raining, you’d probably be pretty upset… as well as wet.
But what if it were even worse? Anomalies like hurricanes, for instance, are crucial to know in advance. Those types of anomalies are exactly what those forecasts should have in them. But if ninety percent of the time they don’t show up and, for whatever reason, the meteorologist didn’t bother to predict them, then that one percent or less of the time that we have a hurricane it would be disastrous. No one would be prepared. So it’s important to understand the context in which your algorithm is going to operate. In this case, helping millions of people to be prepared for the weather ahead.
Equally, if we think back to the credit score, if we just say “no one can have access to credit”, we’re taking away an important safety net from a lot of people. Or if we say that everyone gets credit, then the banks are no longer going to be profitable because there are going to be too many people who are defaulting on their credit.
Algorithms require data scientist to balance precision and accuracy with unbiasedness and context and that ethical consideration of how this algorithm is going to impact the world. That’s why it’s so important to understand algorithms as a part of data science ethics.
Thank you so much for joining us today. I hope you’ve enjoyed this episode. If you have please like and subscribe. You can find the data science ethics podcast on iTunes, Stitcher, PocketCasts, or wherever you get your favorite podcasts. You can also find us on datascienceethics.com and join in the conversation there or on Facebook and Twitter at @DSEthics. See you next time.