Collect Carefully

Image Copyright: grandeduc / 123rf

Episode 28: Collect Carefully – Show Notes

The era of Big Data has meant the ability gathering and processing of vast stores of information about almost anything. It enables data scientists to bring enormous swaths of data to bear on a given problem. Further, it expands the ability to collect data from research techniques that were previously too cumbersome to work with.

But gathering all the data isn’t always the right answer to that problem. Sometimes collating information can increase the chance for biased results and spurious conclusions.

In today’s episode, we discuss why researchers and data scientists must collect carefully to avoid skewing their analyses and causing unfair impacts.

Additional Links on Collect Carefully

The Ethics of Big Data EMC article from 2014 describing the rise of Big Data and the ethics of using more data instead of the right data.

A crummy drop-down menu appeared to kill dozens of mothers in Texas ArsTechnica article about the dropdown that caused data entry errors resulting in increased maternal mortality rates seen in Texas

Data Protection Laws of the World DLA Piper’s interactive map of data privacy regulations around the world

Collect Carefully Episode Transcript

View full episode transcript

Welcome to the Data Science Ethics Podcast. My name is Lexy and I’m your host. This podcast is free and independent thanks to member contributions. You can help by signing up to support us at For just $5 per month, you’ll get access to the members only podcast, Data Science Ethics in Pop Culture. At the $10 per month level, you will also be able to attend live chats and debates with Marie and I. Plus you’ll be helping us to deliver more and better content. Now on with the show.

Marie: Hello everybody and welcome to the data science ethics podcasts. This is Marie Weber

Lexy: And Lexy Kassan

Marie: And today we’re going to talk about one of the principles of data science ethics, which is collect carefully. This goes back to the data science process and it really ties into how you acquire data and your understanding of the data. So Lexy, when you are looking at developing an algorithm and putting something together, how do you incorporate, collect carefully.

Lexy: Collect carefully is really all about understanding where your data comes from, how it was gathered, what was asked, what populations were sampled, all of the underlying process that came into play to get the data to where you’re seeing it. Very often as data scientists, we come in much later in the process. The data was already collected by some other person or some other system that we have not had input into. We are simply taking the results of it. However, in taking those results, we need to be very cognizant of what was done to get that data and how that may impact the results that we’re seeing. As an example, we’d want to make sure that any information that we’re using for a prediction was captured pre-hoc, essentially meaning prior to the event we’re trying to predict, so that we’re not unduly influencing the results was something that we wouldn’t have known before the event.

Lexy: We would want to think about the representation within the data so that we understand any biases or skews that we may have. We’d want to understand how we should impute data. So we’ve talked in the data science process about how we transform data and create features and so forth. But really all of those transformations are informed by what data you have and how it was collected so that you really understand what it represents and you can treat it appropriately to create an algorithm that’s going to work properly and that is based on information that you would be able to replicate.

Marie: And when you think about different data sets, there are going to be data sets that you can potentially have an impact on moving forward. So as you’re developing your model, you can say it would be really great if we could have this type of information moving forward and then you can give that back to your developer team or other internal teams so they can try to collect that data in the future

Lexy: potentially. So if it’s something that’s been owned channel meaning you control somewhere, your organization has some influence over the process by which that data’s collected. Yes. If you’re gathering data from other sources and blending it together, not as much.

Marie: Right. That was going to be the other area that I talked about.

Lexy: Yeah. That said, it’s also part of understanding how to collect carefully to not overreach. If you’re not going to use a certain set of data, don’t just bring it in for the sake of having it there. Make sure that you’re… Whatever you’re collecting, whatever you’re gathering, you plan to use, that you have a reason for. This is partly for privacy. It’s partly to ensure that just in case something happens, there’s nothing that you shouldn’t have. There are certain laws, for example, around what types of data can be brought together. In a number of industries it’s illegal to look at certain factors, so for example, in education you’re not allowed to look at ethnicity. You’re not allowed to look at certain things that can be proxies for ethnicity. You have to be very cautious in what you gather to ensure that you’re meeting the compliance standards of the industry you’re working in. As well as making sure that the privacy of the people about whom your data is is maintained.

Marie: For sure and you’ve talked with me before about interesting ways that data can be extrapolated and so when it comes to education, even something like looking at zip codes could potentially be tied back to race. If you know what types of people live in certain ZIP codes and if you are trying to develop models based off of that, that could be problematic.

Lexy: And if you think about it that way, our prior episode regarding the census – census data collects a lot of information about ethnicity and it groups that geographically, so if you’re gathering someone’s ZIP code or address or block group and you’re then able to link it to census data which has ethnicity, you can look at over and under indexes and ethnicity may not be specific to that person, but you can look at these over and under indexes and start having a dataset that while it has more information potentially for you, you are not supposed to be using that. So be very careful when you’re collecting these types of information to make sure that it meets your compliance for your industry. The other thing that you think about is bias. In the Dataset. We’ve talked about this a number of times with regard to statistical sampling biases, different ways that those can be introduced.

Lexy: Make sure you fully understand where your data’s coming from and if it is truly representative of the totality of the population you’re trying to model. As an example, and we’ve talked about opportunities where especially with visual learning, meaning AIs that are doing image processing, often it’s underrepresented minorities and that has become problematic in a lot of different instances including our Google Gorillas episode. Similarly, you can run into other problems in other areas that are not necessarily quite as obvious but can still have a very big impact. An example I can give you from a recent client was that we were looking at the prevalence of email addresses being present on customer accounts and capture rates around that. We wanted to give incentives to retail associates to collect email addresses, which is always a little bit of a dangerous game because I think about anticipating adversaries.

Lexy: People want to put some email address even if it’s not valid. One of the things that we ran into was that we were seeing that there was this disproportionate amount of email addresses associated with customers who had a return. Well come to find out after a few weeks of are looking at this data, their system required an email address to be present on a return for us to have any information about the return. There had to be something there. Similarly we started seeing capture rates being really low in certain geographic areas. Again, come to find out there was a law precluding customer service associates to ask for any customer identifying information in those geographies.

Marie: Very interesting. Going back to what you were saying about the process that was set up requiring an email address and really in that case forcing that feel to be completed with what it sounds like higher than average irregular or non actionable information means that you then are pushing work to another group that needs to clean it up. Lexy is pointing at herself. I’m going through my mind cause I know I’ve had multiple projects where I’ve had to clean up email addresses and databases. If you are collecting data, you want to do as much as you can to make sure that the process collects data so it’s accurate and then is also data that you can take action on and that it’s clean for sure. Yeah, that’s, that’s ultimately what we’re going for, but instead of just using the word clean, I like to use the accurate and actionable because that paints a clearer picture for people. Yeah.

Lexy: There was actually a great example out of Texas, I believe it was, where there appear to be a disproportionate mortality rate among recent mothers. This was part of an enormous nationwide study of mortality and pregnancy and birth rates and so forth. What they actually came to find out was that it was a poorly designed user interface for the physicians or nurses to select in the field of whether someone had died when they were pregnant recently after having given birth and so forth. It was a dropdown selector. And in the dropdown selector, there was an option that said had not been pregnant in the last 12 months and directly below that was gave birth in the last four weeks or something like that. It was a misclick and there were so many misclicks that it caused what appeared to be a spike in maternal mortality in the state.

Marie: And on the flip side, there might be times where gathering the data through an electronic interface is preferred because you can control that experience more then if maybe you’re doing it via survey or you’re doing it by a panel. So, Lexy, I know you had some examples on this as well.

Lexy: Sure. The survey design and really any sort of first party research market research is a very big field of study. It’s a fascinating field. If you’re interested in this, I highly encourage you to look into it, how you phrase questions, how you frame questions, where you pose questions to a respondent can make a world of difference. So we’ve talked about some of the sampling biases early on. We had an episode on different statistical sampling biases that can be introduced when you ask for people to respond or when you require responses and so forth. In survey design and in survey processing, you often do things like remove straight line answers. So if somebody just answers 10 out of 10 all the way down, take them out. It’s probably not real. But you can also have things that are more nuanced in the way that you phrase a question or frame a question or what’s the sequence of questions because you can lead a respondent sort of in the way you would think of like leading a witness in a trial where you ask a question and there’s a follow on and to follow on. The same thing can happen.

Marie: Or even the way that you phrase the question or even the options that you give to somebody for their answers. So I know from some of my previous positions, well as some of the clients that I’ve worked with who have developed either surveys or focus groups and then have come back to the marketing team and said, we ask people these questions. Here are the answers that we got. Let’s put together a report and when we actually broke down all the information and we really said, can we say that this question tells us that our method is 60% more effective? It was very hard to do because of how the question was phrased and so it’s also really helpful when you’re doing survey design and when you’re talking about the types of data that you wanted to collect to think about how you are going to be using it or talk with the data scientists that you are going to be collaborating with and ask them how they would want to incorporate this into their models so you can make sure that you’re asking a question that they’ll actually be able to utilize.

Lexy: There’s a ton that goes into that. It’s the phrasing of the question based on who the respondents are. So for example, some populations might view a given question differently than other populations, and so understanding how to phrase the question so that everyone responds to the same intended question, if that makes sense, is very tricky at times. Also making sure that you ask succinct questions that you don’t have multi-part questions like do you believe in this or this? If you believe in one, does that count as yes or no? Like does that count as yes, but if I don’t believe in the other one, does it count as no, I don’t know those types of things. You want to be very clear in the question and the answer set aligning.

Marie: Exactly.

Lexy: We used to talk a lot about limiting open response questions. These days with the advent of natural language processing algorithms, it’s a little more feasible to use open responses and gather that text data, process it and be able to actually glean information from that. That used to be a very big problem and so often we would look at structuring data differently prior, but even then framing the question, phrasing the question specifically and being very accurate in how you’re determining what words to use and how that’s going to play into an analysis is a big deal. When you think about how you’re collecting your data. As a data scientist, we don’t often have the opportunity to align with that. Again, we’re often downstream, but it really is fundamental to understanding how your data came to be, what it was. What you’re looking at was from some source, even sensor data, what it’s measuring, where it’s measuring all of that. Even if you’re looking at a mechanical Dataset, there’s still going to be understanding of what is the process by which it’s coming in. Is it being filtered? What happens if there are drops in connectivity? What happens if the sensor goes offline? Those types of things. Again, can bias results, so understand where the data’s coming from. Be Cautious in bringing more data than you need to to bear on a problem and try to make sure that you’re as involved as you can. Be as informed as you can be about how you’re getting your information.

Marie: Absolutely. That was our coverage of collect carefully. We appreciate everybody joining us for this episode of the Data Science Ethics Podcast. This has been Marie Weber and Lexy Kassan. Thanks so much.

Both: Catch you next time.

Lexy: Nice. Go us. We synced up on that.

We hope you’ve enjoyed listening to this episode of the Data Science Ethics podcast. If you have, please like and subscribe via your favorite podcast App. Also, please consider supporting us for just $5 per month. You can help us deliver more and better content.

Join in the conversation at, or on Facebook and Twitter at @DSEthics where we’re discussing model behavior. See you next time.

This podcast is copyright Alexis Kassan. All rights reserved. Music for this podcast is by DJ Shahmoney. Find him on Soundcloud or YouTube as DJShahMoneyBeatz.