Podcast: Play in new window | Download
Subscribe: RSS
Episode 22: Protect Privacy – Show Notes
IT are not the only ones responsible to protect privacy of data. Data scientists share this burden as they search for, collect, store, utilize, and share vast amounts of information.
In this episode, we explore what data scientists and non-practitioners should do to help protect privacy.
Additional Links on Protecting Privacy
https://www.youtube.com/watch?v=xx1AUupLn2w
https://www.engadget.com/2018/11/08/gdpr-data-brokers-complaints/
Episode Transcript
View episode transcriptCaveat: This transcript is largely unedited and may not match the audio as well as usual.
Lexy: Welcome to the Data Science Ethics Podcast. My name is Lexy and I’m your host. This podcast is free and independent thanks to member contributions. You can help by signing up to support us at datascienceethics.com. For just $5 per month, you’ll get access to the members only podcast, Data Science Ethics in Pop Culture. At the $10 per month level, you will also be able to attend live chats and debates with Marie and I. Plus you’ll be helping us to deliver more and better content. Now on with the show.
Marie: Hello everybody and welcome to the data science ethics podcast. This is Marie Weber and Lexy Kassan and today we are going to talk about protecting privacy. This is a big one. It’s a very big one. Most people probably are aware of this. There’s privacy policies on nearly every website that you go to or app that you download for your phone, but this is going to be more in the context of as you are working with data, how do you make sure that you are protecting the data that you’re working with, so you’re protecting the privacy of your customers. So Lexy, a a data scientist, how does that work into your, your method?
Lexy: All over the place. There are a lot of places where data scientists have an opportunity to gather data and use it and potentially share their findings and share the data sets that they’re using and so forth. Or they have access to additional data sets and there’s really a burden of responsibility on data scientists to help ensure that they’re protecting the privacy of the people and subjects of that data. Often when we think about privacy policies and privacy in general, it falls kind of squarely in the it field where you think, oh, well the people who are server administrators or the people who are making sure that my network is secure and you know, all those types of things, that they’re the ones that are responsible for protecting privacy, you know, as long as I don’t take the data out of the building and I don’t share that data, I’m protecting privacy and that’s not necessarily the case. As a data scientist, if you’re working with a Dataset and really any data practitioner, if you’re working with a data set, there are some basic things you need to do to try to protect privacy. The first one is obviously don’t put data on the Internet.
Marie: Exactly. Yeah. Once it’s on the Internet and it’s always on the Internet,
Lexy: The next is don’t email data sets if at all possible because emails can be intercepted, they can be accidentally sent to the wrong recipient. All kinds of things can happen. That person can then forward that information out to someone else and then it’s not really controlled, so don’t email data, especially personally identifying data, only share the data itself, the raw data with people who have a need to know it and see it so other colleagues working on the same problem. People within your organization and so forth. Don’t send it to everybody. Kind of all in sundry and say, look at the data set I’m working with, but the other part that I think is important is to work with your it team to flag. When you see that there are privacy issues. If you come to them and say, Hey, I see an issue here.
Lexy: We really ought to address this. They have more trust in you as a data scientist because they know that you have the best interests of the client and the company at heart and you know that it is someone that you can rely on to help you ensure that you’re protecting the privacy of the people that you’re studying. I think another thing that you can do is not only work with them but then have regular check ins where you talk about what your best practices are and how you’re evolving your practices, because this is an evolving field. Absolutely, and data science, we often gather additional information from external sources. If you gather additional information, make sure that you talk with it about how you need to store that data, where you should be putting that data, how it needs to be secured because they likely have policies that you will have to follow as to how to handle that information.
Lexy: Even if you try to mask that data or try to ensure that you’re not showing any personally identifying information or Pii data or other types of things that would directly identify someone. It doesn’t necessarily mean that you’re protecting the privacy of that person or the data itself. There have been examples of this where there were data sets published in use for a larger project. For example, Kaggle publishes data sets for people to participate in their competitions or get hub. There are tons of data sets out there. People are publishing data to portals like data.world where there’s a lot of information accessible and people can go and download it. So if you’re a data scientist looking for a new Dataset, you might grab another Dataset that has pertinent information to the problem you’re working on from a company like these. And this is important to consider because this goes back to getting the data that might be needed to train a model. Correct? It becomes part of the data set you’re working with. The more that you describe a given observation, the more likely it is that you’ve gathered enough information that could be personally identifying even if there’s no name attached or no email address attached and so forth. This actually happened with a kaggle study and I think we’ll probably end up doing a um, quick a quick on it, but where somebody had gathered enough information and because it was a public data set on Kaggle when they upended additional information, it caused a really big problem.
Marie: And this subject also relates to a couple of other things. So in terms of who you give access to, your data, that also kind of plays into anticipate adversaries, like you want to make sure that you’re not giving people access to the data that don’t need it and you also want to make sure, again, as you said, working with your it to make sure that it’s secure so that way nobody can get to it. That’s not supposed to. We have another episode about that and then even the type of data that you do collect is another thing to consider.
Lexy: Yeah, absolutely. We’ll cover that in another episode. Hopefully soon on a collecting carefully the data that you’re going to need
Marie: going to use and then even that can extend to how you archive information. Absolutely no longer using it
Lexy: several years ago now I would say back in the day, but it’s not all that long ago we used to archive things to tape and those tapes were held in a secure facility essentially. So all of that data would be put onto a physical media and then put into like a vault or something. Now, more often than not, you’re putting it in a cold storage kind of cloud environment or other server. If you’re no longer using it. If they’re older data sets, those data sets are pushed out into some other area that is not as accessible. You may only be able to access it, for example, from an internal system, no vpns, no private network, whatever, or it has to be specifically remounted into an area. You can see it if you need it, which means a request and it really depends on the industry you’re working in, but it makes it more difficult for older data to be breached.
Marie: So Lexy, can you think of any examples that you’ve kind of seen in terms of how protect privacy has impacted a project or has impacted considerations on some data science work that you were doing?
Lexy: Yeah, there are a lot of them on and many of them have to do in my field with linking data together. The biggest one is trying to find a unique person. So when you’re trying to identify behaviors from a given person, you need to be able to look at records from potentially disparate areas and link them all together in order to protect privacy. Sometimes we remove some of that identifying information and we need to be able to link those pieces of data and sometimes that’s very difficult. So that’s referred to as anonymizing data. Correct. So we would take out the name or we would take out address information or something like that that would be more immediately identifiable as this is this individual as opposed to this is record number four or five, six, seven, eight. However, when you try to link from different systems, the ID numbers aren’t the same.
Lexy: So we need. We need that information to be able to link to the same person in companies where there’s not a good system for that or they don’t have a common identifier that’s already been assigned. It makes data science very difficult. So that’s one impact of protecting privacy, but it’s one that it’s worth having even if it becomes a struggle down the road because it’s better to not open yourself to breach than to potentially have a couple more pieces of information. I can tell you a couple of stories from my own personal career where I’ve seen issues with privacy. One of them was I working on a bank dataset. The dataset itself was a list of accounts that included a field for the tax payer ID. As an analyst, I don’t need that taxpayer id so long as I have something that uniquely ties together, multiple accounts in the same way that tax id would.
Lexy: So for example, instead of having a social security number, you could create a hash value and I could look at that. I don’t actually need to see their tax id in the clear. Well, at one point that tax id was shown in the clear and I immediately raised a flag to my it department and said, you need to make sure that that gets hashed, are obfuscated in some way so that I and my team don’t see it. Because while it’s not a problem and we haven’t obviously shared it anywhere, we don’t want any of that risk, especially with a bank wanting to make sure that the privacy of the people who are represented in that Dataset was protected. Another one that came up actually more recently was that I was working with my it team. Again, this is a very important union between analytics and it tremendous, tremendous importance that was working with my it team and they had identified that certain user accounts for a client that we’re working on were breached, meaning that somebody had found a password.
Lexy: They were some very insecure passwords. Someone had gotten access to the account, and this hacker had essentially tried every ip address and a very wide range to see what they could ping from that account. One of the things that they pinged was an environment that contains a very large dataset. In fact, the entire experience Dataset, which has names, addresses, employment statuses, household income, all the census data, tremendous, tremendous amount of data. In fact, the same set was what the Experian from, I think it was 2017, was actually coming from. It was a database set up in much the same way. Now they didn’t get access to that data because there were additional layers of protection in place, but it was a very scary moment to think that someone may have gotten access to that environment and potentially had access to that set of data and we did a lot of due diligence to ensure that the data was still private and still protected and it was thankfully as a data science practitioner, you have to be cognizant of where your data is, what’s represented in it and know whom to contact just in case.
Marie: Absolutely. And again, that goes back to the anticipate adversaries as well. So making sure that you’re, that you are doing things like protecting passwords and I know one of the things that you also wanted to mention this episode was just what people can do on their own to protect their privacy.
Lexy: Definitely. So even if you’re not a data science practitioner and you’re concerned about your data privacy, there are some basic things, and I know this sounds like a trope already, but use longer, stronger passwords. That is the top one, makes sure that their unique across websites. Make sure that any information that you provide, you’re doing so knowingly. You’re not just sending your information out to someone you don’t know where it’s going to go. Make sure that you read those privacy policies or the GDPR statements. Read those policies and understand where your data is going and how it’s being used because that ends up being a much bigger part of the frustration. When something does happen, people don’t necessarily realize that their data was included because they had taken a survey on facebook or they had visited a website or something like that and logged in using authentication method that’s shared, something along those lines and then your data is available. So be aware. Be Cognizant, and if you are concerned about your privacy and your data is privacy. Search for your own data. Sign up for some of these services that search the dark web, use a password generator that can provide you with unique ids. All of those types of things can help you protect your own privacy
Marie: or the services that help you manage passwords. Different site.
Lexy: Absolutely. And stronger passwords. Correct. So most of those come with a password generator, so things like dash lane or key pass or a last pass. Some of those. Yeah, exactly.
Marie: And when you are looking at those privacy policies and another thing that you can look at is they should mention any other third party vendors at that site works with and so that way you can be more informed about how they might be using your data and what other third party providers they might be working with. So a lot of websites will have a crm provider that’s going to be where they store some of this personal information. And then if you hear something about a particular company, you know, potentially having a breach, you’re more informed about where your data might be. Yup. That actually reminds me, uncheck the box. If there’s a check here for third party marketing materials or something like that on check the box, make sure that they’re not going to send your information to whoever is paying top dollar for them for that moment.
Marie: Yup, and on the flip side of that, there are going to be more companies that are asking you to opt into their own marketing material, so if there are companies that you are interested in working with, then give them the permission to email you so you can stay on top of their communications. Speaking to somebody as a marketer, we are implementing that a lot for programs. A lot of people already use double opt in methods, so if you’re going to receive, for example, a newsletter, you’ll sign up for the newsletter. Then you’ll get an email asking you to confirm that you actually want to receive the email and moving forward you’re going to see that type of user experience for even more things is. GDPR is kind of slowly been implemented here by a lot of companies, but then California is also talking about implementing that type of regulation. So you’ll see it more and more. There has been a very large uptake in interest in GDPR, like regulations even in the United States and elsewhere to try to mirror some of what the European Union has done. I would anticipate that more of those types of regulations will start coming in. Absolutely. So that is a little bit of information about protecting privacy. Thank you so much for joining us for this episode of the Data Science Ethics podcast. We’ll see you next time. Bye.
We hope you’ve enjoyed listening to this episode of the Data Science Ethics podcast. If you have, please like and subscribe via your favorite podcast App. Also, please consider supporting us for just $5 per month. You can help us deliver more and better content.
Join in the conversation at datascienceethics.com, or on Facebook and Twitter at @DSEthics where we’re discussing model behavior. See you next time.
This podcast is copyright Alexis Kassan. All rights reserved. Music for this podcast is by DJ Shahmoney. Find him on Soundcloud or YouTube as DJShahMoneyBeatz.