Citizenship, Privacy, and the 2020 Census

Image Copyright: iqoncept / 123rf

Episode 26 – Citizenship, Privacy, and the 2020 Census Show Notes

The US Census happens every ten years and provides the basis for democratic representation, federal fund distribution, and swaths of research. But this time, the census is poised to potentially degrade its own results. The proposed inclusion of a single question around citizenship on the census could lead to biased responses.

Furthermore, the Census Bureau has released an update to their privacy policy indicating their intention to use differential privacy to prevent cross-referencing census and other data sets to identify unique respondents. This could lead to less reproducible census data sets and more masked information for the thousands of researchers, companies, and data scientists who use this data.

In today’s episode, we discuss these two, linked issues around citizenship, privacy, and the 2020 census.

Additional Links on Citizenship, Privacy, and the 2020 Census

Census Bureau Quietly Seeking Immigrants’ Legal Status Article from The Hill detailing how the Department of Homeland Security would provide a cross-reference database for the Census Bureau to individually identify respondents

The 2020 census is in serious trouble Vox coverage on YouTube with a succinct take on why the citizenship question is so problematic

Supreme Court to Decide whether 2020 Census Will Include Citizenship Question NPR coverage of the lawsuit fighting the inclusion of the citizenship question on the census

Protecting the Confidentiality of America’s Statistics Announcement from the Census Bureau about the reasoning for and implementation of differential privacy

US Census Bureau Adopts Differential Privacy Announcement/Paper from John Abowd, one of the researchers into differential privacy and consultant to the Census Bureau

Can a Set of Equations Keep US Census Data Private ScienceMag article around the use of differential privacy including discussion of the database reconstruction theorem

Differential Privacy: What Is It? American Statistical Association article defining types differential privacy and the benefits and drawbacks of using it

Episode Transcript

View full transcript

Welcome to the Data Science Ethics Podcast. My name is Lexy and I’m your host. This podcast is free and independent thanks to member contributions. You can help by signing up to support us at datascienceethics.com. For just $5 per month, you’ll get access to the members only podcast, Data Science Ethics in Pop Culture. At the $10 per month level, you will also be able to attend live chats and debates with Marie and I. Plus you’ll be helping us to deliver more and better content. Now on with the show.

Marie: Hello everyone and welcome to the Data Science Ethics Podcast. This is Marie Weber and Lexy Kassan and today we are going to talk about the upcoming census in 2020 and some of the different data science ethic issues that are around this topic very currently.

Lexy: Yes, very currently and actually in the next month or so we’ll be hopefully getting some updates on some of these.

Marie: Yes, because this is an issue that’s going to be going towards the supreme court in April, so that’s why we decided it would be a good topic to bring up on the podcast. Now, so Lexy, let’s start with a bigger question of why is the census going to the Supreme Court and what is causing it to be brought up in this case?

Lexy: The reason that there is an issue about the census and it’s going to the supreme court is that there is a proposal to add a question to the census in 2020 that would ask is the person for whom the census is about a citizen of the United States? This question has not been asked on a census in many years. They used to ask these types of questions around whether someone was born in the United States and whether they were naturalized as a citizen and so forth. However, it was removed from the census quite a while ago.

Marie: Yeah, it looks like based on one of the articles that we’re going to site and the resources that the last time that type of question was asked would be almost 70 years ago in the 1950 census.

Lexy: Right. At that time, immigration was still much more common, much more prevalent. There was a greater proportion of citizens of the United States who had immigrated. You figure that was not terribly long after World War II and so many people had recently come over, especially from central and eastern Europe as a result of that and so it was very timely at that point.

Lexy: Since then it had been removed from the census. One of the reasons that it is being brought to the supreme court is that the inclusion of this question was challenged due to the possible repercussions of asking it. More often than not, what people are really concerned about here is if for example I see he is going to come after people because they did or did not answer the citizenship question and they were able through other data sets to uniquely identify that person, they could potentially persecute individuals, so that’s certainly a concern.

Lexy: Even beyond that, there are obviously concerns more broadly about other organizations that might take action based on being able to uniquely identify a person with all of the information that’s collected. It is sensitive information that’s on the census. There’s a concern that people would self select to not take the census, that those who are not us citizens might elect to not respond because they don’t want to be singled out. They don’t want to be identified as not being a citizen. That might cause even further impacts to availability of federal funding. The ways that the lines are drawn for representation and four electoral college votes and so forth. Because all of those things are based on the population that’s in a given area. And so the inclusion of this question has become a really big concern. That’s what’s going to be going to the supreme court in April.

Marie: Thing to keep in mind is that the census is on this done once every 10 years. So when the 2020 census happens, if there’s depressed census participation because of this question, that has impacts on how things are allocated over the next 10 years. And just as you were saying, it impacts how funding is going to be allocated. So for example, when they’re looking at the distribution of an estimated $880 billion in federal tax dollars to state and local communities for things like Medicare in schools and other public services, that’s really important information to decide how that funding actually gets distributed.

Lexy: That’s $880 billion per year. Multiply that by 10 yeah, or the fact that you’re going to be using this data for about a decade. Beyond that though, census data is used across many disciplines in research and in our understanding of populations and movement and so forth, so anything that impacts the availability of accurate census data has much broader implications. Even then the very substantial implications it has for public policy and funding and so forth across a host of other areas.

Marie: When you come to the census, there’s probably no other set of data that has more context related to it. Yeah.

Lexy: Census data has been largely publicly available. In order to access this data, you need only go to the internet and you can download swathes of data about the census. That however may start to shift.

Marie: Lexy, can you go into a little bit more about how that could be shifting and some of the things that the people at the census are looking at implementing to protect privacy more going into 2020?

Lexy: Yeah, there’s been an update to the privacy policy for the US census starting in 2020 where they want to implement something called differential privacy. Differential privacy tries to limit the amount of information that could be cross referenced and thereby uniquely identify someone in order to provide more privacy, but there’s a trade off to that. When you do anything that perturbs the data that essentially masks the data in some way, it provides a decrease in the usability somewhere in that data and so what differential privacy attempts to do is find ways to quantify that tradeoff and provide essentially a scalable means of masking data of perturbing data so that you can more finely gage what data is being masked, how much and so forth. In the census. One of the concerns is that you could get to a what’s called a block group level, which is several hundred households usually and that’s how most data in the census is reported.

Lexy: However, if you’re able to cross reference the data from a block group that says three households meet this particular statistic, you don’t have a head of household between the ages of 40 and 44 or something like that and then you can also cross reference it with credit bureau data and you can cross reference it with user data and so forth. You could get to the point where you could pinpoint a specific census respondent based on other pieces of information. This is something that from a personally identifiable information standpoint, we used to think about triangulation of data. Usually it would take three points of information to uniquely identify someone that is starting to get a little more nebulous because of the amount of data that’s available and the amount of computational power that’s available. Because you can deal with such large data sets at scale and bring more data to bear, it’s much more feasible to actually uniquely identify information. Even if it has been anonymized as best you could in a single data.

Marie: One of the articles that we’ll link to is from science mag.org and they talk about how this problem has moved from something that was a concern to something that is an issue, which is part of why they have decided to take action to make sure that they’re working to implement more strenuous processes to protect data, which is something we’re really glad to see the census thinking about and taking actions towards.

Lexy: Yeah, I love the distinction for what it’s worth of something being an issue versus a concern that there’s some sort of hierarchy of there’s a concern and then an issue and then maybe there’s a problem and maybe a crisis. I don’t know. I want to know what this poll list is want. Yes, I want the full taxonomy of problematic things,

Marie: but right now it’s at an issue so we know that level and we’re, we’re addressing it, but while addressing this issue that would be taking action towards protecting privacy. On the the flip side there, there is a trade off as you mentioned and that would be having data that’s not as transparent and that goes back to something that we’ve talked about before which is trained transparently and especially with something like census data that so many things rely on and that so many different groups use either for design how funding happens or as we were talking about deciding what type of marketing to deploy to a certain area or even research that’s happening at a graduate level. There is concern that this new approach to privacy with the census data could have implications on how the data could be.

Lexy: Yes. The other thing that we’ve seen in some of these articles and, and it’s something that is of concern, especially around transparency, is that the Census Bureau has specifically stated that they don’t want to release the algorithms that they’re using to create differential privacy on their data sets because they don’t want the possibility of someone reverse engineering their process and being able to get back to the raw data. We talk about training transparently and providing the understanding of all the steps. Without that, researchers may not trust the data that’s coming from the Census Bureau. If that’s the case, it may have implications to other researchers that rely on that data to then carry out additional studies and if they don’t trust that data, how much more do they have to try to go after on their researches to try to establish some of those fundamental pieces of information that they would have otherwise been able to rely on the census to give them. There are countless ways that data could potentially be manipulated if you don’t know what the algorithms are. If you feel like you can’t trust the data and you don’t know what the algorithms were that influenced the data, then how do you know that there’s anything that you can rely on

Marie: or even when we were discussing how differential privacy could be implemented, it sounds like differential privacy is almost like when you use certain types of authenticators and they give you a unique code each time with differential privacy or to make sure that it is protecting privacy. Each time you access the data set, it’s slightly different. And so that means that if I go to the census and I request the data so I can do my analysis, it’s going to be slightly different than from Lexi. When you take your turn to access the data and do your analysis, and then if we’re trying to do a good scientific method and then share data with each other to see if we can duplicate each other’s results, if we’re working from slightly different data sets, that could really hinder that

Lexy: process. Differential privacy, inject noise into a dataset and that noise can potentially skew the data in different ways. One other article that I was looking at from the American Statistical Association on differential privacy was saying that it can introduce biases and that because the census bureau is not releasing the algorithms that are being used so that you don’t know what needs to be done to eliminate those biases. It’s gonna be very difficult to use the data at least in a statistically valid way.

Marie: There could also potentially be an argument made that because the Census Bureau is something that we use or these historic periods of time that this change to how the privacy of the census data is managed could also impact how you can use it when comparing back to other older historical census data sets. So there’s an implication for how people would be able to use since a standup moving forward. And one of the articles that we were looking at talked about how there’s going to need to be a lot of communication to anybody that’s going to use the dads that’s moving forward so they understand these new privacy procedures and how that impacts the data.

Lexy: A lot of education that would be needed, not only around what was done, but in how to use the data and what types of conclusions could be drawn from it. The doesn’t do a lot of educational outreach at this point. They sort of put the data out for the most part and it just sort of sits on its own. You can always go back and look at the questions that were asked and so forth.

Marie: But

Lexy: generally speaking, if you’re a researcher or an analyst or data scientist, you’re not looking at all of that. You’re just looking at the responses. In this case, it would be a burden to anyone using the data to have to go through some sort of an educational process and a validation that they understand the data, the collection method, the [inaudible] method, all of it, so that you can then come to some reasonable conclusions based on what you find in that data.

Marie: Absolutely. So thank you everybody for joining us for this episode of the Data Science Ethics podcasts where we talked about the upcoming 2020 census and some of the ethical issues surrounding it. Right now. I look

Lexy: forward to reporting on what happens from the Supreme Court in a few weeks.

Marie: Yes, exactly. So this is Lexi and Marie. Catch you next time. Bye.

We hope you’ve enjoyed listening to this episode of the Data Science Ethics podcast. If you have, please like and subscribe via your favorite podcast App. Also, please consider supporting us for just $5 per month. You can help us deliver more and better content.

Join in the conversation at datascienceethics.com, or on Facebook and Twitter at @DSEthics where we’re discussing model behavior. See you next time.

This podcast is copyright Alexis Kassan. All rights reserved. Music for this podcast is by DJ Shahmoney. Find him on Soundcloud or YouTube as DJShahMoneyBeatz.