Episode 18: Train Transparently – Show Notes
As algorithms are created and unleashed upon the world, it is crucial to understand not only what they are but how they came to be. The best way to accomplish this before chaos is wreaked is to train transparently – meaning to let people know what is going on while it is happening.
Training transparently has multiple aspects. It means indicating what data was used or not used. It means specifying which algorithms were evaluated and dismissed or selected. It also means providing detailed justification for each decision so that others can critique the work.
This process, while onerous, allows for an environment of collaboration – one that can foster the sharing of ideas and the broadening of horizons. It can help identify areas of potential bias or inappropriate impact before any harm comes from a skewed analysis. It can bring new techniques from outside resources that can enhance results.
Above all, training transparently allows visibility into an otherwise-esoteric field, granting some measure of comfort to those using the model or on whom the model is used.
Train Transparently Episode TranscriptView Episode Transcript
Lexy: Welcome to the Data Science Ethics Podcast. My name is Lexy and I’m your host. This podcast is free and independent thanks to member contributions. You can help by signing up to support us at datascienceethics.com. For just $5 per month, you’ll get access to the members only podcast, Data Science Ethics in Pop Culture. At the $10 per month level, you will also be able to attend live chats and debates with Marie and I. Plus you’ll be helping us to deliver more and better content. Now on with the show.
Marie: Hello everybody and welcome to the Data Science Ethics Podcast. This is Marie Weber and I am joined by Lexy Kassan and today we are going to be talking about the concept of making sure that you train transparently whenever you’re setting up your data science models and doing your data science process.
Marie: Lexy, when you’re working on putting together a model – and we’ve talked a little bit about the data science process – one of the things I know you’ve talked about before is thinking about where you get your data. What data you use. So how does that relate to the concept of train transparently?
Lexy: There are two aspects to that. I would think one is how you frame the problem and therefore what you gather for the data. The other is whose data are you using and do they know.
Lexy: One is, generally speaking as data scientists, making sure that the companies that we’re advising know what we’re doing. And then, secondly, having some means of identifying to the people who are affected by an algorithm that there is an algorithm and play and that their data is being used in those algorithms. One of the concepts that I kind of think about when it comes to training transparently is really around documentation and it’s like a dirty word in the industry because as you’re trying to iterate, innovate, and be agile and fast and you get to the great algorithm that’s going to solve your problem documentation just as a time suck. It takes forever. It’s a lot of effort. It’s not the fun part of data science, but it’s absolutely crucial partly because it gives you something that you can look back on so that you don’t have to try and remember everything you’ve done, but also so that it can be the foundation of a peer review were either another data scientist or kind of a community of data scientists could look at what you’ve done and potentially even at the data and determine if what you’ve done is a reasonable process.
Marie: And I think that documentation also is helpful from the standpoint of doing, not just good work yourself, but also making sure that anybody else that gets onboarded onto the project can understand what was done. Or if, say that you’re working with a client and you have more of an agency client relationship, and they end up moving to another agency in the future. They could still have that documentation to be able to explain to another vendor how this model was set up, what it’s doing, how it’s operating and so that way it depending on what the agreement is and how they can use that algorithm in the future is something that they can continue to get business value out of
Lexy: That and also I think about it from the perspective of peer review. In most traditional sciences, if you submit a study to a journal for review or for publication, it is sent out and the the data is made available, the methodology is made available, and it is scrutinized by others in the industry to ensure that what you’ve done is reasonable. That the conclusions that you’ve reached or reasonable based on what you had.
Marie: And when you talk about the idea of things being peer reviewed in science, a lot of time the gold standard there is can somebody replicate the same results that you got in your study in their study and so from a data science perspective, you could come up with that same rigor of if somebody was able to use the same data set and basically start with the same set of inputs, get the same type of outputs.
Lexy: With data science, often what you share is not… It’s not just the methodology. There’s a lot of sharing of code. That’s how transfer learning starts to take place. It’s, well usually there’s some sort of other api that you can use, but you can also share just here’s the code I used. At that point, you can have someone scrutinized all the steps that you’ve gone through, everything you did to that dataset at least as far as the starting point of that data set. Now there’s a whole other set of processing that may have gone on prior to that data set.
Lexy: In data science. We talk about the fact that 70 or 80 percent of your time is spent preparing data, cleaning it up, processing that data, bringing data sources together. And then only 20 percent of your time is in modeling the data, creating the algorithms and actually interpreting the results. That 80 percent needs to be documented too. And that needs to be transparent as to what you included and what you did not include. If there were sources that were potentially available or that you looked for and couldn’t find, what were those that you wanted to bring in or you thought about bringing in but excluded for whatever reason? Those are just as valid and that has to be transparent as well.
Marie: And that type of documentation could also help people understand either one, the limitations that you were up against and why you made those decisions or it could help people basically in the peer review process, point out potential biases that maybe you didn’t recognize as you were going through the process.
Lexy: Yeah, that’s very valid and there’s a tremendous opportunity for everyone to bring in their own biases. So if you say, I thought about bringing in this data set, but I opted not to because I didn’t think it was either a credible or pertinent or what have you. Someone else may say, no, I think that that is pertinent and you really ought to consider it. Those are absolutely things that can happen depending on the context of what you’re building. You may or may not really bring this to kind of a public community of data scientists. But at least having someone even if it’s within your own organization, peer review, what you’ve done, even if it’s the business stakeholder, the person who’s asked you for the analysis that you’ve done, the algorithm that you’ve developed. Ensuring that they understand what has gone on with that data is a crucial step because they’re the ones that are going to use that information in whatever way. They’re going to use your algorithm in some way, shape or form. They need to know what happened in it so that they can say, yes, I can use this, or no, I can’t use this. Let’s revisit.
Marie: Or in that type of situation that the legal team or the compliance team being able to review it and understand what steps were taken.
Lexy: We’ve talked a little bit about some industries that are more heavily regulated like health care or finance, where compliance teams are a very real and very strict group. Where you have to document absolutely everything and all of your decisions have to be justified and in compliance with whatever set of regulations you are under.
Lexy: The idea that there is a specter of a compliance officer is often helpful, I will say. To think that someone’s gonna look at this, someone, someone’s going to review this. I better be able to justify every single decision along the way.
Lexy: The other part of it that I think is really important to point out as we talked a little bit about the biases that could have been built in and that people can point those out, but I think the other part is in the usage of the model. It’s one thing when you say, well, the data you used was or wasn’t skewed in some way or you did or didn’t include a certain source or what have you, or you’ve over underrepresented segments of the population.
Lexy: It’s another to say you represented them properly. However, the impacts based on the use, the intended usage of this model are somehow unfair. They’re going to abnormally penalize one group over another. They’re going to have some sort of disproportionate effect on one segment of population. The model itself may not be doing that, but the usage of that model might. So we’ll get into some more examples of these in future episodes. But this could be where as an example, you’ve used some piece of information that seemed on its face to be reasonable, seemed like it’s a fair way of estimating something, but for one reason or another it wound up disproportionately effecting one group or another based on how that model then got used. So we’ll talk through some more of those in upcoming episodes.
Marie: Absolutely. And when you think about training transparently, how does that relate back to one of the steps we talked about in the data science process in terms of the care and feeding of your model?
Lexy: In the care and feeding process – let’s say you did this algorithm development a year ago and you’re trying to revisit it, or you came in new, as you mentioned, to a new group and you’re now in charge of retraining a model or reevaluating the performance of an algorithm. How do you know how it came to be? How do you know the constraints under which the creator of that Algorithm was operating? Like you mentioned, there’s a need for that type of information to be available so that it’s transferable or it’s just remembered. We forget things. We go through a lot of projects in a year. Even if you’re just one person in a larger organization and even if that’s your entire role, chances are you’re not going to remember exactly how you came to that conclusion. If you went through multiple iterations, which is very common. We talked about this as part of the data science process. Maybe you tested something, it didn’t work, but now you don’t even remember that you tested it. It becomes a crucial part of revisiting your algorithms over time to ensure that you’re not repeating yourself. You can still justify the decisions that were made and now you can move forward enhancing a model in ways that are meaningful.
Marie: There can be things that change over a year. There could be new compliance rules that come out. There could be new types of data sets that you get access to. There could be just a different scale in terms of what the business is taking it on or they might have new lines of business that they’re taking on that affect your model. So all those things are considerations that as you think about not just the model that you’re building today, but how that model could exist in the future and that documentation can help you say, okay, this is how the model was built, which helps you be able to move forward with that model so much easier than trying to reverse engineer what happened. So you can figure out if you can use it moving forward.
Marie: And I think that brings up another point in terms of train transparently where you want to be able to describe how a model is coming to the conclusions that it is, and sometimes with the documentation that’s pretty straight forward. But with some of the more advanced techniques that you can use in data science, sometimes that’s a little bit less clear. And so being able to explain what your model is doing and why it’s doing it can be really important in this transparency conversation. And you need to be able to do that as best as possible depending on what methods you’re using.
Lexy: Yeah. We talked in a prior episode about AIs that are trying to explain themselves. The fact that there are deep learning techniques out there which human beings often can’t fully explain. We can sort of see the outcomes and we can see differences amongst the outcomes and so forth, but we don’t necessarily know all the ways that it reached the conclusion it reached. There are some models that are trying to explain their methodology along the way. Those types of considerations though often lead to choosing simpler modeling techniques – choosing simpler, more explainable algorithms over more complex ones. In compliance, they want to be able to understand exactly what went in and exactly what came out and what happened in between. They want to know all the steps. However, as we get further into deep learning and people are trying to use these more advanced, more esoteric concepts in their modeling to get more accurate results, there’s a tradeoff. That tradeoff has to be evaluated very carefully. Even that tradeoff has to be documented. You have to be transparent about the fact that we are using a more complex, more sort of black box algorithm because the benefits of doing so outweigh the risks. That has to be a part of the conversation when you’re looking at what algorithms you’re going to use. If I’m trying to build an algorithm that’s looking for a more complex outcome, I need to use most likely a more complex algorithm.
Marie: Perfect. Well thanks so much, Lexy, for going over more details on the train transparently and again, this is Marie with the data science ethics podcast.
Lexy: And Lexy.
Marie: Thanks so much for joining us.
Lexy: See you next time.
We hope you’ve enjoyed listening to this episode of the Data Science Ethics podcast. If you have, please like and subscribe via your favorite podcast App. Also, please consider supporting us for just $5 per month. You can help us deliver more and better content.