Data Science Process

Team Data Science Process

Episode 2: The Data Science Process – Show Notes

Decisions made at every stage of the data science process can impact the ethics of the outcome. From data selection to hypotheses tested to interpretation, data scientists must carefully evaluate the implications of their models and outputs. In today’s episode, we delve into the data science process to illustrate the areas where ethical considerations must come into play.

Data Science Process Overview

The diagram above, from Microsoft’s Team Data Science Process, illustrates the various steps taken throughout a data science project. There is a lot of back and forth, as depicted with all of the arrows. Data scientists iterate through hypothesis establishment and testing, data preparation, model development, and conferring with subject matter experts to arrive at a reasonable model for use. Below is a quick view of the main steps and how ethics are involved every step of the way.

Business Question Understanding

The data science process starts with a problem to be solved – a business question. Subject matter experts and data scientists meet to review the problem and to discuss the path towards analytics. Both groups may have some conception of what data should be used or what hypotheses should be tested. These first conversations immediately set a tone of how models are approached that carries through the entire analysis.

Selecting a question has ethical implications. Which questions get priority? Which ones are left on the back burner for later? Are these the ones of most value to the people asking? Have they considered the ways the answers will be used and whether these are ethical?

Beyond that, the hypotheses brought up by both the subject matter experts and the data scientists are subject to bias. There are also gaps in hypotheses. There are hypotheses that do not get brought up or that cannot be tested and therefore are not considered.

Data Gathering

Next, data scientists assess what data is available and start gathering it for use. This can lead to further decisions based on whether the data can be matched up appropriately, data cleanliness, applicability, and more.

Some data may be included or excluded simply based on whether it can be linked to other information. This doesn’t necessarily make the analysis more or less ethical – just more convenient. However, the fact that data is or is not available can create additional biases that go unnoticed or undocumented.

Data Exploration

Once data is gathered and linked together, data scientists explore what the data look like. We run descriptive statistics to evaluate whether data has variation, is populated, is distributed in the ways we need, and so forth. Data scientists make decisions about how to handle certain situations in raw data based on this exploration. For example, anomaly handling or data exclusion.

New hypotheses come from exploring what is present and seeing patterns in the data. Again, this presumes that the data present is representative of the full scope of possible influencing factors and outcomes. The hypotheses developed are also subject to preconceived notions of the observers. If the data scientist does not consider enough possibilities, the rest of the process may be biased.

Feature Engineering

Craft the data fields to be used in model development. Features are those fields. Data scientists process and clean the data to create meaningful inputs.

Every change of the data leads to features that are helpful for modeling, yet may cause the meaning of the data to shift a bit. Most changes are benign – they simply shape the data for use. But each change should be evaluated for its impacts.

Model Development & Iteration

Data scientists test multiple models across different sets of features to find those that answer the business question best. “Best” is a tough thing to quantify. We have to make compromises between accuracy and general use. Getting a model that includes enough features that it is predictive yet can still be used across new data sets is an art, rather than a science. Some features may be thrown out in this process that would have implications to outcomes and use downstream.

Model Interpretation

Subject matter experts are not just interested in knowing that a question can be answered. They also want to know what the answer is. Data scientists can easily interpret the results from classic predictive analytics techniques and many machine learning algorithms. However, more complex algorithms and AI functions are “black box” in nature – meaning that we cannot necessarily see what happens to the inputs to achieve the result.

Being able to readily see what is predictive in a model is a helpful capability. Data scientists interpret the model and can then make assessments of ethical constructs. There is nothing to say that black box techniques will create biased or unethical models. It is simply more difficult to evaluate the results of these techniques.

Model Deployment

Creating a model doesn’t yet help solve the initial purpose. The model has to be used to have true value. A key part of the data science process that often gets missed is deploying the final model to production systems. Business systems use models in different ways.

Often, the use of the model is more important for its ethics than the model itself. Where and how the model is implemented is where the impacts are felt. Data scientists and the model stakeholders must carefully consider what the model is doing and therefore how it can be deployed.

Model Retraining

Most models are not “set-and-forget”. The information used to initially train the model ages out and becomes less relevant over time. It’s therefore important to revisit models and ensure that they do not degrade. That is, that the model remains accurate for its purpose. Some machine learning techniques can retrain themselves and improve over time by taking more recent data into account.

All models should have some data scientist oversight to ensure that they remain both accurate and fair. And so the data science process repeats.

Additional Links on The Data Science Process

These links are to different methodologies commonly used for the data science process. They share many of the same stages as we describe above and in the episode.

CRISP-DM Methodology – the cross-industry standard process for data mining has been the go-to data science process for about twenty years. It

Microsoft’s Team Data Science Process – this methodology emphasizes data science as a team effort as well as the need to constantly revisit models after initial development.

SEMMA Methodology – an alternative methodology encoded within the SAS Enterprise Miner software, SEMMA begins with the data preparation steps rather than the business question.

Episode Transcript

View Episode Transcript
Welcome to the Data Science Ethics Podcast. My name is Lexy and I’m your host.

This is episode 2: The Data Science Process. Today, I’m joined by Marie Weber, who will be the true host. She’s going to lead the conversation regarding the data science process and how it has implications to ethics.

Marie: So welcome everybody to The Data Science Ethics Podcast. This is Marie Weber. I am a Digital Marketing Strategist. I am here with Alexis Kassan who is a data scientist and today we’re going to talk about the data science process.

So Lexy, in terms of data science, what is the process that you typically go through? And probably the best question is, when the process starts off, how do you work with somebody? And how do you get the right question so you make sure that you’re doing the right process when it comes to your data science?

Lexy: Alright, so there’s a lot bundled into that. And as you alluded to, the first thing that we do is understand the business question or understand the use case that we’re going after. That typically involves chatting with subject matter experts, chatting with the owner of the question, the process. It’s a little bit different for data science than it is for, let’s say, traditional sciences – physical sciences – where the people who are conducting that research are very likely to be the ones coming up with the next question and doing the testing. We may, but it’s usually separated a little bit. So data scientists often start with talking with the subject matter expert on what they’re looking for – what they’re trying to solve.

One of the things that at least I, as a data scientist, typically ask is “do you have any hypotheses as to why this is or what is going on” or “do you have a conceptions currently of how you would approach this in terms of what data you would look at, what you think is important here.” That starts to set the framework for what we then have as a next step which is identifying what data we need to bring together.

Marie: That’s where you get into your data inclusion versus your data exclusion and how you go about getting that data.

Lexy: We think about it from the standpoint of data gathering and data blending. That’s kind of the first step. Then we start looking at the shape of the data and understanding how that data is distributed and what are the classifications that we have in all of these different elements of what does the data say on its own. Not predictive, not looking for any specific answer yet, just what does it look like?

Then we start getting into understanding, along with the context that we got from a subject matter expert and whatever context we bring of our own, we’ll start to get a better understanding of how to treat that data.

You mentioned data inclusion and exclusion. That happens often when you have to deal with outliers or anomalies in the data. It has to deal with – are there classes of information that you do or do not want to include? Are there classes of information that you have to include? Are there specific types of information that are more or less meaningful? This is often the case where you get a bunch of data elements together and then you look at them and you say “there’s no variation.”

If everything’s the same – let’s say you get a survey answer and everything is a three out from one to five – often times we’ll say “well they didn’t bother. This doesn’t really tell me anything. I’m throwing it out.” Those decisions are part of the process. Understanding that anything where they’ve just straight-line answered – “this is three” all the way down or “this is five” all the way down – may or may not be meaningful. And so if I have enough data that I can exclude some of those, I might. But those are all decisions that play into, then, the results you get later on. So understanding not only what did you brought together, but what decisions are then made, is really important.

Marie: I think it’s also interesting to talk about the decisions that you’re making at the beginning the process because those decisions could then also effect what it’s like further down the process. If you exclude something where everybody answered three all the way down, but then in the future that was changing but it wasn’t actually part of your model – that’s an area where you know somebody made a decision and it could have an impact later on. So it’s very important to think about what the impacts are not just in the analysis that you’re doing at the beginning, when you’re getting set up, but also how that could potentially effect it in the future, especially if that variable changes or you expect it to potentially change.

Lexy: Absolutely! This is something that we’ll see in a lot of the case studies that we get in to. What data actually was used in the training of a model is a tremendous influence on the results at the end. There are times when it makes all the sense in the world to exclude certain information. And there are times when that exclusion may or may not have been purposeful but it caused outcomes that were unintended and potentially damaging. So we’ll definitely have some case studies there.

Marie: Or intended but with unintended consequences.

Lexy: Absolutely. And understanding the context of the data – how that data was gathered – is a whole other subject. We will have a separate informational session on that soon.

Marie: So it’s important to consider any bias that might come from the sources of your data.

Lexy: Absolutely. There’s bias that happens from the sources of your data and how the data was put together and then you got access to it. There’s bias that can be introduce based on whether or not you’re actually bringing in a specific source of data. There’s also bias that happens in kind of the next step which is the data preparation step, or what we often call future engineering.

So with those you’re making decisions as to how you’re going to manipulate the data, how you’re going to aggregate or bin data. Let’s say, for example, that I want to look at income ranges. I might say, “well, I don’t really need to know the specific value of someone’s income, nor do I likely have that value, but the range that it falls in may make sense to look at.” How do I define those those ranges? Where do I cut it off? That’s a decision to be made.

What about if I were to look at education? How important is that? How do I include that? Do I include it as a value where I say less than high school, high school, some college, college graduate, post grad, etc? Or do I make it an ordinal value – zero through n? Those types of decisions. Some of them have to do with the actual techniques you’re going to use later on. For instance, some techniques cannot use a text value. Some of them need to have numeric values. But understanding what was assigned to what makes a difference. These are all things that are happening as part of the process of just preparing to do a model.

As you get into the actual modeling process you start to look at all of those features and say “are these or are these not predictive? Are they or are they not meaningful?” Essentially, in the model, are we seeing any difference based on having this particular variable in the model? In the set of things to be evaluated? And then we start doing things like taking them out, putting in different ones, rearranging the features, coming up with new features, trying again… This starts this iterative process.

Some of the time, when you’re going through each of these steps, you start going back to the subject matter experts and saying “Hey, you had this hypothesis. I tried it in the way that you said and I’m not seeing a lot of impact. Or I don’t see this as being particularly important in the model. Is there something I’m not thinking about here?” Back and forth – this iterative process of first figuring out which features to use and then also identifying which algorithms to use.

The selection process of which algorithm is usually one that’s done based on the accuracy of the model. So you look at a bunch of different statistics about how it’s performing and you see is it giving me a reasonable outcome? “A reasonable outcome” is a very subjective term. So you might try to go for ultimate accuracy, but in the ultimate accuracy category you end up dealing with what we call “overfitting.”

Overfitting means that you’ve set up your model specifically for that set, and that set only, of data. That’s when we use things like cross-validation, which is when you use that same model and you say “does it work on this other set of data? And how about this other set of data?” You do this many times to see “am I still getting the same results? Or did I fit to the training data only and now it doesn’t work when I go outside of that data.”

We deal with all of these things as part of the process. And then, at the end of it, you most commonly go back to your subject matter experts, back to the owner of the question, and you say “Hey, how about this?” You present your model.

Where it gets a little tricky these days. I came from a background in statistics and predictive analytics where you could more or less explain the model that you’ve created. You know, no matter what you did in the interim process, you could explain all the steps. Now we’re getting into an area where sometimes you can – when you’re using more traditional methods and some of the machine learning processes and so forth, there are some things you can do to explain it.

But when you get into things like neural nets and deep learning (and we’ll get into these, I promise, on a later episode because there’s a lot in there) you’re essentially letting the machine do its own thing. You give it all the data and it processes and it figures out the best way to do things and it may or may not tell you what it’s done. Trying to reverse engineer it is essentially looking at a bunch of descriptive statistics at the end and saying “well where is it more or less balanced.” And depending on how deep you go, you may not as a human be able to get there with describing that model. You just know that it works and that it seems to work across multiple sets of data.

So at the end of all of this, hopefully, if you’ve done your work well and you’ve thought through as much as you can, you come up with an algorithm that allows you to get to the answer that the subject matter expert was needing. It does this in a consistent way using data that is valid for use, that is ethical in its use, that allows for new data to come in and be reasonably modeled within it, and that is of value that really is providing a benefit to the people who are asking for it.

Marie: Nice. And when you think about the data science process – once you get the model set up then what do you do? What are the steps that happen at that point?

Lexy: It’s one thing to create a model – it’s another thing to keep it running. And often, this is interesting – there was a recent statistic that came out saying that something like 83% of all models that are created never go into use – never go into production.

Marie: That’s really high.

Lexy: It’s really high. There are a lot of reasons for that. Part of it is that – what you may have heard me not speak about is actually “productionalizing” that model – so putting it into an application, putting it into use somewhere. The data science process often ends at “here’s my model, what do you think?”

It doesn’t end at that in the real world though. Because if that model is good, what you want to do is say “here’s my model. It’s awesome. Let’s go use it!” That “let’s go use it” part often requires a lot of extra steps that data scientists can’t do. Now that’s changing quite a bit, and very rapidly, because there are a lot of tools now that allow for much easier integration with other applications. So if I build something using some of these tools, at the end of it can be called as a web service.

For those of you who are not programmers, a web service is essentially a type of service where you can send information, it will run a process, and will return you back some information. These happen all the time you’ll see that with anything where you’re signing in via Facebook, or you’re connecting one site to another, or it’s pulling data. For example, if you see a stock ticker. All of these things are essentially ways of sending a request for information and receiving back information.

Those types of applications are now available more commonly for data scientists to make available their model for use. They could be their private or public, depending on what the application is. It’s very common for it to be a private application that would simply be served up to the subject matter expert, their group essentially, to then be able use that model. So it’s getting easier.

And then, from the point where it’s actually starting to be used, as a data scientist there’s care and feeding that I need to do. I need to make sure that my baby, my model, out there in the real world, is doing alright – is doing what it’s supposed to do and is performing the way that I had intended it to perform and is doing the function that it it was supposed to do. I, as a data scientist, might have some telemetry, some information, that I’m getting back from the model or from the business saying “here’s how well it’s doing in whatever it was meant to be doing.”

At some point I might say “OK, I’m seeing this model degrade” – meaning it’s not doing as well over time. Or “I’m seeing it get better,” depending on how I’ve built it. Some of the models are auto-learning. They will continue to improve over time rather than degrade over time. It depends on how it’s built. So I may at some point say “OK, I either have new data that I want to put in or the model’s not doing as well.” Maybe I need to go back and change something. And you start this process again.

So once the model’s out there, that’s not the end, right? You really have to make sure that you’re continuing to stay on top it. You’re continuing to understand how it’s doing while is out there doing his job.

Marie: Awesome. Well, thank you, Lexy, for taking us through the data science process and some of the different stages that are involved in the process. And this is the Data Science Ethics Podcast. Thanks so much.

Lexy: Thanks.

I hope you’ve enjoyed listening to this episode of the Data Science Ethics Podcast. If you have please, like and subscribed via your favorite podcast app. Join in the conversation at datascienceethics.com, or on Facebook and Twitter at @DSEthics. See you next time.

This podcast is copyright Alexis Kassan. All rights reserved. Music for this podcast is by DJ Shahmoney. Find him on Soundcloud or YouTube as DJShahMoneyBeatz.