Preventing Fake News

Episode 24: Preventing Fake News – Show Notes

Researchers at OpenAI have made amazing breakthroughs in natural language processing in the creation and interpretation of content. So amazing, in fact, that they have elected to withhold the full version from release so that it does not fall into malicious hands. They stated preventing fake news as one of several reasons to limit the use of this technology.

Today, we discuss what they have been able to accomplish, how they are limiting the released version, and what the implications of these sort of advancements could be.

Additional Links on Preventing Fake News

Researchers, Scared by Their Own Work, Hold Back Deepfakes for Texts AI Ars Technica article detailing OpenAI’s GPT-2

Better Language Models and Their Implications OpenAI blog post including writing sample from GPT-2 and insight from the data scientists who created it

OpenAI’s New, Multi-Talented AI Writes, Translates, and Slanders The Verge article which includes discussion with David Luan of OpenAI on when they realized the implications of their new creation

Preventing Fake News – Episode Transcript

View full transcript

Welcome to the Data Science Ethics Podcast. My name is Lexy and I’m your host. This podcast is free and independent thanks to member contributions. You can help by signing up to support us at datascienceethics.com. For just $5 per month, you’ll get access to the members only podcast, Data Science Ethics in Pop Culture. At the $10 per month level, you will also be able to attend live chats and debates with Marie and I. Plus you’ll be helping us to deliver more and better content. Now on with the show.

Marie: Hello and welcome to the Data Science Ethics Podcast. This is Marie Weber

Lexy: and Lexy Kassan

Marie: And today we are going to talk about new language learning systems developed by OpenAI.

Lexy: This is a very new technology from open AI that is trying to interpret, read and write content and doing a very surprisingly alarmingly good job at it.

Marie: The other interesting thing about it is how it was trained. So Lexy, do you want to start off with houses a little bit different than other programs that have been developed before this?

Lexy: This isn’t necessarily tremendously different from other natural language processing engines prior, but it has been given a lot more data to work with. Open AI has developed an unsupervised technique, which means they have not told the algorithm what is correct or incorrect. They haven’t labeled the data as you know, what topic it’s about, whether it’s positive, negative, neutral, etc. It’s just allowed it to read everything and come to its own conclusions.

Lexy: Initially what they wanted to do was predict the next word that was likely to show up over time. What they did was redeveloped it to take a prompt and based on the prompt it was given to develop a piece of content that it predicted would follow that prompt.

Marie: Right, so this was an example of assistant that was developed that based on the prompt could maybe develop a new story or based on the prompt, could write a review on Amazon or could write a fan fiction or other applications.

Lexy: Correct. The other thing it can do is parse through an article or a piece of content and identify key information. So think of it as an AI that could answer a critical reading test or a writing test on something like the SATs where, it would probably receive pretty high marks for what it discerned from the article or what it was able to craft as a response to a question.

Marie: Exactly. And so this is also a good way to think about unsupervised learning because as you were saying, it wasn’t necessarily given this is a right answer and this is a wrong answer based off of a data set that it was being trained on. It was given a bunch of texts to basically review and then instead of saying you’re going to take a test where you have to choose the right answer, you’re going to be given a prompt like you are given in the SATs to write an essay and can you write, you know, a compelling essay about that topic.

Lexy: Right. It says, “unlike other texts, generation bought models such as those based on Markov chains, the GPT-2 Bot did not lose track of what it was writing about as it generated output, keeping everything in context.”

Marie: Yeah. Yeah. Yeah.

Lexy: Open AI developed this natural language engine that when they discovered how well it was doing that it was able to produce a reasonably convincing argument against something that’s commonly held to be generally good knowledge. They really started to consider the ramifications of what would happen if this were let loose in the wild. What’s particularly promising to me about this is that open AI elected to not release this algorithm. They instead released a limited version of it that did not have access to as much data, as much content that it had learned on so that it could not do as well. They were very concerned that someone would get their hands on it that had malicious intent and use it for an ill purpose.

Marie: This relates back to another quick take that we did about the deep fakes and that algorithm that was produced and so this could essentially be deep fake for text. Exactly. Web content. So the fact that the team at open AI took precautions is a really good sign, especially when we’re talking about data science ethics.

Lexy: Absolutely. They really are anticipating adversaries in the fact that if someone were given the opportunity to create essentially fake news stories or fake reviews or something, there are actors who would want to do that. There are reasons they would want to do that to further their own platforms or what have you and open AI specifically wanted to prevent that from happening as best they could, at least with their algorithm and their technology. Some of the things that they were hoping to achieve with this were to enhance AI based writing assistance, so if you use for example, Gmail, you might notice that Google has now started prompts of what you might write next so that it makes it easier for you to write an email without having to type everything out. You can just hit tab and it will populate whatever Google has kind of put in gray.

Lexy: Similarly, the prompts that you get in a text messaging application where it might be giving you options of what you might want to say in reply. Those are Ai writing assistance. They also wanted to develop more capable dialogue agents, so chatbots and so forth. They wanted to enhance unsupervised translation between languages and actually what they used as the baseline for their platform was specifically meant for translation and also better speech recognition systems because it would know more common patterns for what people might say versus alternatives. If you’ve ever gotten frustrated with speech to text, that’s something that they’re trying to enhance. They specifically said in their reasons for not wanting to release the full version to the public that they wanted to prevent and they could anticipate generating misleading news articles, impersonating other people online, automating the production of abusive or faked content to post on social media or automating the production of spam or phishing content. And they felt that because of those risks, those very real risks, they did not feel comfortable releasing all of that. And so they specifically chose to self regulate and instead released this limited version that did not use all 40 gigs of data for the training.

Marie: Correct. Do you feel like that is going to keep this problem at bay or do you feel like based off of just technology and if somebody had enough time that somebody would potentially still be able to solve this? Maybe not in the exact same way but solve it in a similar way.

Lexy: It’s entirely possible that someone could solve it even in the same way potentially if the only difference is that they didn’t put all of the data behind the algorithm in the way that it was released. The data that it was trained on was they say 40 gigabytes of text, which is actually quite a lot of text that came from any outbound links from reddit that had a Karma score of three or greater.

Marie: The upvoted posts. Yup.

Lexy: They were looking for quality content, things that people felt were valid, valuable pieces of information. So if the only difference is that someone needed to feed it more content, then there absolutely is the possibility that someone could use transfer learning. What that does is take the algorithm as it stands and add a little bit more training or another layer of processing. And and so forth. Essentially build on that work to then make it their own.

Lexy: I think because there is so much information here about how open AI built their model that it’s very possible it could be replicated so long as someone had enough processing power. There are plenty of ways of getting that processing power. They would need the expertise, they would need the data which they could probably scrape and they would need to be able to process it and that’s about it. That’s assuming that the rest of the model, the rest of the actual code to run the model is available. If that was what was released, then it would be, I mean I think in general the fact that these researchers were able to do it means that chances are other researchers are going to be able to do it too and they may not have the ethical construct that says we shouldn’t release this or we shouldn’t use this. They may be the people who want to produce false content and with the combination of a fake piece of text and a fake image that goes with it, you could have something that looks very compelling and is completely made up

Marie: True. If you think about the fact that the transformer that we referred to earlier was only invented 18 months ago by researchers at Google brain, that means the, the timeline from when this basically became feasible is really short. What’s going to be available in the next 18 months could also be fairly transformative.

Lexy: Absolutely. There are plenty of people in the natural language processing space that are studying these types of problems that are trying to solve for these types of use cases. This field has been evolving very quickly, especially over the last probably three to four years. Obviously the last 18 months of that since transformer came out and so forth have produced different types of results, but I think that between the compute power, the algorithm development, the increased use, the increased visibility into these types of things, increased interest, that it all leads to this escalating capability that we have to be a little cautious about. We are really at the precipice of finding ourselves in an untrust worthy Internet beyond just are they’re bad actors. Is is truly what out there even came from a person.

Marie: Yeah. There are things that this type of technology could solve for that could make things more accessible. Being able to come up with this type of technology to make translation faster and more accurate could be really powerful, but there are these other things to consider and that’s where anticipate an adversaries comes into the data science ethics process. Absolutely.

Lexy: One thing that we should mention is open AI is specifically‚Ķ Their intention stated on their website is “discovering and enacting the path safe artificial general intelligence.” This is just one step on a much broader journey. Trying to find a safe way to do it is a really tricky thing to pin down. So the fact that open AI did self select essentially to remove this from the public view that they did regulate themselves and regulate their system from putting this out into the world I think is really a great start, but they’re not the only ones looking at this. And so this is really why we’re here talking about these topics. There are so many other people who have these types of considerations to think about

Marie: And should be including the component of thinking about the ethics aspect of this work in their process. Absolutely. There’s a lot more on this topic, but that’s our quick take on opening eyes do you language learning system.

Marie: Thank you so much for joining us for this episode of the Data Science Ethics podcasts. This is Marie

Lexy: And Lexy.

Marie: Talk to you next time. Thanks so much

We hope you’ve enjoyed listening to this episode of the Data Science Ethics podcast. If you have, please like and subscribe via your favorite podcast App. Also, please consider supporting us for just $5 per month. You can help us deliver more and better content.

Join in the conversation at datascienceethics.com, or on Facebook and Twitter at @DSEthics where we’re discussing model behavior. See you next time.

This podcast is copyright Alexis Kassan. All rights reserved. Music for this podcast is by DJ Shahmoney. Find him on Soundcloud or YouTube as DJShahMoneyBeatz.