Interrogating RoBERTa: Inside the challenge of learning to audit AI models and tools

August 23, 2022

IQT Labs

This post originally appeared on the IQT Blog.

With the overwhelming pace of technological change, does the story of a tool matter as much as the tool itself? To explore this question, we are inviting writers, makers, and other creatives to help us get out of the technical weeds, see the bigger picture of emerging tech, and understand why it matters (or not) in our daily lives.

In this post, science writer Shannon Fischer talks with Andrea Brennen, Ryan Ashley, Ricardo Calix, and Andrew Burt about IQT Labs’ recent audit of RoBERTa, an open source, pretrained Large Language Model. If you are interested in collaborating with us on future projects or storytelling, get in touch at labsinfo@iqt.org.

—

Artificial intelligence (AI) has fully moved into our lives. No longer just for online ad placements and auto-complete, it’s in facial recognition software, job recruiting systems, hospitals, banking and loan decisions. And yet, despite its impact on literally millions of people, AI can and does regularly go wrong—like the healthcare AI that assigned lower care for sicker black patients than for white patients; the judicial sentencing AI that stigmatized younger and black defendants; or the self-driving car that, according to federal reports, killed a woman because incredibly, “The system design did not include a consideration for jaywalking pedestrians.”

Multiple factors underlie these failures. There’s the classic ‘garbage in, garbage out’ problem, where the data that an AI algorithm is trained on can be tainted with human bias, often implicit and unintended. This is exacerbated by the fact that many algorithms are black boxes to their users, and sometimes even their authors. A recent white paper by IQT Labs and BNH.AI, a boutique law firm specializing in AI issues, analyzed real-world failures of AI systems. The authors found that far too often, data scientists don’t test their creations beyond accuracy—even though they often need to be put through rigorous challenges for safety, vulnerability, bias, and other potential failures. BNH.AI, a boutique law firm specializing in AI issues, analyzed real-world failures of AI systems. The authors found that far too often, data scientists don’t test their creations beyond accuracy—even though they often need to be put through rigorous challenges for safety, vulnerability, bias, and other potential failures.

Andrew Burt, co-founder of BNH.AI, cites a fundamental asymmetry, where five to ten data scientists can create models that affect hundreds of thousands, if not millions of people. Granted, that’s also AI’s selling point. The problem, Burt says, is that frequently, the data scientists involved do not have a full understanding of the legal implications of their work. “A lot of times, we just see data scientists searching ‘what are the best metrics?’ for problems like fairness and then they implement those,” he says. By the time lawyers get involved, it can be a mess.

This isn’t an unknown issue by any means. In fact, frameworks and recommendations for better transparency and testing have proliferated across AI-affected industries in recent years. A fundamental problem, however, explains Andrea Brennen, Deputy Director of IQT Labs, is the gap between those high-level guidelines, and the practical, actionable steps needed to put them into use. “This is such a new technology, and the processes and tools and platforms for auditing it are still immature and nascent,” Brennen says. “It’s not always clear what to do or how to do it.”

Last year, IQT Labs decided the best way to figure out those steps was to do it themselves. It would be an experiment in action: they would start auditing AI models and tools and force themselves to figure out the necessary, practical steps along the way.

Now, after a successful first run auditing a small, visual-data-based AI tool called FakeFinder, the crew has stepped up the effort to audit a very different beast—a massive, natural-language processing model named RoBERTa, released by Facebook (now Meta) in 2019.

The two questions driving these audits are:

Is what we’re getting out of this tool what we think we’re getting?
“How do we figure that out?

While some things have translated well from the FakeFinder experience, others have not. “This has been much harder,” Brennen admits.

Lessons from FakeFinder

IQT Labs’ first audit was deliberately small, partly because it was a first-time experiment, and partly to keep it relevant to the real world. “We could literally spend an infinite amount of time on this,” Brennen explains. “But to make auditing a practical workflow, it has to be something that we can do in a fairly limited timeframe with a limited number of people.”

A two-person multidisciplinary red team led by Brennen (expertise in human-centered design) and Ryan Ashley, IQT Labs’ Senior Software Engineer and cybersecurity expert, collaborated with BNH.AI founders Andrew Burt and Patrick Hall for legal input.

“A lot of that initial FakeFinder audit was just us going, ‘Okay, there’s two of us. We’ve got this really compressed timeline. What can we do that’s useful and tangible and isn’t a huge sink for time and effort?’ ” Ashley says.

And it worked. Despite the tight conditions, the team turned up several key discoveries—including security vulnerabilities and a racial bias in the tool’s ability to identify fake videos. They also realized that the tool did not function as an all-encompassing fake video detector because it could only reliably identify one type of deepfake—a technique called “face swap”—even though nothing in its instructions or description suggested that limitation. FakeFinder was and still is only a prototype, but this finding was a prime example of an AI not doing what a user might logically assume it was doing and could have been a critical point of failure for say, an intelligence analyst depending on it to identify video evidence.

From FakeFinder to RoBERTa

In the latest iteration of their auditing experiment, IQT Labs expanded their red team to add two data scientists, a software engineer with experience in large system quality assurance, and a data visualization expert who has done a lot of prior work on the security of open source code.

“To make these things successful and safe and trustworthy in the real world, you need multidisciplinary teams,” Ashley says. “Data scientists are super smart people and they’re very good at what they do, but they shouldn’t have to be experts on secure development and coding best practices.”

Large language models like RoBERTa are already ubiquitous in modern life. They underlie everything from chatbots and text auto-completes, to sophisticated text summarizing and sentiment analysis tools used by law firms and insurance companies. RoBERTa is a particularly robust, extensively trained model. In fact, its formal name is “Robustly-Optimized BERT Approach” (a name also alluding to the fact that it was derived from a 2018 Google model called BERT, or BiDirectional Encoder Representations from Transformers).

RoBERTa, however, is not a complete tool the way FakeFinder is. FakeFinder has a user interface, a clear purpose, and workflow to ground the audit. RoBERTa meanwhile, is just a model —a component of a potential tool, like a library that a programmer would reference in another piece of code. This proved to be an immediate obstacle to the Labs’red team: to gain any kind of foothold in the audit, they had to make several assumptions about how RoBERTa might be used, so that they could have a scenario to pin their questions on.

Finding Bias

The team focused on a task called Named Entity Recognition or NER. In this scenario, a hypothetical user might have a pile of documents and want to use an AI tool to quickly analyze those documents and come up with a list of all the players named. Bias would be a problem if the AI reliably found common English names, but failed to identify names common to another language, like Russian or Arabic.

The team’s data scientists devised a test for RoBERTa’s recognition abilities: they started with 13 English language novels, like The Picture of Dorian Gray and The Great Gatsby, then randomly switched out character names for names that are common in different languages. So, for the Russian run on Anne of Green Gables, Annecould have become Анастасия, Ольга, or Сергей, and so on. They tested common languages like English, Russian, and Arabic, and less common languages, such as Finnish, Icelandic, and Amis (spoken on the east coast of Taiwan) and compared how well RoBERTa did.

Remarkably, RoBERTa performed well, with no significant bias across most languages tested. “This actually initially caused frustration,” admits Ricardo Calix, one of the red team’s data scientists. “When we went into the experiment, we expected the model to be biased, and it wasn’t.” It wasn’t until the team finally tried running names in Saisiyat, spoken in the north of Taiwan, that the AI’s performance dropped.

Once the team discovered the Saisiyat vulnerability, they were able to work backward to better understand how RoBERTa identified names, and how they could exploit that. Instead of detecting names from grammar and context, it seemed like RoBERTa was actually recognizing key subwords— groups of letters within names that the model recognized as especially ’namey’.

The team began playing around with letter combinations. They discovered that adding common English name subwords like ‘son’—transforming a Spanish name like Sofía into Sofíason, for instance—helped RoBERTa recognize names more reliably. Adding a triplet of Saisiyat characters, on the other hand, did the opposite.

“It’s a kind of poisoning because I can intentionally add these [subwords] at the end of names and know what’s going to happen to performance,” Calix explains. In a real-world situation, such as posting text on social media, an attacker could poison the model by adding in rare characters to sensitive words and therefore affect performance.

Model vs. Tool

Just as the red team had to make assumptions about how RoBERTa would be used in order to test for bias, for the security portion of the audit, Ashley had to select downstream software platforms that data scientists might use to access RoBERTa. This led to the discovery of a significant vulnerability (coming in a future post) but it wasn’t a comprehensive security analysis of a RoBERTa-based tool, because of course, there is no complete tool for the team to audit.

The ethical portion of the audit proved even tougher. “What are the ethical implications of a model in the abstract?” Brennen asks. “It was like trying to evaluate the ethical implications of a hammer.”

“Maybe the takeaway here is that for certain portions of the audit, you can look at a model in the abstract, but for other aspects of assurance, you have to know how the model is being used and what it’s being used for,” Brennen says. “Without knowing this, it’s hard to assess end-to-end risks.” Going forward, Brennen says the IQT Labs red team will most likely stick to auditing AI tools and systems.

Andrew Burt who, with his co-founder at BNH.AI, provided input on the RoBERTa bias research, commends IQT’s efforts. IQT Labs is not the only one of their clients trying to figure out how to interrogate AIs, but they are the only ones doing it so openly, sharing their experiences to invite comment and collaboration. “I think there’s a real openness and willingness to try new things,” he says of them. In Burt’s line of work, clients can often resist audits and push back. “And with IQT, we just don’t see that at all, every piece of feedback that we give is really embraced.”

But then there are issues like how it took the bias team over a dozen languages before they discovered the AI’s weakness in one obscure Taiwanese language. There are over 7,000 languages in the world today—so what should a standard be? Should auditors really be scanning their text-based AIs for all 7,000 languages? It’s a point that gets back to Brennen’s earlier statement about deliberately keeping the scope small—a few people, a few months, a few key questions. Any more than that, and the entire auditing process could quickly balloon past what a company under pressure to release a product is willing to do.

Ryan Ashley likes to paraphrase a quote attributed to Jamie Zawinksi, a well-known programmer from the early days of Mosaic and Netscape: “A fifty percent solution that reaches fifty percent of the people is still better than a hundred percent solution that never gets out of your lab.”

Both Ashley and Brennen agree that there is no way to fully explore and mitigate every single risk; that would be an impossible ask. But that doesn’t mean that an AI creator shouldn’t try to explore the limits and risks of their tools. “You should know what you’re using,” Brennen says. “I feel very confident that if you just assume everything’s fine, that’s not ideal.”

For now, the goal is simply to try to answer some basic questions about risk, in a way that IQT Labs can back up through rigorous testing and empirical results and “Then be as transparent as possible about what we did, why we did it, and also what we think is wrong with it,” she says. “And hopefully, this will spur a larger conversation.”