From Google to Nvidia, tech giants have hired hackers to break AI models

Forbes We spoke to the leaders of the AI ​​Red Teams at Microsoft, Google, Nvidia and Meta, which are tasked with finding vulnerabilities in AI systems so they can be fixed. “We’re going to start seeing ads about ‘our most secure,'” predicted one AI security expert.

through Rashi SrivastavaForbes staff


A A month before publicly launching ChatGPT, OpenAI hired Boru Gollo, a lawyer in Kenya, to test GPT-3.5 and later GPT-4 for stereotypes against Africans and Muslims, the chatbot being harmful, biased and generating prompts. Incorrect response. Golo, one of about 50 external experts hired by OpenAI to be part of its “Red Team,” typed a command into ChatGPT, which generated a list of ways to kill the Nigerian — a response OpenAI fired before the chatbot. Available to the world.

Other Red-Teamers used the pre-launch version of GPT-4 to assist in many illegal and innocuous activities, such as writing Facebook posts to convince someone to join al-Qaeda, helping to find unlicensed guns for sale, and creating processes to manufacture dangerous weapons. . Household chemicals, according to GPT-4’s system card, which lists the risks and safety measures used to reduce or eliminate OpenAI.

To protect AI systems from being exploited, red-team hackers think like an adversary to game them and expose blind spots and vulnerabilities baked into the technology so they can be fixed. As tech titans race to build and roll out generative AI tools, their in-house AI red teams are playing an increasingly important role in ensuring models are safe for the public. For example, Google established a separate AI Red Team earlier this year, and in August developers of several popular models, such as OpenAI’s GPT3.5, Meta’s Llama 2, and Google’s LaMDA, participated in a White House-backed program. Hackers get a chance to jailbreak their system.

But AI red teamers often walk the tightrope, balancing the safety and security of AI models while keeping them relevant and usable. Forbes Talk to leaders of the AI ​​Red Teams at Microsoft, Google, Nvidia and Meta about how breaking AI models came to be and the challenges of solving them.

“You’ll have a model that says no to everything and it’s very secure but it’s useless,” said Christian Canton, head of Facebook’s AI Red Team. “There is a trade off. The more useful you can make a model, the more likely you are to work in an area that might produce an unsafe answer.”

The practice of raid teaming software dates back to the 1960s, when adversary attacks were simulated to make systems as robust as possible. “In computers you can never say ‘this is safe’. All we can say is ‘we tried and we couldn’t break it,'” said Bruce Schneier, a security engineer and fellow at the Berkman Klein Center for Internet and Society at Harvard University.

But generative AI is trained on vast troves of data, making protection AI models different from traditional security methods, said Daniel Fabian, head of Google’s new AI Red Team, which tests products like Bard for offensive content before the company adds them. New features such as additional languages.

“Our AI Red Team motto is ‘The more you sweat in training, the less you bleed in battle.'”

Christian Canton, Head of Engineering for Responsible AI at Meta

Beyond interrogating AI models to spit out toxic responses, Red Teams use tricks like training data mining that reveals personally identifiable information like names, addresses and phone numbers, and poisoning datasets by changing portions of the content before using it to train the model. . “An adversary has a portfolio of attacks and if one of them doesn’t work, they’ll move on to the next,” Fabian said. Forbes.

While the field is still in its infancy, security professionals who know how to game AI systems are “invisibly small,” said Daniel Rohrer, Nvidia’s VP of software security. That’s why a tight-knit community of AI red teamers tends to share findings. Google’s Red Teamers have published research on new ways to attack AI models, while Microsoft’s Red Team has open-source attack tools, such as Counterfit, that help other businesses test the safety and security risks of algorithms.

“We were developing these junky scripts that we were using to speed up our own raid teaming,” said Ram Shankar Shiv Kumar, who started the team five years ago. “We want to make this available to all security professionals in a framework they know and understand.”

Before testing the AI ​​system, Shiv Kumar’s team collects data about cyber threats from the company’s threat intelligence team, which are the “eyes and ears of the Internet,” as he calls it. He then works with other Red Teams at Microsoft to determine which vulnerabilities in AI systems to target and how. This year, the team examined Microsoft’s star AI product Bing Chat as well as GPT-4 for flaws.

Meanwhile, part of Nvidia’s red teaming approach is to provide a crash course in how to red team algorithms to security engineers and companies that rely on it for computing resources such as GPUs.

“As the engine of AI for everyone… we have a huge amplification factor. If we can teach others to do that (red teaming), then Anthropic, Google, OpenAI, they’re all right,” Rohrer said.


WWith increased scrutiny of AI applications from users and government officials, red teams also give tech companies a competitive advantage in the AI ​​race. “I think the moat is going to be trust and security,” said Sven Cattel, founder of AI Village, a community of AI hackers and security experts. “You’ll start seeing ads about ‘our safest’.”

Early in the game was Meta’s AI Raid Team, which was founded in 2019 and hosts internal challenges and “risk-a-thons” for hackers to bypass content filters that detect and remove hate speech, nudity, misinformation and AI-generated posts. is done Deep fakes on Instagram and Facebook.

In July 2023, the social media giant hired 350 red teamers, along with external experts, contract workers and an internal team of about 20 employees, to test Llama 2, its open-source latest large-language model, according to a published report that describes how the model was developed. The team offered tips on how to evade taxes, how to start a car without a key, and how to set up a Ponzi scheme. “The motto of our AI Red Team is ‘The more you sweat in training, the less you bleed,'” said Canton, head of Facebook’s Red Team.

That motto was similar to the spirit of the largest AI raid teaming exercise held at the DefCon hacking conference in Las Vegas in early August. Eight companies, including OpenAI, Google, Meta, Nvidia, Stability AI and Anthropic, opened their AI models to more than 2,000 hackers and provided them with prompts designed to reveal sensitive information such as credit card numbers or generate harmful content such as political disinformation. The White House’s Office of Science and Technology Policy teamed up with the event’s organizers to design the Red Teaming Challenge, a guide on how to design, deploy and launch automated systems, following the blueprint of the AI ​​Bill of Rights. safely.

“If we can teach others to do it (Red Teaming), then Anthropic, Google, OpenAI, they’re all right.”

Daniel Rohrer, Nvidia’s VP of Software Security

Initially, companies were reluctant to present their models in a public forum because of the reputational risks associated with red teaming, said Cattell, founder of AI Village, which leads the event. “From Google’s perspective or OpenAI’s perspective, we’re a bunch of guys at DefCon,” he said Forbes.

But after assuring tech companies the models would be anonymized and hackers wouldn’t know which models they were attacking, he agreed. The results of the nearly 17,000 conversations hackers had with AI models won’t be made public until February, but the companies walked away from the program with several new vulnerabilities. Across the eight models, Red Teamers found about 2,700 errors, such as convincing the model to contradict itself or giving instructions on how to observe it without someone’s knowledge, according to new data released by the event’s organizers.

Among the participants was Avijit Ghosh, an AI ethics researcher who was able to get several models to do math wrong, create fake news about the King of Thailand, and write about a housing crisis that didn’t exist.

Such vulnerabilities in the system make red teaming AI models more important, Ghosh said, especially when they can be perceived by some users as all-knowing sensitive entities. “I know a lot of people in real life who think these bots are really intelligent and do things like medical diagnosis with step-by-step reasoning and logic. But it isn’t. It is virtually self-sustaining,” he said.

But generative AI is like a multifaceted monster—as red teams find some holes in systems and other flaws that can occur elsewhere, experts say. “It will take a village to solve this problem,” said Shiv Kumar of Microsoft.

More from Forbes

More from ForbesNobody Beats Wiz: Meet the ultra-aggressive, $10 billion cloud security startupMore from ForbesArmed with ChatGPT, cybercriminals create malware and create fake girl botsMore from ForbesWhy Biden’s White House Gets Behind ‘Largest AI Hacking Event Ever’?More from ForbesHow real people are caught up in Reddit’s AI porn explosion

Leave a Comment