Some of the ways big tech companies feed your personal data to AI feels like a breach of privacy — or even theft
Your email is just the beginning. Meta, the owner of Facebook, took a billion Instagram posts from public accounts to train AI and didn’t ask for permission. Microsoft uses your chats with Bing to train an AI bot to answer questions better, and you can’t stop it.
Increasingly, tech companies are teaching their AI how to write, paint, and pretend your conversations, photos, and documents are human. You may be used to using your data to target you with advertising. But now they’re using it to create profitable new technologies that drive the economy — and can make Big Tech even bigger.
We do not yet understand what risk this behavior poses to your privacy, reputation, or work. But there’s not much you can do about it.
Sometimes companies treat your data with care. Yet often, their behavior is out of sync with common expectations of what happens with your information, including what you consider private.
If you’ve been using any of Big Tech’s shiny new AI products, you’ve had to agree to help make their AI smarter through a “data donation.” (Google has an actual term for that.)
Lost in the data grab: Most people have no way to make truly informed decisions about how their data is being used. That can feel like a violation of privacy — or even theft.
“AI represents a once-in-a-generation leap,” says Nicolas Piachaud, director of the open source nonprofit Mozilla Foundation. “It’s an opportune moment to step back and think: What’s at stake here? Are we ready to give away our right to privacy, our personal data to these big companies? Or should privacy be the default?”
Tech companies using your data to train AI products isn’t new. Netflix uses what you watch and rate to generate recommendations. Facebook uses Facebook to teach you how to order your news feed and show you ads for people you like and comment on.
Yet generative AI is different. Today’s AI arms race requires lots and lots of data. Twitter owner and Tesla chief executive Elon Musk recently bragged to his biographer that he had access to 160 billion video frames shot by cameras built into people’s Teslas to fuel his AI ambitions.
“Everyone is acting as if this is the obvious fate of technological tools built with people’s data,” says Ben Winters, senior adviser at the Electronic Privacy Information Center, which studies the dangers of generative AI. “With the increasing use of AI tools, it’s less of an incentive to collect as much data as you can.”
All of this brings some unique privacy risks. Training AI to learn everything about the world also ends up learning intimate things about individuals.
Some tech companies even admit this in their fine print. When you use Google’s new AI writing coach for Docs, it warns: “Do not include personal, confidential or sensitive information.”
The actual process of training an AI can be a bit intimidating. Sometimes this involves looking at other people’s data. People are reviewing our back-and-forth with Google’s new search engine and the Bard chatbot, just to name two.
Even worse for your privacy, generative AI sometimes spits data back out. Generative AI systems that are extremely difficult to control can restore personal information in response to new, sometimes unexpected prompts.
This even happened to a tech company. Samsung employees were using ChatGPT and discovered on three separate occasions that the chatbot exposed company secrets. The company then banned the use of AI chatbots in the workplace. Apple, Spotify, Verizon and many banks have done the same.
Big tech companies told me they take pains to prevent leaks. Microsoft says it de-identifies user data entered into Bing Chat. Google says it automatically removes personally identifiable information from training data. Meta said it will train generative AI not to reveal private information — so it can share a celebrity’s birthday, but not regular people.
OK, but how effective are these solutions? This is one of those questions that companies won’t give straight answers to. “While our filters are state-of-the-art in the industry, we’re constantly improving them,” Google says. And how often do they leak? “We believe it is too limited,” it says.
It’s great to know that Google’s AI only occasionally leaks our information. “It’s really hard for them to say with a straight face that we don’t have any sensitive data,” says Winters.
Perhaps privacy isn’t the right word for this mess. It’s also about control. Who would have imagined that a vacation photo they posted in 2009 would be used by a megacorporation in 2023 to teach AI to make art, fire a photographer, or identify someone’s face to the police?
There’s a thin line between “making products better” and piracy, and tech companies think they can draw it.
Which of our data is and isn’t off limits? Much of the answer is wrapped up in lawsuits, investigations, and hopefully some new laws. But meanwhile, Big Tech is making its own rules.
I asked Google, Meta and Microsoft to tell me when they take user data from products that are a core part of modern life to make their new generative AI products smarter. Getting answers was like chasing a squirrel through a funhouse.
They told me they never used non-public user information in their largest AI model without permission. But those very carefully chosen words leave out many instances when they are, in fact, building their lucrative AI business with our digital lives.
Not all AI uses for data are equal or even problematic. But as users, we need a degree in computer science to understand what’s going on.
Google is a good example. He tells me his “basic” AI models — the software behind things like Bard, his answer-anything chatbot — come primarily from “publicly available data from the Internet.” Our private Gmail didn’t contribute to that, the company says.
However, Google still uses Gmail to train other AI products, such as Gmail writing-assistant Smart Compose (which completes sentences for you) and the new creative coach Duet AI. It’s fundamentally different, Google argues, because it’s taking product data to improve the product.
There’s probably no way to create something like Smart Compose without looking at your email. But that doesn’t mean Google should just turn it on by default. In Europe, where there are better data laws, Smart Compose is off by default. Even though Google calls them “experiments” like Bard and Duet AI, your data may not be necessary to use its latest and greatest products.
Facebook owner Meta told me they didn’t train their biggest AI model, called Llama 2, on user data. But it has trained other AIs, like an image-recognition system called SEER, on people’s public Instagrams.
And Meta won’t tell me how generative AI is using our personal data to train products. After I pushed back, the company said it would “not train our generative AI models on people’s messages to their friends and family.” Agreed to draw at least some kind of red line.
Microsoft updated its service agreement this summer with broader language about user data and made no promises to me about limiting the use of our data to train AI products in consumer-facing programs like Outlook and Word. Mozilla has even launched a campaign for the software giant to come clean. “If nine privacy experts can’t figure out what Microsoft does with your data, what chance does the average person have?” Mozilla says.
It doesn’t have to be this way. Microsoft has several promises for lucrative corporate customers, including those chatting with the enterprise version of Bing, about keeping their data private. “The data always stays within the customer’s tenant and is never used for other purposes,” says the spokesperson.
Why do companies have a right to privacy more than the rest of us?