Microsoft’s AI research division accidentally exposed 38TB of sensitive data

Rate this post

Microsoft’s AI research division accidentally exposed 38TB of sensitive data

Pierluigi Paganini
18 September 2023

Microsoft AI researchers accidentally exposed 38TB of sensitive data through a public GitHub repository from July 2020.

Cybersecurity firm Wiz discovered that the Microsoft AI research division accidentally leaked 38TB of sensitive data while publishing a bucket of open-source training data on GitHub.

The exposed data revealed disk backups of two employee workstations containing secrets, private keys, passwords and 30,000 internal Microsoft Teams messages.

“The researchers shared their files using an Azure feature called SAS Tokens, which allows you to share data from Azure storage accounts.” A report published by Viz reads.” Access levels can be limited to specific files only; However, in this case, Link was configured to share the entire storage account — including another 38TB of private files.”

The Wiz research team discovered the repository while scanning the internet for misconfigured storage containers exposing cloud-hosted data. Experts found a repository on GitHub under the Microsoft organization name robust-models-transfer.

The repository belongs to Microsoft’s AI research division, which used it to provide open-source code and AI models for image recognition. The Microsoft AI research team began publishing data in July 2020.

Microsoft used Azure SAS tokens to share data stored in Azure storage accounts used by its research team.

The Azure Storage signed-in URL used to access the repository was incorrectly configured to grant permissions to the entire storage account, exposing private data.

“However, this URL allowed more access than just the open-source model. It was configured to grant permissions to the entire storage account, accidentally exposing additional private data.” The company continues. “The simple step of sharing an AI dataset led to a massive data leak containing over 38TB of private data. The original reason was the use of Account SAS tokens as a sharing mechanism. Due to a lack of monitoring and governance, SAS tokens pose a security risk and their use should be limited as much as possible.”

Wiz pointed out that SAS tokens cannot be easily tracked because Microsoft does not provide a centralized way to manage them in the Azure portal.

Microsoft said the data lead did not expose customer data.

“No customer data has been exposed and no other internal services are at risk from this issue. No customer action is required in response to this issue. ” reads a post published by Microsoft.

Below is the timeline of this security incident:

  • July 20, 2020 – SAS token first committed to GitHub; Expiry is set for October 5, 2021
  • 6 October 2021 – SAS Token Expiration Oct. Updated as of 6, 2051
  • 22 June 2023 – Viz Research discovered the problem and reported it to MSRC
  • 24 June 2023 – SAS token invalidation by Microsoft
  • 7 July 2023 – Changed the SAS token on GitHub
  • 16 August 2023 – Microsoft has completed an internal investigation of the potential impact
  • 18 September 2023 – Public disclosure

Follow me on Twitter: @securityaffairs And Facebook And Mastodon

Pierluigi Paganini

(Security Affairs Hacking, Microsoft AI)

Leave a Comment