Microsoft AI Researchers Accidentally Expose Terabytes of Sensitive Data

Microsoft AI researchers have found themselves in hot water after accidentally exposing tens of terabytes of sensitive data, including private keys and passwords. The incident occurred when the researchers published a storage bucket of open-source training data on GitHub.

Key Takeaway

Microsoft AI researchers accidentally exposed terabytes of sensitive data, including private keys and passwords, while publishing a storage bucket of open-source training data on GitHub. The exposed data included personal backups, passwords, secret keys, and internal employee messages.

The discovery was made by cloud security startup Wiz, who stumbled upon a GitHub repository belonging to Microsoft’s AI research division during their ongoing investigation into the accidental exposure of cloud-hosted data. The GitHub repository provided open-source code and AI models for image recognition, with users instructed to download the models from an Azure Storage URL.

However, Wiz found that the URL was misconfigured, granting permissions to the entire storage account instead of just the intended models. This error exposed an additional 38 terabytes of sensitive data, including the personal backups of two Microsoft employees’ personal computers. The exposed data also included passwords to Microsoft services, secret keys, and over 30,000 internal Microsoft Teams messages from hundreds of employees.

The misconfigured URL, which had been exposing this data since 2020, also allowed for “full control” access instead of read-only permissions. This meant that anyone who knew where to look could potentially delete, replace, or inject malicious content into the data.

Wiz clarified that the storage account itself was not directly exposed. Instead, the Microsoft AI developers included an overly permissive shared access signature (SAS) token in the URL. SAS tokens are used by Azure to create shareable links that grant access to an Azure Storage account’s data.

The incident raises concerns about the security of data handled by data scientists and engineers in their pursuit of AI solutions. As the amount of data handled increases, additional security checks and safeguards become necessary. Development teams working with large amounts of data, sharing it with peers, or collaborating on open-source projects need to be vigilant to prevent such mishaps.

Wiz promptly notified Microsoft of its findings on June 22, leading to the revocation of the SAS token on June 24. Microsoft completed its investigation into the potential impact on the organization on August 16.

In response to the incident, Microsoft’s Security Response Center stated that no customer data was exposed, and no other internal services were at risk. However, the company has taken steps to enhance GitHub’s secret scanning service, which now monitors public open-source code changes for plaintext exposure of credentials and other secrets, including SAS tokens with overly permissive privileges.

This incident serves as a reminder of the need for robust security measures when handling sensitive data, especially in the rapidly evolving field of AI research.