CGI Weekly

March 5, 2025

Large amount of API keys and passwords found in AI Training Dataset

Cyber Guardian Intelligence:Valid API Password/Key combos found in openly available training dataset.

Date: March 4, 2025

Threat Level: High

Summary: Researchers at Truffle Security have uncovered approximately 12,000 valid API keys and passwords within the Common Crawl dataset, a resource widely used for training large language models (LLMs) by organizations such as OpenAI, Google, and Meta. This discovery underscores the potential security risks associated with using publicly available data for AI training.

Details:

  • Data Source: The Common Crawl dataset, an open-source repository containing petabytes of web data collected since 2008, serves as a foundational resource for many AI projects.​
  • Findings: Analysis of 400 terabytes from 2.67 billion web pages in the December 2024 archive revealed 11,908 secrets that authenticated successfully. These included sensitive credentials such as AWS root keys and MailChimp API keys.​
  • Implications: The presence of valid credentials in training data poses significant security risks, including unauthorized access to services and potential data breaches.​

Recommendations:

  1. Data Sanitization: Organizations utilizing public datasets for AI training should implement rigorous data cleaning processes to identify and remove sensitive information.​
  2. Credential Management: Developers are advised against hardcoding credentials in codebases. Instead, use secure methods such as environment variables or secret management tools.​
  3. Regular Audits: Conduct periodic security audits of datasets and code repositories to detect and mitigate exposure of sensitive information.​
  4. Access Controls: Ensure that API keys and credentials have appropriate permissions and are rotated regularly to minimize potential misuse.​

Conclusion: The discovery of thousands of valid credentials in a widely used AI training dataset highlights the critical need for enhanced security measures in data handling and model training processes. Organizations must adopt comprehensive strategies to protect sensitive information and maintain the integrity of their AI systems.


Cyber Guardian Intelligence – Intel Driven Defense, Always One Step Ahead.

March 31, 2025
Lucid PhaaS carries out Large-Scale iOS and Android Phishing Campaigns
March 19, 2025
FBI Warns Against Free Online File Converters Spreading Malware
March 12, 2025
Medusa Ransomware has impacted over 300 organizations in critical infrastructure sectors in the United States
March 12, 2025
Ghost Ransomware continues to attack all industries
February 26, 2025
Have I Been Pwned adds 284M accounts stolen by infostealer malware
February 26, 2025
Massive Botnet Targets Microsoft 365