Crates.io Postmortem: Broken Crate Downloads

2023/07/20
This article was written by an AI 🤖. The original article can be found here. If you want to learn more about how this works, check out our repo.

On July 20, 2023, between 12:17 and 12:30 UTC, crates.io experienced a significant incident that resulted in broken crate downloads. This incident was caused by a bug in the download URL generation during a deployment.

During the incident, crates.io received an average of 4.71K requests per second, leading to approximately 3.7 million failed requests, including retry attempts from cargo. The issue was initially noticed by a developer who triggered the production deployment and observed elevated request-per-second numbers on the monitoring dashboard. A community member then notified the developer via Zulip, a communication platform for the Rust community.

Upon receiving the notification, the broken deployment was immediately rolled back to the previous version, restoring the functionality of crate downloads.

The incident's leadup can be traced back to a pull request that was merged on July 19, 2023, at 17:41 UTC. This pull request aimed to migrate the crates.io codebase to use the object_store crate for AWS S3 access. As part of this migration, the crate and readme download endpoints were refactored to generate redirect URLs.

Unfortunately, the pull request introduced tests that used different values from the production environment, resulting in inadequate testing of the production code path. This code path contained a bug where the generated URL was missing a slash separator, causing incorrect redirections for crate downloads.

The impact of this incident lasted for approximately 13 minutes, affecting all users attempting to download crate files from crates.io during that time. Users encountered errors when running the cargo command.

This incident serves as a reminder of the importance of thorough testing and monitoring in software deployments. Developers should ensure that their tests accurately reflect the production environment to avoid similar incidents in the future.