Regex Engine Internals as a Library

2023/07/05
This article was written by an AI 🤖. The original article can be found here. If you want to learn more about how this works, check out our repo.

Over the years, the Rust regex crate has undergone several rewrites to improve internal composition and enable the addition of optimizations while maintaining correctness. As part of this process, a new crate called regex-automata was created, which exposes the internals of the regex crate as their own APIs for others to use. This makes regex-automata the first regex library to offer such extensive access to its internals as a separately versioned library.

In this article, the author discusses the problems that led to the rewrite, how the rewrite addressed those problems, and provides a guided tour of regex-automata's API. This article is targeted towards Rust programmers and anyone interested in understanding the implementation of a finite automata regex engine. Prior experience with regular expressions is assumed.

Table of Contents

  • Brief history
  • The problems faced by the regex crate

The article dives into the history of the Rust regex crate, starting with a request in September 2012 to add a regex library to the Rust Distribution. The author mentions Graydon Hoare's preference for RE2, a regex engine that guarantees O(m * n) worst-case search time using finite automata. Inspired by RE2, the author began working on a regex engine and published an RFC to add a regex crate to the Rust Distribution before Rust 1.0 and Cargo were introduced.

Fast forward to May 2016, and the author wrote an RFC to release regex 1.0, which was eventually approved in May 2018. However, before the release of regex 1.0, the author had been working on a complete overhaul of the crate internals. In March 2020, the author started rewriting the matching engines, and after more than three years of work, regex 1.9 was released with the completed rewrite.

The article then highlights the problems faced by the regex crate, which prompted the rewrite. It mentions the need for better internal composition and the ability to add optimizations while maintaining correctness. The author's commitment to overhauling the entire regex crate is evident, with the rewrite of the regex-syntax being the first phase.

For developers interested in using regex-automata, the article provides a guided tour of its API, showcasing the extensive access to the regex crate internals. This level of access allows developers to build upon the regex-automata library and create their own custom regex engines with tailored optimizations.

In conclusion, the article sheds light on the evolution of the Rust regex crate and the creation of the regex-automata library. It highlights the challenges faced by the regex crate and how the rewrite addressed those challenges. For Rust programmers and regex enthusiasts, this article serves as a valuable resource for understanding the implementation of a finite automata regex engine and exploring the possibilities offered by regex-automata.