This post is not going to introduce regex nor talk about the basics of regex. In this post, we’ll discuss a possible strategy in order to construct/understand lengthy regular expressions. Thus, it is assumed that the reader is somewhat familiar with regex.

Introduction

As a software engineer, sooner rather than later you stumble upon regular expressions. At a first glance, these might look a little bit intimidating or even nonsensical. One could say that the entangled sequence of characters looks similar to the Egyptian hieroglyphs. However, if one understands & masters regex, one might become the Lord of the Rings in the string-searching universe. 🧙‍♂️

Let’s get into it

Why the ‘Regex that matches the entire universe’ title you may wonder? Well, this post is about tackling very lengthy regular expressions, such as the one below.

([\\s]{3,10})(\\w)+(\\s)+(([0-9]{1,2}((\\–|\\—|-)([0-9]{1,2}(:[0- 9]{1,2}[A-Z]{1,2}|[A-Z]{1,2}))|:[0-9]{1,2}([A-Z]{1,2}(\\–|\\—| -)[0-9]{1,2}(:[0-9]{1,2}[A-Z]{1,2}|[A-Z]{1,2})|(\\–|\\—|-)[0-9 ]{1,2}(:[0-9]{1,2}[A-Z]{1,2}|[A-Z]{1,2}))|[A-Z]{1,2}(\\–|\\—|- )[0-9]{1,2}(:[0-9]{1,2}[A-Z]{1,2}|[A-Z]{1,2})))|Collaboratio n|Silver)(\\s\\s|(([0-9]{1,2}((\\–|\\—|-)([0-9]{1,2}(:[0-9]{1,2} [A-Z]{1,2}|[A-Z]{1,2}))|:[0-9]{1,2}([A-Z]{1,2}(\\–|\\—|-)[0-9] {1,2}(:[0-9]{1,2}[A-Z]{1,2}|[A-Z]{1,2})|(\\–|\\—|-)[0-9]{1,2}( :[0-9]{1,2}[A-Z]{1,2}|[A-Z]{1,2}))|[A-Z]{1,2}(\\–|\\—|-)[0-9]{ 1,2}(:[0-9]{1,2}[A-Z]{1,2}|[A-Z]{1,2})))|Collaboration|Silve r)\\s\\s)(\\w)+(\\s{1})(\\w)+(\\s{1})(\\w)+(\\s{1})(\\w)+(\\s{1})(\\w)+

The expression above might seem a little bit tricky, right? If you would have to construct/understand something similar, how should you go about doing this?

As we know, problems can be tackled in various ways. Consequently, the proposed solution is just one way in which one can go about solving this type of problem, namely constructing or understanding lengthy regular expressions.

Solution

If you have to build a lengthy regex that matches a specific pattern, you may want to construct a tree-like structure with which you’ll be able to quickly identify all possible cases. Then it’ll be just a matter of thoroughly translating everything in regex. Let’s take an easy example that illustrates this.

Let’s say you want to identify time intervals in a text. You know that these intervals have the following format: hours-hours. For example:

12-5PM
2-5PM
5PM-1AM etc.

Construct the tree-like structure

This will help you identify all possible cases.

number(s) - number(s) xM xM - number(s) xM

After constructing the above tree-like structure you can easily identify if you’ve taken into account all cases. For example, we can see that 2-5PM can be obtained by following: number(s) → - → number(s) → xM

Construct the regex expression

For this particular example, one possible regex is: