Search

Pony Foo

Ramblings of a degenerate coder

Learn Regular Expressions

(5 comments)reading time: , published

Regular Expressions are a fundamental tool every programmer should understand, at the very least on a basic level. I might not make an expert in regex out of you, but at least, you should be able to comprehend what they do, and create simple ones yourself.

TL;WR cheatsheet

I'll try my best not to drown you in a sea of technical stuff, but regular expressions are a complicated matter, and learning them is no easy feat.

stand-back-regex.jpg

When not to use a regex

Regex is a scaringly powerful tool, when applied properly. That doesn't mean you should apply it to everything that is composed of strings. Parsing HTML with regex is just wrong, and a considerable waste of time.

Regex is a trap. They're hard to read, and interpreting their intent can be a nightmare. They're hard to debug, and the developer next to you probably has no idea what it's doing.

They are convenient, though, when parsing text looking for certain patterns.

The Basics

A regular expression denotes a pattern you want to match in a particular string. Lets look at an example:

var test = /my ca[rt] ate? (the|[2-4]) [sp]lums?/;

// the above regex matches all strings below
var strings = [
    'my cat ate the plum',
    'my cat ate 3 plums',
    'my car at the slums'
];

Think of regex as regular strings. If a character isn't a special marker, it will match that character. Therefore, the /my ca/ regex, will match the 'my ca' sub-string.

The special / character marks the beginning and the end of a regular expression in JavaScript. You can create regex using a constructor new RegExp('my ca'), but I recommend the /form/ form.

Next up we have [. Everything contained in the brackets has a special meaning. [rt] means we'll match either an 'r', or a 't'. For example, /ba[rz]/ matches both 'bar' and 'baz'.

The second special expression we have is ?. This is a quantifier. Quantifiers determine how the previous expression is matched. ? means the previous expression is optional. /ate?/ will match both 'ate' and 'at'.

Usually, we want to do this in longer expressions than just a single character. In this case, we will use groups. These are expressions enclosed in parenthesis. /foo (bar )?baz/, for instance, will match both 'foo baz', and 'foo bar baz'.

That brings us to the next portion of our regex, (the|[2-4]). Here we used a group, but we have several other special characters. The [2-4] expression is a range, and it means either 2, 3, or 4. Anything in the inclusive 2-4 range.

The other special character in this portion, is |, is effectively a logical OR, and it means we should match one or both of the sides of this expression. In the end, this expression will be able to match any of the following: 'the', '2', '3', and '4'.

Modifiers

Are you keeping up? Good! I'm glad I'm not as cryptic as I thought I would be. We'll look at a few more regex examples, but before, lets talk about modifiers.

In JavaScript, modifiers can be provided with the /regex/modifiers form, such as /foo/i, or using the constructor form, new RegExp('foo', 'i').

These are some of the most common modifiers you can use.

  • i: Case insensitivity. Allows /foo/i to match 'foo', 'FOO', etc.
  • g: Global. Matching doesn't stop after the first coincidence.

Anchors

^ represents the start of a string. Similarly, $ represents the end.

For example, in the string 'who let the dogs out? never let them out!', the regex /out[!?]$/ will match 'out!' in the end of the string, but it won't match 'out?'.

A commonly used modifier I purposely left out in the previous section is m. The multi-line modifier. Using this modifier, anchors will work on a line-by-line basis, rather than on the whole string.

Quantifiers

Quantifiers let you repeat patterns while staying DRY.

  • ?. The one we've covered, optionally matches the preceding expression.
  • +. The preceding expression must occur at least once.
  • *. The preceding expression can occur zero, one, or more times.
  • {n}. The preceding expression has to occur n times.
  • {n,}. The preceding expression has to occur at least n times.
  • {n,m}. The preceding expression has to occur n to m times.

Built-in patterns

Some, very simple, patterns that are built into regular expressions. Here are the most useful ones.

  • . means any character, except the new-line
  • \ will escape any character. If you need to match an actual dot, you can use /\./
  • \s matches whitespace. \S is any non-whitespace
  • \d matches digits, effectively the same as [0-9]. \D negates it
  • \w matches words, the same as [A-z0-9]. \W is the opposite

Groups

Groups are useful for replacing patterns, I'll cover that in a minute.

There are two kinds of groups. Capturing, and non-capturing. Capturing groups are the groups we've been talking about so far. Enclosed in parenthesis, such as /S(\d{2})E(\d{2})/, which will match strings such as 'S11E18'. It will capture the values 11 and 18.

Capturing is important to perform replacements, one of the fundamental uses of regex. But sometimes, we want groups for other reasons, for example, when we wrote (the|[2-4]), we did so to keep the OR contained in just that portion of the regex.

In these cases, we'll want to use the non-capturing group syntax. This means adding ?: to our pattern, like this: (?:the|[2-4]).

Replacements

To replace a string using a regex in JS, we can use String.prototype.replace, passing a regex in the first parameter. Lets do an example.

// a pretty non-sensical example
'the cow is a cow, but the cat is not a cow.'.replace(/cow/, 'dog');

// the result is
'the dog is a cow, but the cat is not a cow.'

// not quite what we wanted, we forgot to add the g modifier.
'the cow is a cow, but the cat is not a cow.'.replace(/cow/g, 'dog');

// the result now is
'the dog is a dog, but the cat is not a dog.'

You could also use $1, $2, and so on, in the replacement string. These will get replaced with the group captured when matching your regex. Another example!

// more non-sense please
var rimportant = /(\d+|boss)/g,
    remphasize = '<em>$1</em>';

'build 102 errored... tell the boss it failed!'.replace(rimportant, remphasize);

// results in
'build <em>102</em> errored... tell the <em>boss</em> it failed!'

Alternatively, you could provide a function callback as a replacement parameter, and it will be invoked once for each match. You can find more info on that on MDN.

You could also use RegExp.prototype.test to test whether a pattern matches the provided string.

Conclusion

In this post we've looked at some of the most common patterns of regex. Most importantly, we've looked at the way in which we can build simple regular expressions. I intentionally left out a complicated subset of regular expressions, in assertions, I will definitely cover that topic at some point in the future.

I test out my regular expressions using this online REGex Tester tool, or directly in my browser if they are simple enough.

TL;DR cheatsheet

Comments(5)

Eric Herrmann

It's my first time on your blog. Before reading it I have to say: Your site looks just fantastic!

Now carry on.

Nicolas Bevacqua

Neat, thanks!

What did you think of the article?

Eric Herrmann

I'm just a tech editor, so I don't unterstand a word. And I actually came here for your jQuery exit strategy guide.

:-D

But keep in touch

Paul Dariye

Great read! Please help me explain how to implement a regex in a hash or array to return matching strings and substrings, like in a dictionary.

James Rolan

Great tutorial. And you'll learn regular expression easier with a online regular expression tester & explain tool likes http://liveregex.com. The tool is useful. Give it a try. :D

Pony
Foo
Pony
Foo
Pony
Foo