How I learned to stop worrying and
love Regular Expressions


Jordi Boggiano
@seldaek

Building the internet for over 10 years
  → seld.be

Symfony core team, Composer lead and more OSS
  → github.com/Seldaek

Symfony, app architecture & performance consulting

Dev at Teamup.com

History



History

1956 Stephen Cole Kleen - regular languages and regular sets

1968 Ken Thompson - QED uses it to match patterns in text files

1971 Unix and ed

1974 g/re/p in ed is "Global search for Regular Expression and Print matching lines"

History

1986 Perl gets regular expressions and many new features over the years

1997 PCRE (Perl Compatible Regular Expressions), C lib, used by PHP

2003 Oxford English Dictionary takes grep in

At some point along the line, regular expressions became regexes

Regex Concepts

Regex Components

Pattern

Subject string

Matches

Pattern Components

Characters

You



Metacharacters

Y.u\s

You can't fight in here. This is the War Room!

Pattern Components

Escaping metachars

Y\.u\\s

\QY.u\s\E

Y.u\s

Character Classes

[abcd]
Is it that bad, sir?

With ranges

[a-d]
Is it that bad, sir?

[A-z]+
=>?@ABCDEFGHIJKLMNOPQRSTUVWX
YZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}


Negated classes

[^a-z ]
Is it that bad, sir?

Escaping

[\\^\]-]
\^]-

Character Classes

Match ASCII control characters

[\x00-\x1f]


Matching unicode characters

[\x{25E6}-\x{25EF}]+
◦◧◨◩◪◫◬◭◮◯

[◦-◯]+
◦◧◨◩◪◫◬◭◮◯

Character Class Shortcuts

Word chars

\w = [A-Za-z0-9_]

\W = [^A-Za-z0-9_]


Digits

\d = [0-9]

\D = [^0-9]

Whitespace chars

\s = [ \t\r\n\v\f]

\S = [^ \t\r\n\v\f]


Unicode char classes

See PHP PCRE docs

Dot Metacharacter

. = [^\n]

. = [\s\S]   with s (dotall) modifier

Subpatterns and Alternations

bo(mb|ys)      Stay on the bomb run, boys!



(?P<first>\w)\w+(?P<last>\w)

Just one second, operator.

preg_match_all('(?P<first>\w)\w+(?P<last>\w)', 'Just one second, operator.', $matches);
$matches['first'] // ['J', 'o', 's', 'o']
$matches['last'] // ['t', 'e', 'd', 'r']
                

Subpatterns and Alternations

bob|bobby
bob & bobby

bobby|bob
bob & bobby

bob(?:by)?
bob & bobby

Quantifiers

Cope with uncertainty in the subject

? 0-1 time

* 0-∞ times

+ 1-∞ times

{n,m} n-m times

Quantifiers


Jamaicans?
Try one of these Jamaican cigars, Ambassador.

sto*
Thank you, no. I do not support the work of imperialist stooges.

o+
Oh, only commie stooges, huh?

(aa){1,2}
aaaaaaaaaaa

Lazy Quantifiers

Match as few times as possible


(aa){1,2}
aaaaaaaaaaa

(aa){1,2}?
aaaaaaaaaaa

<em>(.+)</em>
<em>lala</em> and <em>lulu</em>

<em>(.+?)</em>
<em>lala</em> and <em>lulu</em>

Possessive Quantifiers

Match and do not give up matches


trig+ger

How is it possible for this thing to be triggered automatically and at the same time impossible to untriggger?



trig++ger

How is it possible for this thing to be triggered automatically and at the same time impossible to untriggger?

Anchors

^I|t$      I don't give a hoot in Hell how you do it



With m modifier:


^[^@]+@[a-z]+(\.[a-z]+)+$

\nfoo@bar.com\nbaz@qux.co.uk\n

Anchors for Validation

^[^@]+@[a-z]+(\.[a-z]+)+$
foo@bar.com\n



\A[^@]+@[a-z]+(\.[a-z]+)+\z
foo@bar.com\n



\A[^@]+@[a-z]+(\.[a-z]+)+\z
foo@bar.com

\A = absolute beginning, \z = absolute end

Back-references to Subpatterns

['"]\w+['"]
'single' "double" 'mixed"



(['"])\w+\1
'single' "double" 'mixed"

(?P<quote>['"])\w+(?P=quote)

Lookahead & Lookbehind

Let's match words surrounded by *'s

\*\w+\*
Of course, the whole point of a Doomsday Machine is lost, if you *keep* it a *secret*!


(?<=\*)\w+(?=\*)
is lost, if you *keep* it a *secret*!


(?<!\*)\w+(?!\*)
is lost, if you *keep* it a *secret*!

Word boundary metacharacter

\b = (?<=\W)(?=\w)|(?<=\w)(?=\W)



\b(?<!\*)\w+(?!\*)\b
is lost, if you *keep* it a *secret*!

Conditionals

(?(back-reference)yes-pattern)



(?(back-reference)yes-pattern|no-pattern)

Pattern Delimiters

/foo/

/https?:\/\/([^\/]+)\//

{https?://([^\/]+)/}



In PHP, use single quotes to avoid over-escaping

// find literal backslashes at end of string
preg_match('{\\\\$}', $str); // => {\\$}
                    

Modifiers

{...}i Case insensitive ([a-z] = [A-Za-z])

{...}m Multiline (^/$ per line)

{...}s Single line / Dotall (. = [\S\s])

{...}u Unicode

{...}D Dollar end only ($ = \z)

{...}x Free-spacing mode + Comments

Readability

(?#Comments can help readability)

{[◦-◯]+ \x{1F4A9}}u

Readability

(?#Comments can help readability)

{[◦-◯]+ \x{1F4A9}}u

{ [◦-◯]+ # Any googly-eye-range char(s) \ # and a single space \x{1F4A9} # Followed by pile of poo character }ux

◭◮ 💩

Readability

{
(?(DEFINE)
    # idstring: 1*( ALPHA / DIGIT / - / . )
    (?<idstring>[\pL\pN\-\.]{1,})

    # license-id: taken from list
    (?<licenseid>${licenses})

    # license-exception-id: taken from list
    (?<licenseexceptionid>${exceptions})

    # license-ref: [DocumentRef-1*(idstring):]LicenseRef-1*(idstring)
    (?<licenseref>(?:DocumentRef-(?&idstring):)?LicenseRef-(?&idstring))

    # simple-expresssion: license-id / license-id+ / license-ref
    (?<simple_expression>(?&licenseid)\+? | (?&licenseid) | (?&licenseref))
    # ...
                

Readability

{
    # ...
    # compound expression: 1*(
    #   simple-expression /
    #   simple-expression WITH license-exception-id /
    #   compound-expression AND compound-expression /
    #   compound-expression OR compound-expression
    # ) / ( compound-expression ) )
    (?<compound_head>
        (?&simple_expression) ( \s+ (?:WITH) \s+ (?&licenseexceptionid))?
            | \( \s* (?&compound_expression) \s* \)
    )
    (?<compound_expression>
        (?&compound_head) (?: \s+ (?:AND|OR) \s+ (?&compound_expression))?
    )

    # license-expression: 1*1(simple-expression / compound-expression)
    (?<license_expression>(?&compound_expression) | (?&simple_expression))
) # end of define

^(NONE | NOASSERTION | (?&license_expression))$
}x
                

Regex Engines

Engine Properties

First match wins

(set|setFoo)      setFoo

(setFoo|set)      setFoo

Engine Properties

An overall match is always preferred to an overall non-match

(set|setFoo)Bar      setFooBar

Engine Properties

The engine backtracks to the last match encountered when no match is found

^[^@]+@[a-z]+(\.[a-z]+)+$

foo@bar.co.uk

foo@bar.co.uk.foo.

PCRE JIT

Available in PHP7!

pcre.jit=1


Greedy, baseline 6x, JIT crashes & returns null

"[^"\\\\]*(\\\\.[^"\\\\]*)*"


Possessive, baseline 2.2x, JIT 1x

"[^"\\\\]*+(\\\\.[^"\\\\]*+)*+"

Sample Use Cases

Code Search

Calls to fooBar() with arg #3 being "test"

->fooBar\(.*?\btest\b

$x->fooBar(LalaInterface::EXAMPLE, 3, 'test');


Object Foo modified using any setter

\bfoo->set[A-Z]

$foo->setBar(5);

$bar->foo->setFoo(5);

$nofoo->setFoo(5);

Stripping

Remove a prefix if it's present

preg_replace('{^dev-}', '', $version);
                    


Trim trailing whitespace of every line

preg_replace('{\s+$}m', '', $str);
                    

Stripping PHP files to bits

Finding classes, interfaces, .. without token_get_all()

// strip heredocs/nowdocs
$contents = preg_replace('{<<<\s*(\'?)(\w+)\\1(?:\r\n|\n|\r)(?:.*?)(?:\r\n|\n|\r)\\2(?=\r\n|\n|\r|;)}s', 'null', $contents);
// strip strings
$contents = preg_replace('{"[^"\\\\]*+(\\\\.[^"\\\\]*+)*+"|\'[^\'\\\\]*+(\\\\.[^\'\\\\]*+)*+\'}s', 'null', $contents);
// strip leading non-php code if needed
if (substr($contents, 0, 2) !== '<?') {
    $contents = preg_replace('{^.+?<\?}s', '<?', $contents, 1, $replacements);
    if ($replacements === 0) {
        return array();
    }
}
// strip non-php blocks in the file
$contents = preg_replace('{\?>.+<\?}s', '?><?', $contents);
// strip trailing non-php code if needed
$pos = strrpos($contents, '?>');
if (false !== $pos && false === strpos(substr($contents, $pos), '<?')) {
    $contents = substr($contents, 0, $pos);
}

preg_match_all('{
    (?:
         \b(?<![\$:>])(?P<type>class|interface|trait) \s++ (?P<name>[a-zA-Z_\x7f-\xff:][a-zA-Z0-9_\x7f-\xff:\-]*+)
       | \b(?<![\$:>])(?P<ns>namespace) (?P<nsname>\s++[a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*+(?:\s*+\\\\\s*+[a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*+)*+)? \s*+ [\{;]
    )
}ix', $contents, $matches);
                    

Splitting

Split tags separated by spaces or commas

preg_split("/[ ,]+/", $pattern)
                    

grep

Use egrep/grep -E for sanity. ()?+| are only available in extended mode



Best alias it:

alias grep='grep -r --extended-regexp --color=auto --exclude-dir=.svn --exclude-dir=.git'
                    

Does it match?

q(?=u)it      quit

Does it match?

q(?=u)it      quit

Nope, lookahead is zero-width and does not advance the cursor.

q(?=u)             quit

Guidelines for Sanity

Limitations

Use regexes unless you can't

Know your domain variance and restrictions

Domain

Start very strict, avoid . and expand as needed

Matching loosely leads to unexpected results

Boundaries

Remember to use \b...\b for matching

and \A...\z or ^...$ for validation

Document

Document or split up complex regexes

Use named capture group where possible

Now stop worrying,
and love regexes!

Thank you.

Resources

regular-expressions.info

jex.im/regulex

regex101.com

Questions?

@seldaek

slides.seld.be


Feedback:

joind.in/talk/0c752