How I learned to stop worrying and
love Regular Expressions
Jordi Boggiano
@seldaek
Building the internet for over 10 years
→ seld.be
Symfony core team, Composer lead and more OSS
→ github.com/Seldaek
Symfony, app architecture & performance consulting
Dev at Teamup.com
History
1956 Stephen Cole Kleen - regular languages and regular sets
1968 Ken Thompson - QED uses it to match patterns in text files
1971 Unix and ed
1974 g/re/p in ed is "Global search for Regular Expression and Print matching lines"
History
1986 Perl gets regular expressions and many new features over the years
1997 PCRE (Perl Compatible Regular Expressions), C lib, used by PHP
2003 Oxford English Dictionary takes grep in
At some point along the line, regular expressions became regexes
Regex Components
Pattern
Subject string
Matches
Pattern Components
Characters
You
Metacharacters
Y.u\s
You can't fight in here. This is the War Room!
Pattern Components
Escaping metachars
Y\.u\\s
\QY.u\s\E
Y.u\s
Character Classes
[abcd]
Is it that bad, sir?
With ranges
[a-d]
Is it that bad, sir?
[A-z]+
=>?@ABCDEFGHIJKLMNOPQRSTUVWX
YZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}
Negated classes
[^a-z ]
Is it that bad, sir?
Character Classes
Match ASCII control characters
[\x00-\x1f]
Matching unicode characters
[\x{25E6}-\x{25EF}]+
◥◦◧◨◩◪◫◬◭◮◯
Character Class Shortcuts
Word chars
\w = [A-Za-z0-9_]
\W = [^A-Za-z0-9_]
Digits
\d = [0-9]
\D = [^0-9]
Whitespace chars
\s = [ \t\r\n\v\f]
\S = [^ \t\r\n\v\f]
Dot Metacharacter
. = [^\n]
. = [\s\S] with s (dotall) modifier
Subpatterns and Alternations
bo(mb|ys) Stay on the bomb run, boys!
(?P<first>\w)\w+(?P<last>\w)Just one second, operator.
preg_match_all('(?P<first>\w)\w+(?P<last>\w)', 'Just one second, operator.', $matches);
$matches['first'] // ['J', 'o', 's', 'o']
$matches['last'] // ['t', 'e', 'd', 'r']
Subpatterns and Alternations
bob|bobby
bob & bobby
Quantifiers
Cope with uncertainty in the subject
? 0-1 time
* 0-∞ times
+ 1-∞ times
{n,m} n-m times
Quantifiers
Jamaicans?
Try one of these Jamaican cigars, Ambassador.
sto*
Thank you, no. I do not support the work of imperialist stooges.
o+
Oh, only commie stooges, huh?
(aa){1,2}
aaaaaaaaaaa
Lazy Quantifiers
Match as few times as possible
(aa){1,2}
aaaaaaaaaaa
(aa){1,2}?
aaaaaaaaaaa
<em>(.+)</em>
<em>lala</em> and <em>lulu</em>
<em>(.+?)</em>
<em>lala</em> and <em>lulu</em>
Possessive Quantifiers
Match and do not give up matches
trig+ger How is it possible for this thing to be triggered automatically and
at the same time impossible to untriggger?
trig++ger
How is it possible for this thing to be triggered automatically and
at the same time impossible to untriggger?
Anchors
^I|t$ I don't give a hoot in Hell how you do it
With m modifier:
^[^@]+@[a-z]+(\.[a-z]+)+$
\nfoo@bar.com\nbaz@qux.co.uk\n
Anchors for Validation
^[^@]+@[a-z]+(\.[a-z]+)+$
foo@bar.com\n
\A[^@]+@[a-z]+(\.[a-z]+)+\z
foo@bar.com\n
\A[^@]+@[a-z]+(\.[a-z]+)+\z
foo@bar.com
\A = absolute beginning, \z = absolute end
Back-references to Subpatterns
['"]\w+['"]
'single' "double" 'mixed"
(['"])\w+\1
'single' "double" 'mixed"
(?P<quote>['"])\w+(?P=quote)
Lookahead & Lookbehind
Let's match words surrounded by *'s
\*\w+\*
Of course, the whole point of a Doomsday Machine
is lost, if you *keep* it a *secret*!
(?<=\*)\w+(?=\*)
is lost, if you *keep* it a *secret*!
(?<!\*)\w+(?!\*)
is lost, if you *keep* it a *secret*!
Word boundary metacharacter
\b = (?<=\W)(?=\w)|(?<=\w)(?=\W)
\b(?<!\*)\w+(?!\*)\b
is lost, if you *keep* it a *secret*!
Conditionals
(?(back-reference)yes-pattern)
(?(back-reference)yes-pattern|no-pattern)
Pattern Delimiters
/foo/
/https?:\/\/([^\/]+)\//
{https?://([^\/]+)/}
In PHP, use single quotes to avoid over-escaping
// find literal backslashes at end of string
preg_match('{\\\\$}', $str); // => {\\$}
Modifiers
{...}i Case insensitive ([a-z] = [A-Za-z])
{...}m Multiline (^/$ per line)
{...}s Single line / Dotall (. = [\S\s])
{...}u Unicode
{...}D Dollar end only ($ = \z)
{...}x Free-spacing mode + Comments
Readability
(?#Comments can help readability)
{[◦-◯]+ \x{1F4A9}}u
Readability
(?#Comments can help readability)
{[◦-◯]+ \x{1F4A9}}u
{
[◦-◯]+ # Any googly-eye-range char(s)
\ # and a single space
\x{1F4A9} # Followed by pile of poo character
}ux
◭◮ 💩
Readability
{
(?(DEFINE)
# idstring: 1*( ALPHA / DIGIT / - / . )
(?<idstring>[\pL\pN\-\.]{1,})
# license-id: taken from list
(?<licenseid>${licenses})
# license-exception-id: taken from list
(?<licenseexceptionid>${exceptions})
# license-ref: [DocumentRef-1*(idstring):]LicenseRef-1*(idstring)
(?<licenseref>(?:DocumentRef-(?&idstring):)?LicenseRef-(?&idstring))
# simple-expresssion: license-id / license-id+ / license-ref
(?<simple_expression>(?&licenseid)\+? | (?&licenseid) | (?&licenseref))
# ...
Readability
{
# ...
# compound expression: 1*(
# simple-expression /
# simple-expression WITH license-exception-id /
# compound-expression AND compound-expression /
# compound-expression OR compound-expression
# ) / ( compound-expression ) )
(?<compound_head>
(?&simple_expression) ( \s+ (?:WITH) \s+ (?&licenseexceptionid))?
| \( \s* (?&compound_expression) \s* \)
)
(?<compound_expression>
(?&compound_head) (?: \s+ (?:AND|OR) \s+ (?&compound_expression))?
)
# license-expression: 1*1(simple-expression / compound-expression)
(?<license_expression>(?&compound_expression) | (?&simple_expression))
) # end of define
^(NONE | NOASSERTION | (?&license_expression))$
}x
Engine Properties
First match wins
(set|setFoo) setFoo
(setFoo|set) setFoo
Engine Properties
An overall match is always preferred to an overall non-match
(set|setFoo)Bar setFooBar
Engine Properties
The engine backtracks to the last match encountered when no match is found
^[^@]+@[a-z]+(\.[a-z]+)+$
foo@bar.co.uk
foo@bar.co.uk.foo.
PCRE JIT
Available in PHP7!
pcre.jit=1
Greedy, baseline 6x, JIT crashes & returns null
"[^"\\\\]*(\\\\.[^"\\\\]*)*"
Possessive, baseline 2.2x, JIT 1x
"[^"\\\\]*+(\\\\.[^"\\\\]*+)*+"
Code Search
Calls to fooBar() with arg #3 being "test"
->fooBar\(.*?\btest\b
$x->fooBar(LalaInterface::EXAMPLE, 3, 'test');
Object Foo modified using any setter
\bfoo->set[A-Z]
$foo->setBar(5);
$bar->foo->setFoo(5);
$nofoo->setFoo(5);
Stripping
Remove a prefix if it's present
preg_replace('{^dev-}', '', $version);
Trim trailing whitespace of every line
preg_replace('{\s+$}m', '', $str);
Stripping PHP files to bits
Finding classes, interfaces, .. without token_get_all()
// strip heredocs/nowdocs
$contents = preg_replace('{<<<\s*(\'?)(\w+)\\1(?:\r\n|\n|\r)(?:.*?)(?:\r\n|\n|\r)\\2(?=\r\n|\n|\r|;)}s', 'null', $contents);
// strip strings
$contents = preg_replace('{"[^"\\\\]*+(\\\\.[^"\\\\]*+)*+"|\'[^\'\\\\]*+(\\\\.[^\'\\\\]*+)*+\'}s', 'null', $contents);
// strip leading non-php code if needed
if (substr($contents, 0, 2) !== '<?') {
$contents = preg_replace('{^.+?<\?}s', '<?', $contents, 1, $replacements);
if ($replacements === 0) {
return array();
}
}
// strip non-php blocks in the file
$contents = preg_replace('{\?>.+<\?}s', '?><?', $contents);
// strip trailing non-php code if needed
$pos = strrpos($contents, '?>');
if (false !== $pos && false === strpos(substr($contents, $pos), '<?')) {
$contents = substr($contents, 0, $pos);
}
preg_match_all('{
(?:
\b(?<![\$:>])(?P<type>class|interface|trait) \s++ (?P<name>[a-zA-Z_\x7f-\xff:][a-zA-Z0-9_\x7f-\xff:\-]*+)
| \b(?<![\$:>])(?P<ns>namespace) (?P<nsname>\s++[a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*+(?:\s*+\\\\\s*+[a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*+)*+)? \s*+ [\{;]
)
}ix', $contents, $matches);
Splitting
Split tags separated by spaces or commas
preg_split("/[ ,]+/", $pattern)
grep
Use egrep/grep -E for sanity. ()?+| are only available in extended mode
Best alias it:
alias grep='grep -r --extended-regexp --color=auto --exclude-dir=.svn --exclude-dir=.git'
Does it match?
q(?=u)it quit
Does it match?
q(?=u)it quit
Nope, lookahead is zero-width and does not advance the cursor.
q(?=u) quit
Limitations
Use regexes unless you can't
Know your domain variance and restrictions
Domain
Start very strict, avoid . and expand as needed
Matching loosely leads to unexpected results
Boundaries
Remember to use \b...\b for matching
and \A...\z or ^...$ for validation
Document
Document or split up complex regexes
Use named capture group where possible
Now stop worrying,
and love regexes!
Thank you.