nanohub

PEG or Regex?


on blog

Most programmers know (or should know) regexes (Regular Expressions), it is perhaps the most used string parsing and searching tool. Several implementations and extensions exist, and can be found by default in OS and programming languages. Basically, it is a language that allows us to describe other small languages with a regular grammar, and use a computer program to search for patterns and/or split pieces of a given string that follows that grammar.

Some basic examples:

  1. /ok/ matches any string with ok inside, that is, aok, not ok, okay;
  2. /^ok/ matches any string starting with ok, e.g. ok2, okay;
  3. /^ok$/ matches ok exactly;
  4. /^ok = (true|false)$/ matches ok = true or ok = false exactly, returning the capture group in parethesis;
  5. /^version = (\d\.\d+)$/ matches ok = plus one digit, a point and one or more digits (\d is a digit, \. is the point, and + is one or more repetitions of the preceding expression), so we can capture a floating point number written just after version.

Then we see that Regular Expressions a offer pretty concise way of parsing text, but the problem is when we need more complex and longer patterns. For example, what about parsing all components (year, month, etc...) of an ISO datetime string, e.g., 2016-08-15T17:51:52.291Z? Then we need something like:

/(\d{4})-(\d{2})-(\d{2})T(\d{2})\:(\d{2})\:(\d{2})\.(\d+)Z/

Which is perfectly fine, but demands a careful examination to write and read. One day I thought that a more descriptive way would be interesting, if not desirable. Then I found the module LPeg.re, implemented with Parsing Expression Grammars, which is a lot more closer to the Backus-Naur Form and provides more structured syntax. For example, the pattern for the previous regex could be written as:

datetime <- date 'T' time 'Z'
date <- {%d^4} '-' {%d^2} '-' {%d^2}
time <- {%d^2} ':' {%d^2} ':' {%d^2 ('.' {%d+})?}

Here %d is a digit, {} denote capture groups and ^n denote n repetitions of the preceding pattern.

Even though it is longer, it makes possible to separate and name sub-patterns, structuring the source code a lot more -- much like how early macro assemblers evolved into structured programming languages. PEG also offers text transformation and returning objects from captures, sort of an advanced awk or sed integrated in the library, but that is for another day. =]

Daniel Lima