luaguides

LPeg Parsing in Lua: PEG Grammars, Captures, and the re Module

Lua’s built-in string patterns handle most search-and-replace jobs, but they struggle with structured input like arithmetic expressions, config files, or CSV data. Writing a parser by hand is error-prone and tedious. LPeg parsing solves this by giving you a formal grammar system backed by Parsing Expression Grammars (PEGs).

LPeg was created by Roberto Ierusalimschy, Lua’s lead designer, and it ships with the standard library starting in Lua 5.2. LuaJIT users need to install it separately. The companion re module layers a regex-like syntax on top of LPeg’s programmatic API, which is what most developers reach for first.

Getting Started

LPeg is part of the standard library on Lua 5.2 and later:

local lpeg = require("lpeg")

On LuaJIT environments like OpenResty, LPeg is not bundled because LuaJIT targets Lua 5.1 API compatibility, which predates LPeg’s inclusion in the standard library. The recommended installation method is LuaRocks, Lua’s package manager, which compiles the C source against your LuaJIT installation. This ensures the binary matches your runtime’s ABI exactly and avoids subtle segfaults from mismatched library versions.

luarocks install lpeg

After requiring the module, the central function for all LPeg operations is lpeg.match(). It takes a pattern as its first argument and the subject string as the second. Unlike string.match() which returns captured substrings directly, lpeg.match() returns the byte position after the match when no captures are defined. This position-based result lets you chain successive matches across a string by passing the returned index as the starting position for the next call. The following example demonstrates the simplest case: matching a literal string:

local lpeg = require("lpeg")

local pattern = lpeg.P("hello")
local result = lpeg.match(pattern, "hello world")

print(result)  --> 6  (index of first character after the match)

When the pattern fails to match anywhere in the string, lpeg.match() returns nil with no error message. On success with no captures defined, it returns the byte index immediately after the matched portion: 6 in this case because ‘hello’ occupies bytes 1 through 5 in the string. If your pattern includes capture operators like lpeg.C(), lpeg.Ct(), or lpeg.Cg(), the function returns the captured contents instead of a position. Always check for nil before unpacking match results, since a failed parse produces no diagnostic beyond the falsy return value.

Building basic patterns

lpeg.P() converts values into patterns, the fundamental building block of every LPeg grammar. Unlike Lua string patterns where "a." matches an ‘a’ followed by any character, LPeg strings match literally character-by-character. You combine primitive patterns with operators rather than embedding metacharacters inside strings. The R() and S() helpers round out the primitive set for character ranges and sets:

local P, R, S = lpeg.P, lpeg.R, lpeg.S

-- Match literal text
local hello = P("hello")

-- R matches character ranges
local lowercase = R("az")
local uppercase = R("AZ")
local letter = lowercase + uppercase

-- S matches any single character in a set
local op = S("+-*/%^")

R("az") matches exactly one character in the ASCII range a through z—think of it as a character-class range that is more precise than Lua’s %l pattern class. S("+-*/") is a character set that matches any single character listed inside the string argument. These primitives become powerful once you connect them with LPeg’s two main composition operators. The sequence operator * chains patterns left to right, requiring each to succeed in order before the overall pattern matches. The ordered choice operator + tries the left alternative first and only falls through to the right if the left fails completely:

local P, R = lpeg.P, lpeg.R

-- Sequence: match both in order
local greeting = P("hello") * P(",") * P(" ")

-- Ordered choice: try left side first, then right
local opt_hello = P("hello") + P("hi") + P("yo")

The + operator has PEG semantics, not regex semantics. It tries the left side first and only moves to the right if the left side fails completely. This matters more than it sounds.

Repetition Operators

Patterns support a set of repetition operators that behave differently from their regex counterparts:

OperatorMeaning
p^0Zero or more repetitions (greedy)
p^1One or more repetitions (greedy)
p^-1Zero or more, non-greedy (lazy)
p^nExactly n repetitions (positive n), or up to |n| repetitions (negative n); e.g. p^-1 means zero or one
local R = lpeg.R

-- One or more lowercase letters
local word = R("az")^1
print(lpeg.match(word, "hello"))  --> 6

-- Zero or more digits (greedy)
local decimals = R("09")^0
print(lpeg.match(decimals, "123abc"))  --> 4  (matched "123")

-- Zero or more digits (lazy - stops immediately)
print(lpeg.match(R("09")^-1, "123abc"))  --> 1  (matched nothing, tried first char)

LPeg repetition is greedy by default, matching as much input as possible before yielding. When you need minimal matching—for instance, to find the shortest quoted string or the first closing bracket—use the ^-1 modifier for lazy behavior. Greedy vs. lazy semantics are especially important inside grammars where a rule’s repetition can consume input that a subsequent rule needs.

Lookahead

Lookahead operators let you inspect what comes next without consuming input. They are zero-width assertions that succeed or fail based on whether a sub-pattern matches ahead, and they never advance the parse position. This makes them ideal for boundary conditions, reserved-word checks, and end-of-input detection:

-- Positive lookahead: asserts p matches but does not consume it
local p = P("foo") * #P("bar") * P("foo")

print(lpeg.match(p, "foobar"))  --> 4
print(lpeg.match(p, "foobaz"))   --> nil (lookahead failed)

The positive lookahead #p succeeds if p can match at the current position but consumes no input, so the parse cursor stays where it was before the assertion. In the example above, "foobar" matches because foo is followed by bar, but "foobaz" fails the lookahead check. The negative form -p inverts the logic; it succeeds only when p cannot match at the current position. Negative lookahead is useful for defining boundaries, excluding reserved words from an identifier rule, or ensuring a specific suffix does not appear:

-- Negative lookahead: succeeds only if p cannot match
local p = P("foo") * -P("bar") * P("foo")

print(lpeg.match(p, "foobaz"))   --> 4
print(lpeg.match(p, "foobar"))   --> nil (negative lookahead failed)

Captures

Captures are the mechanism that transforms LPeg from a simple pattern matcher into a parser that can extract and reshape data. Without captures, lpeg.match() only tells you whether and where a pattern matched. With captures, it can return substrings, structured tables, transformed values, or even fully built ASTs. LPeg provides six capture kinds: simple, constant, position, group, table, and fold—and they compose freely inside a single pattern so you can mix capture styles.

Simple Capture: lpeg.C()

lpeg.C() wraps a pattern and captures the matched substring as a raw Lua string:

local C, R = lpeg.C, lpeg.R

local word = C(R("az")^1)
print(lpeg.match(word, "hello world"))  --> hello

lpeg.C() returns a new pattern, not the captured value. You still call lpeg.match() on the combined pattern to execute the match and retrieve results. This functional-composition model means you can nest captures inside larger patterns and LPeg will extract exactly the pieces you marked with C(), ignoring everything that sits outside capture wrappers.

Constant and position captures

lpeg.Cc() produces fixed values without consuming input. lpeg.Cp() returns the current position. These two captures are useful for tagging matched regions with metadata or tracking where in the input a match occurred:

local Cc, Cp, P = lpeg.Cc, lpeg.Cp, lpeg.P

-- Tag matched text with a category
local tagged = P("dog") * Cc("animal")
print(lpeg.match(tagged, "dog"))  --> animal

-- Capture the position
local pos = P("hello") * Cp()
print(lpeg.match(pos, "hello world"))  --> 6

Table capture: lpeg.Ct()

lpeg.Ct() collects every capture produced by its sub-pattern into a Lua table, preserving the order in which captures were triggered. This is the mechanism for building an AST or returning structured records from a parse. When combined with lpeg.Cg() for named group captures, the resulting table uses string keys rather than numeric indices, giving you dictionary-style access to parsed fields:

local Ct, C, P, R, V = lpeg.Ct, lpeg.C, lpeg.P, lpeg.R, lpeg.V

local name = C(R("az")^1)

-- Named group captures
local pair = lpeg.Cg(name, "key") * P("=") * lpeg.Cg(name, "value")
local record = Ct(pair)

local result = lpeg.match(record, "foo=bar")
print(result.key, result.value)  --> foo  bar

Fold capture: lpeg.Cf()

lpeg.Cf() aggregates multiple captured values by applying a Lua function cumulatively across them, similar to how table.foldl works in functional libraries. The first argument to lpeg.Cf() is a pattern that must produce at least one capture, and the second is a two-argument function f(accumulator, next_value) that LPeg calls repeatedly. The first capture initializes the accumulator, and each subsequent capture feeds into the next invocation. This lets you compute running sums, build concatenated strings, or construct nested table structures directly during the parse:

local Cf, C, R = lpeg.Cf, lpeg.C, lpeg.R

local number = C(R("09")^1) / tonumber
local sum = Cf(number * ("," * number)^0, function(acc, x) return acc + x end)

print(lpeg.match(sum, "10,20,30"))  --> 60

The / operator applies a transformation to a capture. Here it converts the captured string "10" into the number 10 before the fold function processes it.

Grammars for recursive structures

Real parsers need recursion to handle nested constructs like parenthesized expressions or blocks. A grammar in LPeg is a Lua table of named rules, with the first element "1" or "_" designating the start rule. Each key-value pair in the table maps a rule name to its pattern, and V("name") creates a non-terminal reference that defers matching to the named rule. LPeg grammars are PEG grammars, so ordered choice and greedy repetition apply at every level of the parse tree:

local P, R, V, C, S = lpeg.P, lpeg.R, lpeg.V, lpeg.C, lpeg.S

local calc = P({
    "expr",
    expr   = V"term" * (V"addop" * V"term")^0,
    term   = V"factor" * (V"mulop" * V"factor")^0,
    factor = V"number" + "(" * V"expr" * ")",
    number = C(R("09")^1) / tonumber,
    addop  = C(S("+-")),
    mulop  = C(S("*/")),
})

V("rule") references another rule by name. All rules must be defined before they are referenced, which trips up developers accustomed to PEG forward declarations.

Once matched, calc produces a flat list of alternating operands and operators. You then evaluate it in Lua:

local function evaluate(text)
    local caps = lpeg.match(calc, text)
    if not caps then return nil, "parse error" end

    local result = caps[1]
    local i = 2
    while i <= #caps do
        local op = caps[i]
        local rhs = caps[i + 1]
        if op == "+" then result = result + rhs
        elseif op == "-" then result = result - rhs
        elseif op == "*" then result = result * rhs
        elseif op == "/" then result = result / rhs end
        i = i + 2
    end
    return result
end

print(evaluate("2+3"))       --> 5
print(evaluate("10-3*2"))    --> 4
print(evaluate("(10-3)*2"))  --> 14

The grammar correctly enforces precedence: multiplication binds tighter than addition. 2+3*4 produces 2, +, 3, *, 4, which evaluates to 2 + (3 * 4) = 14.

The re Module

Writing grammars with lpeg.P tables is powerful but verbose. The re module provides a conventional syntax:

local re = require("re")

Key differences from raw LPeg:

re syntaxLPeg equivalent
"abc"P("abc")
[abc]S("abc")
[a-z]R("az")
p*p^0
p+p^1
p?p^-1
p1 p2p1 * p2
p1 / p2p1 + p2
{p}Ct(p)
"name"V"name"
#p#p (lookahead)

Compiling and Matching

re.compile() converts a re-syntax string into an LPeg pattern:

local re = require("re")

local p = re.compile(" [a-z]+ ", { space = re.compile("[ \t\n]*") })

The second argument to re.compile() is an environment table that maps custom definition names to pre-compiled patterns. Inside the re-syntax string, you reference these definitions with %name, similar to how Lua’s string.gsub references captures. This environment mechanism lets you build reusable pattern libraries without repeating the same sub-patterns across multiple grammar strings. For patterns that you apply repeatedly against different inputs, compile once and reuse the result.

For one-shot operations where you do not need a compiled pattern, the re module provides re.match(), re.find(), and re.gsub() as convenience functions. These work like their string library counterparts but accept re-syntax patterns instead of Lua pattern strings:

local re = require("re")

re.match("hello", "hello world")      --> 6
local s, e = re.find("hello", "say hello world")
print(s, e)                           --> 5  9

local result = re.gsub("hello world", "[aeiou]", string.upper)
print(result)                         --> hEllO wOrld

Parsing key-value pairs with re

The re module’s real strength shows when you define a named grammar with the arrow syntax <-. Unlike the flat regex-style patterns used with re.match(), a full re grammar supports named rules, recursive definitions, and table captures: all expressed in a compact string rather than the more verbose lpeg.P() table syntax. The following parser extracts comma-separated key=value pairs into named table fields:

local re = require("re")

local grammar = re.compile([[
    record  <- { pair (',' pair)* }
    pair    <- key '=' value
    key     <- {[a-z]+}
    value   <- {[a-z]+}
]])

local result = grammar:match("foo=bar,baz=qux")
print(result[1].key, result[1].value)  --> foo  bar
print(result[2].key, result[2].value)  --> baz  qux

The curly braces {...} create a table capture. Named sub-patterns like {[a-z]+} capture into named fields.

Practical example: parsing CSV

CSV parsing exposes the real value of LPeg. Quoted fields, escaped quotes, and newlines all create edge cases that regex-based approaches stumble over:

local lpeg = require("lpeg")

local comma     = lpeg.P(",")
local newline   = lpeg.P("\n")
local quote     = lpeg.P('"')
local dquote    = quote * quote  -- escaped quote: ""

-- Unquoted field: any character except comma or newline
local field_nq = lpeg.C((lpeg.P(1) - comma - newline)^0)

-- Quoted field: opening quote, content, closing quote
-- Content can be any char except quote, or escaped quote
local field_q  = lpeg.C(quote * (lpeg.P(1) - quote + dquote)^0 * quote)

local field = field_q + field_nq
local row   = lpeg.Ct(field * (comma * field)^0)
local csv   = lpeg.Ct(row * (newline * row)^0)

local data = lpeg.match(csv, 'name,age,city\nAlice,30,NYC\nBob,25,LA\n')

for _, row in ipairs(data) do
    print(unpack(row))
end

Running this parser against a three-line CSV snippet produces a table of rows, where each row is itself a table of fields. The lpeg.Ct() around the row and CSV patterns ensures a nested table structure that matches the logical two-dimensional layout of the data. Each call to print(unpack(row)) expands the field table into separate arguments, producing tab-separated output that mirrors the original CSV columns:

name	age	city
Alice	30	NYC
Bob	25	LA

The pattern (lpeg.P(1) - comma - newline)^0 reads as “match any character that is not a comma and not a newline, zero or more times.” The - operator is the difference (set subtraction) for character classes.

Common Gotchas

Strings match literally, not as regex. P("a.") matches the two-character string "a.", not an "a" followed by any character. Use P(1) for “any character.”

The + operator is ordered. P("ab") + P("a") will never match "ab" via the second branch, because the first branch succeeds on "a" first. This is a PEG, not a regular expression.

Grammars require pre-declaration. A rule cannot reference a rule defined later in the same table. Define all mutually-referencing rules before using them.

LuaJIT ships without LPeg. Most OpenResty installations use LuaJIT, which implements Lua 5.1 semantics. You need luarocks install lpeg on those systems.

No built-in error messages. lpeg.match() returns nil on failure with no explanation. For better diagnostics, look at lpeglabel, which adds labeled error reporting to LPeg.

See Also