Finite Automata and Lexers in Lua
Overview
A finite automaton is a mathematical model of computation with a finite number of states. It reads an input string symbol by symbol and transitions between states according to defined rules. At the end, it either accepts the input (the string belongs to the language) or rejects it.
Lexers (tokenizers) use finite automata to split source code or text into tokens — identifiers, numbers, keywords, operators. Understanding automata gives you a foundation for building parsers, configuration file readers, and domain-specific languages in Lua.
Deterministic vs non-deterministic automata
A DFA (Deterministic Finite Automaton) has exactly one possible transition for each state-input pair. Given the current state and the next character, you always know exactly which state comes next.
An NFA (Non-deterministic Finite Automaton) can have zero, one, or many possible transitions for a given state-input pair. Multiple paths through the machine might be active simultaneously.
DFAs are faster to execute because there’s no backtracking or multiple active paths. Most lexer implementations compile NFAs or regular expressions into DFAs for this reason. Lua’s own pattern matcher and LPeg both use automaton-based techniques internally.
A simple DFA in Lua
Implement a DFA as a table mapping states to transition functions. Each state has a transition table or function that returns the next state.
Here’s a DFA that recognizes integers (sequences of digits):
-- DFA for non-negative integers
local int_dfa = {
start = "scan_digits",
states = {
scan_digits = {
accept = true,
transitions = {
["0"] = "scan_digits",
["1"] = "scan_digits",
["2"] = "scan_digits",
["3"] = "scan_digits",
["4"] = "scan_digits",
["5"] = "scan_digits",
["6"] = "scan_digits",
["7"] = "scan_digits",
["8"] = "scan_digits",
["9"] = "scan_digits",
}
}
}
}
local function run_dfa(dfa, input)
local state = dfa.start
local pos = 1
while pos <= #input do
local ch = input:sub(pos, pos)
local transitions = dfa.states[state].transitions
if not transitions or not transitions[ch] then
return false, "unexpected character " .. ch .. " at position " .. pos
end
state = transitions[ch]
pos = pos + 1
end
return dfa.states[state].accept, state
end
print(run_dfa(int_dfa, "12345")) --> true
print(run_dfa(int_dfa, "12a3")) --> false
This DFA is hardcoded for demonstration: every digit transition loops back to the same scan_digits state. A real lexer needs distinct states for identifiers, numbers, strings, and operators, with each state defining how to handle every possible input character. The run_dfa function walks through the string one character at a time, consulting the transition table before advancing. When the input ends, the current state’s accept flag determines success or failure.
Building a token lexer
A lexer maps character sequences to token types. It consumes input character by character, building up token values until it hits a transition that doesn’t match.
local function lex(input)
local pos = 1
local tokens = {}
while pos <= #input do
local ch = input:sub(pos, pos)
-- Skip whitespace
if ch == " " or ch == "\t" or ch == "\n" then
pos = pos + 1
-- Integer literal
elseif ch:match("%d") then
local start = pos
while pos <= #input and input:sub(pos, pos):match("%d") do
pos = pos + 1
end
table.insert(tokens, { type = "INT", value = input:sub(start, pos - 1) })
-- Identifier or keyword
elseif ch:match("[a-zA-Z_]") then
local start = pos
while pos <= #input and input:sub(pos, pos):match("[a-zA-Z0-9_]") do
pos = pos + 1
end
local value = input:sub(start, pos - 1)
local keywords = { ["if"] = true, ["then"] = true, ["else"] = true, ["end"] = true }
local type = keywords[value] and "KEYWORD" or "IDENT"
table.insert(tokens, { type = type, value = value })
-- Operators
elseif ch == "+" then
table.insert(tokens, { type = "PLUS", value = "+" })
pos = pos + 1
elseif ch == "-" then
table.insert(tokens, { type = "MINUS", value = "-" })
pos = pos + 1
elseif ch == "*" then
table.insert(tokens, { type = "STAR", value = "*" })
pos = pos + 1
elseif ch == "/" then
table.insert(tokens, { type = "SLASH", value = "/" })
pos = pos + 1
else
return nil, "unrecognized character: " .. ch .. " at " .. pos
end
end
table.insert(tokens, { type = "EOF", value = nil })
return tokens
end
local result = lex("if x + 42 then end")
for _, tok in ipairs(result) do
print(tok.type, tok.value)
end
This hand-written lexer recognizes four token types: integers, identifiers, keywords, and single-character operators. It uses Lua’s %d and [a-zA-Z_] patterns to classify characters, which is simpler than encoding a full DFA transition table for a first pass. The keyword check happens after the identifier is fully collected, a post-processing step rather than a state-machine transition. Running it on "if x + 42 then end" produces a sequence of typed tokens ready for a parser to consume.
Output:
KEYWORD if
IDENT x
PLUS +
INT 42
KEYWORD then
KEYWORD end
EOF nil
The output shows each token tagged with its type: KEYWORD for reserved words, IDENT for user-defined names, INT for numeric literals, and operator-specific tags like PLUS. The final EOF token signals the end of input, which parsers use to detect unexpected trailing content after a complete expression.
Handling multi-character operators
Extend the lexer to handle two-character operators like ==, !=, <=, >=:
local function lex(input)
local pos = 1
local tokens = {}
local function peek(offset)
return input:sub(pos + offset - 1, pos + offset - 1)
end
while pos <= #input do
local ch = input:sub(pos, pos)
-- Skip whitespace
if ch == " " or ch == "\t" or ch == "\n" then
pos = pos + 1
-- Two-character operators
elseif ch == "=" and peek(1) == "=" then
table.insert(tokens, { type = "EQ", value = "==" })
pos = pos + 2
elseif ch == "!" and peek(1) == "=" then
table.insert(tokens, { type = "NEQ", value = "!=" })
pos = pos + 2
elseif ch == "<" and peek(1) == "=" then
table.insert(tokens, { type = "LTE", value = "<=" })
pos = pos + 2
elseif ch == ">" and peek(1) == "=" then
table.insert(tokens, { type = "GTE", value = ">=" })
pos = pos + 2
-- Single-character operators
elseif ch == "+" then
table.insert(tokens, { type = "PLUS", value = "+" })
pos = pos + 1
elseif ch == "-" then
table.insert(tokens, { type = "MINUS", value = "-" })
pos = pos + 1
-- ... rest of lexer
end
end
return tokens
end
The peek helper is the key to multi-character recognition: it checks the next character without advancing the position, letting the lexer decide between a two-character operator like == and a single-character one like =. This lookahead technique avoids backtracking and keeps the scanning pass in linear time. For production lexers, the operator table would be data-driven rather than a chain of elseif branches, but the principle is the same.
String Literals
Add support for quoted strings:
local function lex(input)
local pos = 1
local tokens = {}
while pos <= #input do
local ch = input:sub(pos, pos)
if ch == " " or ch == "\t" or ch == "\n" then
pos = pos + 1
-- Double-quoted string
elseif ch == '"' then
local start = pos
pos = pos + 1
while pos <= #input do
local c = input:sub(pos, pos)
if c == '"' then
pos = pos + 1
break
elseif c == "\\" and pos + 1 <= #input then
pos = pos + 2 -- skip escape sequence
else
pos = pos + 1
end
end
table.insert(tokens, { type = "STRING", value = input:sub(start + 1, pos - 2) })
else
-- ... rest of lexer
pos = pos + 1
end
end
return tokens
end
local result = lex('hello = "world, \\"escaped\\""')
for _, tok in ipairs(result) do
print(tok.type, tok.value)
end
String lexing requires careful handling of escape sequences; the code above skips \" pairs to avoid terminating the string prematurely. In a full lexer, you would also handle single-quoted strings, Lua’s long-string syntax with [[ and ]], and unterminated string errors. Each of these adds a state to the DFA but follows the same character-by-character consumption pattern.
DFA-Driven Lexer
For a more scalable approach, encode the DFA as a state machine table and let the lexer follow transitions explicitly:
-- States: 0 = start, 1 = in_ident, 2 = in_number
local TRANSITIONS = {
[0] = {
ident_start = { next = 1, emit = nil },
digit_start = { next = 2, emit = nil },
whitespace = { next = 0, emit = nil },
op = { next = 0, emit = true },
eof = { next = nil, emit = true },
},
[1] = {
ident_cont = { next = 1, emit = nil },
default = { next = 0, emit = true },
},
[2] = {
digit_cont = { next = 2, emit = nil },
default = { next = 0, emit = true },
},
}
local function is_ident_start(ch)
return ch:match("[a-zA-Z_]") ~= nil
end
local function is_ident_cont(ch)
return ch:match("[a-zA-Z0-9_]") ~= nil
end
local function is_digit(ch)
return ch:match("%d") ~= nil
end
local function is_whitespace(ch)
return ch == " " or ch == "\t" or ch == "\n" or ch == "\r"
end
local function classify(value)
if value:match("^[a-zA-Z_][a-zA-Z0-9_]*$") then
local keywords = { ["if"] = true, ["then"] = true, ["else"] = true, ["end"] = true }
return keywords[value] and "KEYWORD" or "IDENT"
elseif value:match("^%d+$") then
return "INT"
else
return "SYMBOL"
end
end
local function lex_dfa(input)
local tokens = {}
local state = 0
local buffer = ""
local pos = 1
local function flush()
if #buffer > 0 then
table.insert(tokens, { type = classify(buffer), value = buffer })
buffer = ""
end
end
while pos <= #input + 1 do
local ch = pos <= #input and input:sub(pos, pos) or ""
local transitions = TRANSITIONS[state]
local next_state = nil
local emit = false
if is_whitespace(ch) then
flush()
next_state = 0
elseif state == 0 and is_ident_start(ch) then
buffer = buffer .. ch
next_state = 1
elseif state == 1 and is_ident_cont(ch) then
buffer = buffer .. ch
next_state = 1
elseif state == 0 and is_digit(ch) then
buffer = buffer .. ch
next_state = 2
elseif state == 2 and is_digit(ch) then
buffer = buffer .. ch
next_state = 2
elseif ch == "" then
flush()
table.insert(tokens, { type = "EOF", value = nil })
break
else
flush()
if ch ~= "" then
table.insert(tokens, { type = "SYMBOL", value = ch })
end
next_state = 0
end
-- Emit token when transition wants us to
if emit or (next_state ~= state and state ~= 0) then
flush()
end
if next_state == state and state == 0 and ch == "" then
break
end
state = next_state
pos = pos + 1
end
return tokens
end
Why use automata for lexing?
Regular expressions compile to automata internally. When you write lua pattern = "%d+", Lua’s pattern matcher translates it into a state machine that processes the input character by character. LPeg does this more explicitly — it compiles grammar rules into a deterministic automaton that runs in linear time.
Understanding automata helps you:
- Predict what Lua patterns can and can’t match (they correspond to regular languages)
- Debug why a pattern isn’t matching as expected
- Build custom tokenizers when Lua patterns aren’t sufficient
- Use LPeg effectively for complex grammars
The biggest practical payoff is knowing when to reach for a hand-written DFA lexer instead of LPeg or Lua patterns. Hand-written DFAs give you total control over error messages, token priorities, and edge cases that pattern libraries abstract away. For a simple config file reader, Lua patterns are enough. For a full programming language frontend, the state-machine approach scales better because each new token type adds a new state rather than requiring you to rewrite regex patterns that interact in unexpected ways.
Gotchas
Backtracking in naive lexers. The simple lexer above advances character by character and commits when a transition fails. This is linear time but can fail to handle longest-match semantics correctly. A DFA-based lexer always matches the longest possible token at each position.
Lua patterns aren’t full regex. Lua patterns use a simplified notation that corresponds to a subset of regular languages. They can’t match balanced constructs like parentheses, which require a pushdown automaton (a parser, not a lexer).
Unicode in Lua 5.3+.
Lua 5.3+ treats multi-byte UTF-8 characters as multiple bytes. Patterns like %a only match ASCII letters. For Unicode-aware lexing, either limit yourself to ASCII or handle UTF-8 byte sequences explicitly.
See Also
- /guides/lua-string-patterns/, Lua’s pattern language and what it can express
- /guides/lua-lpeg-parsing/, LPeg compiles grammars into deterministic automata for efficient parsing
- /tutorials/lua-fundamentals/pattern-matching/, pattern matching fundamentals in Lua