Finite Automata and Lexers in Lua
Overview
A finite automaton is a mathematical model of computation with a finite number of states. It reads an input string symbol by symbol and transitions between states according to defined rules. At the end, it either accepts the input (the string belongs to the language) or rejects it.
Lexers (tokenizers) use finite automata to split source code or text into tokens — identifiers, numbers, keywords, operators. Understanding automata gives you a foundation for building parsers, configuration file readers, and domain-specific languages in Lua.
Deterministic vs Non-deterministic Automata
A DFA (Deterministic Finite Automaton) has exactly one possible transition for each state-input pair. Given the current state and the next character, you always know exactly which state comes next.
An NFA (Non-deterministic Finite Automaton) can have zero, one, or many possible transitions for a given state-input pair. Multiple paths through the machine might be active simultaneously.
DFAs are faster to execute because there’s no backtracking or multiple active paths. Most lexer implementations compile NFAs or regular expressions into DFAs for this reason. Lua’s own pattern matcher and LPeg both use automaton-based techniques internally.
A Simple DFA in Lua
Implement a DFA as a table mapping states to transition functions. Each state has a transition table or function that returns the next state.
Here’s a DFA that recognizes integers (sequences of digits):
-- DFA for non-negative integers
local int_dfa = {
start = "scan_digits",
states = {
scan_digits = {
accept = true,
transitions = {
["0"] = "scan_digits",
["1"] = "scan_digits",
["2"] = "scan_digits",
["3"] = "scan_digits",
["4"] = "scan_digits",
["5"] = "scan_digits",
["6"] = "scan_digits",
["7"] = "scan_digits",
["8"] = "scan_digits",
["9"] = "scan_digits",
}
}
}
}
local function run_dfa(dfa, input)
local state = dfa.start
local pos = 1
while pos <= #input do
local ch = input:sub(pos, pos)
local transitions = dfa.states[state].transitions
if not transitions or not transitions[ch] then
return false, "unexpected character " .. ch .. " at position " .. pos
end
state = transitions[ch]
pos = pos + 1
end
return dfa.states[state].accept, state
end
print(run_dfa(int_dfa, "12345")) --> true
print(run_dfa(int_dfa, "12a3")) --> false
```lua
## Building a Token Lexer
A lexer maps character sequences to token types. It consumes input character by character, building up token values until it hits a transition that doesn't match.
```lua
local function lex(input)
local pos = 1
local tokens = {}
while pos <= #input do
local ch = input:sub(pos, pos)
-- Skip whitespace
if ch == " " or ch == "\t" or ch == "\n" then
pos = pos + 1
-- Integer literal
elseif ch:match("%d") then
local start = pos
while pos <= #input and input:sub(pos, pos):match("%d") do
pos = pos + 1
end
table.insert(tokens, { type = "INT", value = input:sub(start, pos - 1) })
-- Identifier or keyword
elseif ch:match("[a-zA-Z_]") then
local start = pos
while pos <= #input and input:sub(pos, pos):match("[a-zA-Z0-9_]") do
pos = pos + 1
end
local value = input:sub(start, pos - 1)
local keywords = { ["if"] = true, ["then"] = true, ["else"] = true, ["end"] = true }
local type = keywords[value] and "KEYWORD" or "IDENT"
table.insert(tokens, { type = type, value = value })
-- Operators
elseif ch == "+" then
table.insert(tokens, { type = "PLUS", value = "+" })
pos = pos + 1
elseif ch == "-" then
table.insert(tokens, { type = "MINUS", value = "-" })
pos = pos + 1
elseif ch == "*" then
table.insert(tokens, { type = "STAR", value = "*" })
pos = pos + 1
elseif ch == "/" then
table.insert(tokens, { type = "SLASH", value = "/" })
pos = pos + 1
else
return nil, "unrecognized character: " .. ch .. " at " .. pos
end
end
table.insert(tokens, { type = "EOF", value = nil })
return tokens
end
local result = lex("if x + 42 then end")
for _, tok in ipairs(result) do
print(tok.type, tok.value)
end
```lua
Output:
```lua
KEYWORD if
IDENT x
PLUS +
INT 42
KEYWORD then
KEYWORD end
EOF nil
```lua
## Handling Multi-character Operators
Extend the lexer to handle two-character operators like `==`, `!=`, `<=`, `>=`:
```lua
local function lex(input)
local pos = 1
local tokens = {}
local function peek(offset)
return input:sub(pos + offset - 1, pos + offset - 1)
end
while pos <= #input do
local ch = input:sub(pos, pos)
-- Skip whitespace
if ch == " " or ch == "\t" or ch == "\n" then
pos = pos + 1
-- Two-character operators
elseif ch == "=" and peek(1) == "=" then
table.insert(tokens, { type = "EQ", value = "==" })
pos = pos + 2
elseif ch == "!" and peek(1) == "=" then
table.insert(tokens, { type = "NEQ", value = "!=" })
pos = pos + 2
elseif ch == "<" and peek(1) == "=" then
table.insert(tokens, { type = "LTE", value = "<=" })
pos = pos + 2
elseif ch == ">" and peek(1) == "=" then
table.insert(tokens, { type = "GTE", value = ">=" })
pos = pos + 2
-- Single-character operators
elseif ch == "+" then
table.insert(tokens, { type = "PLUS", value = "+" })
pos = pos + 1
elseif ch == "-" then
table.insert(tokens, { type = "MINUS", value = "-" })
pos = pos + 1
-- ... rest of lexer
end
end
return tokens
end
```lua
## String Literals
Add support for quoted strings:
```lua
local function lex(input)
local pos = 1
local tokens = {}
while pos <= #input do
local ch = input:sub(pos, pos)
if ch == " " or ch == "\t" or ch == "\n" then
pos = pos + 1
-- Double-quoted string
elseif ch == '"' then
local start = pos
pos = pos + 1
while pos <= #input do
local c = input:sub(pos, pos)
if c == '"' then
pos = pos + 1
break
elseif c == "\\" and pos + 1 <= #input then
pos = pos + 2 -- skip escape sequence
else
pos = pos + 1
end
end
table.insert(tokens, { type = "STRING", value = input:sub(start + 1, pos - 2) })
else
-- ... rest of lexer
pos = pos + 1
end
end
return tokens
end
local result = lex('hello = "world, \\"escaped\\""')
for _, tok in ipairs(result) do
print(tok.type, tok.value)
end
```lua
## DFA-Driven Lexer
For a more scalable approach, encode the DFA as a state machine table and let the lexer follow transitions explicitly:
```lua
-- States: 0 = start, 1 = in_ident, 2 = in_number
local TRANSITIONS = {
[0] = {
ident_start = { next = 1, emit = nil },
digit_start = { next = 2, emit = nil },
whitespace = { next = 0, emit = nil },
op = { next = 0, emit = true },
eof = { next = nil, emit = true },
},
[1] = {
ident_cont = { next = 1, emit = nil },
default = { next = 0, emit = true },
},
[2] = {
digit_cont = { next = 2, emit = nil },
default = { next = 0, emit = true },
},
}
local function is_ident_start(ch)
return ch:match("[a-zA-Z_]") ~= nil
end
local function is_ident_cont(ch)
return ch:match("[a-zA-Z0-9_]") ~= nil
end
local function is_digit(ch)
return ch:match("%d") ~= nil
end
local function is_whitespace(ch)
return ch == " " or ch == "\t" or ch == "\n" or ch == "\r"
end
local function classify(value)
if value:match("^[a-zA-Z_][a-zA-Z0-9_]*$") then
local keywords = { ["if"] = true, ["then"] = true, ["else"] = true, ["end"] = true }
return keywords[value] and "KEYWORD" or "IDENT"
elseif value:match("^%d+$") then
return "INT"
else
return "SYMBOL"
end
end
local function lex_dfa(input)
local tokens = {}
local state = 0
local buffer = ""
local pos = 1
local function flush()
if #buffer > 0 then
table.insert(tokens, { type = classify(buffer), value = buffer })
buffer = ""
end
end
while pos <= #input + 1 do
local ch = pos <= #input and input:sub(pos, pos) or ""
local transitions = TRANSITIONS[state]
local next_state = nil
local emit = false
if is_whitespace(ch) then
flush()
next_state = 0
elseif state == 0 and is_ident_start(ch) then
buffer = buffer .. ch
next_state = 1
elseif state == 1 and is_ident_cont(ch) then
buffer = buffer .. ch
next_state = 1
elseif state == 0 and is_digit(ch) then
buffer = buffer .. ch
next_state = 2
elseif state == 2 and is_digit(ch) then
buffer = buffer .. ch
next_state = 2
elseif ch == "" then
flush()
table.insert(tokens, { type = "EOF", value = nil })
break
else
flush()
if ch ~= "" then
table.insert(tokens, { type = "SYMBOL", value = ch })
end
next_state = 0
end
-- Emit token when transition wants us to
if emit or (next_state ~= state and state ~= 0) then
flush()
end
if next_state == state and state == 0 and ch == "" then
break
end
state = next_state
pos = pos + 1
end
return tokens
end
```lua
## Why Use Automata for Lexing?
Regular expressions compile to automata internally. When you write `lua pattern = "%d+"`, Lua's pattern matcher translates it into a state machine that processes the input character by character. LPeg does this more explicitly — it compiles grammar rules into a deterministic automaton that runs in linear time.
Understanding automata helps you:
- Predict what Lua patterns can and can't match (they correspond to regular languages)
- Debug why a pattern isn't matching as expected
- Build custom tokenizers when Lua patterns aren't sufficient
- Use LPeg effectively for complex grammars
## Gotchas
**Backtracking in naive lexers.** The simple lexer above advances character by character and commits when a transition fails. This is linear time but can fail to handle longest-match semantics correctly. A DFA-based lexer always matches the longest possible token at each position.
**Lua patterns aren't full regex.** Lua patterns use a simplified notation that corresponds to a subset of regular languages. They can't match balanced constructs like parentheses, which require a pushdown automaton (a parser, not a lexer).
**Unicode in Lua 5.3+.**
Lua 5.3+ treats multi-byte UTF-8 characters as multiple bytes. Patterns like `%a` only match ASCII letters. For Unicode-aware lexing, either limit yourself to ASCII or handle UTF-8 byte sequences explicitly.
## See Also
- [/guides/lua-string-patterns/](/guides/lua-string-patterns/) — Lua's pattern language and what it can express
- [/guides/lua-lpeg-parsing/](/guides/lua-lpeg-parsing/) — LPeg compiles grammars into deterministic automata for efficient parsing
- [/tutorials/pattern-matching/](/tutorials/pattern-matching/) — pattern matching fundamentals in Lua