luaguides

Finite Automata and Lexers in Lua

Overview

A finite automaton is a mathematical model of computation with a finite number of states. It reads an input string symbol by symbol and transitions between states according to defined rules. At the end, it either accepts the input (the string belongs to the language) or rejects it.

Lexers (tokenizers) use finite automata to split source code or text into tokens — identifiers, numbers, keywords, operators. Understanding automata gives you a foundation for building parsers, configuration file readers, and domain-specific languages in Lua.

Deterministic vs non-deterministic automata

A DFA (Deterministic Finite Automaton) has exactly one possible transition for each state-input pair. Given the current state and the next character, you always know exactly which state comes next.

An NFA (Non-deterministic Finite Automaton) can have zero, one, or many possible transitions for a given state-input pair. Multiple paths through the machine might be active simultaneously.

DFAs are faster to execute because there’s no backtracking or multiple active paths. Most lexer implementations compile NFAs or regular expressions into DFAs for this reason. Lua’s own pattern matcher and LPeg both use automaton-based techniques internally.

A simple DFA in Lua

Implement a DFA as a table mapping states to transition functions. Each state has a transition table or function that returns the next state.

Here’s a DFA that recognizes integers (sequences of digits):

-- DFA for non-negative integers
local int_dfa = {
  start = "scan_digits",
  states = {
    scan_digits = {
      accept = true,
      transitions = {
        ["0"] = "scan_digits",
        ["1"] = "scan_digits",
        ["2"] = "scan_digits",
        ["3"] = "scan_digits",
        ["4"] = "scan_digits",
        ["5"] = "scan_digits",
        ["6"] = "scan_digits",
        ["7"] = "scan_digits",
        ["8"] = "scan_digits",
        ["9"] = "scan_digits",
      }
    }
  }
}

local function run_dfa(dfa, input)
  local state = dfa.start
  local pos = 1

  while pos <= #input do
    local ch = input:sub(pos, pos)
    local transitions = dfa.states[state].transitions

    if not transitions or not transitions[ch] then
      return false, "unexpected character " .. ch .. " at position " .. pos
    end

    state = transitions[ch]
    pos = pos + 1
  end

  return dfa.states[state].accept, state
end

print(run_dfa(int_dfa, "12345"))  --> true
print(run_dfa(int_dfa, "12a3"))   --> false

This DFA is hardcoded for demonstration: every digit transition loops back to the same scan_digits state. A real lexer needs distinct states for identifiers, numbers, strings, and operators, with each state defining how to handle every possible input character. The run_dfa function walks through the string one character at a time, consulting the transition table before advancing. When the input ends, the current state’s accept flag determines success or failure.

Building a token lexer

A lexer maps character sequences to token types. It consumes input character by character, building up token values until it hits a transition that doesn’t match.

local function lex(input)
  local pos = 1
  local tokens = {}

  while pos <= #input do
    local ch = input:sub(pos, pos)

    -- Skip whitespace
    if ch == " " or ch == "\t" or ch == "\n" then
      pos = pos + 1

    -- Integer literal
    elseif ch:match("%d") then
      local start = pos
      while pos <= #input and input:sub(pos, pos):match("%d") do
        pos = pos + 1
      end
      table.insert(tokens, { type = "INT", value = input:sub(start, pos - 1) })

    -- Identifier or keyword
    elseif ch:match("[a-zA-Z_]") then
      local start = pos
      while pos <= #input and input:sub(pos, pos):match("[a-zA-Z0-9_]") do
        pos = pos + 1
      end
      local value = input:sub(start, pos - 1)
      local keywords = { ["if"] = true, ["then"] = true, ["else"] = true, ["end"] = true }
      local type = keywords[value] and "KEYWORD" or "IDENT"
      table.insert(tokens, { type = type, value = value })

    -- Operators
    elseif ch == "+" then
      table.insert(tokens, { type = "PLUS", value = "+" })
      pos = pos + 1
    elseif ch == "-" then
      table.insert(tokens, { type = "MINUS", value = "-" })
      pos = pos + 1
    elseif ch == "*" then
      table.insert(tokens, { type = "STAR", value = "*" })
      pos = pos + 1
    elseif ch == "/" then
      table.insert(tokens, { type = "SLASH", value = "/" })
      pos = pos + 1
    else
      return nil, "unrecognized character: " .. ch .. " at " .. pos
    end
  end

  table.insert(tokens, { type = "EOF", value = nil })
  return tokens
end

local result = lex("if x + 42 then end")
for _, tok in ipairs(result) do
  print(tok.type, tok.value)
end

This hand-written lexer recognizes four token types: integers, identifiers, keywords, and single-character operators. It uses Lua’s %d and [a-zA-Z_] patterns to classify characters, which is simpler than encoding a full DFA transition table for a first pass. The keyword check happens after the identifier is fully collected, a post-processing step rather than a state-machine transition. Running it on "if x + 42 then end" produces a sequence of typed tokens ready for a parser to consume.

Output:

KEYWORD  if
IDENT    x
PLUS     +
INT      42
KEYWORD  then
KEYWORD  end
EOF      nil

The output shows each token tagged with its type: KEYWORD for reserved words, IDENT for user-defined names, INT for numeric literals, and operator-specific tags like PLUS. The final EOF token signals the end of input, which parsers use to detect unexpected trailing content after a complete expression.

Handling multi-character operators

Extend the lexer to handle two-character operators like ==, !=, <=, >=:

local function lex(input)
  local pos = 1
  local tokens = {}

  local function peek(offset)
    return input:sub(pos + offset - 1, pos + offset - 1)
  end

  while pos <= #input do
    local ch = input:sub(pos, pos)

    -- Skip whitespace
    if ch == " " or ch == "\t" or ch == "\n" then
      pos = pos + 1

    -- Two-character operators
    elseif ch == "=" and peek(1) == "=" then
      table.insert(tokens, { type = "EQ", value = "==" })
      pos = pos + 2
    elseif ch == "!" and peek(1) == "=" then
      table.insert(tokens, { type = "NEQ", value = "!=" })
      pos = pos + 2
    elseif ch == "<" and peek(1) == "=" then
      table.insert(tokens, { type = "LTE", value = "<=" })
      pos = pos + 2
    elseif ch == ">" and peek(1) == "=" then
      table.insert(tokens, { type = "GTE", value = ">=" })
      pos = pos + 2

    -- Single-character operators
    elseif ch == "+" then
      table.insert(tokens, { type = "PLUS", value = "+" })
      pos = pos + 1
    elseif ch == "-" then
      table.insert(tokens, { type = "MINUS", value = "-" })
      pos = pos + 1
    -- ... rest of lexer
    end
  end

  return tokens
end

The peek helper is the key to multi-character recognition: it checks the next character without advancing the position, letting the lexer decide between a two-character operator like == and a single-character one like =. This lookahead technique avoids backtracking and keeps the scanning pass in linear time. For production lexers, the operator table would be data-driven rather than a chain of elseif branches, but the principle is the same.

String Literals

Add support for quoted strings:

local function lex(input)
  local pos = 1
  local tokens = {}

  while pos <= #input do
    local ch = input:sub(pos, pos)

    if ch == " " or ch == "\t" or ch == "\n" then
      pos = pos + 1

    -- Double-quoted string
    elseif ch == '"' then
      local start = pos
      pos = pos + 1
      while pos <= #input do
        local c = input:sub(pos, pos)
        if c == '"' then
          pos = pos + 1
          break
        elseif c == "\\" and pos + 1 <= #input then
          pos = pos + 2  -- skip escape sequence
        else
          pos = pos + 1
        end
      end
      table.insert(tokens, { type = "STRING", value = input:sub(start + 1, pos - 2) })

    else
      -- ... rest of lexer
      pos = pos + 1
    end
  end

  return tokens
end

local result = lex('hello = "world, \\"escaped\\""')
for _, tok in ipairs(result) do
  print(tok.type, tok.value)
end

String lexing requires careful handling of escape sequences; the code above skips \" pairs to avoid terminating the string prematurely. In a full lexer, you would also handle single-quoted strings, Lua’s long-string syntax with [[ and ]], and unterminated string errors. Each of these adds a state to the DFA but follows the same character-by-character consumption pattern.

DFA-Driven Lexer

For a more scalable approach, encode the DFA as a state machine table and let the lexer follow transitions explicitly:

-- States: 0 = start, 1 = in_ident, 2 = in_number
local TRANSITIONS = {
  [0] = {
    ident_start  = { next = 1, emit = nil },
    digit_start = { next = 2, emit = nil },
    whitespace   = { next = 0, emit = nil },
    op          = { next = 0, emit = true },
    eof         = { next = nil, emit = true },
  },
  [1] = {
    ident_cont = { next = 1, emit = nil },
    default    = { next = 0, emit = true },
  },
  [2] = {
    digit_cont = { next = 2, emit = nil },
    default   = { next = 0, emit = true },
  },
}

local function is_ident_start(ch)
  return ch:match("[a-zA-Z_]") ~= nil
end

local function is_ident_cont(ch)
  return ch:match("[a-zA-Z0-9_]") ~= nil
end

local function is_digit(ch)
  return ch:match("%d") ~= nil
end

local function is_whitespace(ch)
  return ch == " " or ch == "\t" or ch == "\n" or ch == "\r"
end

local function classify(value)
  if value:match("^[a-zA-Z_][a-zA-Z0-9_]*$") then
    local keywords = { ["if"] = true, ["then"] = true, ["else"] = true, ["end"] = true }
    return keywords[value] and "KEYWORD" or "IDENT"
  elseif value:match("^%d+$") then
    return "INT"
  else
    return "SYMBOL"
  end
end

local function lex_dfa(input)
  local tokens = {}
  local state = 0
  local buffer = ""
  local pos = 1

  local function flush()
    if #buffer > 0 then
      table.insert(tokens, { type = classify(buffer), value = buffer })
      buffer = ""
    end
  end

  while pos <= #input + 1 do
    local ch = pos <= #input and input:sub(pos, pos) or ""
    local transitions = TRANSITIONS[state]

    local next_state = nil
    local emit = false

    if is_whitespace(ch) then
      flush()
      next_state = 0
    elseif state == 0 and is_ident_start(ch) then
      buffer = buffer .. ch
      next_state = 1
    elseif state == 1 and is_ident_cont(ch) then
      buffer = buffer .. ch
      next_state = 1
    elseif state == 0 and is_digit(ch) then
      buffer = buffer .. ch
      next_state = 2
    elseif state == 2 and is_digit(ch) then
      buffer = buffer .. ch
      next_state = 2
    elseif ch == "" then
      flush()
      table.insert(tokens, { type = "EOF", value = nil })
      break
    else
      flush()
      if ch ~= "" then
        table.insert(tokens, { type = "SYMBOL", value = ch })
      end
      next_state = 0
    end

    -- Emit token when transition wants us to
    if emit or (next_state ~= state and state ~= 0) then
      flush()
    end

    if next_state == state and state == 0 and ch == "" then
      break
    end

    state = next_state
    pos = pos + 1
  end

  return tokens
end

Why use automata for lexing?

Regular expressions compile to automata internally. When you write lua pattern = "%d+", Lua’s pattern matcher translates it into a state machine that processes the input character by character. LPeg does this more explicitly — it compiles grammar rules into a deterministic automaton that runs in linear time.

Understanding automata helps you:

  • Predict what Lua patterns can and can’t match (they correspond to regular languages)
  • Debug why a pattern isn’t matching as expected
  • Build custom tokenizers when Lua patterns aren’t sufficient
  • Use LPeg effectively for complex grammars

The biggest practical payoff is knowing when to reach for a hand-written DFA lexer instead of LPeg or Lua patterns. Hand-written DFAs give you total control over error messages, token priorities, and edge cases that pattern libraries abstract away. For a simple config file reader, Lua patterns are enough. For a full programming language frontend, the state-machine approach scales better because each new token type adds a new state rather than requiring you to rewrite regex patterns that interact in unexpected ways.

Gotchas

Backtracking in naive lexers. The simple lexer above advances character by character and commits when a transition fails. This is linear time but can fail to handle longest-match semantics correctly. A DFA-based lexer always matches the longest possible token at each position.

Lua patterns aren’t full regex. Lua patterns use a simplified notation that corresponds to a subset of regular languages. They can’t match balanced constructs like parentheses, which require a pushdown automaton (a parser, not a lexer).

Unicode in Lua 5.3+. Lua 5.3+ treats multi-byte UTF-8 characters as multiple bytes. Patterns like %a only match ASCII letters. For Unicode-aware lexing, either limit yourself to ASCII or handle UTF-8 byte sequences explicitly.

See Also