luaguides

Finite Automata and Lexers in Lua

Overview

A finite automaton is a mathematical model of computation with a finite number of states. It reads an input string symbol by symbol and transitions between states according to defined rules. At the end, it either accepts the input (the string belongs to the language) or rejects it.

Lexers (tokenizers) use finite automata to split source code or text into tokens — identifiers, numbers, keywords, operators. Understanding automata gives you a foundation for building parsers, configuration file readers, and domain-specific languages in Lua.

Deterministic vs Non-deterministic Automata

A DFA (Deterministic Finite Automaton) has exactly one possible transition for each state-input pair. Given the current state and the next character, you always know exactly which state comes next.

An NFA (Non-deterministic Finite Automaton) can have zero, one, or many possible transitions for a given state-input pair. Multiple paths through the machine might be active simultaneously.

DFAs are faster to execute because there’s no backtracking or multiple active paths. Most lexer implementations compile NFAs or regular expressions into DFAs for this reason. Lua’s own pattern matcher and LPeg both use automaton-based techniques internally.

A Simple DFA in Lua

Implement a DFA as a table mapping states to transition functions. Each state has a transition table or function that returns the next state.

Here’s a DFA that recognizes integers (sequences of digits):

-- DFA for non-negative integers
local int_dfa = {
  start = "scan_digits",
  states = {
    scan_digits = {
      accept = true,
      transitions = {
        ["0"] = "scan_digits",
        ["1"] = "scan_digits",
        ["2"] = "scan_digits",
        ["3"] = "scan_digits",
        ["4"] = "scan_digits",
        ["5"] = "scan_digits",
        ["6"] = "scan_digits",
        ["7"] = "scan_digits",
        ["8"] = "scan_digits",
        ["9"] = "scan_digits",
      }
    }
  }
}

local function run_dfa(dfa, input)
  local state = dfa.start
  local pos = 1

  while pos <= #input do
    local ch = input:sub(pos, pos)
    local transitions = dfa.states[state].transitions

    if not transitions or not transitions[ch] then
      return false, "unexpected character " .. ch .. " at position " .. pos
    end

    state = transitions[ch]
    pos = pos + 1
  end

  return dfa.states[state].accept, state
end

print(run_dfa(int_dfa, "12345"))  --> true
print(run_dfa(int_dfa, "12a3"))   --> false
```lua

## Building a Token Lexer

A lexer maps character sequences to token types. It consumes input character by character, building up token values until it hits a transition that doesn't match.

```lua
local function lex(input)
  local pos = 1
  local tokens = {}

  while pos <= #input do
    local ch = input:sub(pos, pos)

    -- Skip whitespace
    if ch == " " or ch == "\t" or ch == "\n" then
      pos = pos + 1

    -- Integer literal
    elseif ch:match("%d") then
      local start = pos
      while pos <= #input and input:sub(pos, pos):match("%d") do
        pos = pos + 1
      end
      table.insert(tokens, { type = "INT", value = input:sub(start, pos - 1) })

    -- Identifier or keyword
    elseif ch:match("[a-zA-Z_]") then
      local start = pos
      while pos <= #input and input:sub(pos, pos):match("[a-zA-Z0-9_]") do
        pos = pos + 1
      end
      local value = input:sub(start, pos - 1)
      local keywords = { ["if"] = true, ["then"] = true, ["else"] = true, ["end"] = true }
      local type = keywords[value] and "KEYWORD" or "IDENT"
      table.insert(tokens, { type = type, value = value })

    -- Operators
    elseif ch == "+" then
      table.insert(tokens, { type = "PLUS", value = "+" })
      pos = pos + 1
    elseif ch == "-" then
      table.insert(tokens, { type = "MINUS", value = "-" })
      pos = pos + 1
    elseif ch == "*" then
      table.insert(tokens, { type = "STAR", value = "*" })
      pos = pos + 1
    elseif ch == "/" then
      table.insert(tokens, { type = "SLASH", value = "/" })
      pos = pos + 1
    else
      return nil, "unrecognized character: " .. ch .. " at " .. pos
    end
  end

  table.insert(tokens, { type = "EOF", value = nil })
  return tokens
end

local result = lex("if x + 42 then end")
for _, tok in ipairs(result) do
  print(tok.type, tok.value)
end
```lua

Output:
```lua
KEYWORD  if
IDENT    x
PLUS     +
INT      42
KEYWORD  then
KEYWORD  end
EOF      nil
```lua

## Handling Multi-character Operators

Extend the lexer to handle two-character operators like `==`, `!=`, `<=`, `>=`:

```lua
local function lex(input)
  local pos = 1
  local tokens = {}

  local function peek(offset)
    return input:sub(pos + offset - 1, pos + offset - 1)
  end

  while pos <= #input do
    local ch = input:sub(pos, pos)

    -- Skip whitespace
    if ch == " " or ch == "\t" or ch == "\n" then
      pos = pos + 1

    -- Two-character operators
    elseif ch == "=" and peek(1) == "=" then
      table.insert(tokens, { type = "EQ", value = "==" })
      pos = pos + 2
    elseif ch == "!" and peek(1) == "=" then
      table.insert(tokens, { type = "NEQ", value = "!=" })
      pos = pos + 2
    elseif ch == "<" and peek(1) == "=" then
      table.insert(tokens, { type = "LTE", value = "<=" })
      pos = pos + 2
    elseif ch == ">" and peek(1) == "=" then
      table.insert(tokens, { type = "GTE", value = ">=" })
      pos = pos + 2

    -- Single-character operators
    elseif ch == "+" then
      table.insert(tokens, { type = "PLUS", value = "+" })
      pos = pos + 1
    elseif ch == "-" then
      table.insert(tokens, { type = "MINUS", value = "-" })
      pos = pos + 1
    -- ... rest of lexer
    end
  end

  return tokens
end
```lua

## String Literals

Add support for quoted strings:

```lua
local function lex(input)
  local pos = 1
  local tokens = {}

  while pos <= #input do
    local ch = input:sub(pos, pos)

    if ch == " " or ch == "\t" or ch == "\n" then
      pos = pos + 1

    -- Double-quoted string
    elseif ch == '"' then
      local start = pos
      pos = pos + 1
      while pos <= #input do
        local c = input:sub(pos, pos)
        if c == '"' then
          pos = pos + 1
          break
        elseif c == "\\" and pos + 1 <= #input then
          pos = pos + 2  -- skip escape sequence
        else
          pos = pos + 1
        end
      end
      table.insert(tokens, { type = "STRING", value = input:sub(start + 1, pos - 2) })

    else
      -- ... rest of lexer
      pos = pos + 1
    end
  end

  return tokens
end

local result = lex('hello = "world, \\"escaped\\""')
for _, tok in ipairs(result) do
  print(tok.type, tok.value)
end
```lua

## DFA-Driven Lexer

For a more scalable approach, encode the DFA as a state machine table and let the lexer follow transitions explicitly:

```lua
-- States: 0 = start, 1 = in_ident, 2 = in_number
local TRANSITIONS = {
  [0] = {
    ident_start  = { next = 1, emit = nil },
    digit_start = { next = 2, emit = nil },
    whitespace   = { next = 0, emit = nil },
    op          = { next = 0, emit = true },
    eof         = { next = nil, emit = true },
  },
  [1] = {
    ident_cont = { next = 1, emit = nil },
    default    = { next = 0, emit = true },
  },
  [2] = {
    digit_cont = { next = 2, emit = nil },
    default   = { next = 0, emit = true },
  },
}

local function is_ident_start(ch)
  return ch:match("[a-zA-Z_]") ~= nil
end

local function is_ident_cont(ch)
  return ch:match("[a-zA-Z0-9_]") ~= nil
end

local function is_digit(ch)
  return ch:match("%d") ~= nil
end

local function is_whitespace(ch)
  return ch == " " or ch == "\t" or ch == "\n" or ch == "\r"
end

local function classify(value)
  if value:match("^[a-zA-Z_][a-zA-Z0-9_]*$") then
    local keywords = { ["if"] = true, ["then"] = true, ["else"] = true, ["end"] = true }
    return keywords[value] and "KEYWORD" or "IDENT"
  elseif value:match("^%d+$") then
    return "INT"
  else
    return "SYMBOL"
  end
end

local function lex_dfa(input)
  local tokens = {}
  local state = 0
  local buffer = ""
  local pos = 1

  local function flush()
    if #buffer > 0 then
      table.insert(tokens, { type = classify(buffer), value = buffer })
      buffer = ""
    end
  end

  while pos <= #input + 1 do
    local ch = pos <= #input and input:sub(pos, pos) or ""
    local transitions = TRANSITIONS[state]

    local next_state = nil
    local emit = false

    if is_whitespace(ch) then
      flush()
      next_state = 0
    elseif state == 0 and is_ident_start(ch) then
      buffer = buffer .. ch
      next_state = 1
    elseif state == 1 and is_ident_cont(ch) then
      buffer = buffer .. ch
      next_state = 1
    elseif state == 0 and is_digit(ch) then
      buffer = buffer .. ch
      next_state = 2
    elseif state == 2 and is_digit(ch) then
      buffer = buffer .. ch
      next_state = 2
    elseif ch == "" then
      flush()
      table.insert(tokens, { type = "EOF", value = nil })
      break
    else
      flush()
      if ch ~= "" then
        table.insert(tokens, { type = "SYMBOL", value = ch })
      end
      next_state = 0
    end

    -- Emit token when transition wants us to
    if emit or (next_state ~= state and state ~= 0) then
      flush()
    end

    if next_state == state and state == 0 and ch == "" then
      break
    end

    state = next_state
    pos = pos + 1
  end

  return tokens
end
```lua

## Why Use Automata for Lexing?

Regular expressions compile to automata internally. When you write `lua pattern = "%d+"`, Lua's pattern matcher translates it into a state machine that processes the input character by character. LPeg does this more explicitly — it compiles grammar rules into a deterministic automaton that runs in linear time.

Understanding automata helps you:
- Predict what Lua patterns can and can't match (they correspond to regular languages)
- Debug why a pattern isn't matching as expected
- Build custom tokenizers when Lua patterns aren't sufficient
- Use LPeg effectively for complex grammars

## Gotchas

**Backtracking in naive lexers.** The simple lexer above advances character by character and commits when a transition fails. This is linear time but can fail to handle longest-match semantics correctly. A DFA-based lexer always matches the longest possible token at each position.

**Lua patterns aren't full regex.** Lua patterns use a simplified notation that corresponds to a subset of regular languages. They can't match balanced constructs like parentheses, which require a pushdown automaton (a parser, not a lexer).

**Unicode in Lua 5.3+.**
Lua 5.3+ treats multi-byte UTF-8 characters as multiple bytes. Patterns like `%a` only match ASCII letters. For Unicode-aware lexing, either limit yourself to ASCII or handle UTF-8 byte sequences explicitly.

## See Also

- [/guides/lua-string-patterns/](/guides/lua-string-patterns/) — Lua's pattern language and what it can express
- [/guides/lua-lpeg-parsing/](/guides/lua-lpeg-parsing/) — LPeg compiles grammars into deterministic automata for efficient parsing
- [/tutorials/pattern-matching/](/tutorials/pattern-matching/) — pattern matching fundamentals in Lua