luaguides

Profiling Lua Code for Bottlenecks

A profiler tells you which parts of your program are slow — the bottlenecks. Without profiling Lua code, you guess. You optimise what feels slow, miss the real culprit, and end up with code that is faster in all the wrong places.

Lua makes profiling straightforward because the language gives you the tools built in. You do not need special tooling or recompiled binaries. This guide walks through every available approach, from quick manual timing to a full statistical sampler written in pure Lua.

Manual Timing with os.clock()

The fastest way to measure a function is to call it between two os.clock() readings:

local start = os.clock()
my_function(arg1, arg2)
local elapsed = os.clock() - start
print(("Took %.4f seconds"):format(elapsed))

os.clock() returns CPU time used by the current process, measured in seconds with microsecond precision. It is good enough for functions that take at least a few milliseconds.

For sub-millisecond measurements, the resolution is too coarse, and system scheduling noise dominates the result. In that case, run the function many times and divide:

local N = 100000
local start = os.clock()
for i = 1, N do
  my_function(arg1, arg2)
end
local total = os.clock() - start
print(("%.6f ms per call"):format(total / N * 1000))

Running a function in a tight loop N times and dividing by N gives you the per-call average, which smooths out the jitter from os.clock()’s finite resolution. For very fast functions that complete in under a microsecond, you may need hundreds of thousands of iterations before the measurement stabilizes. Even with this approach, one unpredictable factor remains: the garbage collector fires on its own schedule and can inflate any single timing run dramatically:

Disable the garbage collector

The garbage collector runs unpredictably. A GC pause during a timed section can skew a result by hundreds of milliseconds. Pause the GC before timing and restart it after:

collectgarbage("stop")
local start = os.clock()
for i = 1, N do
  my_function()
end
local elapsed = os.clock() - start
collectgarbage("restart")
print(("%.4f ms per call"):format(elapsed * 1000 / N))

This gives you steady, repeatable measurements.

Statistical Sampling with debug.sethook()

Manual timing tells you how long a specific function takes. A statistical sampler tells you where your program spends its time overall — especially useful when you do not know where to start.

The debug.sethook() function registers a hook that fires on events. You can hook on "call", "return", and "line" events. The strategy is to sample the call stack at regular intervals and build a picture of where the program was when the timer fired.

Here is a minimal pure-Lua sampler:

local sampled = {}
local counts = {}

local function sampler(event)
  -- Walk the stack and record each active function
  for level = 1, 100 do
    local info = debug.getinfo(level, "nSl")
    if not info then break end
    local name = info.name or "(anon)"
    local key = ("%s:%d %s"):format(info.short_src, info.currentline, name)
    counts[key] = (counts[key] or 0) + 1
  end
end

-- Sample for 5 seconds
debug.sethook(sampler, "crl")
local start = os.clock()
while (os.clock() - start) < 5 do end  -- replace with your program
debug.sethook()

-- Print top hotspots
local sorted = {}
for k, v in pairs(counts) do sorted[#sorted + 1] = {k, v} end
table.sort(sorted, function(a, b) return a[2] > b[2] end)
for i = 1, 20 do
  print(("%6d  %s"):format(sorted[i][2], sorted[i][1]))
end

The output lists source locations and function names sorted by hit count. The highest counts are your hotspots. This works in Lua 5.1 through 5.4 without any C extensions.

The main limitation: it samples by wall time, so functions that run during idle time (waiting for I/O, sleeping) appear disproportionately. Use it on representative workloads.

LuaJIT Profiling

If you are running on LuaJIT, the jit.profile module gives you built-in statistical sampling:

local profile = require("jit.profile")

local results = {}
profile.start("l", function(thread, samples, vmstate)
  -- vmstate is the current VM state at the sample
  print(samples, vmstate)
end)

The "l" flag tells the profiler to sample on every line executed, which gives you a per-line breakdown similar to line-level profilers in other languages. The callback receives the current sample count and the VM state, letting you distinguish between interpreted and JIT-compiled execution. For a coarser but more actionable view, LuaJIT’s trace recorder shows you which loops and functions the JIT compiler decided to optimize, along with the IR it generated:

A simpler view comes from jit.trace diagnostics. Enable trace recording and dump stats at the end of a run:

local jit = require("jit")
jit.on()

-- your code here --

jit.off()
require("jit.dump").exit()

This prints every compiled trace with IR instruction counts. Large traces with many instructions are worth examining for optimisation opportunities.

The strategy: narrow down, then measure

Profiling is not a one-shot operation. Follow this cycle:

  1. Sample broadly: use the debug hook profiler to find the general area (a module, a function, a loop)
  2. Isolate the culprit: comment out or stub sections until the hot spot disappears
  3. Time the fix: benchmark the specific function before and after

Most bottlenecks live in a small part of the code. Getting to that part is the hard part. Once you know where to look, fixing it is usually obvious.

The stub-and-time technique

When a program has deep call chains, isolate the slow layer by replacing the innermost function with a no-op:

local function original_slow(path)
  -- do complex parsing
end

-- Temporarily stub it out
local function stub(path) return {} end
original_slow = stub

-- Now measure
local start = os.clock()
main_loop()
print(("With stub: %.4f seconds"):format(os.clock() - start))

If the total time drops significantly, the bottleneck is in original_slow or its callees. If it barely changes, the bottleneck is somewhere else. This tells you where to dig without any tools.

Common bottlenecks to watch for

Even without a profiler, these patterns are frequent culprits:

Repeated table resizing. Every t[#t+1] = val is a potential resize if the array part is full. In hot loops building large tables, use table.create(n) in Lua 5.4 or table.new(n, 0) in LuaJIT:

local large = table.create(100000)  -- pre-allocated
for i = 1, 100000 do
  large[i] = compute(i)
end

Pre-allocating with table.create(n) tells Lua to reserve exactly n slots in the array part up front, so the loop body never triggers a resize. The gain is most visible when you know the final size ahead of time, as with fixed-size buffers or pre-computed result sets. If you are on LuaJIT, use table.new(n, 0) instead, which takes separate array and hash part sizes. Moving from table sizing to function call overhead, another common bottleneck lurks in how Lua resolves names:

Global lookups in tight loops. Each reference to math.sin costs a dictionary lookup. Cache it locally:

local sin, cos = math.sin, math.cos  -- look up once
for i = 1, N do
  sum = sum + sin(i) + cos(i)
end

Localizing library functions is one of the cheapest performance wins in Lua. The speedup comes from replacing a table lookup plus a function call with a single local variable access. The same principle applies to any frequently-called global, not just math functions: table.insert, string.format, and io.write all benefit. In LuaJIT, the JIT compiler can sometimes hoist these lookups automatically, but explicit caching guarantees the optimization regardless of trace boundaries or compilation heuristics. Another category of allocation-driven slowdown involves the way Lua handles string building:

String concatenation in loops. s = s .. chunk reallocates and copies on every iteration. Use a table:

local parts = {}
for line in lines do
  parts[#parts + 1] = process(line)
end
local result = table.concat(parts, "\n")

Using table.concat on a pre-built parts array converts an O(n²) string-building operation into O(n), because Lua concatenates all the pieces in a single allocation pass rather than creating a new string on every iteration of the loop. The threshold where this difference matters varies by string length, but for anything over a few hundred iterations the gap between the naive and table-based approaches becomes stark. A similar “hidden cost on every access” pattern shows up with metatables, where the __index metamethod fires on every key that does not already exist in the table:

Metatable __index on every access. If you rely on __index to provide defaults, it fires on every missing key. Once the table is populated, direct access is faster. Populate explicitly after construction:

for k, v in pairs(defaults) do
  t[k] = v  -- now direct, no __index overhead
end

Reading profilers correctly

A profiler shows where time was spent, not why it was slow. A line that appears in 80% of samples may look like the culprit, but it could be calling a slow C function deep in a library. Follow the call chain upward.

Also watch for self time vs total time. A function that shows 30% of total time may be doing real work — or it may call another function that accounts for 28% of that. Most profilers report self time (inside the function only) and total time (including callees). Look at both.

See Also