Lua Performance Tips and Benchmarking
Lua performance starts with the language design itself: a single function call or table access takes on the order of nanoseconds. But as your program grows, small inefficiencies compound. The difference between a well-tuned loop and a careless one can be orders of magnitude.
This guide covers the most impactful Lua-specific performance techniques, explains when each one matters, and shows you how to measure improvements correctly.
Table Pre-allocation
Every time you append to a table with t[#t+1] = val, Lua checks whether the array part has capacity. When it runs out, Lua grows the table — typically doubling its size — and copies all existing elements to the new memory block. For a table built with thousands of iterations, these resize events add up.
Lua 5.4 introduced table.create(n) to pre-allocate exactly n array slots:
-- Grows in ~2x increments, causing repeated reallocations
local slow = {}
for i = 1, 100000 do
slow[i] = i
end
-- Pre-allocated: no resizing, no copies
local fast = table.create(100000)
for i = 1, 100000 do
fast[i] = i
end
In LuaJIT, the equivalent is table.new() from the table.new() FFI extension. If you are targeting LuaJIT and have the FFI available, table.new(n, 0) pre-allocates the array and hash parts.
For most scripts, the difference is negligible. But for programs that build large tables in hot loops like parsers, game entity systems, and data processors, pre-allocation can cut build time by half or more.
Local variable caching
Every time Lua evaluates a global name like math.sin, it looks it up in the current environment table. This lookup is fast, but not free — especially inside tight loops that run hundreds of thousands of times.
Caching library functions in local variables lets Lua skip the environment lookup entirely:
-- Three global lookups per iteration
local function test_uncached(n)
local sum = 0
for i = 1, n do
sum = sum + math.sin(i) + math.cos(i)
end
return sum
end
-- One lookup per function, then direct calls
local function test_cached(n)
local sin, cos = math.sin, math.cos
local sum = 0
for i = 1, n do
sum = sum + sin(i) + cos(i)
end
return sum
end
Expect a 30–50% speedup in hot loops. The gain is most noticeable for math.*, string.*, table.*, and io.* functions called repeatedly. LuaJIT’s JIT compiler can sometimes elide global lookups automatically, but explicit caching guarantees the optimisation.
String concatenation in loops
Lua strings are immutable. The .. operator does not append in place; it allocates a new string and copies both operands into it. When you concatenate inside a loop, you pay this allocation-and-copy cost on every iteration:
-- O(n²) in total string length. Each step copies everything so far
local s = ""
for i = 1, 5000 do
s = s .. "item" .. i .. "\n"
end
The naive approach pays a steep price: each s = s .. chunk allocates a brand-new string the size of s plus chunk, copies both contents into it, and leaves the old string for the garbage collector. For a loop of N iterations with average string growth, the total work is proportional to N², and the GC pressure mounts alongside it. The idiomatic fix avoids all intermediate allocations by deferring the concatenation to a single table.concat call:
-- O(n): one table write per iteration, one allocation at the end
local parts = {}
for i = 1, 5000 do
parts[#parts + 1] = "item" .. i
end
local s = table.concat(parts, "\n")
For short loops the difference is negligible. At 5,000 iterations you start to feel it. At 50,000 the naive version can take seconds while the table version finishes in milliseconds.
Upvalues and closure creation
Closures in Lua are cheap to call. The cost of a function call is what you pay, not anything special about the closure mechanism. The overhead lives in creating the closure.
If you create a closure inside a frequently-called function, you pay the allocation cost on every call:
-- Creates a new closure on every invocation
function processValues(values, offset)
local adder = function(x) return x + offset end
local result = {}
for i = 1, #values do
result[i] = adder(values[i])
end
return result
end
This pattern is expensive because the adder closure is re-created on every call to processValues, even though the offset variable stays the same across invocations. Each closure allocation consumes memory and triggers GC pressure, which adds up quickly in hot code paths that run thousands of times per frame. The fix is to hoist the closure creation outside the function body so it happens once:
Move the inner function out if the captured values do not change between calls:
-- One closure created once
local function makeAdder(offset)
return function(x) return x + offset end
end
local addOffset = makeAdder(100)
function processValues(values)
local result = {}
for i = 1, #values do
result[i] = addOffset(values[i])
end
return result
end
The captured upvalue (offset / addOffset) is stored with the closure. There is no per-call cost once the closure exists.
Metatable Overhead
Metatables add a layer of indirection on every table access that triggers a metamethod. The most common source of accidental overhead is __index being called on every missed key lookup:
local mt = {
__index = function(t, k)
return rawget(defaults, k) or 0
end
}
local t = setmetatable({}, mt)
for i = 1, 100000 do
local v = t[i] -- __index called 100000 times
end
The cost of routing every table access through a metatable adds up fast when the table is hit tens of thousands of times. Each missed key triggers a function call to __index, which walks the prototype chain to find the default value. For tables that are accessed far more often than they are modified, the upfront cost of copying defaults into the table pays off quickly. Here is the approach:
local t = setmetatable({}, mt)
-- populate it
for k, v in pairs(defaults) do
t[k] = v
end
-- now t has all keys directly; __index only fires for truly missing keys
Use rawget(t, k) and rawset(t, k, v) when you know the key exists or you want to bypass the metamethod intentionally. In performance-critical OOP patterns, storing methods directly on each instance (rather than routing through a shared metatable) can eliminate a meaningful fraction of call overhead.
Benchmarking Correctly
It is easy to write a benchmark that tells you nothing. These are the most common mistakes and how to avoid them.
Use a timer with enough resolution
os.time() has only second-level resolution. For measuring anything under a few seconds, use os.clock() which returns CPU time in seconds with microsecond precision. In OpenResty or any environment with LuaSocket, socket.gettime() gives wall-clock time:
local start = os.clock()
-- operation under test --
local elapsed = os.clock() - start
print(("Took %.4f seconds"):format(elapsed))
os.clock() measures CPU time consumed by your process, not wall-clock time. This means it excludes time spent waiting for I/O or sleeping, which is what you want for benchmarking pure computation. The precision is typically around 10 microseconds on Linux and 1 millisecond on Windows, so for sub-millisecond operations you still need to run many iterations. One variable that can ruin even a careful measurement is the garbage collector, which fires on its own schedule and can add hundreds of milliseconds to a timed section:
Disable the garbage collector
The garbage collector runs unpredictably. A GC cycle during your timed section can add hundreds of milliseconds to a measurement that should take one:
collectgarbage("stop")
local start = os.clock()
-- benchmark code --
local elapsed = os.clock() - start
collectgarbage("restart")
print(("%.4f seconds"):format(elapsed))
Warm up the JIT compiler
If you are using LuaJIT, the JIT compiler needs to compile a trace before hot code runs at full speed. This takes somewhere between hundreds and thousands of iterations. Always run the code you intend to measure for several thousand iterations first:
-- Warmup: let JIT compile the trace
for i = 1, 50000 do
hot_function(i)
end
-- Now measure
collectgarbage("stop")
local start = os.clock()
for i = 1, 100000 do
hot_function(i)
end
local elapsed = os.clock() - start
collectgarbage("restart")
print(("%.4f seconds"):format(elapsed))
The warmup phase is essential for LuaJIT benchmarks. Without it, your measurement includes the cost of trace recording, IR generation, and machine code compilation, which can be orders of magnitude slower than the steady-state execution speed. Once a trace is compiled, the JIT engine dispatches directly to native code, bypassing the interpreter entirely for that code path. Measurements that include the warmup overhead will paint a misleading picture of how fast the code actually runs in production:
Measurements taken before JIT compilation finishes include compilation time and are not representative of steady-state performance.
Run multiple iterations
A single os.clock() delta has too much variance from system scheduling noise. Run the operation N times, sum the total, and divide:
local N = 10000
collectgarbage("stop")
local start = os.clock()
for i = 1, N do
operation()
end
local elapsed = (os.clock() - start) / N
collectgarbage("restart")
print(("Average: %.6f seconds per call"):format(elapsed))
Choosing a Lua version for performance
| Version | Notes |
|---|---|
| Lua 5.1 | LuaJIT targets 5.1 syntax. Still widely used for this reason. |
| Lua 5.3 | Integer type, table.unpack as a local. Good balance of features and speed. |
| Lua 5.4 | table.create, generational GC. Avoid 5.4.6, which has known performance regressions compared to 5.4.1 and 5.4.4. |
| LuaJIT 2.0/2.1 | JIT compiler; typically 5–10x faster than PUC Lua for JIT-compatible code. Limited to Lua 5.1 syntax. |
LuaJIT delivers the highest raw throughput for numerically-intensive, JIT-friendly code. But it does not support Lua 5.2+ syntax features, and some patterns (like table.move in hot paths) do not JIT compile well. PUC Lua 5.3 is the right choice when you need modern language features and cross-version compatibility.
See Also
- Tables Intro covers core table mechanics and the difference between array and hash parts
- Lua Garbage Collection explains how the GC works and what triggers it
- Lua Iterators Guide walks through writing efficient iterators with proper state management
- Modules and Require walks through
requireinternals and module structuring for fast loading