Lua Performance Tips and Benchmarking
Lua is fast by design. A single function call or table access takes on the order of nanoseconds. But as your program grows, small inefficiencies compound. The difference between a well-tuned loop and a careless one can be orders of magnitude.
This guide covers the most impactful Lua-specific performance techniques, explains when each one matters, and shows you how to measure improvements correctly.
Table Pre-allocation
Every time you append to a table with t[#t+1] = val, Lua checks whether the array part has capacity. When it runs out, Lua grows the table — typically doubling its size — and copies all existing elements to the new memory block. For a table built with thousands of iterations, these resize events add up.
Lua 5.4 introduced table.create(n) to pre-allocate exactly n array slots:
-- Grows in ~2x increments, causing repeated reallocations
local slow = {}
for i = 1, 100000 do
slow[i] = i
end
-- Pre-allocated: no resizing, no copies
local fast = table.create(100000)
for i = 1, 100000 do
fast[i] = i
end
In LuaJIT, the equivalent is table.new() from the table.new() FFI extension. If you are targeting LuaJIT and have the FFI available, table.new(n, 0) pre-allocates the array and hash parts.
For most scripts, the difference is negligible. But for programs that build large tables in hot loops — parsers, game entity systems, data processors — pre-allocation can cut build time by half or more.
Local Variable Caching
Every time Lua evaluates a global name like math.sin, it looks it up in the current environment table. This lookup is fast, but not free — especially inside tight loops that run hundreds of thousands of times.
Caching library functions in local variables lets Lua skip the environment lookup entirely:
-- Three global lookups per iteration
local function test_uncached(n)
local sum = 0
for i = 1, n do
sum = sum + math.sin(i) + math.cos(i)
end
return sum
end
-- One lookup per function, then direct calls
local function test_cached(n)
local sin, cos = math.sin, math.cos
local sum = 0
for i = 1, n do
sum = sum + sin(i) + cos(i)
end
return sum
end
Expect a 30–50% speedup in hot loops. The gain is most noticeable for math.*, string.*, table.*, and io.* functions called repeatedly. LuaJIT’s JIT compiler can sometimes elide global lookups automatically, but explicit caching guarantees the optimisation.
String Concatenation in Loops
Lua strings are immutable. The .. operator does not append in place — it allocates a new string and copies both operands into it. When you concatenate inside a loop, you pay this allocation-and-copy cost on every iteration:
-- O(n²) in total string length — each step copies everything so far
local s = ""
for i = 1, 5000 do
s = s .. "item" .. i .. "\n"
end
The idiomatic fix is to accumulate parts in a table and call table.concat once at the end:
-- O(n): one table write per iteration, one allocation at the end
local parts = {}
for i = 1, 5000 do
parts[#parts + 1] = "item" .. i
end
local s = table.concat(parts, "\n")
For short loops the difference is negligible. At 5,000 iterations you start to feel it. At 50,000 the naive version can take seconds while the table version finishes in milliseconds.
Upvalues and Closure Creation
Closures in Lua are cheap to call — the cost of a function call is what you pay, not anything special about the closure mechanism. The overhead lives in creating the closure.
If you create a closure inside a frequently-called function, you pay the allocation cost on every call:
-- Creates a new closure on every invocation
function processValues(values, offset)
local adder = function(x) return x + offset end
local result = {}
for i = 1, #values do
result[i] = adder(values[i])
end
return result
end
Move the inner function out if the captured values do not change between calls:
-- One closure created once
local function makeAdder(offset)
return function(x) return x + offset end
end
local addOffset = makeAdder(100)
function processValues(values)
local result = {}
for i = 1, #values do
result[i] = addOffset(values[i])
end
return result
end
The captured upvalue (offset / addOffset) is stored with the closure. There is no per-call cost once the closure exists.
Metatable Overhead
Metatables add a layer of indirection on every table access that triggers a metamethod. The most common source of accidental overhead is __index being called on every missed key lookup:
local mt = {
__index = function(t, k)
return rawget(defaults, k) or 0
end
}
local t = setmetatable({}, mt)
for i = 1, 100000 do
local v = t[i] -- __index called 100000 times
end
Once a table is fully populated, consider replacing the metatable lookup with direct access:
local t = setmetatable({}, mt)
-- populate it
for k, v in pairs(defaults) do
t[k] = v
end
-- now t has all keys directly; __index only fires for truly missing keys
Use rawget(t, k) and rawset(t, k, v) when you know the key exists or you want to bypass the metamethod intentionally. In performance-critical OOP patterns, storing methods directly on each instance (rather than routing through a shared metatable) can eliminate a meaningful fraction of call overhead.
Benchmarking Correctly
It is easy to write a benchmark that tells you nothing. These are the most common mistakes and how to avoid them.
Use a Timer with Enough Resolution
os.time() has only second-level resolution. For measuring anything under a few seconds, use os.clock() which returns CPU time in seconds with microsecond precision. In OpenResty or any environment with LuaSocket, socket.gettime() gives wall-clock time:
local start = os.clock()
-- operation under test --
local elapsed = os.clock() - start
print(("Took %.4f seconds"):format(elapsed))
Disable the Garbage Collector
The garbage collector runs unpredictably. A GC cycle during your timed section can add hundreds of milliseconds to a measurement that should take one:
collectgarbage("stop")
local start = os.clock()
-- benchmark code --
local elapsed = os.clock() - start
collectgarbage("restart")
print(("%.4f seconds"):format(elapsed))
Warm Up the JIT Compiler
If you are using LuaJIT, the JIT compiler needs to compile a trace before hot code runs at full speed. This takes somewhere between hundreds and thousands of iterations. Always run the code you intend to measure for several thousand iterations first:
-- Warmup: let JIT compile the trace
for i = 1, 50000 do
hot_function(i)
end
-- Now measure
collectgarbage("stop")
local start = os.clock()
for i = 1, 100000 do
hot_function(i)
end
local elapsed = os.clock() - start
collectgarbage("restart")
print(("%.4f seconds"):format(elapsed))
Measurements taken before JIT compilation finishes include compilation time and are not representative of steady-state performance.
Run Multiple Iterations
A single os.clock() delta has too much variance from system scheduling noise. Run the operation N times, sum the total, and divide:
local N = 10000
collectgarbage("stop")
local start = os.clock()
for i = 1, N do
operation()
end
local elapsed = (os.clock() - start) / N
collectgarbage("restart")
print(("Average: %.6f seconds per call"):format(elapsed))
Choosing a Lua Version for Performance
| Version | Notes |
|---|---|
| Lua 5.1 | LuaJIT targets 5.1 syntax. Still widely used for this reason. |
| Lua 5.3 | Integer type, table.unpack as a local. Good balance of features and speed. |
| Lua 5.4 | table.create, generational GC. Avoid 5.4.6 — performance regressions vs 5.4.1 and 5.4.4. |
| LuaJIT 2.0/2.1 | JIT compiler; typically 5–10x faster than PUC Lua for JIT-compatible code. Limited to Lua 5.1 syntax. |
LuaJIT delivers the highest raw throughput for numerically-intensive, JIT-friendly code. But it does not support Lua 5.2+ syntax features, and some patterns (like table.move in hot paths) do not JIT compile well. PUC Lua 5.3 is the right choice when you need modern language features and cross-version compatibility.
See Also
- Tables Intro — core table mechanics and the difference between array and hash parts
- Lua Garbage Collection — how the GC works and what triggers it
- Lua Iterators Guide — writing efficient iterators with proper state management
- Modules and Require — how
requireworks and how to structure modules for fast loading