Cache Lines, Hot Data, and Frame-Time Spikes

A lot of performance work starts with the visible things: translucent VFX, expensive widgets, large materials, tick-heavy actors, or a pass that looks suspicious in GPU captures. Those are real costs. The quieter problem is the CPU waiting on memory because a hot loop keeps touching data in the wrong shape.

Cache behavior rarely announces itself with a clean label. It shows up as inconsistent spikes, a loop that scales badly with content count, or code that becomes fragile once the team adds enough real production data. The fix is not magic. It is mostly about making the data you touch every frame smaller, closer together, and easier for the CPU to predict. This is the kind of work that sits inside our optimisation and profiling support, especially when a project needs diagnosis before broad rewrites.

Jin and Jaw studio mark on a red and dark technical art composition. — Example figure block: post images can sit inside the reading flow with a mono caption.

Start with the working set

A cache line is a small block of nearby memory moved into cache together. On many modern CPUs, that block is commonly 64 bytes. If a hot update only needs position and velocity, but those values live inside a large object filled with names, debug flags, soft references, editor-only data, and rarely used state, the CPU still has to drag unwanted bytes along for the ride.

In practice, this means the question is not only "how many objects are we updating?" It is also "how much irrelevant memory are we touching per object?" The gap between those two questions is where a lot of frame-time waste hides.

Hot data should be boring: compact, predictable, contiguous, and separated from the fields that only tooling, UI, debug, or one-off transitions need.

Separate hot state from cold state

For Unreal work, this often starts with a pass over structs that are updated every frame or many times per frame. If a system only needs transforms, velocities, IDs, and a small amount of status, those fields deserve a compact path. Everything else can usually sit behind an index, a handle, or a separate array that is only touched when the colder behavior runs.

struct FAgentHotState
{
    FVector3f Position;
    FVector3f Velocity;
    int32 StateId;
    uint8 bActive;
};

struct FAgentColdState
{
    FString DebugName;
    TArray<int32> InventoryIds;
    TObjectPtr<UObject> PresentationAsset;
};

The exact split depends on the system. The useful habit is to make the hot path explicit. Once that is visible, the team can decide whether a simple array of hot structs is enough, or whether a structure-of-arrays layout is worth the extra complexity for a heavier loop.

Prefer contiguous batches

Pointer chasing is expensive because each pointer can send the CPU somewhere else in memory. A linked list of small objects might look tidy from an ownership perspective, but it is rarely a good shape for work that has to run across thousands of items in a frame.

In Unreal, a plain TArray is often the first tool to reach for. Reserve capacity when counts are predictable, compact data when deletion patterns leave holes, and process homogeneous work in batches instead of bouncing between unrelated systems per entity.

Measure the change, not the theory

Cache-friendly code can still be wrong code if it makes the system harder to maintain without moving the real bottleneck. Before reshaping data, capture the problem. After reshaping data, capture it again. Look for lower frame variance, better scaling as content count rises, and fewer unexpected spikes in the loop you changed. For wider production context, this usually connects back to technical art support rather than a standalone optimisation pass.

The goal is not to rewrite every system into a benchmark. The goal is to recognize which loops are truly hot and give those loops a data layout that matches how the hardware wants to read memory.

A quick review checklist

Is the per-frame data compact enough to understand at a glance?
Are hot fields separated from debug, UI, editor, and rare transition data?
Can the loop process a contiguous array instead of following scattered pointers?
Has capacity been reserved for arrays that grow predictably?
Did profiling confirm the change helped the actual frame-time problem?