Improving Performance in Changes


5-24-2021

Changes is a pretty simple game from a programming perspective. When I started desiging it one of the goals was to make a game that didn't have a lot of the complexities that other projects had. Despite that there are still a lot of difficulties that came up as we get closer to shipping. Some of are because the game did get more complicated (graphical and audio improvements + features; screenshots, fading in/out audio). My goal here is to show how even a simple game has some surprising complexity. Many of these things will be obvious to anyone who has shipped a game before, and some of them are likely difficulties I could have avoided by making better decisions early (it's important to not add friction when desiging a game but I think many of these things actually reduce friction by making memory usage more specific).

This is not meant to show novel or sophisticated solutions; they're all rudimentry, but it's stuff I didn't think about when I began, so hopefully it will be useful to someone else who is getting started, or wants some affirmation that the rudimentry solution can be sufficient.

Changes is a level-based game, thus it's very discrete, we don't need to stream in data. We have 5 types of assets really: a tesselated plane, textures, audio (clips and streams), text, meshes, and level data . The meshes, text, and levels get loaded at the beginning; the textures are unique per "world" (there is no overlap); there is overlap in the audio between levels, but not in a clear or uniform way; the tesselated plane needs to be generated for each level.

The game would chug in two points: startup and swapping between levels. The first step for improving performance is to measure. You can make educated guesses, but nothing beats actual data. The slow startup was a lower priority than frame hitches when going between levels (which involves an animation that must run smoothly) [Show a gif here]

These were the culprits:

  1. Generating a tesselated plane in one frame (40-60 ms!)
  2. De/Initializing multiple textures for OpenGL (20 ms) [check this]
  3. Freeing memory (not a huge perf hit, but there's no need to do this)

Naturally generating the plane was top priority and had an obvious fix: time-slice the generation. When we transition between levels we fade out the grid-lines for level A until they're invisible, then we swap to the plane for level B and fade in the grid lines. That means we don't need the plane for level B until the fadeout animation is complete.

Time-slicing is when you break up work into multiple parts and only do part of it each frame. Let's look at what's involved:

  1. Allocate enough memory for the mesh.
  2. Calculate the position of each vertex going from top left to bottom right.
  3. Calculate the indices for each triangle.

This is very simple to time slice. Allocation needs to happen all at once (but that's trivial), but calculating the positions and the triangles can happen in chunks. So I simply process N rows of the plane per frame. Again, it's important to measure here. I want N to be big enough that I've generated the whole mesh by the time the swap happens, but small enough that I'm not spending more than 16ms on a given frame (this turned out to be about 100 rows per frame).

Turns out that time slicing was also a good solution to problem to setting up textures for OpenGL. The engine has a multi-threaded job system which we use to load assets, but there is a sync point that happens on the main thread for assets which need to do GL operations. Again, measuring revealed the bottleneck was that on a given frame we were initialzing multiple textures. Of course I don't need the textures til the end of the animation, so it's fine to just initialize one per frame.

Time slicing shows you can get a lot of leverage from knowing what your data is and when you need it. It may be less code to simply say "load all my textures now" (do it if you can!), but it's important to ask yourself "do I really need all that now?"

So what else did this leverage give me? Well one thing I considered even before measuring was "I'm sure doing a lot of heap allocations." Part of this was laziness (I'll figure out where to allocate this later), and part of it was from code written before I gave that any thought. The asset system was a particularly suspicious character. It is one of the oldest parts of the engine (after all, isn't a multi-threaded job system the first thing you should write? It's not).

Like a lot of things in games, assets are a good use case for a memory arena. However the asset system I had worked by being given a load task for a single asset and sending that off to a thread. The solution here was to introduce a "catolog" which had a memory arena along with a list of assets to load. Then a worker thread can simply load each of the assets in the catolog, allocating into the memory arena.

Because Changes is a level based game it's easy to bundle assets that are needed for a level, then when you leave the level we simply clear the arena. Previously I was doing work to see if two levels shared any audio assets between them, because I knew I wanted to avoid extra allocations, and because I didn't want to accidentally deallocate data that the next level needed. By using memory arenas tho it was trivial to just load the asset twice (after all, the memory is already being used). This is acutally a case where the code became simpler when I actually thought about what my data was and when I needed it.

There was one other place allocating memory indiscriminately: DynamicArrays. This was another data structure that's been around for a while; it simply does a heap allocating, and when it needs to grow it just allocates again into a bigger block of memory, copying over the data. This was honestly not much of a problem, in this game at least, because the DynamicArrays don't often need to reallocate, especially if given a reasonable initial size (again, know your data).

However they did do a heap allocating (most of these were at initialization). Because of this DynamicArrays weren't well suited for temporary use, and you needed to keep track of deallocations (this was a source of a lot of leaks).

I had another more recent data structure ChunkedArray which allocates into an arena and instead of moving the data it simply allocates a new chunk, keeping a pointer to the next chunk. I never supported memory arenas for DynamicArrays because that reallocation is bad, and I didn't like the idea of enforing a fixed size, or complicatin the data structure by giving it two modes.

The nice thing about a ChunkedArray is it's clear where the data lives, so clearing that arena handles all the ChunkedArrays allocated from it. It also works with a temporary arena ("FrameMem"). The disadvantage is that the data can be non-local (I'm unsure how bad this actually is in practice since it's easy to pick the maximum chunk size you'll need). One very nice thing about the DynamicArray was that you could simply pass the data pointer to a function.

ChunkedArrays also weren't supported by the de/serialization (another early misstep "if it doesn't go on the heap how will we know where it goes!?"). Adding support for that did speed up the load time a bit because now we aren't allocating an array for each level we load at startup (these levels will all be packed into one file come ship anyway).

Missing: screenshots

The remaining performance work probably needs to happen in the shader.