Intro
Hello! I’m Tomasz Mozolewski and together with Robin Maddock we are tech leads on League of Legends. In short, our job is maximizing game performance and providing a stable platform for designers to create awesome experiences for players.
Nobody likes when the game slows down or loses frames in the middle of a team fight. In a game like League, having your mouse click register half a second later can be the difference between executing that sweet combo and whiffing your skill shot.
So we spent some time last year really diving into a variety of ways we could improve performance for players by better allocating games to servers and reducing CPU usage. In the process we challenged some preconceived notions for how this would be possible, ran a variety of tests, and eventually found a solution that created a significant improvement.
This is the Riot Tech Blog so, from here on out, we’re going to dive deep into the tech choices. If you want to see the entire process with a bunch of graphs, the scoring system we set up to measure tests, and the conclusion we reached, we’ve got that for you. But if you want the TL;DR for what changed and how it impacted the player experience, here you go:
After a bunch of testing we identified a Round Robin strategy to allocate new games to servers that dramatically lowered CPU variability, allowed us to raise the auto scaling target by 10% (from 40% to 50%), and resulted in a 20% saving on hardware cost while achieving a better, more consistent game quality.
Cloud-Based Context
We moved to the cloud for the ability to quickly obtain hardware for new features and scale for big events. But cloud environments are different from data centers, so we had to rethink parts of our environment to get the most out of this new tech. One major opportunity was around game server (the physical machine) allocations, and how we balanced performance needs, maintenance overhead, and cost.
The Old and the New
The old way of distributing games was based on a pull model. When a game was ready to be played, it was put on a queue. Each game server would pull a game from the queue when it was ready and start the next game process. To ensure even distribution of games, we introduced a delay proportional to how heavily loaded a particular game server was.
While this strategy resulted in fairly even loading of game servers, we noticed that some machines were very greedy and took on too many early games because of how little CPU is used at the beginning of the game. We added heavier weighting for early games and this solution has served us relatively well for a long time.
The new way reversed the process. Instead of each server pulling games, we introduced an orchestrator that would monitor all game servers and decide where to place incoming games. This solution had a nice property that eliminated the delay of the pull model and provided all the data in a central place allowing us to make more informed decisions on how to place games.
Our engineers have configured auto scaling based on CPU utilization (this is currently our top resource that gets exhausted fastest). Unfortunately, due to high variability of CPU utilization across game servers, we had to keep the target CPU threshold at 40% to prevent machines from going over 80%.
The Goals
We were looking to achieve four goals in this brave new world:
Improve performance quality for players by maintaining peak CPU below target almost all the time
Implement a strategy capable of handling differences in games, game modes, and player behavior
Reduce manual maintenance by requiring little or no configuration (especially between regions)
Remaining cost optimal allowing for auto scaling at a higher threshold
We considered a variety of game server allocation strategies to meet these goals. Some strategies were fairly straightforward, such as:
LeastCPU – Assign new games to the server with the lowest CPU utilization
LeastGameCount – Assign new games to the server with the fewest active games
RoundRobin – Select the next server in line and assign a game
Random – Pick a game server at random and assign a game
We also considered combinations of strategies, such as “Pick N servers at random, then assign to the one server with the least CPU load.” This gave us a lot of possible directions, and the next step was figuring out the best one.
Building a Simulation
Rather than trying to guess which strategy will prove the best, we decided to test them all. To do that, we pulled game CPU utilization data by minute from our game processes. Then we pulled the number of games we start every minute throughout the day. We repeated this process for all games and all shards. Here are some examples.
Graph color legend
Game Cost
We’ve noticed that different games behave differently with time. They all started with a low CPU utilization that grew over time to reach a peak towards the middle (LoL) or closer to the ¾ mark (TFT) of the game. While the average CPU was fairly smooth, the individual games were spiky and maximum CPU usage of each game could easily more than double.
Game Duration
No surprise here, game duration is also distributed differently between games. This told us that we need to consider multiple scenarios like running all games on separate sets of game servers and combining them into a single set.
LoL Game Duration
Game Starts
The number of games starting changes throughout the day, week, year, and special events. There will be more players online on a Sunday evening in the middle of Anima Squad than there will be in the morning on a random Wednesday in spring. To account for these differences, we decided to simulate a 24-hour period based on data pulled from different shards. Depending on the region, players play at different times and sometimes have two peaks in a day. Graphs show a 24h period that starts at the trough for the region.
EUN
Auto Scaling
Operating in the cloud, we take advantage of auto scaling. To decide how many game servers we need to run, we look at average CPU across all game servers and attempt to maintain a target; for example 50%. If the average CPU raises above that target, we add more servers, but they don’t immediately start accepting new games - they need on average about 5 minutes to start up first. On the other hand, when average utilization falls below target, we take away servers, but they have to stay up until all games on those servers finish before we can shut them down.
Graph color legend
On one hand, selecting a higher auto scaling target allows us to save cost by better utilizing available compute resources on each server. On the other hand if we select too high of a target, we risk crossing the 80% performance impact threshold and degrading game quality for players. Still, potential savings shown in the table below, make it a goal worth pursuing if we can find a strategy that allows us to raise the target without sacrificing game quality.
Solution Criteria
The technical goals of a solution are:
Minimize cases where a single machine goes over 80% CPU to guarantee quality
Maximize auto scaling target CPU to save cost
Get closer to a perfect ideal where all machines have identical CPU usage
To keep the simulation simple, we need limits:
Games will only be placed on machines below 70% CPU utilization
Assume that all game servers are on uniform hardware (there are adjustments we can make to calculations to allow different sizes of servers if necessary)
To better reflect reality, we’ll add a couple of twists to see how well strategies behave in different conditions. We’ll vary:
Number of orchestrators. For redundancy, game starts requests can be sent to more than one orchestrator that will independently place them on a game server based on the selected allocation strategy
Data refresh interval (how often input data such as CPU is updated)
Different hardware size might be used (different number of vCPUs)
Game server startup time (how long it takes from game server being requested to available to place games)
Our strategies will get a single list of game servers as input with two attributes:
Current CPU
Current number of games running
And finally a single output number:
Index of selected game server to place the next game on
Results
Graph color legend
LeastCpuis the most intuitive strategy when the goal is to have all machines at the same CPU load, you simply put new games on the machine with the lowest CPU. Surprisingly, it was actually the worst performing strategy.
Due to input data being refreshed every 30 seconds, all new games are placed on the new game server with CPU usage peaking once those games reach the midpoint. This also resulted in the whole fleet going into an oscillating state as games on one server were all starting and ending in a similar time window.
Random allocation strategy produced much better results than LeastCpu. While the average CPU utilization hovered around 50%, many game servers crossed our 70% limit multiple times throughout the day. We thought we could do better.
TenPercentCpuRandom where we selected the bottom 10% of the game servers by CPU and picked one at random.
TenPercentGameCountRandom where we selected the bottom 10% of the game servers by Game Count and picked one at random.
Round Robin strategy where each game goes to the next server in line. With this allocation strategy, we got much tighter CPU utilization.
The interesting finding here was that CPU based strategies were performing worse than game count strategies. There’s two main reasons for this:
When the game starts, CPU utilization is low and grows over time causing CPU based strategies to allocate too many games to lightly utilized game servers without taking into account their performance growth.
The law of large numbers means that even if games have different utilization based on type and time of game, with sufficiently large numbers of games, it will all average out. So game count is actually a better predictor of future demand than current CPU.
Scoring
Looking at the graphs is fun, but to be able to evaluate a large number of scenarios and quickly compare them, we had to come up with a scoring system that would accurately reflect player pain. We used a weighting system to reflect the amount of player pain expected at different CPU usage levels and we set a score target of less than 1%. Here’s the equation:
Here’s the score each strategy received on a single scenario with auto scaling threshold set at 50% CPU.
While a lot of strategies achieved our target score, our secondary objective was to save cost, so we wanted to see how those strategies perform at higher auto scaling threshold CPU.
Now we can replace the score that ensures game quality with the highest target CPU at which we can achieve the desired score. This will allow us to save cost while delivering a high quality experience to players. We can see how stable that score is across data from different shards.
And compare top strategies for different games (by average max cpu across scenarios)
Based on LoL and TFT data, we chose Round Robin because it was consistently close to the top, even though it was never the top strategy (before Swarm was introduced during Anima Squad last summer) and because of the simplicity of the mechanism. Round Robin doesn’t rely on any input data, so it isn’t susceptible to delays in updates and doesn’t get confused by games that significantly increase resource use later in the game.
We also wanted to use an algorithm that would not have to be changed every time we launch a new game mode. This decision proved necessary when we launched Swarm. This game mode completely obliterated the other alternatives and showed that Round Robin was the only algorithm that could withstand the havoc it unleashed on our game servers.
Top strategies
Lower CPU variability allowed us to raise the auto scaling target by 10% (from 40% to 50%) and resulted in 20% saving on hardware cost while achieving a better, more consistent game quality.
Learnings
Four major factors have been responsible for most of the allocation strategy performance results
Input data synchronization delay - strategies overly relying on current data were hurt the most.
Variable CPU utilization during game - strategies relying on CPU were misled by games using very different CPU costs over the course of the game.
Law of large numbers - when you have a large number of games running at different stages of the game, the peaks and valleys of CPU utilization are offset from each other. This leads to some games using a lot of CPU while others still use little. The more uniformly distributed games are, the more stable CPU utilization remains.
Variable game duration - in a constant load scenario, RoundRobin proved to be the best, but game duration variation and introduction of new game servers during auto scaling made it too slow to react to changing environments to maintain top position.
The obvious choice of using CPU-based strategies created the worst results
The simplest strategies outperform more complex ones with sufficient variety of scenarios
There’s always going to be some waste in the system, the less waste remains, the harder it is to get rid of it
Overoptimization for specific scenarios leads to poor performance in others and creates a very fragile system that can easily fail when conditions change.
Higher complexity might get you better results, but will cost a lot more to maintain than savings in hardware cost. As other improvements are made (faster startup, larger game servers), the benefit of added complexity quickly fades.
As long as your scoring formula makes sense, it doesn’t significantly change the results.
We also tried a version that took into account only game servers without game minutes
And another one with a smoother exponential formula instead of steps
All three formulas resulted in values within 2% and the same ordering of strategies
Allocation strategies are only one part of the equation. Game code performance, new game server startup time, number of vCPUs on each game server, ability to autoscale ahead of time (predictive auto scaling) and many more factors contribute to the ultimate goal of ensuring the best player experience.