Hi there. My name is Justin, and I’m in charge of server engineering at Boss Fight Entertainment. I hope to provide a glimpse into what our server tech stack looks like, and how we came to choose the individual components. Although we started with a single game in mind, almost everything has proven reusable.
Dungeon Boss was the first game we built at Boss Fight Entertainment so the server architecture started with a blank slate. We knew it was going to be a free-to-play mobile game, so the ability to scale was paramount. We knew we wanted social features like chat, mail, friends, guilds, and PvP, so there was going to be a lot of back-end plumbing. We knew the game logic was in C# with Unity, but we really wanted to run exactly the same logic on the client and server.
Once you make that initial conceptual leap, however, Node becomes an easy and productive way to code. If you’re just now trying it, do not hesitate to use libraries like underscore and async. We also recommend treating yourself to the WebStorm editor.
Allegedly “productive” languages sweep through the industry biannually. What actually makes Node worth trying compared to something like Java or Go? Believe it or not, the most convenient and underrated aspect of Node.js is simply that JSON is native.
Imagine that you need to ingest some JSON, dynamically remove some fields and add some others, then hand off the result as JSON. Try doing it with an inheritance-based object model and an annotation-based system like Jackson. Soon you’ll give up on elegantly annotated object hierarchies and just make everything a map — maps are easy. But now you have an untyped, indirect way of expressing what would be utterly explicit in Node.js.
Node.js is not without faults. Node’s concurrency model is often cited as being easy to reason about. Unfortunately, if you are used reasoning about multiple threads in your process you’ll be disappointed. On the other hand, Node’s model makes a lot of sense in a distributed architecture where your servers spend most of their time waiting on external resources.
The validation of game logic presents a tradeoff between security and player experience. If everything happened locally on device then there would never be any network or server latency to interrupt the UI. Of course, if everything was local then nothing would stop savvy players from cheating.
The inverse isn’t particularly palatable either. If all game logic happened on the server then every player click would require a round trip before we could render the outcome — the experience would be terrible. It just wouldn’t work for a combat game.
We settled on a client-predictive, server-authoritative model that offered the best of both worlds. This approach requires that equivalent game logic can run in both environments. Although we debated it, keeping separate client and server implementations of the game logic looked like a maintenance nightmare.
The use of Unity on the client meant the game logic was in C#, which is not natively supported on linux. Luckily the mono project makes this possible — you can run exactly the same DLL on linux that you might use on obscure platforms like Windows.
Our game servers are stateless. Every time a request comes in from a client device we have to look up some lightweight account information (1KB) and, potentially, the game state (40-200KB) for that particular player. In either case, we have enough information to directly fetch what we need rather than search for it.
If you can express your operations in terms of a primary key you have the option of using a key-value store. Relational databases can work well in this capacity, up to a point, but they tend to scale vertically. Dedicated key-value stores can run on an arbitrary number of boxes and distribute replica data so your eggs are never all in one basket.
For us, at least, CouchBase’s speed was a distinguishing factor. Running a large cluster of anything is less fun than letting someone else manage it. Once it became clear that virtually any publisher allowed a reliance on AWS, there was no obvious reason not to use DynamoDB instead of CouchBase. So why in the world would we choose the manual solution? Well, we can almost always fetch 50KB values from a remote CouchBase instance in under 5ms. With DynamoDB that number was more like 40ms.
Bear in mind that CouchBase isn’t something that scales automagically. All instances in a cluster are identical — if you double the number of instances you also double the amount of RAM and disk space in the cluster. You can add more instances to a running cluster, but be proactive because the “rebalance” process itself puts the cluster in a diminished state and takes an inordinate amount of time.
We had early versions of the game running using only CouchBase, but the inescapable truth is that CouchBase works best when you can fetch documents by key. There are manual work-arounds when you want to index by one or two additional fields, but CouchBase’s general purpose “view” system was simply too slow to use for handling live requests.
Luckily, I happened to be familiar with another scalable key-value store that was also proficient at indexing documents. ElasticSearch scales horizontally with ease and its robust query system understands the structure of the JSON documents. Perhaps its best trick is “document routing” — the use of an auxiliary value which controls where a document is physically stored. When used judiciously, routing allows queries for a set of related documents to hit a single instance instead of every box in the cluster.
To see why ElasticSearch seemed interesting, consider the problem of PvP opponent sampling at scale. There are many potential dimensions involved in matchmaking and we tend to tweak these formulas regularly. In order to preserve maximum flexibility, we’d like to keep candidates in memory. We don’t need every potential opponent — we just want a diverse subset of recently active players.
Our goal is to find the N most-recently active opponents of player level X, whose player ID mod Y equals Z, where Y is chosen based on the size of the population and Z is chosen randomly for each scan. ²
In CouchBase the hard part is selecting the “N most-recently active” records. We would need to create a view with a complex key similar to [X,Y,Z,YEAR,MONTH,DAY,HOUR]. However, in order to get N records, we would need to keep querying the view for successively earlier hours and days until we met our threshold. Nasty.
With ElasticSearch, on the other hand, it’s easy to query based on multiple fields, then sort by time and limit the results, all in a single operation. It’s intuitive in the same way a SQL solution would be.
Another use-case that troubled us involved modeling one-to-many (parent-child) relationships. An example would be a player’s inbox, with the player being the parent and the messages being the children. There are viable approaches that might use one document total, or one per message, but the latter requires some way to “find the children” given just a parent key.
There are several ways to implement the inbox with CouchBase. The simplest approach is to have a single inbox document per player. Of course this requires us to lock and transmit the whole document if we want to add, remove, or update any single message, and it prohibits automatically expiring individual messages. Another approach is to have individual documents per message with a single index document per player. This sounds like a great way to reduce contention and I/O, but it adds a lot of complexity and still doesn’t allow automatic message expiration without invalidating the index document.
With ElasticSearch we can store the messages as individual documents routed by the inbox owner. Documents can be updated or deleted individually, and each document can have its own automatic expiration time. Routing by player ID allows for efficient fetching and counting. ³
There were several use cases where ElasticSearch could help us, but that didn’t automatically make it worthwhile. There was a good deal of soul searching about introducing yet another data store. In fact, we even considered dropping CouchBase so we only had to manage one cluster. In the end, however, we decided that CouchBase was more reliable and significantly more forgiving about document structure. We use ElasticSearch extensively, but everything it holds is expendable.
Our game server infrastructure is stateless, so we can automatically scale the number of instances based on load by simply spinning up a pre-baked machine image. In this paradigm, it’s common for multiple instances to calculate the same values and cache them locally. However, if the work is expensive enough then the savings from a shared cache can outweigh the benefits of total independence.
I previously mentioned the use of ElasticSearch to perform opponent sampling. Before a server can take traffic, it needs to have these opponent samples loaded (thereafter periodic cache reloading happens at a semi-random interval ⁴). Any event that caused all servers to start in unison, including simply deploying a new version, suddenly put a huge load on ElasticSearch. Requests would time out, delaying the moment a server declared itself ready. Some servers would get automatically killed and restarted. It was a mess.
After toying with several options, we came up with a pretty solid solution. As you have no doubt guessed, the solution was memcached (ElastiCache at AWS). In order to keep everything as robust as possible, the server code kept the ability to load samples from ElasticSearch, but it is only used as a fallback. We now check memcached first.
We designate a single instance to keep the cache warm ⁵, but instead of sampling randomly it would query for ALL result sets. For instance, if the number of active level 50 players implied modulo 11 sampling, we would make 11 different queries and cache the results of each under different memcached keys. As long as the cache warmer can cycle through all the combinations of all the levels before the cached results expire, then none of the other instances have to look beyond the cache.
This single change was probably our biggest stability improvement. Deploying new server versions — an almost daily occurrence — no longer generated CloudWatch warnings and generally became much less exciting.
When everything runs on a single box it’s easy to find the logs. As you add boxes to your environment the task becomes increasingly laborious. When you start auto-scaling and instances disappear overnight, it becomes impossible to access logs unless they were shipped elsewhere. The solution is log aggregation — all of the autoscaled boxes still have locally rotated logs, but they also send the data to a central location indexed by a search engine.
The ELK stack (ElasticSearch, Logstash, Kibana) is a group of free technologies from the brilliant folks who make ElasticSearch. Logstash is both a well-known format for log data and a corresponding set of libraries for different languages to help emit this data. Kibana is a web application that makes (typically daily) indexes for logstash data, and provides efficient searching and graphing of the aggregated information.
Even though we use ElasticSearch in our main game infrastructure, we keep an entirely separate cluster for Kibana. Our rationale is that a logging failure would be a small problem on its own, but a major disaster if it took down the live game. Additionally, log and non-log ElasticSearch clusters will often have significantly different sizing so it’s best to manage them separately.
The firehose of production log traffic can quickly bring an undersized cluster to its knees. The best rule we have found to maximize the traffic handled by a given number of boxes is this: one shard per index per box. You can control the number of shards and replicas created in the logstash template. If using the common settings of five shards per index with one replica, that means you need TEN elasticsearch instances. If you need fewer boxes, it’s best to also reduce the number of shards.
Kibana strives to be associated with data visualization, but it’s important to temper your expectations. First, if you want to be able to graph (or really aggregate) individual “fields” within your log messages, it’s important that logstash is configured to extract those fields before they get to ElasticSearch. Second, setting up visualizations is unintuitive and minimal control is offered over the final appearance. Third, certain types of graphs are simply impossible, invariably including the one kind you need.⁶
RDS Postgres and Redshift
At some point we decided that we really needed cool-looking graphs on the wall: installs per hour, unique users per hour, exceptions, daily retention, that sort of thing. The plan had always been to create Kibana dashboards, but when the time came we found that the most frequently requested graph was not possible!
Graphing “retention” requires that you know the specific users who install on a given day (D0), and count how many of the same users return on a specific subsequent day (D1..N). A graph of D3 retention would generally have dates on the X-axis and counts on the Y, and show the installs for each date in one color and the DN returns for that date in another.
Unfortunately, Kibana was unable to make a data series for DN that consisted only of players known to have installed on D0 and returned on DN. We initially responded with hacks — if you embed the days since install in your log messages you can sort of make it work. This was cumbersome, required updating the logstash template, and we could never quite get the output formatted and labelled properly.
Our next attempt was more successful. We used a Node.js app to query Kibana’s underlying daily indexes and graph the output ourselves. The results looked nice and the performance was acceptable for many queries, but eventually it became too slow for heavily aggregated graphs.
Eventually it became clear that the graphs we most wanted to explore were not the ones we could generate quickly, and so we made a radical change. We decided to use a database (RDS hosted Postgres). With a well designed schema and pre-aggregated data, it’s possible to support a complex charting UI without any expensive SQL. Our key table held “sessions”, but a secondary table held only aggregated data. We used a background process to continually update this secondary table with installs and unique user sessions aggregated by hour, day, platform and country.
The postgres solution completely solved all performance issues with the graphs, but soon more people wanted access to the underlying database and everyone had some pet data they needed to track. Over the course of a year, the people using postgres for ad hoc data analysis grew increasingly dissatisfied. So, finally, we went for the nuclear option — Redshift.
If you already have data in RDS, it’s not that hard to pipe it into Redshift. AWS Data Pipeline makes it easy to keep Redshift up to date — the only challenge is getting historical data migrated. Once we had Redshift working, our analysts rejoiced. Suddenly queries that were taking fifteen minutes started taking 15 seconds. Bear in mind that Redshift is not quite as snappy as a normal relational database for simple queries, but we’re so impressed with Redshift that we are looking at ways to use it without postgres in subsequent projects.
A complex technology ecosystem can make it difficult to set up isolated dev or test environments. Luckily, we’re able to pack everything we need on a single instance (usually an AWS C3.large). We use pre-configured machine images for this based on vanilla Amazon linux. The image contains Node.js, mono, CouchBase, ElasticSearch, and our chat server, along with the Node.js apps that allow us to manage what’s deployed and run concurrent versions.
An environment setup this way can be completely stand-alone, or it can use some shared, centralized resources. For instance, it can optionally ship it’s logs to a non-production Kibana instance, or use a non-production memcached. Most dev instances don’t write to postgres, but a shared, non-production instance is available.
An isolated stack is great for development but it’s an extremely poor predictor for real-life system performance. The general rule we follow when setting up a staging environment is that anything that would be a cluster in production must be at least a minimal cluster in staging. Anything that has a master-slave scheme in production has at least one slave in staging. Likewise, anything that would have a load balancer or autoscaling in production also has it in staging.
Consider our chat system. All writes come from inside the network and go to a master, all reads come from outside the network and go to a slave. On a dev stack we can co-locate these processes, but in production we have the chat slaves in an auto-scaling group behind an ELB. That means our staging environment also has a master and one slave in a very lonely auto-scaling group behind an ELB.
The main benefit of this approach is that you’re never surprised by emergent bugs which only happen in distributed versions of your environment. If it works in staging, it will work in production. It’s also much easier to scale up for load testing if you start with a realistic topology. The downside, of course, is that your baseline cost is higher, and a complex ecosystem is harder to spin down when you’re not using it.
It’s hard to talk about our tech stack without diving pretty deeply into how we leverage AWS. That topic is vast enough to deserve its own post.
Other areas we really glossed over are the build and deploy cycle, and our system for running multiple server versions concurrently.
*1 Don’t get too excited, it’s never going to replace Hadoop or EMR or even Cassandra.
*2 The modulus helps keep sufficient diversity in the results — the values are pre-calculated for ID mod 11, 53, and 199.
*3 The answer is not clear-cut, however, as we implemented friend lists the easy way with a single CouchBase document. The primary difference was that friends don’t benefit from per-entry expiration.
*4 We always use semi-random intervals to prevent a Thundering Herd.
*5 Each environment already has a single administrative instance, so this was the obvious choice.
*6 Kibana continues to improve and newer versions may have addressed this. We were using Kibana 4.0.