Continuous Infrastructure on Google Cloud

Aug 7, 2019| Daniel Jones

Recently we helped the folks at Unit 2 Games automate the creation of their Google Cloud and Kubernetes infrastructure by using Concourse and EngineerBetter’s own best-practice approach to reliably deploying platforms.

“We’ve managed to take a huge step forward in our infrastructure automation with the help of EngineerBetter. Where our environments were once hand-crafted and fragile, they are now reliable, easy and flexible, and we’re getting really confident with the tooling they recommended.”

Tom Gummery - Senior Software Engineer, Unit 2 Games

Before going into how we worked and how we achieved this, here’s a summary of how Unit 2’s pipelines now work:

How We Worked

Unit 2 Games are in the midst of making Crayta, and awesomely-powerful collaborative gameplay creation tool. Using the Unreal engine, Crayta offers collaborative game creation for all - think a much more advanced and aesthetically-pleasing Roblox.

The folks at Unit 2 need to be able to deploy rock-solid Kubernetes-based platforms for large numbers of users. As a Game-as-a-Service, these platforms need to be updated continuously whilst live game sessions are going on.

A global online game doesn’t need just one monolithic platform - instead, multiple production environments are required in a number of geographies in order to provide low-latency to users around the world. Not only that, but due to the viral nature of such a social experience, the folks at Unit 2 need to be able to deploy new environments in new territories at a moment’s notice.

“Undertaking this work was initially a daunting prospect, but with EngineerBetter’s help we broke the work down and delivered great value back to the business. We greatly reduced our fear of change, and so now reap many benefits such as being able to rapidly provision environments, audit infrastructure changes, and tighten system access control.”

Steve McDowell - Web Systems Director, Unit 2 Games

EngineerBetter always work collaboratively, so we used remote mob-programming to team up with Unit 2. We simultaneously helped build their infrastructure automation solution whilst sharing our experiences and educating their Kubernetes-savvy engineers in the ways of Concourse.

Unit 2 engineers booked out a meeting room, EngineerBetter dialled in via Zoom, and then we proceeded with usual mobbing rules - 10-minute turns on the keyboard, and you can only type something if someone else has told you.

Engineering days were not contiguous, and we’d spend maybe a few days a week collaborating in this way. This allowed the Unit 2 engineers to attend to other matters, and prevented us from getting remote-work fatigue. It also allowed ideas to percolate, meaning we got ‘creative tunnel vision’ less often.

“Mobbing over a remote video link turned out to be a effective way of skills transfer and education, whilst also building towards a real deliverable. It has been particularly easy to schedule these sessions in around other work, allowing us to maintain normal service while we develop improvements.”

Tom Gummery - Senior Software Engineer, Unit 2 Games

The Implementation

In this post we’ll refer to stages of QA as tiers - so think things like ‘dev’, ‘staging’ and ‘production’. We’ll call each freestanding and isolated instance of Unit 2 stack an environment.

The environments themselves consist of a few different elements:

Infrastructure as Code

As is always the case when we’re pipelining platforms, everything that makes up an environment is made concrete in a Git repository.

A lot of this is Terraform, along with Kubernetes YAML, and the odd bit of custom scripting to glue things together. For example, we want our pipelines to be able to create the buckets they use to hold their state in which creates a bit of a chicken-and-egg problem, so we created a reusable Concourse task to do this.

Why Not Just Terraform?

If we’re using Terraform inside our pipelines, why not just use Terraform? Why bother with Concourse at all?

Applying Terraform is a one-off process. We want our environments to adopt the latest patches, updates and security fixes. Concourse is the driver for that - as a ‘continuous thing-doer’, it’s watching for upstream releases and then feeding those into Terraform and other tools.

The Same Code for All Environments

The biggest mistake we see experienced platform teams make when using Concourse is to have different pipeline definitions for production versus other environments.

If your production Infrastructure-as-Code is different to the tiers you tested in, then your tests are not valid. You are quite literally using untested code in production.

In order to have a cast-iron guarantee that production will work you need to be using the same assets all the way through, giving parity between tiers.

In practice, this means some sophisticated tricks with Concourse, such as either pipelines that set themselves, or pipelines that set other pipelines. This is explained later.

Promoting Versions with Stopover

Versions of components in an environment should only be promoted if they are known to work together. We need to integration test everything that goes into that environment, and then promote that set of versions to the next tier.

The simple view of this is that the version of every resource (Git repo, Helm chart, BOSH release) that we’re interested in is written to a YAML file upon completion of integration tests. We’ve written a tool to do this called Stopover, which we’ll be writing about in more detail some time soon.

This file is then uploaded to S3, or committed to Git. This act of creating a new version of the file triggers the next pipeline in the chain: so perhaps the dev pipeline writes a file that triggers the staging pipeline. Only this set of versions is used in the subsequent pipeline.

Canaries and Rolling Upgrades

One production environment is the designated ‘canary’ environment. This is live, has real users, and is also the first of the production environments to deploy a new set of components. Only once this canary environment has been successfully updated do we consider upgrading the others.

The other production pipelines are configured to be triggered by updates to the versions file that has been ‘promoted’ by the canary pipeline. I use inverted commas as at this point we’re already making live production changes.

In order to facilitate a rolling upgrade of remaining production deployments, we use the Concourse Pool Resource. Each pipeline needs to acquire a lock on a member of the ‘in flight’ pool for its tier. If there are 10 production environments, then we have 2 members in this pool, meaning there’s a 20% max-in-flight. Other pipelines will block waiting for a lock to become available.

Consistency

Each pipeline is locked to only accept versions of resources that have passed previous pipelines. This means that there is every expectation that they should work, and if they don’t, we’re discovering something new about environmental differences.

There are additional consistency safeguards too.

Our pipelines set themselves. Every time the automation runs, it ensures that it is running with the latest tested version of the automation itself, as well as the latest tested versions of all the other things that make up the platform. Novice Concourse usage involves developers setting pipelines ‘by hand’, which is error-prone and oft-forgotten.

Our pipelines are also atomic. We make further use of the Concourse Pool Resource to ensure that pipeline config can’t be reconfigured halfway through a run. Without this, a pipeline could be 75% complete, a new run could be triggered, which would then update the runtime configuration of the pipeline whilst the prior run was still executing!

Addressing this consistency issue is of such importance that setting pipelines is on the roadmap to become a first-class construct in Concourse.

On-Demand Production Environments

Environments in any tier are defined via YAML. In fact, what makes each environment unique is a set of parameters to the pipeline itself - so things like name, region, cluster size, and so on.

Adding a new environment is as simple as adding a new YAML file. How so? We have a ‘meta-pipeline’ that watches this Git repository for changes, and when it detects a new file, sets a new Concourse pipeline using the parameters in that file! Once the pipeline is set, it will automatically keep itself up-to-date via self-setting.

Of course, the meta-pipeline also sets itself in order to ensure consistency with the latest pipeline definition in Git.

Bringing it Together

You can expand the image for a full view of this rough design diagram.

Unit 2 Games already had a capable infrastructure automation team, and EngineerBetter were able to accelerate their progress towards having fully-automated GitOps-style infrastructure-as-code on Google Cloud.

The patterns that we use have been tried and tested in banking, fintech, security, publishing and now gaming. Our technical approach transcends IAASes, and once again mob-programming has proven a flexible method to both upskill and deliver.

Get in touch

See how much we can help you.
Call +44 (0) 20 7846 0140 or

Contact us