“GitHub Actions makes it easy to automate all your software workflows, now with world-class CI/CD.”
That is how GitHub describe their built in CI/CD tooling. I must say that I really like it and this is something GitHub has been missing before. Many of the Git as a Service providers, GitLab, Bitbucket, Azure DevOps, have had a bundled CI/CD tooling. With GitHub you always needed to use an outside tool.
This is part two in the series on how to create and setup your own self hosted runner in AWS.
All code used can be found in my GitHub repo
Part one, short recap
In part one I tried and showed how to run self hosted runners using Fargate. The conclusion was that it wasn’t a good match. If you have missed it you can find it here
Part two, EC2
Instead, in this part I will show how to use EC2 and Auto Scaling Groups to run and host the runners. As always we need to start by setting up a VPC. I run everything in a simple 2 public subnet VPC. The CloudFormation template used is located on GitHub
Automatically add and register a runner
Create a Auto Scaling Group
We start by creating a Auto Scaling Group that can add and remove instances as we see fit.
We set the desired capacity to one instance with possibility to scale up to five instances.
Create a Launch Template
For a EC2 instance to install and setup everything when started from an Auto Scaling Group we create a Launch Template. We use the User Data part of the Launch Template to install and register the runner. This way every time a new Instance is created by the AWS Auto Scaling Group (ASG) the instance will register with GitHub.
Once again we need to set RUNNER_ALLOW_RUNASROOT to true since the User Script is run as root. When the instance has started, and registered with GitHub, it’s ready to start serving build jobs.
Automatically remove a runner
Automatically registering and starting to serve jobs is just one part in the chain. If an instance is removed by the auto scaling group we also want it to remove it self from the pool of runners in GitHub. To do that we tie a couple of features and services together.
Life Cycle Hooks
To get notified when an instance is removed and have the possibility to pause the termination process we use Life Cycle Hooks in the auto scaling group. This way the instance will go into a pending state giving us the possibility to run scripts to remove it as runner before it terminates.
AWS Systems Manager
When an instance enters the Terminating:wait state we like to run a script on the instance. To do that we use AWS Systems Manager Documents and then send a command to SSM agent to run the Document.
The Document will run a shell script. We stop and uninstall the GitHub runner service. Fetch a remove token and removes the runner from GitHub.
The Document will be run as root, and since the remove command ignore RUNNER_ALLOW_RUNASROOT flag we make sure we run as the EC2 user instead. This is really important. Remove will not work otherwise. That is why we give EC2-User full access to the runners folder when we install and start the service in the Launch Configuration.
To get notified when a Lifecycle event happens it is possible to use SNS. However I decided to use EventBridge instead for many reasons. Primarily due to EventBridge support more endpoints.
So I setup an Events::Rule to detect the change and trigger a Lambda function that will run the SSM Document.
When the Lifecycle hook is detected by EventBridge a Lambda function will be triggered. The Lambda function will call SSM RunCommand API to run the SSM Document. As discussed earlier the Document will run a script that will call GitHub and remove the runner.
When the Document is done running we need to notify the ASG to continue terminating the instance.
As said we need to call the SSM RunCommand API and call the ASG API to continue the termination.
Running the GitHub self hosted runners on EC2 instances in an Auto Scaling Group with LifeCycle Hooks for removal works really well. We can now add and remove instances in the ASG and have them register and remove them self.
But I’m still not happy. What if we get a spike in number of jobs and the queue grow? Even though we run in a ASG we still doesn’t auto scale.
Time to throw auto scaling into the pot…. Stay tuned for part 3.
All code in this blog series can be found on GitHub