GitHub Self hosted runners in AWS - part 2 - EC2

2020-06-29

"GitHub Actions makes it easy to automate all your software workflows, now with world-class CI/CD."

That is how GitHub describe their built in CI/CD tooling. I must say that I really like it and this is something GitHub has been missing before. Many of the Git as a Service providers, GitLab, Bitbucket, Azure DevOps, have had a bundled CI/CD tooling. With GitHub you always needed to use an outside tool.

This is part two in the series on how to create and setup your own self hosted runner in AWS.

All code used can be found in my GitHub repo

Part one, short recap

In part one I tried and showed how to run self hosted runners using Fargate. The conclusion was that it wasn't a good match. If you have missed it you can find it here

Part two, EC2

Instead, in this part I will show how to use EC2 and Auto Scaling Groups to run and host the runners. As always we need to start by setting up a VPC. I run everything in a simple 2 public subnet VPC. The CloudFormation template used is located on GitHub

Automatically add and register a runner

Create a Auto Scaling Group

We start by creating a Auto Scaling Group that can add and remove instances as we see fit.

  AutoScalingGroup:
Type: AWS::AutoScaling::AutoScalingGroup
Properties:
AutoScalingGroupName: github-runners-asg
Cooldown: 300
DesiredCapacity: 1
MaxSize: 5
MinSize: 0
HealthCheckGracePeriod: 300
HealthCheckType: EC2
LaunchConfigurationName: !Ref LaunchConfiguration
VPCZoneIdentifier:
- Fn::ImportValue: !Sub ${VpcStackName}:PublicSubnet1
- Fn::ImportValue: !Sub ${VpcStackName}:PublicSubnet2

We set the desired capacity to one instance with possibility to scale up to five instances.

Create a Launch Template

For a EC2 instance to install and setup everything when started from an Auto Scaling Group we create a Launch Template. We use the User Data part of the Launch Template to install and register the runner. This way every time a new Instance is created by the AWS Auto Scaling Group (ASG) the instance will register with GitHub.

  LaunchConfiguration:
Type: AWS::AutoScaling::LaunchConfiguration
Properties:
ImageId: !Ref EC2ImageId
InstanceType: t3.micro
IamInstanceProfile: !GetAtt EC2InstanceProfile.Arn
KeyName: !Ref SSHKeyName
SecurityGroups:
- !Ref SecurityGroup
UserData:
Fn::Base64:
Fn::Sub: |
#!/bin/bash -xe
yum update -y
yum install docker -y
yum install git -y
yum install jq -y
sudo usermod -a -G docker ec2-user
sudo systemctl start docker
sudo systemctl enable docker
export RUNNER_ALLOW_RUNASROOT=true
mkdir actions-runner
cd actions-runner
curl -O -L https://github.com/actions/runner/releases/download/v2.262.1/actions-runner-linux-x64-2.262.1.tar.gz
tar xzf ./actions-runner-linux-x64-2.262.1.tar.gz
PAT=<Super Secret PAT>
token=$(curl -s -XPOST \
-H "authorization: token $PAT" \
https://api.github.com/repos/<GitHub_User>/<GitHub_Repo>/actions/runners/registration-token |\
jq -r .token)
sudo chown ec2-user -R /actions-runner
./config.sh --url https://github.com/<GitHub_User>/<GitHub_Repo> --token $token --name "my-runner-$(hostname)" --work _work
sudo ./svc.sh install
sudo ./svc.sh start
sudo chown ec2-user -R /actions-runner

Once again we need to set RUNNER_ALLOW_RUNASROOT to true since the User Script is run as root. When the instance has started, and registered with GitHub, it's ready to start serving build jobs.

Automatically remove a runner

Automatically registering and starting to serve jobs is just one part in the chain. If an instance is removed by the auto scaling group we also want it to remove it self from the pool of runners in GitHub. To do that we tie a couple of features and services together.

Life Cycle Hooks

To get notified when an instance is removed and have the possibility to pause the termination process we use Life Cycle Hooks in the auto scaling group. This way the instance will go into a pending state giving us the possibility to run scripts to remove it as runner before it terminates.

image{:class="img-responsive"}

  TerminateLifecycleHook:
Type: AWS::AutoScaling::LifecycleHook
Properties:
AutoScalingGroupName: !Ref AutoScalingGroup
LifecycleTransition: autoscaling:EC2_INSTANCE_TERMINATING

AWS Systems Manager

When an instance enters the Terminating:wait state we like to run a script on the instance. To do that we use AWS Systems Manager Documents and then send a command to SSM agent to run the Document.

  RemoveDocument:
Type: AWS::SSM::Document
Properties:
DocumentType: Command
Tags:
- Key: Name
Value: github-actions-install-register-runner
Content:
schemaVersion: "2.2"
description: Command Document de-register GitHub Actions Runner
mainSteps:
- action: "aws:runShellScript"
name: "deregister"
inputs:
runCommand:
- "cd /actions-runner"
- "sudo ./svc.sh stop"
- "sudo ./svc.sh uninstall"
- "PAT=<Super Secret PAT>"
- 'token=$(curl -s -XPOST -H "authorization: token $PAT" https://api.github.com/repos/<GitHub_User>/<GitHub_Repo>/actions/runners/remove-token | jq -r .token)'
- 'su ec2-user -c "./config.sh remove --token $token"'

The Document will run a shell script. We stop and uninstall the GitHub runner service. Fetch a remove token and removes the runner from GitHub.
The Document will be run as root, and since the remove command ignore RUNNER_ALLOW_RUNASROOT flag we make sure we run as the EC2 user instead. This is really important. Remove will not work otherwise. That is why we give EC2-User full access to the runners folder when we install and start the service in the Launch Configuration.

EventBridge

To get notified when a Lifecycle event happens it is possible to use SNS. However I decided to use EventBridge instead for many reasons. Primarily due to EventBridge support more endpoints.
So I setup an Events::Rule to detect the change and trigger a Lambda function that will run the SSM Document.

  TerminatingRule:
Type: AWS::Events::Rule
Properties:
EventPattern: !Sub |
{
"source": [
"aws.autoscaling"
],
"detail-type": [
"EC2 Instance-terminate Lifecycle Action"
]
}

Targets:
- Arn: !GetAtt LifeCycleHookTerminatingFunction.Arn
Id: target

Lambda

When the Lifecycle hook is detected by EventBridge a Lambda function will be triggered. The Lambda function will call SSM RunCommand API to run the SSM Document. As discussed earlier the Document will run a script that will call GitHub and remove the runner.
When the Document is done running we need to notify the ASG to continue terminating the instance.

  LifeCycleHookTerminatingFunction:
Type: AWS::Serverless::Function
Properties:
FunctionName: github-runners-asg-lifecycle-hook-terminate
Runtime: python3.6
MemorySize: 256
Timeout: 30
CodeUri: ./lambdas
Handler: terminate.handler
Role: !GetAtt LifeCycleHookTerminatingFunctionRole.Arn
Environment:
Variables:
SSM_DOCUMENT_NAME: !Ref RemoveDocument

As said we need to call the SSM RunCommand API and call the ASG API to continue the termination.

def handler(event, context):
message = event['detail']
if LIFECYCLE_KEY in message and ASG_KEY in message:
life_cycle_hook = message[LIFECYCLE_KEY]
auto_scaling_group = message[ASG_KEY]
instance_id = message[EC2_KEY]
ssm_document = os.environ[SSM_DOCUMENT_KEY]
success = run_ssm_command(ssm_document, instance_id)
result = 'CONTINUE'
if not success:
result = 'ABANDON'
notify_lifecycle(life_cycle_hook, auto_scaling_group,
instance_id, result)
return {}

def run_ssm_command(ssm_document, instance_id):
ssm_client = boto3.client('ssm')
try:
instances = [str(instance_id)]
response = ssm_client.send_command(DocumentName=ssm_document,
InstanceIds=instances,
Comment='Remove GitHub Runner',
TimeoutSeconds=1200)
except Exception as e:
return False
return True

def notify_lifecycle(life_cycle_hook, auto_scaling_group, instance_id, result):
asg_client = boto3.client('autoscaling')
try:
asg_client.complete_lifecycle_action(
LifecycleHookName=life_cycle_hook,
AutoScalingGroupName=auto_scaling_group,
LifecycleActionResult=result,
InstanceId=instance_id
)
except Exception as e:
logger.error(
"Lifecycle hook notified could not be executed: %s", str(e))
raise e

Conclusion

Running the GitHub self hosted runners on EC2 instances in an Auto Scaling Group with LifeCycle Hooks for removal works really well. We can now add and remove instances in the ASG and have them register and remove them self.
But I'm still not happy. What if we get a spike in number of jobs and the queue grow? Even though we run in a ASG we still doesn't auto scale.

Time to throw auto scaling into the pot.... Stay tuned for part 3.

Code

All code in this blog series can be found on GitHub