Terraform your EKS fleet - PART 1

Ever wondered how to provision and manage a fleet of EKS clusters without spending your days clicking through the AWS interface? You're in luck because it's exactly what I will show you in this multi-part tutorial. If you are ready, let's get started!

Part 1: Introduction to Infrastructure as Code
Part 2: First Terraform resources

Yann Irbah

March 30, 2022 · 9 min read

Terraform your EKS fleet - PART 1 - Qovery

Written by

Yann Irbah

See all articles

AWSTerraform

Table of Contents

Disclaimer
A story about cats in the snow
- Snowflakes
- Pets VS Cattle
Infrastructure As Code
- Declarative infrastructure
- Terraform
- Terragrunt
Conclusion

As a Developer Advocate at Qovery, I end up talking daily with people looking to begin their Cloud journey. A lot of them are willing to try out AWS EKS and don’t know how to get started. Some can’t make sense of it, some others manage to get a working EKS cluster after many clicks in the AWS console, but soon more questions come to their minds:

How do I reproduce all those steps if I need another cluster?
What about updates? How do I ensure uniformity across my whole infrastructure?
What services should I install to my clusters?
Is there a way to automate the process of creating and bootstrapping a cluster?

All those questions are valid. We’ve all known this moment of excitement when hearing about Kubernetes for the first time, suddenly everything is possible, we feel powerful. And with managed services like EKS, our first cluster is only a few clicks away. Only to be caught with the hard reality: Creating a Kubernetes cluster is only a small part of the issue.

Before even considering the difficulty of deploying applications with a CI/CD workflow, day-to-day infrastructure operations can quickly become a nightmare without the proper workflows and tools in place.

I’ve been there before and through this multi-part articles series, I’m gonna share with you what I learned and the method I used to handle several EKS clusters, create and delete new ones on demand, manage updates to the whole infrastructure …

#Disclaimer

Keep in mind that there is no silver bullet and the ecosystem around IaC is moving fast. I only present you a method that served me well when I was in charge of the infrastructure as a DevOps Engineer in a previous job. It doesn’t mean it is the best or an optimal one.

Also, being able to put your own IaC workflow in place doesn’t mean you should. Even if the tools we present here can make your life easier, it still demands a lot of effort to be put in place, maintained, kept up to date …

If your team doesn’t have a serious need to put its own IaC system in place and doesn’t have dedicated DevOps Engineers, maybe you should consider a fully managed solution like Qovery.

Goal

In this series we will:

Start from the basics of IaC (Infrastructure as code) and Terraform.
See some tips and best practices I learned along the way.
Learn how to leverage a tool called Terragrunt to extend Terraform.
Setup several Git repositories to version our infrastructure code.
Use a CD server called Flux to bootstrap our cluster with some base services.
Provision a fully working EKS cluster and deploy an application to it.

In the end, you will have a fully functional application running on an EKS cluster in your AWS account. You will be able to provision and delete clusters with a single command.

So if you’re ready, let’s get started.

#A story about cats in the snow

First I would like to quickly introduce you to two important notions, so you can understand the benefit of managing your infrastructure in a declarative way with IaC tools.

#Snowflakes

Think about snowflakes. I mean actual snowflakes, the ones falling from the sky and painting landscapes in white during winter in some regions of the world. Those you can aggregate and throw at your friends, slide on with various kinds of boards …

From a macro point of view, they all look the same. We all know the classic representation of a snowflake: ❄️.

But if you zoom in with a microscope, you’ll notice slight differences in shapes. Each one is unique in a way. Why should we care? Well, we shouldn’t unless we’re scientists studying snow formation.

But now think about your infrastructure. You configured your servers and clusters by hand, clicking on a Web UI, running commands through SSH. Best case you documented the procedure somewhere. Then you needed to create new servers or clusters, so you did it all over again.

Time passed, you did some fixes on servers configuration, some updates, provisioned more servers …

Did you keep track of everything?
Can you guarantee all your servers are running with the exact same configuration?
Can you tell the exact configuration of all of your servers without any doubt?

If yes, congratulations but you wasted a lot of time. If not, you’re not alone. You’ve been caught by the reality of day-to-day operations, outages, emergencies …

Now chances are, all of your servers look the same, but they most likely slightly diverged in terms of configuration.

To quote Martin Fowler:

"The result is a unique snowflake - good for a ski resort, bad for a data center."

Be sure that this situation will lead to some head-banging unsolvable bugs at one point or another.

I encourage you to check Martin Fowler’s article on the topic.

#Pets VS Cattle

TW: I apologize to my fellow vegan folks for this analogy, but keep in mind we’re ultimately talking about non-sentient machines (virtual most of the time).

I have 3 cats. I love them (when they’re not mewing in the middle of the night). They have names (China, Gruik, and Relou. Don’t judge).

They’re all very different and I can recognize them at first glance.

China is obese and requires a special diet.
Gruik is too skinny and needs to eat more than the others.

They all get specific care tailored to their needs. I get anyone having pets can relate to that (if you don’t have any pets, use your imagination).

But as a geek, I also had non-biological pets in the past. As a student, I was the SysAdmin of my major lab. I’m almost getting emotional when thinking about Yggdrasill, the master server, Odin, Thor, Loki, and all the other machines on the network …

I see you looking at me but know that it was a common naming scheme at the time, along with Greek gods and The Simpsons characters.

As with my cat, each machine got special treatment, and even if they were downloading their configuration from Yggdrasill at boot-time, they would slightly differ in hardware and some other details.

Treating your machines like pets turns them into snowflakes.

Now think about the farming industry. Not talking about your small local organic farm but big, industrial ones. Cows are getting an ear piercing with a number instead of a name. They all look mostly the same. If one dies, it can be replaced and it makes no difference. They all will be replaced over time anyway. From the outside, it doesn’t make any difference. The headcount stays the same, the herd looks similar.

All of this is for a good reason. It would be highly inefficient to treat each cow as a pet. Their goal is to provide food, not be a life companion. That’s what we call cattle.

Now back to our infrastructure. If we want to be efficient we need to treat our machines like cattle rather than pets. If one gets sick (I mean is misbehaving due to configuration or hardware issues), we just terminate it and replace it with an identical new one. We can even replace all of them on a regular basis to make sure they always are behaving properly. Their role is to serve our applications and we only care about the big picture. Like numbers for cows, we will give them unique identifiers with no special meaning.

And more than anything, we avoid the snowflakes situation we mentioned earlier.

No wonder that one of the most popular companies in the Kubernetes ecosystem is called Rancher. Now you know why.

#Infrastructure As Code

Now that we agreed to treat our infrastructure like cattle, what is the most efficient to do it?

#Declarative infrastructure

In the early days of infrastructure automation, system administrators used to write scripts to provision and configure machines. This might seem like a good idea, but it has major drawbacks:

You have no way to know if the script was already executed for a particular resource or not.
It is most likely not idempotent (except if adding complicated checks everywhere), which means executing the script several times for the same resources will lead to unexpected behavior.

This is what we call an imperative approach. We are describing the steps to execute to reach our end goal. Looking at the code doesn’t tell us anything about the current state of our infrastructure. We just know how it was or would be provisioned.

On the opposite spectrum, we now have tools allowing us to have a declarative approach. Instead of stating “How” we describe “What”.

Each tool has its specificities and ways to describe our infrastructure, but what they basically do is make sure we get what we asked for, in an idempotent way. We can run them as many times as we want, we will always get what we described (or sometimes a crash in the process).

The advantage is that given that the provisioning process is executed successfully, reading the code gives us a clear idea of the current state of our infrastructure. The infrastructure is described as code, hence the term “Infrastructure As Code”.

Of course, behind the scenes, they all end-up executing imperative commands. But that’s transparent to you.

Another big advantage of using an IaC tool is the ability to version the infrastructure code and adopt the same kind of workflow as software development. Push your infrastructure code to Git and you can now have pull-requests, reviews, commits history, diffs, rollbacks …

A non-exhaustive list of IaC tools would include:

All those tools have their own use cases and describing all of them is out of the scope of this series. We will focus on Terraform.

#Terraform

First, let’s start by describing what Terraform is. I’ll quote the official documentation:

"HashiCorp Terraform is an infrastructure as code tool that lets you define both cloud and on-prem resources in human-readable configuration files that you can version reuse, and share. You can then use a consistent workflow to provision and manage all of your infrastructure throughout its lifecycle. Terraform can manage low-level components like compute, storage, and networking resources, as well as high-level components like DNS entries and SaaS features."

We pick Terraform because we will be using AWS for our infrastructure. And when it comes to cloud resources, it really shines. It has providers for almost any cloud resource you can think of. And it’s not limited to computing resources. You can even handle GitHub repositories with it (and we’ll do later in the series).

It is the most popular IaC tool in the Cloud Native world at the time of this writing, even with Pulumi gaining more and more traction.

Terraform use a configuration language created by HashiCorp called HCL (HashiCorp Configuration Language). It is quite straightforward and easy to learn. You won’t code applications with it because it was specifically designed for configuration purposes.

As an example, provisioning an AWS EC2 instance would look like that:

resource "awsinstance" "example" {
 ami = "ami-0c55b159cbfafe1f0"
 instancetype = "t2.micro"
}

Not too scary, huh?

If you’re eager to learn more about Terraform, you can check the official tutorials, but we’ll go through every step in the following articles.

#Terragrunt

Terragrunt is a tool created by Gruntwork. It is in fact a wrapper around Terraform, extending it with useful functionalities helping us to follow best practices and make our code DRY. Maybe future versions of Terraform will include these features, but for now, it’s a very useful tool.

I’ll give more details about Terragrunt when we will be to the point of needing it.

#Conclusion

Now you hopefully have an understanding of what Infrastructure as Code and Terraform are, and how they can help us manage our fleet of EKS clusters. In the rest of this series, we will get our hands dirty and start terraforming our infrastructure.

Stay tuned for Part 2. We will use Terraform and Terragrunt to provision our first EKS cluster.