Terraform vs Ansible over GCP

Hey, I am not dead yet! Back from the Cloud ...

We're having a big change that we will move to the Cloud so there is a long quiet time in order to step into the whole new world! As you can see all my knowledge so far is on-premise settings. Now, I'm moving with the trend now, yahooo!

Tell the truth, there was a month blackout when I tried to setup a Kubernetes cluster in house. You can tell how hurt a guy try run and bang his head to the wall :D. I tried with kubeadm and then use the rke from Rancher but it's not stable as I expected. It was dropped! So you don't have this funny story here!

So I move directly to GCP and start with my journey of IaaS or IaC, at first I wanted to start with Ansible but then I read through some comparison such as this great article and that, then I decide to bring them as a combination as we should differentiate them:
  • Terraform: provisioning tool
  • Ansible: configuration tool
Although each tool themselves have a bit overlap on the other side but you should use them where they are most influence. So, using Terraform to provision the infrastructure resources (what we use on GCP - Google Cloud Platform) then we already got all the Ansible roles defined for the on-prem solution then now we will use Terraform 'local-exec' provider to trigger the ansible playbook to configure the new instances.

There is another solution of 'early binding', which creates a custom instance template (image) by Packer then use Terraform to provision these custom templates. Then we will have another mechanism to trigger the template refresh when there is a change from our infrastructure scripting. I will check that solution and update in detail later.

Alright, let's dig into detail!
GCP Environment
All your infrastructure will be based on GCP, so you have to understand some basic term of this cloud provider.You can create an account with $300 credit for free. Everything is based on Project like a Solution in your IDE :). Let's start with the access credential, as we will use the scripting access, so a best practice should be using a service account. When create a service account, you will assign the appropriate roles into this IAM. Let's jump a little bit forward before coming back to this service creation. 

From my point of view (actually I learn from internet :P), all the cloud provider will offer those three (3) main services: network, compute and storage. The other fancy services are the wrapper with more features and functions that move closer to PaaS layer.

Why network is mentioned at first, if you're the on-prem solution person, compute, storage then network is the last one to establish. However, as a cloud provider, you have to plan the fragmentation of your solution from early stage, how you link those fragments into each other like the way you design your microservices application/solution, it's the 'network' of your resource topology. So when you take a look at the basic fundamental of any cloud services, they have the solid network solution. Then compute is where the virtualization become the king, storage is also the key of the modern IT world where everything is data solution.

Alright, enough for the fancy talk, so when you create a service account, make sure you will add the Compute Admin role and appropriate storage role and (this is most important) Service Account User role. For an easy way, you can ignore this information by using the default from Google, they will help you to add all the most popular roles into your default service account. The last step, select to create the key file, you can select whichever format you're familiar with, I'd suggest for json which we might reuse with modern application. Save your json key for your service account for later usage. This is your init action from GCP web console. How to use its SDK from your local machine, install gcloud SDK with this guide. There is a note here when create a service account, Google will create an IAM for it, however, when you delete the service account, it won't delete the associate IAM (or not yet delete for my test), so if you create the same service account name, it will reuse the previous IAM then you will not see the changes of your new account. This is quite tricky, you have to delete the IAM manual if you want to create a fresh new service account with the same name.

Now, you will start to use the gcloud SDK from command line by start with 

gcloud init

This tool will help to setup your connection to your GCP account, you can switch your credential with the service account with this command

gcloud auth activate-service-account --key-file=<path/to/json_key>

This article will not talk into detail of gcloud SDK so I only give you a sample to create a compute instance as below

gcloud compute instances create 'my-test-vm' --machine-type='f1-micro' --image-project='ubuntu-os-cloud' --image='ubuntu-1804-bionic-v20190628'

The command above will create 'my-test-vm' with smallest configuration (1CPU 614MB RAM) with Ubuntu 18.04 OS. I strongly suggest you take control with gcloud SDK because the other tools will wrap those command into their tools so you will have more inside knowledge to do the troubleshooting.

Terraform Provisioning

Coming to the main part. You can practice with this small started guide. Not sure why, HashiCorp doesn't have its own learning session for GCP but AWS, so you have to dig around for the samples or use their impressive documentation of GCP provider. As mentioned above, I will use Terraform to provision the GCP resources then execute the Ansible playbook on the newly created instance. There is also consideration between using 'local-exec' on control machine which runs the Terraform script vs 'remote-exec' which will execute the Ansible script on the new instance. The remote option will have a lot of things to do such as:

  • Setup the Ansible infrastructure on remote instance: use startup script
  • Copy the Ansible scripts into remote instance: use file profiler
Unfortunately, those tasks will be executed asynchronously on the remote instance so you might add up other task to sync up those actions. So I decided to run on the local control machine and copy the control user ssh key to the instance at startup script then you can open the ssh connection to the remote with this ssh keypair.

The 1st sample for this session is to:
  • Create a compute instance
  • Create a filestore and mount this storage to the compute as NFS (do not confuse between fiLestore vs fiRestore)
  • Setup compute firewall rules for the instance above
  • Setup Jenkins master on this instance and set JENKINS_HOME on the NFS
The project folder structure as below:
    /project
          |__ main.tf
          |__ ansible/
                     |__ jenkins_pb.yml


Here is a runable sample:

// Configure the Google Cloud provider
provider "google" {
    credentials = "${file("<path/service_account/key.json>")}"
    project     = "<my_project>"
    region      = "us-west1"
}


resource "google_filestore_instance" "shared_nfs" {
    name = "shared-nfs"
    zone = "us-west1-a"
    tier = "STANDARD"

    file_shares {
        capacity_gb = 1024
        name        = "shared"
    }

    networks {
        network = "default"
        modes   = ["MODE_IPV4"]
    }
}


// Create a single Compute Engine instance
data "google_compute_image" "shield_ubuntu18" {
    family = "ubuntu-1804-lts"
    project = "gce-uefi-images"
}

resource "google_compute_instance" "jenkins_master" {
    name = "jenkins-master"
    machine_type = "n1-standard-4"
    zone = "us-west1-a"

    boot_disk {
        initialize_params {
            image = "${data.google_compute_image.shield_ubuntu18.self_link}"
        }
    }
    network_interface {
        network = "default"
        access_config {
            // Include this section to give the VM an external IP address
        }
    }
    metadata = {
        ssh-keys = "<my_user>:${file("<path/my_user_public_key>")}"
        startup-script = <<SCRIPT
            apt-get -y update
            apt-get install -yq apt-transport-https nfs-common
            sudo mount ${google_filestore_instance.shared_nfs.networks.0.ip_addresses.0}:/shared /mnt/tools
            mkdir /mnt/tools/jenkins_home
        SCRIPT
    }

    connection {
        host        = "${self.network_interface.0.access_config.0.nat_ip}"
        type        = "ssh"
        user        = "<my_user>"
        private_key = "${file("<path/my_user_private_key>")}"
        agent       = "false"
    }
    provisioner "remote-exec" {
        inline = [
            "hostname"
        ]
    }
    provisioner "local-exec" {
        command = <<EOT
            sleep 120
            cd /project/ansible
            export ANSIBLE_HOST_KEY_CHECKING=False
            ansible-playbook -u <my_user> --private-key <path/my_user_private_key> -i ${self.network_interface.0.access_config.0.nat_ip}, jenkins_pb.yml --extra-var jenkins_home=/mnt/tools/jenkins_home
        EOT
    }
    //service_account {
    //    scopes = ["https://www.googleapis.com/auth/devstorage.read_write"]
    //}
    
}
resource "google_compute_firewall" "allow_http" {
    name    = "jenkins-firewall"
    network = "default"

    allow {
        protocol = "tcp"
        ports    = ["80", "8080"]
    }
}

output "ip" {
    value = "${google_compute_instance.jenkins_master.network_interface.0.access_config.0.nat_ip}"

}

If you can notice why I make a call on remote! In order to verify the ssh channel between my control machine to the new instance, I reuse Terraform error handling to wait for the environment ready before execute my Ansible script to remote. Another note is the connection which is define as instance scope however, when I run the local Ansible script, it will use the user that you specify in the ansible-playbook command. I create the firewall rule for port 80 and 8080 on 'default' network, I will have other sample where we will create custom network.

So you will call those commands to setup the environment

terraform init
terraform plan
echo yes | terraform apply

To tear down this deployment just simply call:

echo yes | terraform destroy

Ansible Configuring

We can re-use the Ansible roles here so I will not write a lot here as you can refer to my old posts here and here. There are some notes when using Ansible:

  • I copy the code from here
  • Why I have to wait for 120s before calling the Ansible script, because even the ssh connection is ready, however, other services on the new instance are still initializing (I got issue unable to lock the apt lock when try to run apt-get update)
  • Add the environment variable to ignore the ssh host key check because the public IP of the new instance is unknown to our control machine.
My suggestion that you should organize your Ansible script with Ansible roles, so you can re-use and organize by hierarchy. Talking about Ansible role then we will take a deep look on Terraform, it has the module which help us to organize your script structure like Ansible role.

Terraform Module 


I create the 2nd sample to:

  • Create custom project network
  • Create public and private subnets with reserved IP ranges
  • Create internal and external IP addresses
  • Assign new compute instance with those resources
I will use Terraform module to organize this use case (a great read for beginner Terraform) with the project structure as below:
    /project
          |__ main.tf
          |__ variables.tf
          |__ network.tf
          |__ modules/
                       |__ devops/        # module name
                                   |__ main.tf
                                   |__ variables.tf
                                   |__ subnets.tf

I try to use the same convention from Ansible role, put the core tasks into 'main.tf' and variables into 'variables.tf'. Terraform will automatically include all the 'tf' files in the same folder at run time so you don't have to include explicit ones.

  • main.tf: initiate GCP provider and feed data to module, return main output. The variables passes to module have to be defined in the module, also the output which use from module has to be defined as output from module.

// Configure the Google Cloud provider
provider "google" {
    credentials = "${file("<path/service_account/key.json>")}"
    project     = "<my_project>"
    region      = "us-west1"
}

// Other global resources will be handled in separated script
// DevOps environment module
module "devops" {
    source             = "./modules/devops"
    vpc_network        = "${google_compute_network.vpc.name}"
    network_self_link  = "${google_compute_network.vpc.self_link}"
    prefix_name        = "${var.company}-${var.team}-${var.env}"
    var_pubic_subnet   = "${var.devops_public_subnet}"
    var_private_subnet = "${var.devops_private_subnet}"
}


// Display Output Resources
output "devops_master_address"  { value = "${module.devops.devops_master_ip}" }

output "vpc_self_link" { value = "${google_compute_network.vpc.self_link}" }


  • variables.tf: define values for environment

variable "company" { 
    default = "<my_company>"
}
variable "team" {
    default = "devops"
}
variable "env" {
    default = "dev"
}
variable "devops_private_subnet" {
    default = "192.168.1.0/24"
}
variable "devops_public_subnet" {
    default = "10.125.1.0/24"
}


  • network.tf: create custom network with its firewall rules for the compute instances in this project network

resource "google_compute_network" "vpc" {
    name          =  "${format("%s","${var.company}-${var.env}-vpc")}"
    auto_create_subnetworks = "false"
    routing_mode            = "GLOBAL"
}
resource "google_compute_firewall" "allow-internal" {
    name    = "${var.company}-fw-allow-internal"
    network = "${google_compute_network.vpc.name}"
    allow {
        protocol = "icmp"
    }
    allow {
        protocol = "tcp"
        ports    = ["0-65535"]
    }
    allow {
        protocol = "udp"
        ports    = ["0-65535"]
    }
    source_ranges = [
        "${var.devops_public_subnet}"
    ]
}
resource "google_compute_firewall" "allow-http" {
    name    = "${var.company}-fw-allow-http"
    network = "${google_compute_network.vpc.name}"
    allow {
        protocol = "tcp"
        ports    = ["80"]
    }
}
resource "google_compute_firewall" "allow-ssh" {
    name    = "${var.company}-fw-allow-ssh"
    network = "${google_compute_network.vpc.name}"
    allow {
        protocol = "tcp"
        ports    = ["22"]
    }
}

  • modules/devops/main.tf: create a compute instance with the subnet and IP address reources in subnet.tf

data "google_compute_image" "shield_ubuntu18" {
    family = "ubuntu-1804-lts"
    project = "gce-uefi-images"
}

resource "google_compute_instance" "jenkins-master" {
    name          = "${format("%s","${var.prefix_name}-${var.devops_region}-master")}"
    machine_type  = "f1-micro"
    zone          = "${format("%s","${var.devops_region}-a")}"
    
    boot_disk {
        initialize_params {
            image = "${data.google_compute_image.shield_ubuntu18.self_link}"
        }
    }
    metadata = {
        ssh-keys = "<my_user>:${file("<path/my_user_public_key>")}"
    }
    network_interface {
        subnetwork = "${google_compute_subnetwork.public_subnet.name}"
        network_ip = "${google_compute_address.internal.address}"
        access_config {
            nat_ip = "${google_compute_address.external.address}"
        }
    }
    connection {
        host        = "${self.network_interface.0.access_config.0.nat_ip}"
        type        = "ssh"
        user        = "<my_user>"
        private_key = "${file("<path/my_user_private_key>")}"
        agent       = "false"
    }
    provisioner "file" {
        source      = "<path/src_files>"
        destination = "/tmp"
    }
    provisioner "remote-exec" {
        inline = [
            "hostname"
        ]
    }
}

# Return output to main script
output "devops_master_ip" { value = "${google_compute_instance.jenkins-master.network_interface.0.access_config.0.nat_ip}" }

  • modules/devops/subnets.tf: create public and private subnet from the custom network, vpc, and the internal IP from the public subnet and external IP
resource "google_compute_subnetwork" "public_subnet" {
    name          = "${var.prefix_name}-${var.devops_region}-pub-net"
    ip_cidr_range = "${var.var_pubic_subnet}"
    network       = "${var.network_self_link}"
    region        = "${var.devops_region}"
}

resource "google_compute_subnetwork" "private_subnet" {
    name          = "${var.prefix_name}-${var.devops_region}-pri-net"
    ip_cidr_range = "${var.var_private_subnet}"
    network       = "${var.network_self_link}"
    region        = "${var.devops_region}"
}

resource "google_compute_address" "internal" {
  name         = "${var.prefix_name}-${var.devops_region}-pub-int-ip"
  subnetwork   = "${google_compute_subnetwork.public_subnet.name}"
  address_type = "INTERNAL"
  region       = "${var.devops_region}"
}

resource "google_compute_address" "external" {
  name         = "${var.prefix_name}-${var.devops_region}-pub-ext-ip"
  address_type = "EXTERNAL"
  region       = "${var.devops_region}"
}

  • modules/devops/variables.tf: define the variables will be used by devops module and provided from project root
variable "devops_region" { 
    default = "us-west1"
}
variable "vpc_network" {
}
variable "network_self_link" {
}
variable "prefix_name" {
}
variable "var_pubic_subnet" {
}
variable "var_private_subnet" {
}

This sample mostly work on GCP network, you can take a look here to add the load balancer as a gateway to public internet for your microservice design. The different between public and private subnet is the Private Access Option of the subnet.

Here is another source for you to read more Terraform samples.

Alright, you then can run terraform apply with the sample above to test the result.

That should be it for a long quiet time!

PS: I will setup my github repository to store the code for more convenience.

Comments

Post a Comment