Eiara Blog

The last 5 posts to the Eiara Blog

Deployment Roles

Fri 13 October 2017

One of the core aspects of modern DevOps process is the CI/CD pipeline, where newly built deployment artefacts can be easily pushed into our various environments, automating the tedium of deployment and going live.

In terms of convenience, this is great. Complicated deployment procedures are turned into turn-key automation, enabling us to get new software into testing or production, and getting feedback on our work has never been easier.

However, when we’re building these systems, security is often a secondary thought, or not considered at all, and it’s all too easy to use highly-privileged credentials to set up these deployment roles.

We recently built out a deployment system for AWS Lambda, using Terraform, and it takes a lot of thought to consider what, exactly, our deployment role should be able to do, but also what we’re trading when we limit things like that.

The Need

AWS Lambda is easy to deploy and easy to work with, and there’s a myriad of opinions on how to deploy code to it.

We wanted to be able to use our standard TravisCI-based build process for pushing new versions of a codebase to AWS Lambda, by using the industry-standard tool Terraform to manage the deployment process.

As Terraform requires AWS API access, the fastest way to achieving this goal is just to make standard AWS access keys and insert them into the TravisCI UI. While this is largely safe, in that TravisCI is well-defended and manages their own breach detection, including highly-privileged access keys here is still a risk if someone breaches your source code repository or one of your developers’ workstations.

Because TravisCI has to be trusted, those access keys are available in the clear, and so anyone with access to your source code repository must have access to those keys. This would be the same with any CI provider, because any CI provider would be in the same position of high trust.

So what should we be doing?

Principle of Least Access

In this, and every case, we should be thinking about how little access we need to get the job done.

When we built our system to deploy into AWS Lambda, we asked what we needed and how we should be thinking about deploying functions.

We decided on using a versioned S3 bucket to provide a historic record of function payloads, and a limited execution role to be passed to the Lambda itself.

From those decisions, the smallest set of permissions needed to deploy a function would be: - Write to an S3 bucket - List, create and delete functions - Pass in an IAM role for execution

It’s important to note that our deployment isn’t going to be creating the S3 bucket, or the IAM role needed for execution.

S3 Bucket

The S3 bucket permission is fairly straightforward, we need to be able to upload payload files, and know where to upload them. This part of our role won’t need to delete or modify existing files, as we’re focussing solely on letting S3 manage the historic record.

List, Create, Modify

The second piece is being able to modify functions. Because the CI system is authoritative when it comes to the deployment of functions, it needs to have the permissions to make these modifications.

However, the CI system should only be authoritative over its own functions. In our design, we implemented this by requiring a CI-specific prefix for the function names, ensuring that functions created through other means couldn’t be touched.

Pass a Role

Finally, a newly-created function needs to have an execution role associated. This role determines what a function is capable of doing, and this is probably the most important aspect of ensuring a consistent security profile when building this sort of CI service.

In order to create a function, the deployment role needs to be able to pass a role. In general, this could be any role, from the most basic AWSLambdaBasicExectionRole to the core admin role.

In order to ensure we’re not able to assign an admin role to our function, we set up our deployment role to only be able to assign a single, pre-determined execution role, and only that role.

By doing this we can be assured that our functions can’t leak elevated access, and that functions can never do anything more than we expect.

Implications

Limiting our deployment role to this extent does come with a major potential downside, in that it introduces a gatekeeping stage every time we need to deploy functions which require different levels of access to our AWS accounts.

This kind of gatekeeping can be one of the major drivers behind creating a shadow IT environment, because it interferes in rapid testing and iteration processes.

This level of deployment role lockdown may not be appropriate for your environment, but it is necessary to consider and have the conversation in your team as to its necessity and impact, as well as the impact of not implementing these ideas.

Code Examples

So what does the setup for this deployment role look like in practise? Let’s look at some Terraform code to set it up:

The Policies

This first block of code defines the IAM policies required to create AWS Lambda functions. These policies are the core that enables what our CI role can do.

data "aws_iam_policy_document" "lambda_create" {

  # So that CI can bind the Lambda to the execution role

  statement {
    actions = [
      "iam:PassRole",
      "iam:GetRole",
    ]
    # Allows us to only provide one role to our lambda function
    resources = [
      "arn:aws:iam::account-id:role/role-name",
    ]
  }

  # So that we can create and modify all the Lambdas

  statement {
    actions = [
      "lambda:CreateAlias",
      "lambda:CreateFunction",
      "lambda:GetPolicy",
      "lambda:DeleteFunction",
      "lambda:GetFunction*",
      "lambda:ListFunctions",
      "lambda:ListVersionsByFunction",
      "lambda:PublishVersion",
      "lambda:UpdateAlias",
      "lambda:UpdateFunctionCode",
      "lambda:UpdateFunctionConfiguration",
    ]

    # But only the ones with the prefix
    resources = [
      "arn:aws:lambda:region:account-id:function:cicd_prefix_*",
    ]
  }
}


data "aws_iam_policy_document" "s3" {
  # Allows our CI provider to list the payloads bucket

  statement {
    actions = [
      "s3:ListBucket",
      "s3:GetBucketLocation",
    ]

    resources = [
      "arn:aws:s3:::example-deployment-bucket",
    ]
  }

  statement {
    # So we can update files in our own bucket, but not get or delete them.

    actions = [
      "s3:PutObject",
      "s3:PutObjectAcl",
    ]
    resources = [
      "arn:aws:s3:::example-deployment-bucket/*",
    ]
  }
}

The User

The second part is the user itself, and the group membership. By creating the user in this way and isolating it from the existing roles in AWS, we’re able to strongly control our role capabilities, and providing a skeleton for creating new deployment roles in the future.

# Creates an AWS user to hold the CI deployment role

resource "aws_iam_user" "ci" {
  name = "ci_user"
}

resource "aws_iam_group" "ci_group" {
  name = "ci"
  path = "/ci/"
}

resource "aws_iam_group_membership" "ci_group_membership" {
  name = "ci-group-membership"

  users = [
    "${aws_iam_user.ci.name}",
  ]
  group = "${aws_iam_group.ci_group.name}"
}


# Create the policy that permits creating Lambda functions

resource "aws_iam_policy" "allow_create_lambda" {
  name        = "allow_create_lambda"
  path        = "/ci/"
  description = "allows limited lambda source creation and modification"
  policy      = "${data.aws_iam_policy_document.lambda_create.json}"
}

# Create the policy that permits uploading Lambda functions to S3

resource "aws_iam_policy" "allow_ci_S3" {
  name        = "allow_ci_S3"
  path        = "/ci/"
  description = "allows limited S3 access for CI"
  policy      = "${data.aws_iam_policy_document.s3.json}"
}

# Connect the policies to the group that our CI user is a part of

resource "aws_iam_group_policy_attachment" "lambda_attach" {
  group      = "${aws_iam_group.ci_group.name}"
  policy_arn = "${aws_iam_policy.allow_create_lambda.arn}"
}

resource "aws_iam_group_policy_attachment" "s3_allow" {
  group      = "${aws_iam_group.ci_group.name}"
  policy_arn = "${aws_iam_policy.allow_ci_S3.arn}"
}

# Creates credentials that can be used in your CI provider of choice

resource "aws_iam_access_key" "ci_credentials" {
  user = "${aws_iam_user.ci.name}"
}

Implications

Pulling the credentials as we are in aws_iam_access_key does have the implication that these credentials are being written into the Terraform state file, which may be inappropriate for your threat model.

If it is inappropriate, generating the access keys from the AWS console will be a better option, and should be explored instead.


Encryption is Not Enough

Fri 29 September 2017

With the rise of DevOps, we've seen a shift from the old world of a delivery handoff and disconnect of developer resources from operational concerns to delivery teams with a full gradient of skills, ranging from development best practices to appropriate SRE1 policies.

Coinciding with the rise of infrastructure-as-code has been the awareness of how some practices are harmful, the most primary example being "don't check your secrets into source control."

We hear numerous stories of this having happened, of AWS keys compromised or SSL certificates leaked because they were inadvertently checked into Github, to trusted self-signed certificates being used to mask malfeasance. From this example, one of the patterns we've picked up is to encrypt our secrets, to ensure that they are never stored unencrypted. This is an important step, and we can be assured to some extent that our secrets will remain secret, remain safe.

Security requirements and recommendations often touch on this necessity, point to a lack of encryption as a flaw in your information security architecture.

So we encrypt our files. We feel safe. We feel secure. We are secure. Right?

Goals

In this case, encryption has been added because our secrets must be encrypted. But, why? Because our security requirements said we must.

But, why? So that the secrets cannot be read by parties unknown.

So how does encryption prevent that?

If an attacker gains access to the ciphertext, say through a Github account breach, we feel as though we have security, that they will not be able to access the file or our secrets. But we won't know if they ever succeed in gaining access except through secondary means, such as detecting unauthorised queries.

If an attacker gains access to our deployed servers, servers which must have the secrets in the clear in order to use them, all our encryption is for nothing because anyone on that server can access those secrets, and use them as they see fit.

As a core axiom of security planning is the awareness that we will, one day, lose, we must treat our secrets as though they could be breached and misused at any moment, we start to see that just encrypting our secrets in our Github accounts or configuration management data stores is not and can not be enough.

Instead, we need to think about access.

Guaranteed Access

Let's consider a standard three-tier web application deployment, with our loadbalancers, web application tier and database. Each layer of this design has access to a set of secrets, from the loadbalancers with SSL for encryption to our clients, to the web servers holding passwords to log in to the database.

At each tier, the machines are trusted users of those passwords, not just expected to have them but required. The web server can't do its job if it can't connect to the databases, after all!

From that, a user with access to that machine is also trusted, to the same degree.

This tells us that we must consider a world where our secrets may be accessed at any time, by people outside of our control, and since the location of access is trusted, we cannot rely on encryption to enforce access control.

So, what if we accepted that? That anyone or anything on that server could reasonably access those secrets?

At that point, we start thinking not of restricting access but of awareness of access. If I know that my web application restarted and decrypted the database password, then I have external awareness of that fact. If my secrets are accessed and there is no correlated application restart, I can now ask if it was a legitimate access, and react appropriately if it wasn't.

By changing our focus from performing an action to considering a result, we move from encrypting secret material to being aware of illegitimate access of the secret material by using encryption technology.

Tools

The approach of treating access as the fundamental point of instrumentation is one of the underlying ideas behind Vault and Eiara's own Ratchet proposal, where encryption is used to provide a gateway, and accesses are always logged. We are given new ways to frame "is this a valid access?" and "should I continue to trust this server?", questions that are difficult to answer without this consideration.


  1. Site Reliability Engineering 


The Principles of DevOps

Fri 29 September 2017

The DevOps movement has been running for close to a decade, and has produced some of most powerful tools for managing and deploying software.

We talk about how automation improves our lives, but we don’t talk about the underlying goals that drive why we focus on automation. We talk about moving from specialist servers to classes of machine, with only an implicit idea of the problems we are solving.

As part of our journey, we’ve identified three core principles that underpin, on a technical level, everything that DevOps is trying to achieve and improve, the principles of isolation, repeatability and determinism. These principles are the guiding forces behind the technical choices that our tools are making, and frame the way we think about reliability and resilience.

Isolation

In DevOps terms, the isolation principle covers clean lines of authority within our technical components, where each component of our infrastructure must be separate from each other piece by the greatest extent possible.

In a deployment sense, this tells us that our product is distinct and must be considered distinct from the operating system, and isolated from changes.

In a microservices sense, we’re taught that each service must have a strong, defined interface between other components, one that is static and unlikely to change.

Repeatability

The next property speaks of our processes, insisting that repeating the process is possible. In order to achieve that possibility, we are required to fully capture every detail of the process, in a form that can be consumed by others.

A process isn’t repeatable if it relies on steps in a single persons’ head, after all, and so that specialist knowledge must be documented, each step fully explored, and each stage thoroughly captured.

In a server sense, this tells us that a new server can be created from scratch in a known and knowable way, that we can give that task to anyone on our team and be reasonably assured that the process will be completed.

This principle is the first time we begin to capture the necessity for automation.

Determinism

The final property tells us that the process we’ve defined must always produce the same result. If I set up a server, then it must be the same as the server you set up. If I install the product, it must be the same as if you install the product.

This final tenet demands automation due to the difficulty inherent in achieving consistent results without automated testing and build processes.

Determinism seems like it should be a part of Repeatability. After all, doesn’t a repeatable process imply that the results will be the same? Rather, Determinism posits that people performing a process will produce similar, not identical results, and the goals of Determinism is to fully control for that variation.

Properties

Using these core principles we are able to see the very powerful properties that DevOps practises offer us, enabling new forms of deployment and maintenance of our infrastructure.

Lowering the Costs

For instance, the DevOps principles insist we invest in both automation and consistent and decoupled interfaces between our components, which allows us to begin to treat our infrastructure as disposable. Why would we modify a system in an attempt to repair it when rebuilding it from scratch is an automated, repeatable, and deterministic act?

But a pure replacement does not help us diagnose the failure. Fortunately, instead of removing and replacing failed systems, we can disconnect a failed machine and isolate it for forensics and analysis, all without impacting the core of our infrastructure.

Being able to discard and recreate our infrastructure in this manner is only possible because we can guarantee the result of our process, because we know that there is no tight coupling that will break when we remove critical infrastructure, and that possibility only exists by adherence to the DevOps principles.

Scaling Further

Strongly isolated components in our infrastructure also allows for access to the true power of the cloud, scale. By providing standard interfaces and decoupled properties, any component that communicates with another component shouldn’t need to communicate with just one, it could communicate with one of many.

With determinism and repeatability we are able to reach the promise of the elastic cloud, because we are never in doubt of our ability to guarantee the consistency of our processes.

Fail Fast, Fail Often

We can start to talk about the reliability and resilience properties that DevOps unlocks. When we guarantee that what has failed will be replaced with an identical result, we can treat failures as a known quantity. Test environments to reproduce the failure can be constructed without complicated work because our automation ensures the process will be identical.

This allows transient failures themselves to be managed automatically, reducing the workload of operational firefighting and freeing up resources to work on improving resilience in other areas of the infrastructure.

Test It All

Finally, adherence to DevOps principles allows for development and quality assurance to happen with access to an environment as close to production as possible. Changes can be tested, tried, and retested without complicated work creating environments or long change requests, increasing the turnaround from development to testing.

Complete automation also allows for tested environments to be vetted before introduction into production, with full, visible and automated visibility into the differences now being within reach.

Potential

These are the properties that the DevOps principles unlocks, by viewing our deployments as integral to our software development and by adherence to the principles, we are no longer limited to “I think”, but are much more able to say “I know.”


Ratchet

Sat 12 August 2017

One of the core axioms of security for modern web deployments is to not check your keys into source control. From our API keys to our database passwords, these secrets being in our Git repository is a recipe for multiple kinds of disaster.

From the firing of a new engineer who used production credentials left in onboarding documentation to the numerous reports of plaintext credentials being used to pivot further into a breach, we’re instructed over and over to encrypt credentials, and decrypt only once we’ve deployed.

But.

Encryption is presented as the answer, as the secure approach, but encryption grants us nothing. We still need to be able to decrypt the secrets, and inject them into our app during startup. In testing, we often do this with the equivalent of .env files, and the Twelve Factor Application approach recommends we load our secrets into environment variables.

What this process loses is visibility. Our secrets, theoretically, only get decrypted and used once, during application startup. But, how do we know? In a development context, this is largely irrelevant, because our .env files or startup scripts have disposable credentials for disposable services. In production, though, do we know if credentials have been decrypted outside of our normal process?

Tools such as Hashicorp’s Vault are intended to solve this problem, by providing an interface that logs every request and regularly rotates key material, which both makes rogue access visible and ensures that any breach can be quickly repaired.

However, integrating Vault requires changes to development and deployment workflows. Instead of loading secrets from a file, developers need to modify the application to request credentials from the Vault API. This is fine, but it involves both extra developer load and a different code path in production and development.

Not to mention the extra maintenance and operational requirements that adding another service to the deployment stack entails.

What if there was a simpler idea?

Simpler Idea

Ratchet is a similar, simpler idea.

Rachet works by providing a Docker container that mounts a directory, and creates a .env in that directory. But instead of it being a file, that .env is backed by a FIFO buffer.

Once the FIFO is opened by the application, to read its environment as normal, Ratchet reaches out to an API and fetching the standard credentials file and writing it to the FIFO, ratcheting around to wait for any subsequent read.

The API backing Ratchet is nothing more than a simple AWS Lambda or other Azure Serverless function, with access control handled by standard IAM roles or API Gateway access keys, and that Lambda function is able to transparently use AWS KMS or Azure Key Vault to decrypt credentials dynamically, sending them over secure HTTPS to the requesting Ratchet daemon.1

By running in this way, Ratchet ensures that credentials are always stored in encryption-at-rest by the Lambda function and cloud key management, are always encrypted during transit to the Ratchet daemon, which offers similar visibility and security semantics when compared with tools like Vault.

Crucially, this strategy also allows Ratchet to be seamlessly integrated with existing development and deployment workflows, as the same code path is always used - reading a file from disk with the credentials we need. These credentials are never stored on disk on the deployed machines, and any attacker reading from our source file provides an access notification.

This technique also ensures easy rotation of secrets, as we merely have to update the Lambda function’s environment with new, KMS-secured environment variables, and perform a rolling restart of our services.

ECS

Ratchet uses a Docker image because it specifically came out of the difficulty involved in securely distributing credentials to containers running on ECS hosts. While direct API calls to AWS KMS are always a possibility, this requires updates and ongoing developer attention and support to maintain.

By using the Ratchet Docker image, the deployment and startup is seamlessly handled by ECS, and ensures that only a single interface point exists for managing key material.

Ratchet aims to be fully application-agnostic by providing a simple, cross-language interface that’s easy to integrate with. This simplicity and generic approach ensures that any web stack can benefit from secure, reliable secret storage with a minimum of developer time or maintenance.


  1. While Ratchet is currently built for AWS services, Azure offers identical functionality, and Ratchet would run seamlessly on the Azure platform.