If you’ve followed me on Twitter for the last few months, you’ve been seeing me talk a lot about what I’m doing with Cloud Island1. I also wrote a great article on the design behind Cloud Island and the decisions that we made.
While I was talking about this project on Twitter, I was approached by a developer who asked how I managed to plan the infrastructure for Cloud Island.
They mentioned that they often feel overwhelmed while trying to figure out what the infrastructure even is or needs to be, and asked me if I could explain what goes into an infrastructure deployment and how to think about it themselves.
I thought this was a good point to touch on how Eiara plans our infrastructure, and the ideas that we bring to our customers.
The Three-Tier Design
It’s a basic reality that almost everything that we deploy nowadays is a web service of some variety, which gives us an incredible consistent set of concepts and semantics. This allows us to focus on the specifics of your service, instead of trying to create new concepts and designs.
The three-tier design works extremely well because it describes how web software is built, which causes new web software to be built to this concept, creating a positive feedback loop that improves design for everyone.
The three tiers are: - The web server, - The application, and - The database.
These conceptual tiers both require us to think about a system in terms of what pieces are required to even run the system, as well as provides a useful metaphor for considering how trusted each tier is.
The web server tier is the
nginx or loadbalancer tier. This is the tier of systems that directly face the end user, which may be the internet at large. As a result, this tier is the least trusted, and has the greatest exposure to potential threats.
This is the tier where the fewest services should run, limited to the webserver or loadbalancer, and a server running the SSH bastion or VPN endpoint.
Anything else that runs in this tier is directly vulnerable.
Putting database systems like Elasticsearch or Redis in this public-facing tier have been the cause of a huge number of security breaches and database dumps over the years.
The application tier is where the custom software runs. This is generally the Rails, Python, Java, or other application stack that the developer team is working on.
This tier is conceptually separate from the webserver tier, but is often merged on the same physical hardware.
This tier also carries more trust, as it’s running the business logic that makes the service work. It has heightened access to the underlying databases. This is often the tier where an attacker is most likely to achieve a breach through exploiting the application software.
This tier holds the database servers relevant to the application. Where possible, this tier should be completely isolated from the internet, with access only possible through a VPN or SSH bastion.
This tier has the highest level of trust possible, as it is the crown jewels. Not just the application data is available here, but this tier often finds itself home to a variety of other sensitive databases and data sets.
I’ve used the term “trust” in when I’ve described tiers, and the idea I’m trying to communicate here is the combination of level of exposure to the outside world and direct attack, as well as the importance of the software and data present.
“Trust” helps us highlight that different systems have different levels of responsibility, and reminds us that we should be thinking about how we can restrict movement between these levels, and to what level we can monitor where those transition points occur.
So now that we have an idea of a good, easy-to-use design criteria for a webservice deployment, we can start thinking about what our plan is going to look like. This is broken into several major parts, “what do we need?”, “who is going to attack me?”, and “what are my access controls?”
What do we need?
What you’re going to need is, as always, highly dependent on how your webservice is built. The common pattern, as above, will be the webserver (nginx), the application code, and the supporting databases.
When we’re planning at this level, we’re thinking about the specifics of servers themselves. What kinds of servers are needed? How many are needed? Do we need to think about high availability or other complex redundancy?
What kind of storage do we need? This will include the database software and storage medium, with thought given to what sort of workload we expect - write heavy or read heavy, but also database backups and where those backups are going to be stored.
If your service offers user uploads, where will those be stored? Do they need to be backed up? If they do, where will they be backed up?
For Cloud Island, we needed a web server to run the Rails application, a background processing server to handle both standard running tasks and other maintenance, and three database servers.
Two of the database servers, Redis and Elasticsearch, don’t need to have any backups as they’re used as for caching, and their data can be regenerated from Postgres at any time.
The main data store, Postgres, does require backups. Those backups are handled by swiftbackmeup, and are automatically uploaded to the Catalyst Cloud object storage before being mirrored for redundancy.
It’s hard to plan for capacity for a service you’ve never run before, and it’s very easy to fall into a pit of chasing research for exactly what you’re going to need to run your system.
The problem is, you will get it wrong.
Every system is different and will have different scaling needs and different load patterns, and it’s very hard to get a good sense of those needs in advance.
Instead of focussing on what could be, the best option for planning for capacity is to make it easy to scale.
In the case of Cloud Island, even though we put the three core Rails services on a single server, using
docker-compose means that it will be trivial to move those services onto their own servers as performance requirements dictate.
In the case of our database servers, we’re using custom volumes that can be detached and reattached to bigger servers later, if needed, or, merged into a single server if cost reductions become necessary.
Access Controls and Security Planning
The second major part of any infrastructure plan is considering what basic security controls are necessary and important.
For example, any number of off-the-shelf tutorials for infrastructure will default to using high-permission utility accounts or IAM rules, leading to escalation paths in the event of a breach.
The default three-tier architecture brings some security controls in by default, but they’re a baseline.
On top of that baseline, for each component we add to the service we need to ask “What does this need to access?” and “How do I prevent other things from also accessing that?”
The framework for asking these questions is called threat modelling. Threat modelling can take many forms, from formal and in-depth processes to a quick and ad-hoc conversation about common or expected attackers.
The major questions that threat modelling asks us are who wants to attack us, what are they after, and what resources do they have. Answers to these questions can be as unique as the service you’re running, and aren’t always what immediately comes to mind.
For Cloud Island, I think the major attack scopes are normal background scanning and attacks, DDoSes, and people trying to dump our database to get our account data.
Our Access Controls
For Mastodon, the access controls we’ve added are:
- Restricting external SSH to the Bastion,
- Restricting internal SSH to only originate from the Bastion,
- Restricting what servers can access Redis, Postgres, or Elasticsearch, and
- Restricting how servers can access the monitoring service.
Cloud platforms offer security groups as a basic unit of access control, and we use them with the access label pattern, with a lock-and-key naming convention. For example, Postgres is assigned the
postgres_database_lock security group, which restricts connections to only be permitted from servers that hold the
postgres_database_key security group.
By doing this, we can add or remove new web servers that need access to Postgres without needing to adjust IP ranges as servers come online or go offline. We can simply define what security groups should be applied to the server, and trust that it will Just Work.
It also becomes easy to audit what servers have been granted access to the database server, without needing to try to map IP addresses to individual servers.
Cloud Service Users
For Cloud Island, we need to be able to upload images to the Catalyst Cloud object store, which means we need a user account.
It’s tempting to use a root account or an account that has full object store permissions for that service account. After all, it solves our problem of “uploading image” quickly and easily.
But if we do that, what else can that account do that we wouldn’t want to happen?
In the case of Cloud Island, an attacker that gets onto the web serving machine would have access to that password. If that’s a full root account they would have access to everything I have access to, like deleting running instances or creating new ones to deploy mining software.
Even if we create a user limited to just full object store permissions an attacker would be able to download all the images without issue, delete every piece of media ever uploaded, or download and delete all the backups.
That’d be pretty catastrophic.
So, a service account should probably be limited to just being able to upload and delete images, not have any other capabilities, and have the permissions to do those uploads explicitly granted.
This limits our potential risk to “an attacker can read the database to figure out what all the images are, and download them”. This is still a risk, but it’s a less severe risk than “can delete everything”.
Given we also have an idea of what this attack would look like we could put monitoring on the database to let us know if anyone is crawling the database for images to download, or trying to dump the database as a whole.
Other Security Concerns
This is just one of the security concerns that I thought about while designing Cloud Island, but it encapsulates the core concept of how “security” should be considered as part of a broader whole.
Because we asked ourselves what a potential attacker may want to do and what they may want to access, we could think about what security made sense, and make some educated guesses about what the goals would be. From those guesses (and like all things, they are guesses) we can build in friction, make it harder to achieve those goals, and try to limit access to ways that we can monitor.
One of the core tenets of a good DevOps practice is the idea that everything will fail, and that failure is inevitable.
Instead of trying to use this fact to attempt to resist failure, the ideal of a modern system is to make repair or replacement easy. The easier it is to replace and recover a system, the less stressed that we will be of potential failures.
There is no One True Disaster Recovery Plan. Every system has not just unique ways that it may fail, but have unique requirements in what components are important for recovery.
Disasters aren’t limited to just the web service, either. A disaster can impact any component at any step of the deployment and operation of the service. Is the customer-facing website, not the webapp, part of your disaster plan? Does it matter if it goes down?
Does it matter if your system goes down for a day? A week? A month? What have you agreed to support with your users, and how upset will they be if there’s a failure?
What about your billing system, if that’s applicable?
For Cloud Island I think there’s two major areas of catastrophic failure, one of which is easier to manage and more critical to get right. The two areas are the media storage and the database backups.
Media storage is, by far, the hardest part to offer any level of recovery in the event of catastrophic loss. Mastodon generates gigabytes of media, the vast majority from other instances, but does not currently offer any automated tools to extract and back up the local images.
The much more critical component of recovery for Cloud Island will be the database itself. While it would be a huge loss if the uploaded media was unrecoverable, it would be a service-ending event to lose the database.
Knowing that, I’ve prioritised database backups, and ensuring that they’re hard to delete. Catalyst Cloud is a modern, multi-region cloud provider with data centres spread across New Zealand, and any one of those regions experiencing a disaster would not cause me to lose the database backup.
The way the backups are set up, accidentally deleting the database server or having the data deleted by an attacker won’t cost me the backups. It’ll be a huge pain, it’ll take the service down for days, but it won’t be the end of everything.
Of course, an untested backup is not a backup, just a hope that things might get better.
Cloud Island has a backup of the Postgres databases. It’s quite straightforward to reload a Postgres database backup, and the entirety of the infrastructure behind Cloud Island is set up via automation.
As a result, if we need to redeploy to Catalyst Cloud, we could be back online from absolutely nothing in about half a day.
If we needed to migrate to another OpenStack provider, we’d be facing a much longer recovery period as some of the automation makes some Catalyst Cloud-specific assumptions that would have to be reconsidered. We might be looking at a week or so for a full recovery.
Half a day? I’m pretty happy with that, for a young service like Cloud Island.
A week? I’m not so comfortable with that, but, I have confidence that Catalyst Cloud isn’t going anywhere and that their multi-region deployment means a very low risk of catastrophic failure.
Much of this design ideal comes from experience built up over years, of thinking about the benefits against the added complexity of added abstractions.
These abstractions aren’t free, after all. Setting up individual service accounts, maintaining passwords, designing the access layouts, and thinking about how data will flow in a system is a lot of up-front work.
It’s work that, for Cloud Island, I considered a fundamentally critical aspect of deploying the system. Thinking about backups, about potential security risks, about how to ensure the system is here today and can be relied on to be here tomorrow is my responsibility.
A Mastodon instance, a part of the world-wide federated social network. ↩