Renting a virtual computer on Amazon Web Services (AWS)
Once upon a time, organizations that worked with large data sets needed to own and maintain their own servers. These days, renting time and space on someone else’s servers — a practice referred to as cloud computing — is often more cost-effective. Amazon Web Services (AWS) is currently the most widely-used cloud computing platform. As a result, there is a growing need for people who work with large datasets to understand how to use AWS.
Getting started with AWS can be intimidating: the platform provides a large number of services, many of which are referred to by acronyms. My goal in this post is to provide a conceptual framework that makes the platform easier to learn. We will focus primarily on the services that allow you to “rent” a virtual computer and storage space.
Renting a “computer”
Imagine a distant room full of desktop computers, each of which is configured differently. Some use Linux as their operating system, while others use Windows. Some have more memory and/or processing power than others. Different computers also have different networking capabilities.
Now imagine being able to use your personal computer from home to operate one of the computers in that room. You must pay to “rent” this computer, with higher-end computers costing more money to use.
Amazon’s Elastic Compute Cloud (EC2) service is sort of like that. Instead of a room full of independent desktop computers, there is a building full of servers, which are high performance computers (without monitors) that can be accessed through a network. As opposed to renting a desktop computer, the user is renting a portion of a server that is set up to behave kind of like a desktop computer. This is called a virtual machine. There may be multiple virtual machines running on a single server.
In order to set up a virtual machine, the user must select (or create) a template that describes the type of computer they would like to use. Each of these templates is referred to as an Amazon Machine Image (AMI). For example, you can specify which operating system and applications you want your virtual computer to have.
Each time you use a virtual machine is referred to as an instance. “Turning on” a virtual computer is referred to as launching an instance. You can rent a virtual machine for a short amount of time (e.g. seconds to hours) or a long amount of time (e.g. one to three years). Stopping an instance that you have reserved for a long time is analogous to turning off a computer; you can start it up again later. Terminating an instance, on the other hand, is analogous to using a magic wand to make a computer disappear: the part of the server that was being used to host your virtual machine is now freed up to do something else. After you terminate an instance, you can “recreate” your virtual computer from a “template” by launching a new instance using the same AMI.
You control what kind of software your virtual computer has by choosing an AMI, but you control what kind of hardware your virtual computer has by choosing an instance type. Different instance types have different processing and networking capabilities, as well as different amounts of short-term memory and long-term storage space. As a result, different instance types are optimized for different uses. For example, instance types with more processing power may be better suited for running machine learning algorithms, while instance types with more long-term storage space may be better suited to retrieving information from large databases. The full list of instance types is quite large. We will limit our discussion to a few of the parameters to consider when choosing an instance type.
Processing power
One factor to consider when choosing an instance type is the number of virtual computer processing units (vCPU) you want your machine to have. In order to understand vCPUs, we need to discuss “real” computer processing units (CPUs). A CPU is a piece of hardware that executes instructions from a computer’s software or hardware. Each sequence of instructions is called a thread. If a computer were a brain, the CPU would do the “thinking” and each “train of thought” would be a thread.
Different CPUs can process different amounts of information at different speeds. The processing power of a CPU is determined in part by the number of cores it has. Each core is an individual processing unit. Once upon a time, each CPU had a single core, but thanks to advances in technology, multiple cores now fit on a single piece of hardware.
The processing power of a CPU is also determined by how many sets of instructions each core can process at once. Using a technique called threading, a single core can process multiple threads simultaneously.
Earlier, we discussed how a single physical server can be subdivided into multiple virtual computers. This is performed in part by subdividing a physical CPU into multiple virtual CPUs. When you rent a virtual machine with one vCPU, you are renting the ability to process one thread of instructions on one core of a physical CPU. The number of vCPUs you need to rent depends on how you plan to use your virtual computer.
You should also consider whether to choose an instance type with one or more graphic processing units (GPUs). GPUs are similar to CPUs, but they are designed to perform many small tasks at once. Although they are often used for processing graphics, they can also be used for other functions, such as deep learning.
Short-term memory
Computers use different components to store data for immediate access or long-term storage. Short-term data storage is referred to as memory. A hardware component called random access memory (RAM) stores data that the processor needs to access quickly, such as data used to run applications. You can select instance types with different amounts of RAM; a virtual computer with more RAM will be able to multitask more quickly.
Long-term storage
A computer’s long-term “memory” is referred as storage. A physical computer will generally use either a hard disk drive or a solid state drive to perform long-term data storage. A hard disk drive features a spinning disk that is used for storing and accessing data. A solid state drive stores data on chips and does not feature any moving components. Hard disk drives are generally less expensive, while solid state drives are faster and more durable.
Intuitively, it might make sense that you would choose an instance type with either a hard disk drive or a solid state drive. The options provided by Amazon, however, are slightly more complicated. This is because Amazon rents out storage space through multiple services. When selecting an instance type, you may see that the instance includes direct access to a certain amount of volume on a hard disk drive or a solid state drive. This is referred to as EC2 instance storage. Alternately, you may see that the instance requires the use of another Amazon service.
So far, we have been talking about the Amazon service known as EC2, which allows you to rent a virtual computer. You can rent long-term data storage separately through multiple Amazon services. The three services we will discuss below use different methods to store and retrieve data: object, block, or file storage.
The least complex of these options is the Simple Storage Service (S3), which uses a method called object storage. In S3, you can create data storage containers known as buckets. The things you store in buckets are referred to as objects. S3 does not use the type of “file folder” organizational system that you are used to using on your home computer — the objects are all just “sitting” on the bottom of the bucket. You can use a web interface to create a bucket and upload objects. You can also mount an S3 bucket to an EC2 instance, which allows your virtual computer to access the objects inside. Object storage is generally less expensive and more slow than block or file storage.
In contrast, the Elastic Block Store (EBS) service uses a method called block storage. In block storage, data is broken up into chunks called blocks before it is stored. To use EBS, you start by creating a virtual drive called a volume. You can specify how big you want this volume to be and whether you want it to use a hard disk drive or a solid state drive. You can access your EBS volume after attaching and mounting it to an EC2 instance. To see a step-by-step example of this process, check out this article. Block storage is generally more expensive than object storage, but you can access data more quickly.
Amazon offers several services that use a method called file storage: files are organized in a hierarchical system that can be accessed simultaneously by multiple applications and users. This method is similar to the way you access files on your home computer. You may want to use the file storage method if you have a manageable number of organized files and are looking for a familiar interface.
Distinguishing between the various Amazon services that use file storage is challenging; this presentation is the most useful resource on this topic that I have found. The Elastic File System (EFS) service can be used with Linux and MacOS and only charges you for the resources you use (as opposed to charging you for a set amount of reserved storage space, as in EBS). FSx for Windows is compatible with Windows-based applications, while FSx for Lustre may be suitable for high-performance computing.
Pricing
Discussing AWS pricing in depth is a topic for another post, but you can find information on EC2 pricing here. AWS provides certain services for free, often up to a specified usage limit; you can find more information on what is included in their Free Tier here. You can also control your costs by setting up budgets and budget alerts; you can find more information on setting up AWS budgets here.
Regions
The infrastructure that Amazon uses to support its services is spread out all over the world. When you use AWS services, you can choose to use servers in a specific geographical region. For more information on AWS regions, click here.
In summary
- You can use Amazon’s EC2 service to rent a virtual computer
- You control what kind of operating system and software you use by selecting or creating an AMI
- You control what kind of hardware you use by selecting an instance type
- You may need to use another Amazon service (e.g. S3, EBS, EFS, or FSx) to provide data storage
Try it yourself
Free online resources that clearly explain AWS at the introductory level are hard to find. If you are interested in a guide that walks you through steps like creating an account and launching an instance, try this tutorial from AWS and W3Schools. If you have a learning resource to recommend, please drop it in the comments below!