The recent developement in data storage and processing have been motivated by the increasing amount and complexity of data available to individuals and companies. Most of these recent advancements require sophisticated and powerful hardware. Aspiring data scientists must be able to understand and master these new tools.
For instance, given the computer power required to carry out training of algorithms, these are usually carried out “on the cloud” (i.e. by remote access to a virtual machine) which avoid to buy and maintain very expensive hardware especially if the peak usage is only occasional. Cloud computing allow to perform these very intensive tasks and must now be part of the toolbox of any data scientist.
One of the most popular service provider is Amazon which provide an IaaS (infrastructure as a service) with AWS. I have to admit that the AWS interface is dauting and might be very offputting for a beginner. I started using AWS back in 2012 to run Monte Carlo simulations because back then I had a cheap computer and I followed classes on two campuses and the one with the powerful computers I would go to only every other week. I recently stumbled upon my notes when I was trying to figure out how to launch an AWS instance and realized how much the interface had changed since then. I thus decided to update them and in the process to make them available to anyone starting out with AWS.
I assume that the reader already has an AWS account and has a basic knowledge of the Linux terminal commands. If not, here is a quick tutorial. Setting up an EC2 (Elastic Cloud Compute) is a matter of 5 steps:
1. generate a key pair
As I said before, since the computing resource is in the cloud, we need to access it remotely, in order to do so securely, we use an ssh connection that will require a key pair. You can generate one by going into the AWS EC2 dashboard:
And then you need to select the Key Pairs tab:
You can now create a key pair:
You only need to give the key pair a name:
This will download the key pair in a .pem file. It is not possible to connect to the instance without this file so do not lose it. I repeat: DO. NOT. LOSE. IT. You can still retrieve your instance in this case, AWS provides some information here. Some additional precautions must be taken to ensure that only the administrator can read this file by using this command in the terminal (after a cd in the directory where the file was downloaded):
chmod 400 myec2instance.pem
2. create a new user (optional but recommended)
Now that you have you key pair, you need to create a new user. This step is not required if only one user connects to the instance, but necessary if the connection is shared (setup access rights). To do so, one need to go to the IAM (Identity and Access Management) section:
Then add a new user:
the new user name must be completed and its access (programmatic and/or through the AWS console):
When all of this is chosen, the user permission must be chosen (this is the most important step). Here since we have only one user, we setup administrator rights:
Then you can download the user access IDs:
Now that the new user is created, it is possible to access the AWS plateform with the link provided in the downloaded csv:
New users do not have root access, only the original account has. In any case, I would advise against using the root access for doing this tutorial: it is just best practice not to get to know the interface with a root access.
3. create billing alert (optional but strongly recommended)
Not only this is best practice, it is in your best interest to set up a billing alert. This is no joke. I have seen plenty of horror stories on reddit on this subject matter. It only takes 5 minutes and it can save you some headaches. You first need to go to CloudWatch US East (N. Virginia) region where the billing alerts are handled.
Then type billing in the metric search bar and select Total Estimated Charge:
Then select the EstimatedCharges metric in the “All Metric” tab:
And then in the “Graphed Metric” tab we can now set an alarm:
Then it is possible to choose the threshold amount and the email address to contact when the estimated cost reaches this level:
4. create a new instance
Now that we have created an alert, we can dive in. Virtual machines can be intanciated in then EC2 Dashboard. It can be accessed as follows:
Back to the EC2 dashboard, and in the “Instances” select “Launch Instance”:
The new instance OS is chosen among the available images on AWS (here I choose Ubuntu Server 16.04 LTS).
Following this, the computing power of the instance must be chosen. There are many options. For a test, it is better to choose a small, general purpose configuration:
Before launching the virtual machine, the last step is to choose the key pair associated.
5. connect remotely to instance
After a few minutes, the new instance should be running. It is possible to check in the EC2 dashboard:
By clicking on the 1 Running Instances link, we get:
Now, in order to connect to the instance remotely, we need to do it through an SSH client, the key pair will allow us to securely connect to the server:
ssh -i myec2instance.pem ubuntu@[EC2 Public DNS]
And you are all set. Some of my upcoming posts (Docker, API with Flask,…) will require to setup an EC2 instance, then this tutorial will come in handy if you still follow me then.