CloudForest, An Overview

On-Demand Cloud Computing at Stanford GSB


W. Ross Morrow  Research Computing Specialist and Developer , DARC
Alex Storer  Director , DARC

01 May 2019

GSB faculty, their research staff, and collaborators use a lot of computing power, often in excess of what is available in our dedicated, on-premise interactive and batch computing environments. Occasionally research might be improved using hardware accelerators, such as GPUs or FPGAs, that aren’t readily available on campus to GSB researchers. Here we provide an overview of CloudForest, an on-demand cloud computing solution built at Stanford GSB.

Over the past year or so, GSB’s DARC team has been rebuilding a system to allow GSB faculty and their collaborators to easily spin up cloud instances in AWS EC2 for research computing. Our model is more of an interactive than a batch computing model: researchers get access to a remote machine that, for all intensive purposes, looks like a machine in our on-premise clusters, with the exception that there aren’t a myriad of users simultaneously using it for various tasks. They can log into these machines via ssh and start work immediately just as they would in our on-premise computers. We are not using batch tools like AWS Parallel Cluster largely because (i) researchers at GSB are currently used to interactive environments, and (ii) we have an on-premise batch submission environment already in Sherlock.

As illustrated in figure below, CloudForest 3+ is a web-based frontend coupled with a backend built in full-stack javascript (react.js/redux.js, node.js, and MongoDB), uses TICK stack monitoring, automatically stops instances that idle for too long, and does instance initialization using continuous deployment (CD) tools. Our goal is to make it easy, fast, and secure for GSB researchers to obtain excess computing resources in the cloud, while making it cost efficient and easily auditable, maintainable, and upgradable for us in DARC.

Technologies and their relationships in a running CloudForest platform.

The rest of this (very) high-level overview proceeds as follows: First we cover the key features in CloudForest 3+. Second we briefly review the AWS services used. Third we give a more detailed look at CloudForest’s dashboard, and the functionality it gives users. Then we discuss four technical pieces that really make CloudForest tick: the primary backend server, cfserver; the database (in MongoDB); monitoring and automation; and how we set instances up after launch. Then we discuss how we maintain CloudForest in a tiered test/staging/production environment. Finally we provide an overview of usage during our “beta” phase which started around October 2018. For deeper details on all these topics see our documentation.

Features

CloudForest 2

Before describing our new system, it is probably helpful to review the old one.

CloudForest 2 was a terminal-based python application (literally cf2) that ran on our on-premise research servers, the yens. Users could create, start, stop, and ssh into instances and clusters of instances through this application. They could also maintain a directory on the yens that was autmatically synced to any instances in their “project”. Technically, instances and clusters were managed using terraform.

While this platform generally functioned and provided GSB researchers with cloud computing capacity, we eventually recognized a number of key problems we felt required comprehensively addressing:

  • Not having built the system ourselves, it was very difficult to debug any issues users confronted without external support.

  • CloudForest 2 was hosted and depended on the yen computers, with two consequences: First, CloudForest could not be used when there were problems with the yens. Second, an ssh-from-ssh structure (users ssh to the yens, and from there implicitly ssh to an instance) limited the means via which researchers could access their cloud instances.

  • Because the CloudForest 2 tool depended on the yens, the tool occasionally failed due to various vague permissions errors when users tried to connect to their systems.

  • When users connected to instances via ssh, they connected using ssh keys for a sudo-enabled user, breaking typical Unix permissioning models.

  • The sync process to maintain consistency between specific directories on the yen systems and CloudForest instances was unclear, in that it could not inform users of progress in syncing data, and wasteful, in that it could only sync the same content to all instances in a group.

  • We had to rely on users to stop instances in the terminal-based application when they were finished using them. Mostly users did not stop their instances, and the much of costs DARC paid was for idle time.

  • terraform and AWS services did not always “play nice” together. In one case, 4 out of 5 instances were launched by terraform before hitting AWS resource limits, without any notification that all 5 were not launched. When the user stopped the cluster, terraform and/or AWS launched the last instance, which remained unused in our system for 2 weeks before it was noticed and stopped. This incident alone cost DARC about $1,500 (with no productive work taking place on the “phantom” instance).

New Features

A number of features of the new system differentiate it from the old system:

  • Independence from On-Premise Computing: The new CloudForest is completely independent of GSB’s on premise computing environments. Everything runs from the cloud. The yen computers can be unplugged and CloudForest won’t care.

  • Web Interface: CloudForest now appears to users as a web-based application. Many of our users are used to GUI applications more so than terminal-based applications, and this type of interface is much easier to use for most of our users. This includes notebooks for interactive computing right from the browser.

  • Easy SSH Login: Users are defined on the system as is typical on multi-user, permissioned systems and are able to ssh/scp into their machines using standard tools (when complying with firewall rules, of course). We also have a simplified ssh wrapper, cfssh, wherein users can ssh into a machine using only it’s CloudForest name, as in
    $ cfssh my-cool-instance
    
  • Stanford 2FA: Both the web interface and ssh/scp into any instance relies on Stanford 2-Factor Authentication, harmonizing with other Stanford systems.

  • Instance Autostop: CloudForest now has autostop protocol, wherein instances are stopped after “idling” for 24 hours or whatever length of time costs 20 USD. “Idling” for H hours means that the maximum, 5-minute average total CPU utilization over H hours was less than 5% per CPU core.

  • Self-Managed User Groups: Faculty have tools in the dashboard to add or remove users to their user groups themselves, rather than relying on DARC to act on requests for access.

  • Cost Visibility: Although DARC pays the bill for EC2 usage, it is still useful for faculty to understand what charges are incured by their work on EC2 instances. Moreover, it is good for us to have an independent view of the costs incurred by the CloudForest platform. Cost rates and total costs are clearly visible in our dashboard.

  • No Clusters: We have found that running clusters is difficult and error prone, while research at GSB does not really depend on running highly parallelized, multi-machine codes. In the rare cases where this is warranted, we have other available tools. So for CloudForest 3+ we allow only “instances” not “clusters”.

  • On-Instance Stop: Users can call a simple executable cfstop to stop an instance from itself, for example when work on some task has completed. Using this tool can further improve cost efficiencies.

  • Issue Reporting: Through code.stanford.edu DARC formally tracks issues related to the platform.

  • Built In-House: The new system was built entirely in house. We feel this gives us a better chance to debug and maintain the system when users encounter errors.

AWS Services Used

CloudForest is essentially a platform to make it easier for faculty and their research collaborators to use EC2 instances for research purposes. Of course, this requires using a variety of AWS services not just EC2. This section reviews the AWS services CloudForest depends on, and very briefly describes how.

Identity and Access Management (IAM)

AWS IAM help provide control over who can do what in your AWS accounts. IAM tools are used to provide account credentials for our backend to securely execute calls to the other services used. Users of CloudForest never themselves see or use IAM credentials.

Elastic Compute Cloud (EC2)

AWS EC2, AWS’s “generic” computers-in-the-cloud service, is the backbone of CloudForest. All our machines and their data are launched within EC2 as EC2 instances and EBS volumes, respectively. Our webserver, backend server, and monitoring server also all run in EC2 (as reserved instances).

Simple Storage Service (S3)

AWS S3 is a cloud data hosting service. Formally, S3 services are used to host EBS volume snapshots and AMI’s associated with CloudForest. We are also considering S3 as a transfer storage location for data researchers might want to move across computing platforms.

Simple Email Service (SES)

AWS SES is an outgoing email management system within AWS. We use the SES service to send notification emails to our users when instances are requested, when instances are ready to use, and when instances are going to (and have) autostopped.

CodeDeploy

CodeDeploy is a cloud software deployment service for AWS. We use CloudDeploy as the engine behind instance setup to make sure any instances launched are properly instantiated for use by CloudForest users. CodeDeploy also makes it easy for us to deploy new features or upgrades to the pool of running instances.

Lambda

AWS Lambda is a “serverless” computing platform wherein you submit code that can run after certain triggers, and AWS handles execution of this code in a highly available, scalable fashion. We use Lambda in tandem with CloudWatch events to notify our backend when EC2 instances and or EBS volume snapshots change state. This helps us by creating a stateless backend code, rather than one that relies on polling the AWS system for data when certain highly asynchronous tasks occur.

The Dashboard

The center piece of CloudForest to users is probably the (SSO-protected) dashboard. In this section we sketch out the major appearance and features of this part of the platform.

Registration

If you aren’t yet known to the system, you actually get a “registration” page wherein you can verify your information and get added as a CloudForest user when you first go to the dashboard.

However, you can’t add yourself to any groups. An administrator of a particular group has to actively add you for this “user status” to be meaningful.

An Example Dashboard

A non-trivial example dashboard is shown below:

You can choose your group (if you are in more than one) and see the instances associated with that group. For each instance you get to see it’s name, type, cost rate, total cost, state, current CPU and memory usage, whether it is “synced” with cfserver for state updates, followed by a sequence of action buttons. These actions are start, stop, connect, open a notebook, open the monitor for detailed CPU/memory usage data, view/add notes, submit an issue, and delete.

The monitor queries data from the TICK stack for this instance and displays it in the browser:

Notes can be added by users to describe what is happening with the instance or added by the system to describe important events:

Requesting Instances

The dashboard has an embedded request form that appears as a large modal dialog when the + button is clicked. This form looks something like the following:

In filling out this form the user provides general information about what they are doing with the instance, what type of instance they want, how much storage they want, how restrictive they want the connection firewall to be, and how long they think they will use it for. There are alot of instance types, so we include the ability to filter what instance types show up based on CPUs, memory, and special hardware like GPUs. Note you have to agree to the usage and data policies to enable the “submit” button and actually create an instance.

Implementation

The dashboard is programmed entirely in react.js/redux.js on top of our rssreacts template for Stanford SSO-enabled apps. Most of the UI components come from Material UI, though we use some draft.js text fields. The dashboard interacts with cfserver using axios for http requests and socket.io for sockets. The dashbaord is actually a set of static assets served by apache, as built by yarn (yarn build). We do have our own build “superscript” bundleup.py from rssreacts, though, that makes the build-and-deploy process easier and configuration-driven. Changes to the front-end can be deployed to development and production environments with a few simple commands.

Custom Server(s)

CloudForest has a custom backend, cfserver, a node.js (express) server that manages groups, users, requests, instances, alerts, and more. cfserver mediates all communication with the MongoDB database (described below). cfserver is also almost “stateless” – http requests are stateless, but not socket connections – although we have not formally tested this property. By “stateless” we mean any particular process running cfserver could respond to any request made to cfserver; a particular process shouldn’t need to respond to particular subsets of requests, like those coming from a single dashboard session. A stateless server is useful in that it is, in principal, horizontally scalable.

APIs

cfserver “installs” and runs various (versioned) http APIs to work with groups, users, requests, instances, volumes, alerts, errors, and more. Thus cfserver mediates interactions with the database and takes actions in AWS corresponding to specific requests (like creating, starting and stopping instances). All interactions with the database and AWS occur through appropriate cfserver API calls.

Security

If enabled, cfserver can require authorization for all http requests as well as implement Role-Based Access Control (RBAC). Both are optional, but important in that cfserver can create, change, and delete EC2 resources as well interact with the database. Without API access restrictions, malicious (or partially competent) actors could do alot of damage to a running CloudForest platform.

Authorization

When authorization is enabled, every http request made to cfserver must have an Authorization header with a suitable value for making requests. Suitable values are either (i) an API key created by cfserver and stored in the database or (ii) a login token from our rsslogins server that has not yet expired. express middleware ensures that, before any actions take place, any request is checked for this Authorization header and the value evalauted. The value is first checked against the current list of API keys in the database and, if it is not an API key, then checked with the login server to see if it is a valid token. Requests with an API key and/or valid token are allowed to proceed.

Role-Based Access Control (RBAC)

Authorizing requests does not in itself provide any granularity for permissions to execute certain commands via API routes. cfserver implements Role-Based Access Control (RBAC) with casbin in express middleware. The middleware checks the roles currently adoptable for the Identity associated with the request (as stored in a header, but may be required to match API key and/or token values) against permissions for those roles to ensure the caller can execute the route.

Logs

We uniformly log all requests made to cfserver as well as all associated responses. Request and Response log entries look something like the following:

[2019-05-06T10:26:52.771] [INFO] info - REQLOG,GET,/user/ayurukog,6e9fc245e5a923666cd928f9,WHVsKHJ...
[2019-05-06T10:26:52.853] [INFO] info - RESLOG,6e9fc245e5a923666cd928f9,GET,/USER/:ID,304,JSON,1557163612.771,1557163612.853,0.082
[2019-05-06T10:26:52.910] [INFO] info - REQLOG,GET,/group/656d24e258f6d0fda608ba08,7becdedb1b82910aa783254b,WHVsKHJ...
[2019-05-06T10:26:53.546] [INFO] info - RESLOG,7becdedb1b82910aa783254b,GET,/GROUP/:ID,304,JSON,1557163612.910,1557163613.546,0.636

REQLOG declares that the line describes a request, with format

<method>,<URL>,<response key>,<login token>

RESLOG declares that the line describes a response, with format

<response key>,<method>,<URL>,<status>,<response method>,<start time>,<end time>,<duration>

The two are linked by the response keys, e.g. 6e9fc245e5a923666cd928f9 and 7becdedb1b82910aa783254b here. REQLOG lines.

Testing

Launching cfserver creates a test script test/routes.sh you can run locally. You can also make a remotely make a GET HTTP request to /tests on a running cfserver instance and get back a testing script tailored for that server.

This testing script is created entirely from tests defined in the code itself. Specifically, any API route in cfserver is defined by adding a field to an (exported) object routes as follows:

// generic "root" call
routes.root = { 
  route    : "/" , 
  method   : "get" , 
  tests    : [ { status : 200 } ] , 
  callback : ( req , res ) => { 
    res.send( "Hi! I'm the DARC CloudForest server at Stanford GSB.\n" ); 
  }
};

The “tests” field specifies a list of tests to run as objects whose fields define the call and its expected result. In this case, the single test for the route / is a straight-up call to GET / with an expected return status of 200 (success). Tests can be specfied with parameters too (the things specified by the /: parts of route URLs) and need not be expected to succeed. For example,

const defaultGroupNameTests = [ 
  { params : { name : "test" } , status : 200 } , 
  { params : { name : "notagroup" } , status : 404 } , 
];

specifies two test: a test for group name calls with parameter test that should succeed (200) and one with parameter notagroup that should fail (404 NotFound). In this way the code itself, near the routes themselves, can specify how they should be tested.

CloudForest’s Database

Like many services, CloudForest depends on a database. We use a MongoDB Atlas instance to store data related to the new CloudForest platform. This database has a variety of collections, briefly reviewed below, that serve both static and dynamic needs. That is, the database stores AWS EC2 resource data which should rarely change as well as instance data which can change rapidly. There are actually two databases, one for (sandboxed) development and one for production, each having the same structure but different content.

Technically, we have a M20 series instance hosted in AWS’s us-east-1 region. This class provides 2 vCPUs, 4 GB RAM, 20 GB storage, a maximum of 700 connections, and continuous backups. Given that cfserver mediates all interaction with the database, we aren’t yet greatly concerned about database performance other than read/write latency. In fact, for most of our beta run we operated out of the free tier, M0 class with only a shared resources, 512 MB of storage, and no backup, and for all intensive purposes everything worked fine.

Each collection used in CloudForest is described below:

Static Data

By “static data” we mean data held in the database that does not change particularly often, and certainly not through regular use of the CloudForest platform. This could perhaps be thought of as the high-level configuration data for the system.

options

Our options collection stores configuration options influence how cfserver operates. There are four main categories:

  • server, containing options related to server setup and operation

  • aws, containing AWS-related options (including credentials)

  • comms, containing communication-related options (e.g., Slack channel POST url)

  • and metrics, containing a description of the metrics server

By storing these options online in the database we make online administration possible. Options can be changed and the server “reloaded” with these updated settings from our Admin page, without ever having to actually log in the the instance running cfserver.

awsdata

The awsdata collection holds data on AWS instances that can be used in the platform. Specifically, awsdata is a collection of json objects describing EC2 instances along with any internal flags like availability (whether we allow the instance or not).

templates

The templates collection keeps the pug templates used to construct the emails we send corresponding to various events. Templates are identified by their name and have body and subject fields (as strings in pug markup form), a description, and a dictionary of variables the pug templates for the body and subject will expect.

Dynamic Data

By “dynamic data” we mean data held in the database that does change often, and primarily through regular use of the CloudForest platform.

shadow

The shadow collection holds platform secrets like user passwords (in unix-hashed form, never cleartext) and API keys. There are no large-scale access routines for data from shadow written into cfserver.

users

Data about who can use the system is stored in the users collection. Each user is primarily identified by their uid, has name details, stores group memberships by MongoDB _id and name:

{ 
    "_id" : ObjectId("xxxxxxxxxxxxxxxxxxxxxxxx"), 
    "uid" : "cfuser", 
    ... 
    "groups" : {
        "xxxxxxxxxxxxxxxxxxxxxxxx" : "test-group", 
        "xxxxxxxxxxxxxxxxxxxxxxxx" : "facultyA"
    }, 
    "name" : {
        "first" : "Some", 
        "middle" : "C. F.", 
        "last" : "User", 
        "display" : "C. F. User"
    }
}

There is other data, particularly a user profile, history, created/updated times, as well.

groups

The groups collection stores information about the CloudForest groups. Groups are mainly identified by their name, but can also hold a description, have a list of admins (the uids of the admin users), a list of members (by _id, with name and “rights” in the group), a list of instances the group currently has (by _id and name), and a history of all instances the group has had (by _id and name).

requests

instances

volumes

alerts

Any alert sent to cfserver from an instance, AWS Lambda functions, or from kapacitor is stored in the alerts collection.

emails

The emails collection holds a permanent record of the emails cfserver sends to users of the platform via AWS’s SES service. This includes the Destination (all to, cc, and Bcc recipients), Message (body and subject), source, state (e.g., sent), and an SES messageId.

errors

The errors collection holds errors that are POSTed to cfserver by the ErrorBoundary in our react.js dashboard.

Monitoring and Automation

CloudForest uses the TICK stack to monitor instance use and facilitate automation. Specifically, we install and run telegraf to collect and ship monitoring data from running instances to another instance running influxdb, chronograf, and kapacitor. DARC uses a single reserved instance running the “ICK” part of the stack for all our monitoring purposes, including CloudForest. kapacitor templates to instantiate scripts when instances start to monitor for various conditions such as disk reaching capacity or instances idling. chronograf is a convenient web UI to explore data, but accesses influxdb through a user with permissions only to read certain databases.

We have found the TICK stack to be a fantastic set of products for this purpose. It is relatively easy to install and maintain, data is stored very compactly and easy to query locally and remotely, and telegraf ships highly detailed metrics out of the box. Our automation strategy is based entirely on kapacitor, which would be impossible were it not flexible and relatively easy to work with.

Stopping Idling Instances

Our main example is idling warnings and instance stopping, which we review here. Whenever an instance is running, telegraf is shipping metric data, including CPU utilization, to our influxdb instance. kapacitor is also running on that instance, and subscribes to updates from influxdb. There are kapacitor tasks set up for any running instance that watch CPU utilization – technically 100.0-usage.idle – and create (i) a stream of averages of CPU utilization over non-overlapping five minute windows and (ii) a windowed maximum of those averages over an hour, an instance-dependent idling time limit, and an hour less than that limit. Part of the tasks logs the resultant maximum data for auditing and review purposes. But, when the maximum values are too small, another part of the tasks can trigger alerts to cfserver via https requests according to the time period over this the maximum value was too small. cfserver can then act based on these alerts, for example by sending email notifications that an instance might stop soon if not used or actually stopping an instance if it has been idling for too long.

cfmetrics

CloudForest monitoring relies on another server, called cfmetrics, that runs on the instance that holds the influxdb installation. The purpose of this server is just to “proxy” requests for data from influxdb and task definition in kapacitor. This way, we don’t need to code details calls for data and alerts right into cfserver, although that is an aesthetic decision as much as a technical one. cfmetrics also serves as an https proxy: influxdb can run with a self-signed certificate while cfmetrics runs with proper CA certs to satify requests from in-browser apps.

Instance Setup

One important goal is that when users log into a CloudForest instance, it feels familiar. The same software is there, the same layout for home and data directories, etc. For us, the admins, we want to make sure software like telegraf is running so we can collect data on instances and act on events described by that data. And more.

Some of these setup tasks can be handled by installing software in the AMI: all our analysis software (python, R, MATLAB, Stata SE/MP, etc) is pre-installed, as is telegraf and many other useful packages. Take, however, telegraf as an example. If it is installed and running in the AMI, so that as soon as an EC2 instance boots up, then whatever data it sends will be tagged with the host setup in the AMI. This won’t be useful at all, because we need instance-specific host tags in the telegraf data to show users their data or alert based on events for particular machines. We will have to do some telegraf setup after the instance launches for this to be useful. There are quite a few other “setup” tasks that must be done. For example, AWS EBS volumes are allocated and attached to instances when they are launched but those volumes are not necessarily formatted or mounted to the instance. If they aren’t the volumes are useless.

Our Setup

As of this writing our setup tasks include:

  • Installing the latest versions of useful CloudForest CLI tools on each instance
  • Correctly specifying the hostname of an instance, based on its given name
  • Correctly configuring and starting telegraf monitoring, given the correct host name
  • Obtaining https certificates (via LetsEncrypt) for each instance under its correct name
  • Starting security services like splunk log forwarding and qualys software scanning
  • Formatting and mounting volumes to an instance
  • Identifying and creating which users should be defined on the system, or deleting users that shouldn’t be
  • Configuring and starting a jupyter notebook server to enable in-browser computing

We actually accomplish all these tasks in a fairly general way using a single script install.sh that searches a directory for subfolders with their own install.sh files and running any of those. Our implementation also allows for:

  • install-specific environnment variable definitions in an install.env file,
  • “prerequisites” as listed in a single file after (e.g., starting telegraf and splunk should follow the definition of the hostname),
  • and failure tracking.

In this way adding new installs for, say, new CloudForest CLI tools, security features, or software upgrades is actually fairly easy.

Implementation

We store all these “installs” in a bitbucket repository and use pipelines connected to AWS CodeDeploy to define what repository should be deployed to running instances. Valid running instances to deploy to are identified with tags. Our development and production environments actually use completely different AWS accounts, specified in pipelines using different repository and deployment variables. However we also have a “staging” environment within our production CloudForest defined by a particular group to enable trial deployment of our setup routines in production. This is an important feature; bad updates to our setup routines deployed directly to production have deleted user data.

Deployment Environments

Testing/Development

We maintain a sandboxed testing/development environment in a completely separate AWS account. This account holds two EC2 instances for running cfserver and for monitoring with the TICK stack as well as cfmetrics.

The main git branch for development in cfserver, cfmetrics, and cfinstancedeploy is aptly named development.

Staging,

Our “staging” environment is a subset of the production environment consisting of instances in a particular group, staging. In particular, staging enables us to roll out changes to cfinstancedeploy to production affect users’ running instances. Changes to cfinstancedeploy should be commited and pushed in the following order:

development -> staging -> production

Production

Our production environment is our main AWS account with two servers (on AWS reserved EC2 instances): one hosting the website and cfserver (among other tools provided by DARC) and another hosting the TICK stack installations and cfmetrics.

The webserver machine is a m4.2xlarge instance with 8 vCPUs, 32 GB memory, and 256 GB of storage. This machine serves CloudForest related webpages, including the dashboard, using apache and runs a single cfserver process on another port. This machine is multi-purpose, hosting several of DARC’s tools and service domains including CloudForest.

The TICK stack server is an r5d.2xlarge instance with 8 vCPUs, 64 GB memory, and 300 NVMe SSD instance storage on which we hold the influxdb database. We use an attached EBS volume and a cron/rsync job to capture daily snapshots of the data stored there. This server is our TICK stack instance for all computing metrics collection at DARC, not only CloudForest.

The main git branch for production in cfserver, cfmetrics, and cfinstancedeploy is conventionally named master.

Usage to Date

Conclusions

Acknowledgements

We aren’t too specific here about who helped with what. The entire DARC team and more has contributed alot to this project, and that includes (in no particular order) Arun Aaggar, Luba Gloukhova, Mason Jiang, Wonhee Lee, Sal Mancuso, Amy Ng, and Jason Ponce among others.