CloudForest, An Overview
On-Demand Cloud Computing at Stanford GSB
GSB faculty, their research staff, and collaborators use a lot of computing power, often in excess of what is available in our dedicated, on-premise interactive and batch computing environments. Occasionally research might be improved using hardware accelerators, such as GPUs or FPGAs, that aren’t readily available on campus to GSB researchers. Here we provide an overview of CloudForest
, an on-demand cloud computing solution built at Stanford GSB.
Over the past year or so, GSB’s DARC team has been rebuilding a system to allow GSB faculty and their collaborators to easily spin up cloud instances in AWS EC2 for research computing. Our model is more of an interactive than a batch computing model: researchers get access to a remote machine that, for all intensive purposes, looks like a machine in our on-premise clusters, with the exception that there aren’t a myriad of users simultaneously using it for various tasks. They can log into these machines via ssh
and start work immediately just as they would in our on-premise computers. We are not using batch tools like AWS Parallel Cluster largely because (i) researchers at GSB are currently used to interactive environments, and (ii) we have an on-premise batch submission environment already in Sherlock.
As illustrated in figure below, CloudForest 3+
is a web-based frontend coupled with a backend built in full-stack javascript (react.js
/redux.js
, node.js
, and MongoDB), uses TICK stack monitoring, automatically stops instances that idle for too long, and does instance initialization using continuous deployment (CD) tools. Our goal is to make it easy, fast, and secure for GSB researchers to obtain excess computing resources in the cloud, while making it cost efficient and easily auditable, maintainable, and upgradable for us in DARC.

CloudForest
platform.The rest of this (very) high-level overview proceeds as follows: First we cover the key features in CloudForest 3+
. Second we briefly review the AWS services used. Third we give a more detailed look at CloudForest
’s dashboard, and the functionality it gives users. Then we discuss four technical pieces that really make CloudForest
tick: the primary backend server, cfserver
; the database (in MongoDB); monitoring and automation; and how we set instances up after launch. Then we discuss how we maintain CloudForest
in a tiered test/staging/production environment. Finally we provide an overview of usage during our “beta” phase which started around October 2018. For deeper details on all these topics see our documentation.
Features
CloudForest 2
Before describing our new system, it is probably helpful to review the old one.
CloudForest 2
was a terminal-based python application (literally cf2
) that ran on our on-premise research servers, the yen
s. Users could create, start, stop, and ssh
into instances and clusters of instances through this application. They could also maintain a directory on the yen
s that was autmatically synced to any instances in their “project”. Technically, instances and clusters were managed using terraform
.
While this platform generally functioned and provided GSB researchers with cloud computing capacity, we eventually recognized a number of key problems we felt required comprehensively addressing:
-
Not having built the system ourselves, it was very difficult to debug any issues users confronted without external support.
-
CloudForest 2
was hosted and depended on theyen
computers, with two consequences: First,CloudForest
could not be used when there were problems with theyen
s. Second, anssh
-from-ssh
structure (usersssh
to theyen
s, and from there implicitlyssh
to an instance) limited the means via which researchers could access their cloud instances. -
Because the
CloudForest 2
tool depended on theyen
s, the tool occasionally failed due to various vague permissions errors when users tried to connect to their systems. -
When users connected to instances via
ssh
, they connected usingssh
keys for asudo
-enabled user, breaking typical Unix permissioning models. -
The
sync
process to maintain consistency between specific directories on theyen
systems andCloudForest
instances was unclear, in that it could not inform users of progress in syncing data, and wasteful, in that it could onlysync
the same content to all instances in a group. -
We had to rely on users to stop instances in the terminal-based application when they were finished using them. Mostly users did not stop their instances, and the much of costs DARC paid was for idle time.
-
terraform
and AWS services did not always “play nice” together. In one case, 4 out of 5 instances were launched byterraform
before hitting AWS resource limits, without any notification that all 5 were not launched. When the user stopped the cluster,terraform
and/or AWS launched the last instance, which remained unused in our system for 2 weeks before it was noticed and stopped. This incident alone cost DARC about $1,500 (with no productive work taking place on the “phantom” instance).
New Features
A number of features of the new system differentiate it from the old system:
-
Independence from On-Premise Computing: The new
CloudForest
is completely independent of GSB’s on premise computing environments. Everything runs from the cloud. Theyen
computers can be unplugged andCloudForest
won’t care. -
Web Interface:
CloudForest
now appears to users as a web-based application. Many of our users are used to GUI applications more so than terminal-based applications, and this type of interface is much easier to use for most of our users. This includes notebooks for interactive computing right from the browser. - Easy SSH Login: Users are defined on the system as is typical on multi-user, permissioned systems and are able to
ssh
/scp
into their machines using standard tools (when complying with firewall rules, of course). We also have a simplifiedssh
wrapper,cfssh
, wherein users canssh
into a machine using only it’sCloudForest
name, as in$ cfssh my-cool-instance
-
Stanford 2FA: Both the web interface and
ssh
/scp
into any instance relies on Stanford 2-Factor Authentication, harmonizing with other Stanford systems. -
Instance Autostop:
CloudForest
now has autostop protocol, wherein instances are stopped after “idling” for 24 hours or whatever length of time costs 20 USD. “Idling” for H hours means that the maximum, 5-minute average total CPU utilization over H hours was less than 5% per CPU core. -
Self-Managed User Groups: Faculty have tools in the dashboard to add or remove users to their user groups themselves, rather than relying on DARC to act on requests for access.
-
Cost Visibility: Although DARC pays the bill for EC2 usage, it is still useful for faculty to understand what charges are incured by their work on EC2 instances. Moreover, it is good for us to have an independent view of the costs incurred by the CloudForest platform. Cost rates and total costs are clearly visible in our dashboard.
-
No Clusters: We have found that running clusters is difficult and error prone, while research at GSB does not really depend on running highly parallelized, multi-machine codes. In the rare cases where this is warranted, we have other available tools. So for
CloudForest 3+
we allow only “instances” not “clusters”. -
On-Instance Stop: Users can call a simple executable
cfstop
to stop an instance from itself, for example when work on some task has completed. Using this tool can further improve cost efficiencies. -
Issue Reporting: Through code.stanford.edu DARC formally tracks issues related to the platform.
- Built In-House: The new system was built entirely in house. We feel this gives us a better chance to debug and maintain the system when users encounter errors.
AWS Services Used
CloudForest
is essentially a platform to make it easier for faculty and their research collaborators to use EC2 instances for research purposes. Of course, this requires using a variety of AWS services not just EC2. This section reviews the AWS services CloudForest
depends on, and very briefly describes how.
Identity and Access Management (IAM)
AWS IAM help provide control over who can do what in your AWS accounts. IAM tools are used to provide account credentials for our backend to securely execute calls to the other services used. Users of CloudForest
never themselves see or use IAM credentials.
Elastic Compute Cloud (EC2)
AWS EC2, AWS’s “generic” computers-in-the-cloud service, is the backbone of CloudForest
. All our machines and their data are launched within EC2 as EC2 instances and EBS volumes, respectively. Our webserver, backend server, and monitoring server also all run in EC2 (as reserved instances).
Simple Storage Service (S3)
AWS S3 is a cloud data hosting service. Formally, S3 services are used to host EBS volume snapshots and AMI’s associated with CloudForest
. We are also considering S3 as a transfer storage location for data researchers might want to move across computing platforms.
Simple Email Service (SES)
AWS SES is an outgoing email management system within AWS. We use the SES service to send notification emails to our users when instances are requested, when instances are ready to use, and when instances are going to (and have) autostopped.
CodeDeploy
CodeDeploy is a cloud software deployment service for AWS. We use CloudDeploy as the engine behind instance setup to make sure any instances launched are properly instantiated for use by CloudForest
users. CodeDeploy also makes it easy for us to deploy new features or upgrades to the pool of running instances.
Lambda
AWS Lambda is a “serverless” computing platform wherein you submit code that can run after certain triggers, and AWS handles execution of this code in a highly available, scalable fashion. We use Lambda in tandem with CloudWatch events to notify our backend when EC2 instances and or EBS volume snapshots change state. This helps us by creating a stateless backend code, rather than one that relies on polling the AWS system for data when certain highly asynchronous tasks occur.
The Dashboard
The center piece of CloudForest
to users is probably the (SSO-protected) dashboard. In this section we sketch out the major appearance and features of this part of the platform.
Registration
If you aren’t yet known to the system, you actually get a “registration” page wherein you can verify your information and get added as a CloudForest
user when you first go to the dashboard.

However, you can’t add yourself to any groups. An administrator of a particular group has to actively add you for this “user status” to be meaningful.
An Example Dashboard
A non-trivial example dashboard is shown below:

You can choose your group (if you are in more than one) and see the instances associated with that group. For each instance you get to see it’s name, type, cost rate, total cost, state, current CPU and memory usage, whether it is “synced” with cfserver
for state updates, followed by a sequence of action buttons. These actions are start, stop, connect, open a notebook, open the monitor for detailed CPU/memory usage data, view/add notes, submit an issue, and delete.
The monitor queries data from the TICK stack for this instance and displays it in the browser:

Notes can be added by users to describe what is happening with the instance or added by the system to describe important events:

Requesting Instances
The dashboard has an embedded request form that appears as a large modal dialog when the +
button is clicked. This form looks something like the following:




In filling out this form the user provides general information about what they are doing with the instance, what type of instance they want, how much storage they want, how restrictive they want the connection firewall to be, and how long they think they will use it for. There are alot of instance types, so we include the ability to filter what instance types show up based on CPUs, memory, and special hardware like GPUs. Note you have to agree to the usage and data policies to enable the “submit” button and actually create an instance.
Implementation
The dashboard is programmed entirely in react.js
/redux.js
on top of our rssreacts
template for Stanford SSO-enabled apps. Most of the UI components come from Material UI, though we use some draft.js
text fields. The dashboard interacts with cfserver
using axios
for http
requests and socket.io
for sockets. The dashbaord is actually a set of static assets served by apache
, as built by yarn
(yarn build
). We do have our own build “superscript” bundleup.py
from rssreacts
, though, that makes the build-and-deploy process easier and configuration-driven. Changes to the front-end can be deployed to development and production environments with a few simple commands.
Custom Server(s)
CloudForest
has a custom backend, cfserver
, a node.js
(express
) server that manages groups, users, requests, instances, alerts, and more. cfserver
mediates all communication with the MongoDB database (described below). cfserver
is also almost “stateless” – http
requests are stateless, but not socket connections – although we have not formally tested this property. By “stateless” we mean any particular process running cfserver
could respond to any request made to cfserver
; a particular process shouldn’t need to respond to particular subsets of requests, like those coming from a single dashboard session. A stateless server is useful in that it is, in principal, horizontally scalable.
APIs
cfserver
“installs” and runs various (versioned) http
APIs to work with groups, users, requests, instances, volumes, alerts, errors, and more. Thus cfserver
mediates interactions with the database and takes actions in AWS corresponding to specific requests (like creating, starting and stopping instances). All interactions with the database and AWS occur through appropriate cfserver
API calls.
Security
If enabled, cfserver
can require authorization for all http
requests as well as implement Role-Based Access Control (RBAC). Both are optional, but important in that cfserver
can create, change, and delete EC2 resources as well interact with the database. Without API access restrictions, malicious (or partially competent) actors could do alot of damage to a running CloudForest
platform.
Authorization
When authorization is enabled, every http
request made to cfserver
must have an Authorization
header with a suitable value for making requests. Suitable values are either (i) an API key created by cfserver
and stored in the database or (ii) a login token from our rsslogins
server that has not yet expired. express
middleware ensures that, before any actions take place, any request is checked for this Authorization
header and the value evalauted. The value is first checked against the current list of API keys in the database and, if it is not an API key, then checked with the login server to see if it is a valid token. Requests with an API key and/or valid token are allowed to proceed.
Role-Based Access Control (RBAC)
Authorizing requests does not in itself provide any granularity for permissions to execute certain commands via API routes. cfserver
implements Role-Based Access Control (RBAC) with casbin
in express
middleware. The middleware checks the roles currently adoptable for the Identity
associated with the request (as stored in a header, but may be required to match API key and/or token values) against permissions for those roles to ensure the caller can execute the route.
Logs
We uniformly log all requests made to cfserver
as well as all associated responses. Request and Response log entries look something like the following:
[2019-05-06T10:26:52.771] [INFO] info - REQLOG,GET,/user/ayurukog,6e9fc245e5a923666cd928f9,WHVsKHJ...
[2019-05-06T10:26:52.853] [INFO] info - RESLOG,6e9fc245e5a923666cd928f9,GET,/USER/:ID,304,JSON,1557163612.771,1557163612.853,0.082
[2019-05-06T10:26:52.910] [INFO] info - REQLOG,GET,/group/656d24e258f6d0fda608ba08,7becdedb1b82910aa783254b,WHVsKHJ...
[2019-05-06T10:26:53.546] [INFO] info - RESLOG,7becdedb1b82910aa783254b,GET,/GROUP/:ID,304,JSON,1557163612.910,1557163613.546,0.636
REQLOG
declares that the line describes a request, with format
<method>,<URL>,<response key>,<login token>
RESLOG
declares that the line describes a response, with format
<response key>,<method>,<URL>,<status>,<response method>,<start time>,<end time>,<duration>
The two are linked by the response keys, e.g. 6e9fc245e5a923666cd928f9
and 7becdedb1b82910aa783254b
here. REQLOG
lines.
Testing
Launching cfserver
creates a test script test/routes.sh
you can run locally. You can also make a remotely make a GET
HTTP
request to /tests
on a running cfserver
instance and get back a testing script tailored for that server.
This testing script is created entirely from tests defined in the code itself. Specifically, any API route in cfserver
is defined by adding a field to an (exported) object routes
as follows:
// generic "root" call
routes.root = {
route : "/" ,
method : "get" ,
tests : [ { status : 200 } ] ,
callback : ( req , res ) => {
res.send( "Hi! I'm the DARC CloudForest server at Stanford GSB.\n" );
}
};
The “tests
” field specifies a list of tests to run as objects whose fields define the call and its expected result. In this case, the single test for the route /
is a straight-up call to GET
/
with an expected return status
of 200
(success). Tests can be specfied with parameters too (the things specified by the /:
parts of route URLs) and need not be expected to succeed. For example,
const defaultGroupNameTests = [
{ params : { name : "test" } , status : 200 } ,
{ params : { name : "notagroup" } , status : 404 } ,
];
specifies two test: a test for group name calls with parameter test
that should succeed (200
) and one with parameter notagroup
that should fail (404 NotFound
). In this way the code itself, near the routes themselves, can specify how they should be tested.
CloudForest’s Database
Like many services, CloudForest
depends on a database. We use a MongoDB Atlas instance to store data related to the new CloudForest
platform. This database has a variety of collections, briefly reviewed below, that serve both static and dynamic needs. That is, the database stores AWS EC2 resource data which should rarely change as well as instance data which can change rapidly. There are actually two databases, one for (sandboxed) development and one for production, each having the same structure but different content.
Technically, we have a M20
series instance hosted in AWS’s us-east-1
region. This class provides 2 vCPUs, 4 GB RAM, 20 GB storage, a maximum of 700 connections, and continuous backups. Given that cfserver
mediates all interaction with the database, we aren’t yet greatly concerned about database performance other than read/write latency. In fact, for most of our beta run we operated out of the free tier, M0
class with only a shared resources, 512 MB of storage, and no backup, and for all intensive purposes everything worked fine.
Each collection used in CloudForest
is described below:
Static Data
By “static data” we mean data held in the database that does not change particularly often, and certainly not through regular use of the CloudForest
platform. This could perhaps be thought of as the high-level configuration data for the system.
options
Our options
collection stores configuration options influence how cfserver
operates. There are four main categories:
-
server
, containing options related to server setup and operation -
aws
, containing AWS-related options (including credentials) -
comms
, containing communication-related options (e.g., Slack channelPOST
url) -
and
metrics
, containing a description of the metrics server
By storing these options online in the database we make online administration possible. Options can be changed and the server “reloaded” with these updated settings from our Admin page, without ever having to actually log in the the instance running cfserver
.
awsdata
The awsdata
collection holds data on AWS instances that can be used in the platform. Specifically, awsdata
is a collection of json
objects describing EC2 instances along with any internal flags like availability
(whether we allow the instance or not).
templates
The templates
collection keeps the pug
templates used to construct the emails we send corresponding to various events. Templates are identified by their name
and have body
and subject
fields (as strings in pug
markup form), a description
, and a dictionary of variables
the pug
templates for the body
and subject
will expect.
Dynamic Data
By “dynamic data” we mean data held in the database that does change often, and primarily through regular use of the CloudForest
platform.
shadow
The shadow
collection holds platform secrets like user passwords (in unix-hashed form, never cleartext) and API keys. There are no large-scale access routines for data from shadow
written into cfserver
.
users
Data about who can use the system is stored in the users
collection. Each user is primarily identified by their uid
, has name details, stores group memberships by MongoDB _id
and name:
{
"_id" : ObjectId("xxxxxxxxxxxxxxxxxxxxxxxx"),
"uid" : "cfuser",
...
"groups" : {
"xxxxxxxxxxxxxxxxxxxxxxxx" : "test-group",
"xxxxxxxxxxxxxxxxxxxxxxxx" : "facultyA"
},
"name" : {
"first" : "Some",
"middle" : "C. F.",
"last" : "User",
"display" : "C. F. User"
}
}
There is other data, particularly a user profile, history, created/updated times, as well.
groups
The groups
collection stores information about the CloudForest
groups. Groups are mainly identified by their name
, but can also hold a description
, have a list of admins
(the uid
s of the admin users), a list of members
(by _id
, with name and “rights” in the group), a list of instances
the group currently has (by _id
and name), and a history
of all instances the group has had (by _id
and name).
requests
instances
volumes
alerts
Any alert sent to cfserver
from an instance, AWS Lambda functions, or from kapacitor
is stored in the alerts
collection.
emails
The emails
collection holds a permanent record of the emails cfserver
sends to users of the platform via AWS’s SES service. This includes the Destination
(all to, cc, and Bcc recipients), Message
(body and subject), source
, state
(e.g., sent
), and an SES messageId
.
errors
The errors
collection holds errors that are POST
ed to cfserver
by the ErrorBoundary
in our react.js
dashboard.
Monitoring and Automation
CloudForest
uses the TICK stack to monitor instance use and facilitate automation. Specifically, we install and run telegraf
to collect and ship monitoring data from running instances to another instance running influxdb
, chronograf
, and kapacitor
. DARC uses a single reserved instance running the “ICK
” part of the stack for all our monitoring purposes, including CloudForest
. kapacitor
templates to instantiate scripts when instances start to monitor for various conditions such as disk reaching capacity or instances idling. chronograf
is a convenient web UI to explore data, but accesses influxdb
through a user with permissions only to read certain databases.
We have found the TICK stack to be a fantastic set of products for this purpose. It is relatively easy to install and maintain, data is stored very compactly and easy to query locally and remotely, and telegraf
ships highly detailed metrics out of the box. Our automation strategy is based entirely on kapacitor
, which would be impossible were it not flexible and relatively easy to work with.
Stopping Idling Instances
Our main example is idling warnings and instance stopping, which we review here. Whenever an instance is running, telegraf
is shipping metric data, including CPU utilization, to our influxdb
instance. kapacitor
is also running on that instance, and subscribes to updates from influxdb
. There are kapacitor
tasks set up for any running instance that watch CPU utilization – technically 100.0-usage.idle
– and create (i) a stream of averages of CPU utilization over non-overlapping five minute windows and (ii) a windowed maximum of those averages over an hour, an instance-dependent idling time limit, and an hour less than that limit. Part of the tasks logs the resultant maximum data for auditing and review purposes. But, when the maximum values are too small, another part of the tasks can trigger alerts to cfserver
via https
requests according to the time period over this the maximum value was too small. cfserver
can then act based on these alerts, for example by sending email notifications that an instance might stop soon if not used or actually stopping an instance if it has been idling for too long.
cfmetrics
CloudForest
monitoring relies on another server, called cfmetrics
, that runs on the instance that holds the influxdb
installation. The purpose of this server is just to “proxy” requests for data from influxdb
and task definition in kapacitor
. This way, we don’t need to code details calls for data and alerts right into cfserver
, although that is an aesthetic decision as much as a technical one. cfmetrics
also serves as an https
proxy: influxdb
can run with a self-signed certificate while cfmetrics
runs with proper CA
certs to satify requests from in-browser apps.
Instance Setup
One important goal is that when users log into a CloudForest
instance, it feels familiar. The same software is there, the same layout for home
and data
directories, etc. For us, the admins, we want to make sure software like telegraf
is running so we can collect data on instances and act on events described by that data. And more.
Some of these setup tasks can be handled by installing software in the AMI: all our analysis software (python, R, MATLAB, Stata SE/MP, etc) is pre-installed, as is telegraf
and many other useful packages. Take, however, telegraf
as an example. If it is installed and running in the AMI, so that as soon as an EC2 instance boots up, then whatever data it sends will be tagged with the host
setup in the AMI. This won’t be useful at all, because we need instance-specific host
tags in the telegraf
data to show users their data or alert based on events for particular machines. We will have to do some telegraf
setup after the instance launches for this to be useful. There are quite a few other “setup” tasks that must be done. For example, AWS EBS volumes are allocated and attached to instances when they are launched but those volumes are not necessarily formatted or mounted to the instance. If they aren’t the volumes are useless.
Our Setup
As of this writing our setup tasks include:
- Installing the latest versions of useful
CloudForest
CLI tools on each instance - Correctly specifying the
hostname
of an instance, based on its given name - Correctly configuring and starting
telegraf
monitoring, given the correcthost
name - Obtaining
https
certificates (via LetsEncrypt) for each instance under its correct name - Starting security services like
splunk
log forwarding andqualys
software scanning - Formatting and mounting volumes to an instance
- Identifying and creating which users should be defined on the system, or deleting users that shouldn’t be
- Configuring and starting a
jupyter notebook
server to enable in-browser computing
We actually accomplish all these tasks in a fairly general way using a single script install.sh
that searches a directory for subfolders with their own install.sh
files and running any of those. Our implementation also allows for:
- install-specific environnment variable definitions in an
install.env
file, - “prerequisites” as listed in a single file
after
(e.g., startingtelegraf
andsplunk
should follow the definition of thehostname
), - and failure tracking.
In this way adding new installs for, say, new CloudForest
CLI tools, security features, or software upgrades is actually fairly easy.
Implementation
We store all these “installs” in a bitbucket
repository and use pipelines connected to AWS CodeDeploy to define what repository should be deployed to running instances. Valid running instances to deploy to are identified with tags. Our development and production environments actually use completely different AWS accounts, specified in pipelines using different repository and deployment variables. However we also have a “staging” environment within our production CloudForest
defined by a particular group to enable trial deployment of our setup routines in production. This is an important feature; bad updates to our setup routines deployed directly to production have deleted user data.
Deployment Environments
Testing/Development
We maintain a sandboxed testing/development environment in a completely separate AWS account. This account holds two EC2 instances for running cfserver
and for monitoring with the TICK stack as well as cfmetrics
.
The main git
branch for development in cfserver
, cfmetrics
, and cfinstancedeploy
is aptly named development
.
Staging,
Our “staging” environment is a subset of the production environment consisting of instances in a particular group, staging
. In particular, staging enables us to roll out changes to cfinstancedeploy
to production affect users’ running instances. Changes to cfinstancedeploy
should be commited and pushed in the following order:
development -> staging -> production
Production
Our production environment is our main AWS account with two servers (on AWS reserved EC2 instances): one hosting the website and cfserver
(among other tools provided by DARC) and another hosting the TICK stack installations and cfmetrics
.
The webserver machine is a m4.2xlarge
instance with 8 vCPUs, 32 GB memory, and 256 GB of storage. This machine serves CloudForest
related webpages, including the dashboard, using apache
and runs a single cfserver
process on another port. This machine is multi-purpose, hosting several of DARC’s tools and service domains including CloudForest
.
The TICK stack server is an r5d.2xlarge
instance with 8 vCPUs, 64 GB memory, and 300 NVMe SSD instance storage on which we hold the influxdb
database. We use an attached EBS volume and a cron
/rsync
job to capture daily snapshots of the data stored there. This server is our TICK stack instance for all computing metrics collection at DARC, not only CloudForest
.
The main git
branch for production in cfserver
, cfmetrics
, and cfinstancedeploy
is conventionally named master
.
Usage to Date
Conclusions
Acknowledgements
We aren’t too specific here about who helped with what. The entire DARC team and more has contributed alot to this project, and that includes (in no particular order) Arun Aaggar, Luba Gloukhova, Mason Jiang, Wonhee Lee, Sal Mancuso, Amy Ng, and Jason Ponce among others.
Connect with us