Red Eye

remlite: The Light and Fluffy Red Eye Monitor

Author: Geoff Howland <geoff.at.gmail.dot.com>

Red Eye Monitor (REM) is a comprehensive Total Systems Automation project, which attempts to manage all aspects of your system in a full life cycle. All failures are planned for, and as many conditions that can be handled automatically, without needing human intervention in any way, are coded and tested, to create a flexible, scalable and self-healing system in vendor cloud environments (like Amazon's EC2, and others), and in your own data centers, integrating the worlds of utility computing with fixed assets to create a true Cloud.

remlite significantly scope reduced automation system, that does not intend to be a Total Systems Automation project. remlite is run out of YAML files, instead of a relational database. remlite does not have versioning taken into account in all of it's components, and is not meant to have different sets of administrators operating in different environments without stepping on each others feet. remlite is meant to provide a minimal layer of automation, to handle the up-keep of machines providing services in the cloud, in different accounts.

remlite will:

remlite will not:

Deciding to use remlite will be predicated on wanting a quick system that can be installed, with minimal testing, to get a centralized management server up quickly.

 

Overview

Overview

remlite is configured with YAML files, which may be checked into a version control source to provide auditing, and an indirect method for updating files, so logging into the remlite Manager machine is not necessary to update running information.

Once files have been updated, the remlite Manager will see any updated file times, and reload those files if they are already loaded in memory. No restart or re-HUP is necessary for this.

remlite is made to manage multiple accounts, and these could be fix assets or Amazon EC2 instances, but the goal of remlite is management of EC2 instances, so all initial code will be directed for that purpose.

These accounts are listed in the overview diagram prod, test, webeng and systems. These would represent different Amazon EC2 accounts, with their own authorization and access configurations, and their own AMIs and auto-scale groups.

Each account can have any number of services, listed as Service A, Service B and Service C. Services are differentiated as a distinct class of machine. All machines in the same service should have the same installation requirements, and be used for essentially the same purpose.

Each machine in the service, listed as Host 0, Host 1 and Host 2, are configured independently, and could be passed their instance number of the instance count machines in this service. Host 0 would be 0 of 3, in this example. This can be useful in sharding data on machines of the same service type. They contain the same packages and executables, and run the same services, but they load different data sets compared to their instance count.

Each machine in the service can have any number of EBS volumes associated with them. A service defines the volumes associated with the service, and whether the volume is mounted on every machine or not. These volumes are tracked by the instance count of the machine instance they are assigned to. On the death of that machine instance, and re-birth of a new instance by Auto-Scale, the remlite Manager will run it's Provisioning script, and assign the new machine instance to the YAML file of a the hollow instance count. With the instance count now populated with a running machine instance, the remlite Manager Configuration script will assign the EBS volumes to the new machine instance, at the previous device mount point, so the new instance will start back up with populated storage.

 

Auto-Scale

Auto Scale

Auto-Scaling is used to create pools of machines, for a given service, and once the instances are created, the remlite Manager beings the Provisioning process.

Life Cycle

Life CycleThe life cycle of a machine is initiated by EC2 Auto-Scale. Auto-Scale should be configured to the same number of machines as specified in the service instance_count. This is not enforced or managed by remlite (unlike REM, which manages and enforces these types of details).

Once a machine instance has been created in EC2, the remlite Manager will determine which service it belongs to by it's Auto-Start group. The remlite Manager will then assign this instance to one of the depopulated machine YAML files for this service.

Once the YAML file has been updated, the service machine for that instance number is marked as Provisioned, and is ready for configuration.

The remlite Manager starts the configuration process locally, by assigning all the EC2 EBS volumes to the specified devices that were assigned to the previous machine instance, so the storage configuration will remain the same. remlite will not attempt to verify the storage is still usable and in good condition (unlike REM, which performs tests at every level of the storage's devices and software that uses the storage).

After the volumes have been successfully mounted on the target machine, the remlite Manager will copy all current remlite files, including scripts, configuration files, templates and everything else that is not security related to the target machine. This ensures the target machine has all the latest remlite YAML and script files. No security related files are ever copied to a target machine, to keep EC2 instances totally unaware of their own account information.

After the remlite YAML and script files have been copied, the remlite Manager will start the remlite Local server. This is a long running process which runs on the target instance, and performs local monitoring on the machine, holds the monitoring data to be collected by the remlite Manager's Monitor Collector. Scripts may also be run as cron jobs, through the monitoring mechanism, with the results stored the same way monitoring results are stored. Monitoring is meant to be used very loosely, to take advantage of the powerful nature of storing dictionaries of collected data in RRD files, to be graphed and queried as time series data for alerting and capacity planning.

Once the remlite Manager has started the Local server, the Configuration script will be run. At this point the target machine is considered Configured, even though the configuration may take some period of time.

A target machine marked as Configured is now in the Verification process. Verification is simply running the monitors against the target machine until all the monitors pass successfully. There is a timeout period specified in the verify section in the service scripts specification. Once this timeout period has been passed, the contact_on_timeout list is used to alert administrators the failure to verify that this configured server is working. This timeout stops remlite's progress, and a human must intervene to set the situation right. (This differs from REM which will take every action possible to automatically restore the service, whether it is through a restored snapshot or running repair scripts over data, many levels of repairs are automated.)

As soon as all the monitors pass the target machine is marked Verified, and the Activation process begin. This is just another script to run from the remlite Manager, which will let other services and machines know that this instance is now Active and can take traffic.

Once Active a machine will be monitored according to it's service specification indefinitely, with alerts being sent. If the machine instance is terminated, fails monitors and is killed by automated scripts, or simply disappears the Auto-Scale and Provisioning process will repeat.

 

Stages

Stages

Configuration

YAML Files

Configuration of remlite is done through YAML files, so it is easily edited by humans, and can have changes made to it's data and scripts easily.

accounts.yaml

systems:
  manager: ec2

  account:
    
    number: 020661041473
    user: systems@netflix.com
    
    pk: certs/pk_systems.pem
    cert: certs/cert_systems.pem
    
    aws_key: certs/systems.key
    aws_secret: certs/systems.secret
  
  ssh:
    low_security: certs/sa.pem

This YAML snippet describes a single account in accounts.yaml.

sites.yaml

systems:
  
  # Last site version given out to a deployment.  Each new QA deployment gets
  #   a new version, and then it keeps that version number the rest of it's
  #   life.  New QA versions made from that version get a new version number.
  #TODO(g): Ensure this the functoin that updates this Locks.
  last_site_version: 0
  
  # Configuration file name format.  Example: conf/systems/www_00000.yaml
  conf_name: "conf/systems/%(name)s_%(site_version)s.yaml"
  
  # Autoscale group name format, ex: ha_proxy_00001
  autoscale_group_format: "%(name)s_%05(site_version)d"
  
  # Deployments in this site
  deployment_conf: conf/deployments/systems.yaml
  
  # Any dynamic deployments (QA deployments), will be written here, in the same
  #   style as the above deployments
  deployment_directory: conf/systems/deployments/

This YAML snippet describes a single account in sites.yaml. More of those can be written in sites.yaml so that multiple accounts can be managed. Access and certificate information is tracked, SSH keys for root are tracked, and the services that run in this account is tracked.

deployment_conf: systems.yaml

# The production deployment is based on this
production:
  # This is a production deployment.  There is only one of these, ever.
  deployment: production
  
  # This is an actively running deployment
  type: running
  
  account: systems
  
  # Production deployments cannot be promoted
  promote: null
  
  # Deployment Configuration file for this deployment type
  conf: conf/systems/deployments/systems_00000.yaml



# The staging deployment is based on this (uses Production databases)
staging:
  # This is a staging deployment.  There is only one of these, ever.
  deployment: staging
  
  # Type is "running" when active, and "off" when inactive.  When there is not
  #   a QA deployment being promoted to test as staging, then staging is
  #   off, and instances are not being used or paid for.
  type: off
  
  account: systems
  
  # Staging deployments get promoted to production
  promote:
    #NOTE(g): When promotion to done from staging to production, we update
    #   production's services to point to the new versions, and we kill
    #   the previous production AS group (once the transition has been
    #   completed properly, we leave it up as fall-back until then), 
    deployment: production
    
    # Staging must be at 100 percent production traffic to be promoted to
    #   production
    production_traffic_percent: 100
  
  
  # Deployment Configuration file for this deployment type
  conf: null



# All QA deployments are based on this (users per-QA-deploy databases)
#NOTE(g): This is both an actual deployment, and a template for users
#   to do their own QA deployments for manual testing before promoting
#   to staging, and finally staging to production.
qa:
  # This is a QA deployment, there can be an unlimited number of these.
  #   This is how deployments are changed, an existing deployment is duplicated
  #   and a new QA deployment is added to dynamic deployments, with the
  #   deployment config copied from this (qa), and the service YAML files copied
  #   from the service version specified to duplicate.  These files are then
  #   updated as staging's service files, and staging instance_count levels
  #   are used, and production_traffic_percent is set to 0, so manual testing
  #   can be done on staging first.
  deployment: qa
  
  # This is a template, and will not run instances.  This is used to
  #   create user specific test environments, using template_format,
  #   which allows a testname (ie, user account name or project name)
  #   to be inserted to the temporary qa_* deployment, which is either
  #   discarded after manual testing, or promoted to staging.
  type: template
  
  # This template file formatter is appended to the QA running deployments
  #   that are created from this deployment template.
  #
  # QA systems are run in a different EC2 account, so that different ACLs
  #   can be used (ie. running in a VPC or global deny ACLs).
  account: systems_qa
  
  # QA deployments get promoted into the "systems" site, out of
  #   the "systems_qa" site, and are promoted into the staging deployment.
  promote:
    deployment: staging
    
    # QA deployments will be at 0 percent production traffic when they are
    #   promoted to staging
    production_traffic_percent: 0
  
  
  # This configuration is a template to be used for our QA instances, it does
  #   not ever run any instances itself (which is why type==template), so
  #   we will never set this conf variable.  Just here as a place-holder.
  #NOTE(g): When actual QA site deployment configurations are loaded, they
  #   use this deployment specifications as default dictionary values, and then
  #   update their values over them, so that type and conf both get updated.
  conf: null

This YAML snippet describes a single account in sites.yaml. More of those can be written in sites.yaml so that multiple accounts can be managed. Access and certificate information is tracked, SSH keys for root are tracked, and the services that run in this account is tracked.

users.yaml

# Administrators
admins:
  
  oncall:
    
    2010-01-29:
      hour_start: 10
      users: [ghowland, mchesnut, rkubica]
      
    2010-02-05:
      hour_start: 10
      users: [mchesnut, rkbuica, ghowland]
      
    2010-02-12:
      hour_start: 10
      users: [rkubica, ghowland, mchesnut]
  
  
  # When user contact types are available
  access:
    
    weekday_days:
      
      accept:
        weekdays:
          day: [1, 2, 3, 4, 5]
          hour_start: 10
          hour_end: 22
      deny:
        all_except: [weekdays]
      
    weekday_nights:
      accept:
        weekdays:
          days: [1, 2, 3, 4, 5]
          hour_start: 22
          hour_end: 10
      deny:
        all_except: [weekdays]
    
    weekends:
      accept:
        weekends:
          days: [0, 6]
      deny:
        all_except: [weekends]
  
  
  users:
    
    ghowland:
      name: Geoff Howland
      
      uid: 1000
      shell: /bin/bash
      ssh_public_key: conf/users/ghowland_ssh.pub
      
      notify:
        email: ghowland@netflix.com
      
      alert:
        email: 4083484645@att.wireless.net
        #access: [weekday_days]
    
    mchesnut:
      name: Mike Chesnut
      
      uid: 1001
      shell: /bin/bash
      ssh_public_key: conf/users/mchesnut_ssh.pub
      
      notify:
        email: mchesnut@netflix.com
      
      alert:
        email: 4083484645@att.wireless.net
        #access: [weekday_nights]
    
    rkubica:
      name: Ryan Kubica
      
      uid: 1002
      shell: /bin/bash
      ssh_public_key: conf/users/rkubica_ssh.pub
      
      notify:
        email: rkubica@netflix.com
      
      alert:
        email: 4083484645@att.wireless.net
        #access: [weekends]

This YAML snippet describes a user group (admins) and a user (ghowland). The user will be referenced by this label in other YAML files, and the contact and other information will be drawn from this source.

service: www_middle.yaml

This is not actually called service.yaml, it is the service name of the YAML file specified in the services section of the sites.yaml file. In our example I used service_a, which resides in conf/systems/service_a.yaml.

name: www_middle

site_version: 0

autoscale_group: www_middle_00000

# AMI
ami: ami-d7d03cbe


# Each of our machine's profiles is kept in this path, as YAML files
#NOTE(g): All deployments of systems/ha_proxy go in this path, to keep this
#   mechanism simple.  This may change once everything is built.
machine_conf_path: conf/systems/www_middle

# Machine name format
machine_name: "%(name)s_%02(instance)d"

# Machine domain name format
machine_domain: "%(machine_name)s.lb.%(deployment)s.%(zone)s.cloud.netflix.net"


# Availability zones to keep instances
zones: [us-east-1c]

# EC2 instance type
instance_type: m1.small


# Security groups to create these instances with
security_groups: [test]

# Keypair to use for SSH access
keypair: test


# Configure out HA Proxy load balancers to target this for edge traffic
load_balancer: middle


# Image
#NOTE(g): Build the AMIs automatically from this list of packages, which
#   can get new updates, and then spawns new AMIs.
image:
  os: centos_54
  
  packges:
  - something.rpm
  - else.rpm
  - also.rpm


# Deployment specific variables
deployment:
  
  production:
    # Number of instances for this service (per zone)
    instance_count: 2
  
  staging:
    # Number of instances for this service (per zone)
    instance_count: 2
  
  # QA Deployment
  qa:
    # Number of instances for this service (per zone)
    instance_count: 2


# The offset to being the first instance number.  The last instance number
#   will be instance_count_start+instance_count.
#NOTE(g): instance_count_start is typically 0, but if we are spreading
#   instances over differnet accounts for the same service, and we want to keep
#   the instance ids unique, then choosing a different count_start and
#   count_stop could be useful.
instance_count_start: 0


# Number of shards, different data
#NOTE(g): The difference between instance_count_start and instance_count_stop
#   should be evenly divisable by shards, as the total number should be
#   a repeat of the shard_count, to create a master group and then groups
#   of replicas.  The ordering is striped, so in a shard size of 2 with 6
#   total instances, the shard number would be: 0, 1, 0, 1, 0, 1
#   This means the 1st, 3rd and 5th instance have the same data, with the
#   1st instance being the write-master and 3rd and 5th being read-replicas.
#   The 2nd, 4th and 6th instances would have the other half of the data, and
#   the 4th and 6th instance would slave off the 2nd instance.
#NOTE(g): shard=1 means all instances are the same, and can be accessed the
#   same way.  If there is a write-master, this should be instance 0.
#NOTE(g): remlite does not enforce sharding, but the variable is "shard"
#   is always set.  If shard_count==1, then shard will always be 0.
shard_count: 1


# SSH Bastion Servers, for access to the service machines
#NOTE(g): Users connect to this machine from these bastion machines.  Ensure
#   the bastion macines are the right security level to be the jump point
#   to this service's machines.
ssh_bastions: [cloudbastion100, cloudbastion200]


# Alerts
alert:
  
  configuration:
    
    info: Status of this configuration process.
  
  
  http_admin_server:
    
    info: This machine uses an HTTP admin server for maintenance.
    
    contact_groups: [admins, webeng_developers]
    
    # Populate the alert template with the results of the template_script
    #TODO(g): What data gets passed in to the template_script?  We want data
    #   from whatever event caused this, which I think 
    template: conf/templates/http_admin_server.txt
    template_script: scripts/templates/http_admin_server.py
    
    # Number of seconds to delay between sending alerts (1800=30 minutes)
    alert_delay: 1800
    
    # State change thresholds
    #NOTE(g): Should testing for RRD/buffer data be done here?  If not, where?
    #   This area makes the most sense, and then we have the alerts go out
    #   based on data we collect.
    #
    #   So first we COLLECT data, without caring about how it is.  Then we
    #   FETCH/PROCESS the data, and decide to alert or take an action.
    #
    #   I like this, it seperates collection from alerting, which is good.
    #   the alert script can be run just like all the other scripts, every
    #   15 seconds on the local machine, which is reading from in-memory
    #   buffers.
    #
    #   Local buffers will have to keep enough data to satisfy the longest
    #   fetch series, and then it will always be able to do the tests itself.
    #
    #   Wrap all the tests inside a script, so this file does not get any more
    #   complex.  Simplify this file.
    thresholds:
      yellow:
        contact_type: notify
      red:
        contact_type: alert
    
    # Time ranged can be listed here, to suppress this alert
    suppressed:
      



# Paths owned by this group
paths:
  - data:
    path: /data/
    mode: 770
    owner: root
    group: developer
  
  - www:
    path: /var/www/
    mode: 774
    owner: apache
    group: developer



# Users on this machine (get default data from users.yaml)
users:
  ghowland:
    sudo:
      - /usr/sbin/service httpd restart
  
  mchesnut:
    sudo:
      - /usr/sbin/service httpd restart
  
  rkubica:
    sudo:
      - /usr/sbin/service httpd restart




# Local scripts
scripts:
  
  configure_www_middle:
    run_delay: 30
    
    if:
      instance_status: [configured, verified, active]
    
    run:
      - script: scripts/configure/www_middle.py
      - except:
          alert:
            configure_www_middle: red


  monitor_www_middle:
    run_delay: 15
    
    if:
      instance_status: [configured, verified, active]
    
    run:
      - script: scripts/monitor/apache_stats.py
      - except:
          alert:
            monitor_www_middle: red
    
    rrd:
      apache:
        store_delay: 15
        
        columns:
          requests:
            key: ["requests"]
            type: GAUGE
          
          errors:
            key: ["errors"]
            type: GAUGE
          
          security:
            key: ["security"]
            type: GAUGE
          
          connections:
            key: ["connections"]
            type: GAUGE
        
        graph:
          traffic:
            columns: [requests, errors]
            label_vertical: Requests
          
          connections:
            columns: [connections]
            label_vertical: Connections

The service YAML file details a number of things to track about this service:

Deployment

YAML Files

Deployment is based on three distinct areas:

Production is assumed to be serving traffic of for a service with some level of importance. Production deployments have their own accounts, seperate from QA and Development, for security and manageability.

Staging is a QA or Development deployment that has been promoted and is being tested. Initially Staging deployments are configured in the Production account, but do not take any production traffic. After manual testing confirms the Staging servers seem to be working, then the percentage of traffic can be slowly increased (1%, 10%, 25%, 50%, 75%, 100%), until all the production traffic is running on the Staging servers. At this point they can be promoted to the Production servers, and the Staging deployment can be de-activated until another QA or Development deployment is promoted.

QA and Development are run in their own accounts, and are the only machines developers and QA personel have login access to, which allows manual configuration and testing, but ensures that any deployed servers do not require hand tuning, because they cannot be hand tuned after they have become Staging. If a problem is found, then a new QA deployment needs to be created from the Staging deployment, and the Staging deployment is de-activated.

This process is completed until a Staging deployment makes it to 100% production traffic and is promoted to Production.