Using Custom Metrics to Measure User Load in Cloudwatch

Published by Ryan on

Being part time devops in a small shop can limit the time (and money) you spend load testing a web application or server API. And yet if you’re publishing a consumer application with the hopes of growing your user base, it’s important to keep your finger on the pulse of your infrastructure. So how does a stretched dev team decide when to scale or optimize before the whole house of cards comes crashing down?

There are a lot of right answers, even better ones than described here (ie auto-scaling server groups), but for now we have EC2 instances and a pretty bomb monitoring tool provided by our friends at Amazon. Along with an unlimited supply of Diet Mountain Dew, our stretched devops person needs to work with what’s been given to them.

Enter Cloudwatch

Setting up pre-defined and existing metrics and alerts in Amazon is pretty trivial. Its a table-stakes measure for a team and a good start. Since there are a ton of resources available on the procedures, we’re going to skip straight to strategy. If you need some white papers, Amazon’s are so far, the best.

Anecdotally, our application suffered some common modes of failure as it scaled. The most common were:

  • Application server CPU peaks, causing requests to slow down or grind to a halt and in some cases, time out.
  • Unrotated log files or phantom logs or unchecked operating system logs, pointers or session files filling disks
  • Database server CPU peaks caused by slow queries

Removed some *em* company specific labels.

Enter some basic graphs. CPU utilization and network traffic are a good start. They should show highly proportional changes during normal usage. We set up alerts for when usage peaked above a certain threshold (~75%).

Disk space for EC2 instances was also important given our failure modes, but default usage isn’t available via Cloudwatch’s default metrics. Instead  we needed Amazon’s Perl-written custom monitoring scripts.

These aren’t the only graphs we set up for our infrastructure, we have MySQL/Aurora and Dynamo usage graphs, queue server and queue size monitoring as well as network and load balancer monitoring. These are all great, we have a pulse, we can monitor trends and we can get alerts when things start going amiss. But we still don’t understand really how many users  our servers can support…
 

Enter The Meat

Our native mobile game connects to one of many node servers, who process and return data during the user session. In a glorious moment, the dev team delivers an API endpoint which returns a local and global count of connected users on each server. We’re almost there, let’s get that into Cloudwatch too!

To do this we need three things: our chopping block, machete, and the AWS custom monitoring scripts handed to us above. Since Amazon’s team has done the heavy lifting for us, there’s no reason to reinvent the wheel, but as you are about to see, I AM NOT A PERL DEVELOPER. This was my first foray into Perl scripting, I needed to accomplish pulling the necessary pieces from Amazon’s script and include a CURL/request library to contact my server API. I also added a snippet to run an operating-system level TCP connection count (line 152). If this file is confusing I’d highly recommend reviewing the original monitoring scripts package first.

This new file (mon-put-user-data.pl) is designed to live in the same directory as the existing aws examples (mon-put-instance-data.pl):

 

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
#!/usr/bin/perl
BEGIN
{
  use File::Basename;
  my $script_dir   = &File::Basename::dirname($0);
  push @INC, $script_dir;
}

use strict;
use warnings;
use LWP::Simple;
use LWP::UserAgent;
use HTTP::Request;
use JSON;
use Sys::Hostname;
use Getopt::Long;
use Sys::Syslog qw(:DEFAULT setlogsock);
use Sys::Syslog qw(:standard :macros);
use CloudWatchClient;
use constant { NOW => 0 };
use Data::Dumper;

#
# For cloudwatch
#

my $version = '0.1';
my $client_name = 'CloudWatch-PutUserData';
my $enable_compression;
my $aws_credential_file;
my $aws_access_key_id;
my $aws_secret_key;
my $aws_iam_role;
my $from_cron;
my $parse_result = 1;
my $parse_error = '';
my $argv_size = @ARGV;
my $mcount = 0;
my %params = ();
my $now = time();
my $timestamp   = CloudWatchClient::get_offset_time(NOW);
my $instance_id = CloudWatchClient::get_instance_id();

#
# Set default input DS, Namespace, Dimensions
#

$params{'Input'} = {};
my $input_ref = $params{'Input'};
$input_ref->{'Namespace'}="System/Linux";
my %xdims = (("InstanceId"=>$instance_id));

#
# Adds a new metric to the request
#

sub add_single_metric
{
  my $name = shift;
  my $unit = shift;
  my $value = shift;
  my $dims = shift;
  my $metric = {};

  $metric->{"MetricName"} = $name;
  $metric->{"Timestamp"} = $timestamp;
  $metric->{"RawValue"} = $value;
  $metric->{"Unit"} = $unit;

  my $dimensions = [];
  foreach my $key (sort keys %$dims)
  {
    push(@$dimensions, {"Name" => $key, "Value" => $dims->{$key}});
  }
  $metric->{"Dimensions"} = $dimensions;
  push(@{$input_ref->{'MetricData'}},  $metric);
  ++$mcount;
}

#
# Prints out or logs an error and then exits.
#

sub exit_with_error
{
  my $message = shift;
  report_message(LOG_ERR, $message);
  exit 1;
}

#
# Prints out or logs a message
#

sub report_message
{
  my $log_level = shift;
  my $message = shift;
  chomp $message;

  if ($from_cron)
  {
    setlogsock('unix');
    openlog($client_name, 'nofatal', LOG_USER);
    syslog($log_level, $message);
    closelog;
  }
  elsif ($log_level == LOG_ERR) {
    print STDERR "\nERROR: $message\n";
  }
  elsif ($log_level == LOG_WARNING) {
    print "\nWARNING: $message\n";
  }
  elsif ($log_level == LOG_INFO) {
    print "\nINFO: $message\n";
  }
}

{
  # Capture warnings from GetOptions
  local $SIG{__WARN__} = sub { $parse_error .= $_[0]; };

  $parse_result = GetOptions(
    'from-cron' => \$from_cron,
    'aws-credential-file:s' => \$aws_credential_file,
    'aws-access-key-id:s' => \$aws_access_key_id,
    'aws-secret-key:s' => \$aws_secret_key,
    'enable-compression' => \$enable_compression,
    'aws-iam-role:s' => \$aws_iam_role,
    );
}

if (!defined($instance_id) || length($instance_id) == 0) {
  exit_with_error("Cannot obtain instance id from EC2 meta-data.");
}

#
# Params for connecting with and talking to the server API
#

my $clientId     = '';
my $clientSecret = '';
my $clientPass   = '';
my $authEndpoint = 'https://path.to.auth';
my $userEndpoint = 'https://path.to.data';
my $asaPort      = 81;

#
# Collect data from netstat command
#

my $cxns = `netstat -ant | grep $port | grep EST | wc -l`;
add_single_metric("TCP Connections","Count",$cxns,\%xdims);

#
# Get auth token from core
#

my $ua = LWP::UserAgent->new;
my $req = HTTP::Request->new(POST => $authEndpoint);

$req->header('response_type'=>'json');
$req->content_type('application/x-www-form-urlencoded');
$req->content('grant_type=client_credentials&client_id='.$clientId
   .'&client_secret='.$clientSecret);
my $res = $ua->request($req);

#
# check the authorization outcome
#
if ($res->is_success) {

   my $auth = decode_json($res->decoded_content);
   my $token= $auth->{'access_token'};

   #
   # Make the call for active user data
   #

   my $req  = HTTP::Request->new(GET => $userEndpoint);
   $req->header('Authorization'=>'Bearer '.$token);
   $req->header('response_type'=>'json');
   my $res = $ua->request($req);
   if ($res->is_success) {
      my $data = decode_json($res->decoded_content);
      my $users= $data->{'data'};
      add_single_metric("Active Users","Count", $users, \%xdims);
      $mcount++;
   }

   if($mcount > 0) {

      #
      # Attempt to send them to cloudwatch
      #

      my %opts = ();
      $opts{'aws-credential-file'} = $aws_credential_file;
      $opts{'aws-access-key-id'}   = $aws_access_key_id;
      $opts{'aws-secret-key'}      = $aws_secret_key;
      $opts{'retries'} = 2;
      $opts{'user-agent'} = "$client_name/$version";
      $opts{'enable_compression'} = 1 if ($enable_compression);
      $opts{'aws-iam-role'} = $aws_iam_role;

      my $response = CloudWatchClient::call_json('PutMetricData', \%params, \%opts);
      my $code    = $response->code;
      my $message = $response->message;

      if ($code == 200 && !$from_cron) {
        my $request_id = $response->headers->{'x-amzn-requestid'};
        print "Successfully reported metrics to CloudWatch. Reference Id: $request_id\n";
      }
      elsif ($code < 100) {
        exit_with_error($message);
      }
      elsif ($code != 200) {
        exit_with_error("Failed to call CloudWatch: HTTP $code. Message: $message");
      }

   } else {
      print "Error: " . $res->status_line . "\n";
      exit_with_error($res->status_line);
   }
}
else {
   print "Error: " . $res->status_line . "\n";
   exit_with_error($res->status_line);
}

 

View the script on Github.

Lastly, as with any custom monitoring script, a CRON job has to be installed to run the script on a schedule… mischief managed. With the metrics being sent, we can now access them in Cloudwatch and add them to some graphs, measuring active users against resource usage!

 

Looks like its time to take action on server 3. We have a few options available, from growing the instance size to better distributing the user load.

This opens up a variety of metric and devops KPI options including estimating how much a single user costs to support in infrastructure and lets us predict and project resource requirements at various growth rates. In the end, this is just one approach… the tip of the iceberg in terms of infrastructure strategy. But for a small shop, a single devops guy and limited resources, its invaluable insight to keep servers running and business growing.

Stay tuned, I’ll add devops posts as we continue to refine our strategy.

Up next: optimizing at the application level to improve per-user cost.


Leave a Reply

Your email address will not be published. Required fields are marked *