Improving web service downtime alerts by comparing Pingdom and Assertible : Assertible

06/08/2017 Engineering Cody Reichert

At Assertible we use several different API and web service monitoring tools to receive downtime notifications. We do this to for several reasons:

To dogfood our own product (Assertible is an API monitoring tool)
Compare response times which helps us improve alerts in Assertible
Analyze and compare error messages (we love helpful error messages)

One of our favorite tools is Pingdom, both for it's simplicity and power. However, Pingdom's default notifications don't provide enough data to determine any useful information about the problem.

Recently, there was a brief outage in one of the APIs we monitor with both Assertible and Pingdom, SimplyRETS. Our whole development team was away from the keyboard when this problem occurred which sparked a conversation regarding just how important effective alerts and notifications are in an API monitoring tool.

When this downtime occurred, our Pingdom and Assertible alerts came in at roughly the same time. Yet, our Pingdom alerts were lacking a few critical pieces of information about the fault that Assertible provides by default. This information is vital to identifying and quickly hypothesising about the severity level of a fault at a glance.

AWS Elastic Beanstalk monitoring environment health

The web service that experienced downtime is hosted on Amazon AWS. This particular service is critical for users who rely on the availability of this service for their services.

As mentioned before, we dogfood our own product extensively and SimplyRETS is one of several real world web services we monitor with both Pingdom and Assertible. Both monitoring services are checking the API frequently and have alerts set when failures happen.

At about 1:22 PM, we received an alert from Pingdom that the service was down.

Pingdom uptime failure alert

The alert didn't tell us much; only that the system is DOWN. Using the default Pingdom configuration, this is all the information we have to start hypothesising about potential problems.

SimplyRETS uses a continuous delivery pipeline that deploys the app every time a branch is merged to master. When we first received the Pingdom notification, we did notice that the service had just finished a recent deployment.

While our team began discussing the issue, the second notification from Assertible hit our emails within a minute of the Pingdom alert.

Assertible uptime failure alert

The Assertible notification immediately gave us more information about the failure. I was aware there was likely a transient problem with the deployment to AWS. The Assertible notification indicated there was an HTTP 503 status code on a simple 200 OK test.

Unfortunately, we have experienced random 503 status codes on deployments before. This error occurs rarely enough that we haven't investigated it deeply, but often enough that we can identify the problem just by knowing the failing status code. Fortunately, the error immediately corrects itself with no noticable downtime.

When I was able to open up the AWS console and look at what was going on, I noticed a Degraded instance. Because I was able to load the service up in a browser manually, I immediately correlated that the problem was the deployment, confirming my earlier hypothesis based simply on the Assertible notification.

AWS Elastic Beanstalk events log warning

I monitored the logs for some time and everything seemed fine. It took 6 minutes for the Pingdom notification to correct it's status (back to UP).

The moral of the story is that key peices of data from the HTTP request can be vital to responding to web service faults. Pingdom is a great service and is very useful. However, their default notifications could stand to be improved when it comes to HTTP related errors.

Assertible solves this problem with more detailed alert messages (we take special care in crafting our error messages so that they are as relevant and quite as possible).

In general, Assertible is better suited for data integrity checks across a wider array of endpoints. Furthermore, using Assertible's Slack or Zapier integrations makes it possible to send notifications to SMS or anywhere else so that teams can respond quickly with and have access to important data points immediately.

:: Cody Reichert

Improving web service downtime alerts by comparing Pingdom and Assertible

Categories

Recent posts