At Assertible we use several different API and web service monitoring tools to receive downtime notifications. We do this to for several reasons:
- To dogfood our own product (Assertible is an API monitoring tool)
- Compare response times which helps us improve alerts in Assertible
- Analyze and compare error messages (we love helpful error messages)
One of our favorite tools is Pingdom, both for it's simplicity and power. However, Pingdom's default notifications don't provide enough data to determine any useful information about the problem.
Recently, there was a brief outage in one of the APIs we monitor with both Assertible and Pingdom, SimplyRETS. Our whole development team was away from the keyboard when this problem occurred which sparked a conversation regarding just how important effective alerts and notifications are in an API monitoring tool.
When this downtime occurred, our Pingdom and Assertible alerts came in at roughly the same time. Yet, our Pingdom alerts were lacking a few critical pieces of information about the fault that Assertible provides by default. This information is vital to identifying and quickly hypothesising about the severity level of a fault at a glance.
The web service that experienced downtime is hosted on Amazon AWS. This particular service is critical for users who rely on the availability of this service for their services.
As mentioned before, we dogfood our own product extensively and SimplyRETS is one of several real world web services we monitor with both Pingdom and Assertible. Both monitoring services are checking the API frequently and have alerts set when failures happen.
At about 1:22 PM, we received an alert from Pingdom that the service was down.
The alert didn't tell us much; only that the system is DOWN. Using the default Pingdom configuration, this is all the information we have to start hypothesising about potential problems.
SimplyRETS uses a continuous delivery pipeline that deploys the
app every time a branch is merged to master
. When we first received
the Pingdom notification, we did notice that the service had just
finished a recent deployment.
While our team began discussing the issue, the second notification from Assertible hit our emails within a minute of the Pingdom alert.
The Assertible notification immediately gave us more information about
the failure. I was aware there was likely a transient problem with the
deployment to AWS. The Assertible notification indicated there was
an HTTP 503
status code on a simple 200 OK
test.
Unfortunately, we have experienced random 503
status codes on
deployments before. This error occurs rarely enough that we haven't
investigated it deeply, but often enough that we can identify the
problem just by knowing the failing status code. Fortunately, the
error immediately corrects itself with no noticable downtime.
When I was able to open up the AWS console and look at what was going on, I noticed a Degraded instance. Because I was able to load the service up in a browser manually, I immediately correlated that the problem was the deployment, confirming my earlier hypothesis based simply on the Assertible notification.
I monitored the logs for some time and everything seemed fine. It took 6 minutes for the Pingdom notification to correct it's status (back to UP).
The moral of the story is that key peices of data from the HTTP request can be vital to responding to web service faults. Pingdom is a great service and is very useful. However, their default notifications could stand to be improved when it comes to HTTP related errors.
Assertible solves this problem with more detailed alert messages (we take special care in crafting our error messages so that they are as relevant and quite as possible).
In general, Assertible is better suited for data integrity checks across a wider array of endpoints. Furthermore, using Assertible's Slack or Zapier integrations makes it possible to send notifications to SMS or anywhere else so that teams can respond quickly with and have access to important data points immediately.
:: Cody Reichert
Categories
The easiest way to test and
monitor your web services
Reduce bugs in web applications by using Assertible to create an automated QA pipeline that helps you catch failures & ship code faster.
Get started with GitHubSign up for freeRecent posts
Tips for importing and testing your API spec with Assertible 05/26/2020
New feature: Encrypted variables 10/30/2019
New feature: Smarter notifications 5/17/2019