Probably few noticed but we suffered some downtime, mainly yesterday early a.m. from around 12:30 - 8:30. Our main WWW server took a complete tank. Fortunately, we were able to get it back up and running relatively quickly, but as with any business - this isn’t something we want happening on a regular basis. The last thing we need is a reputation like Twitter’s for downtime (though we would love their popularity). We figured others might enjoy how we are working to rectify this problem.

1. In our long-term plan is full system redundancy. Every system will be in a cluster and the failure of any single node will have no visible effect on the system.

2. We have an account with mon.itor.us. Its a free ping service with a nice dashboard (you should take a look, they also let us compare our response times to that of competitors). We also purchased an SMS package from them. This means morning, noon, or night (24×7x365) staff will now receive text messages any time the website goes down. This will allow rapid and effective response to issues (though perhaps to the detriment of our engineer’s sleep cycle).

3. We have an account with ScoutApp. ScoutApp allows us to monitor and receive notifications relating to CPU, memory, and disk utilization - not to mention the ability to monitor slow MySQL queries.

Some may wonder why we have chosen to utilize hosted solutions to provide our systems monitoring rather than an on-site solution. It’s SaaS baby! Okay, I just felt like saying that. But as most of you probably already know, SaaS stands for Software as a Service and has significant advantages over on-site hosting in many situations. In fact, we have built our entire service on a combination of SaaS and cloud-based technologies.

Because we haven’t deployed a traditional monitoring solution like AdventNet’s OpManager or Quest’s Big Brother we are able to reduce our surface attack area significantly (OpManager for example runs its own instance of Apache and MySQL), we’ve also decreased in-house system requirements. While we could theoretically have installed a monitoring solution on a web or database server we like (and believe in) separating roles onto separate servers (whether physical or virtual) - this would have required an additional server acquisition, setup, and management cost. Finally, we’ve detached our monitoring solution from our in-house network. What happens when your in-house network crashes entirely (e.g. someone digs up both of your fiber connections at the same exact moment?)? You don’t receive notifications because your monitoring application is down as well.

Using out-sourced systems monitoring provides thus three significant advantages: (a) increased security, (b) decreased in-house equipment and management, and (c) better monitoring of major system issues.

Zemanta Pixie