We’d like to give you an update on our recent stability issues.
As you know, as BC usage is growing and we continue to serve billions of page requests per month, we are in the process of continuously upgrading the underlying infrastructure for BC – from upgrading various operating systems, database versions, moving to SSD for faster performance and updating various other subsystems. As we’re also moving to deploy BC next, we are accelerating some of these upgrades to make sure next year catches us in a ready shape for more growth with no impact.
One of the areas where we have upgraded a subsystem was our load balancers – a pair of gateways to the whole BC datacenter designed to distribute load evenly on all web servers. Unfortunately, we are facing a load balancer bugs causing the underlying machine(s) to restart randomly, and consequently to produce downtime to the affected sites.
We have installed a premium load balancer with premium support, and been working closely both on our side and with the vendor to get this fixed. The vendor has identified the cause of instability in their code and promised to issue a patch for it.
However, given the critical situation we are facing, and given that we don’t have an ETA for the load balancer fix, we have decided double down our investment in this are that caused most of the recent incidents.
- First, we have already added additional load balancer machines, so that if anything happens with one of them, we have a failover instance up and running, ready to take over.
- Second, we are working on a fast failover mechanism which will minimize impact when a load balancer crashes down to 1-2 minutes – we plan to deploy this by Wednesday.
- Third, we are also investigating the possibility of returning the currently upgraded load balancer software to the vendor and switching to a different provider, after rigorous testing.
And above all, we continue to stay alert to minimize any impact caused in the meantime, and we continue to keep you updated with our progress.
Another area where we’re making significant upgrades are the BC databases, where we’re moving to SSD and upgrading to the latest SQL server version.
We did finalize the upgrades in Sidney so far, and after one month of active load testing we can now continue deploying the upgrades in the EU and US datacenters.
For the EU datacenter, we are performing a database maintenance this Sunday, and this will be scheduled at an hour that will ensure minimal customer impact. The US datacenter will be upgraded late November.
Finally, we are working to make the BC Status Page monitors more accurate.
Today, these monitors do indeed show up as “green” if the majority of customers in a data center are up, even when some of the customers might have issues accessing their sites. This is caused by our “cell architecture”, where we split a datacenter in relatively independent cells in order to insure uptime in case one cell have problems for the rest of the sites.
We are changing the monitors to be more aggressive and show up as “yellow” or “red”, even when just a subset of the sites are down.
We really thank you for the patience you have shown with these incidents and we assure you we are doing our best to keep them under control. They are transient in nature as we’re finalizing this massive wave of upgrades needed for BC.next, and we are changing our processes to make sure we do test things more rigorously before deploying to minimize impact.
The Business Catalyst Team.