Incident report - recent downtime for AU data center

Cristinel Anastasoaie - Wednesday, October 01, 2014

As you are probably aware, sites on all data centers have experienced some downtime these two weeks. First, we apologize for any inconvenience this might have caused you, and offer you a detailed explanation on what happened and what measures we're taking to prevent this in the future.

Starting on September 26th, sites on all data centers have begun experiencing intermittent downtimes. Sites on our Asia Pacific data center have experienced longer and more frequent downtime sessions than what we have announced in the AWS maintenance blog post.

The downtime has been caused by three distinct events and was amplified by timing:

  • Amazon AWS infrastructure upgrade - this operation implied many server restarts and failing over from one Amazon availability zone onto another and then back (basically, we had to execute a scheduled disaster recovery procedure). Our team has worked 24/7 to make this major AWS-wide infrastructure upgrade as smooth as possible to all our customers. During these procedures, the sites on the data center under maintenance became totally unavailable while sites on the other two data centers kept their front-ends running but had most of the back-end services disabled because we needed to stop the data replication between data centers. While Amazon has performed the restarts outside business hours for each region, the restarts of NA and Europe data centers fell during AU business hours and thus had some impact on all sites by preventing customers to access some of the back-end services. We are looking into implementing some architectural changes that will limit the impact of such operations from one data center to the other.
  • Load balancer crash - this week we have encountered a load balancer crash. We have worked with the vendor to identify the root cause and we decided to upgrade the system’s firmware; this procedure is almost completed now and we are closely monitoring the load balancer for any unforeseen issues that might arise.
  • A network connectivity issue between Amazon datacenters triggered an automatic fail over of the database servers to the backup servers. This type of operation usually generates a downtime of up to several minutes. We are currently trying to identify a potential network architecture change that could help mitigate this type of occurrence.

Once again, our apologies for any inconvenience this incident might have caused. Both our team and Amazon are fully committed to provide the upmost level of security and reliability to all our customers and we continuously dedicate efforts to improve on these fronts.

Sincerely,

The Adobe Business Catalyst Team

Comments