Incident report for May 3rd 2017

Magda Neagu - Wednesday, May 03, 2017

As you probably are aware, our Australian datacenter has experienced outages beginning on Sunday, and continuing to this day. We want to apologize for any inconvenience this might have caused you, and offer you a detailed explanation on what happened and what measures we're taking to prevent this in the future.

Three days ago we were confronted with a DDoS attack (Distributed Denial of Service) that caused downtime and/or performance degradation to our Australian data center. This has been the most sustained, aggressive incident in the history of Business Catalyst, and our operations team together with partners at Amazon are still working on containing the impact on the sites hosted in the affected area. We usually hold of on incident announcements until we can conduct a proper investigation after the incident is complete, but in this particular case we recognize the need for information from both partners and customers, so we are providing this intermediary report. A full analysis will only take place after we complete the work, but at this time these are the most important elements of the incident, in the order they happened. 

  • The incident started on May 1st, 07.49 AEST, with performance degradation alerts raised by our monitors.
  • Our incident team was assembled shortly thereafter to establish the nature of the incident and possible solutions
  • Sites with DNS hosted by BC were back online at 13:32 AEST, as they offered the OPS team more tools to work with and extra flexibility. 
  • We managed to obtain a first level of stabilization at 17:27 AEST, with extra throttling capacity being introduced by the DDOS response team over at AWS. 
  • The attack continued through Tuesday and Wednesday, with dozens of millions of fraudulent IPs being blocked by our team in the meantime, in an attempt to prevent the malicious traffic from reaching the BC sites. The IPs were distributed across multiple geographies, and together, the Adobe and AWS teams deployed several counter-measures at network device level.
  • On Wednesday, at 13:00 AEST the Amazon team deployed an additional layer of throttling to increase the availability on the datacenter. 
  • As we write this incident report, May 3rd, at 18:00 AEST, the incident is still ongoing. We have managed to bring back most of the sites, with some connections timing out but being available upon a simple refresh of the page. As we write, some legitimate IPs might also be blocked by the aggressive filtering put in place, but the Ops team is working right now on separating the legitimate traffic from the malicious one.

We understand that this extended incident has caused strain with your customers and partners, but want assure you that the severity and scale of the incident was on a level never before reached in the history of the product. We are doing out best to contain this and bring all our sites back online, and our team has been working around the clock, together with our partners at Amazon, to make sure that the impact is as limited as possible under the circumstances. 

Should there be any other important details, they will be mentioned in a future report once the system resumes normal functioning parameters. As a measure that would help increase the availability of your site in this instance (as well as in any such future incidents), we strongly advise you to switch your DNS management to BC. This would allow our Ops team to dynamically change the IP of your site and have it back up in a matter of hours, while they tackle the task of containing any future incident. 

We apologize for any inconvenience this incident may have caused you, and assure you of our team’s full commitment in bringing the service back to normal parameters.

The BC Team

 

Comments