Intermittent Connectivity currently in one U.S. Datacenter

Posted

Update 6/3:

At approximately 01:30 UTC, on May 30, 2015, the power utility (PG&E) experienced an outage affecting one of our infrastructure provider’s datacenter. Seven of the facility’s eight generators started correctly and provided uninterrupted power. Unfortunately, one generator experienced an electromechanical failure and failed to start. This caused an outage which affected all servers we host in that datacenter.

PG&E was in contact and gave an initial ETR for restoration of utility power of 04:30 UTC. This was later revised to 05:00 UTC and then 06:30 UTC. Utility power was actually restored at 06:05 UTC.

The maintenance vendor for the generator dispatched a technician to the datacenter and it was determined that a battery used for starting the generator failed under load. The batteries were subsequently replaced by the technician. The generators are tested monthly, and the failed generator passed all of its checks two weeks prior to the outage. It was also tested under load earlier in the month.

The UPS system and its batteries did not suffer a failure.

As soon as the outage occurred, our infrastructure provider’s engineers verified it was indeed power related and remained on standby for over four hours waiting for power to be restored.

Several servers did not survive the sudden loss of power and needed individual attention. Our infrastructure provider’s engineers worked well after the power was restored in order to repair and make these systems operational again which involved both hot and cold spare components. They were able to recover every system.

WP Engine apologizes for this power interruption and any inconvenience it has caused you. We sincerely appreciate your business and are committed to providing the best service possible. Our infrastructure provider is in the process of reevaluating their maintenance procedures and adding additional tests for this battery condition.

Update 8:15AM Central: All but 3 servers are fully functional and back online completely, we will be reaching out to those customers directly with further steps and notifications. As such, we will be closing this issue on our Status Page.

Update 3:00AM Central: The vast majority of servers are fully functional, we are working with our infrastructure provider on some that have not come back gracefully.

Update 2:40AM Central: We have received an update that our infrastructure provider is beginning to bring services back online. Some affected customers have now seen service restored, others are still impacted, we’ll continue to monitor until everything is fully resolved.

Update 12:35AM Central: We have received an update with a new tentative ETA of 1:30AM CST from our infrastructure provider to resume service, we are in constant communication with them and will update this post if anything changes.

Update 11:38PM Central: We have received an update with a new tentative ETA of 12:00AM CST from our infrastructure provider to resume service, we are in constant communication with them and will update this post if anything changes.

Update 10:01PM Central: We have received an update with a very tentative ETA of 11:30pm CST from our infrastructure provider to resume service, we are in constant communication with them and will update this post if anything changes.

Update 9:31PM Central: Our infrastructure provider has informed us that a power failure has impacted the network and they are working with the power provider to restore service.

Hello,

One of our infrastructure providers have alerted us to intermittent connectivity in one of their datacenters. We have observed monitoring alerts for servers that reside in this datacenter. If your site is currently not loading or is loading slowly this would most likely be the cause. Our infrastructure provider is currently investigating and we will provide an update as soon as we have one.