Amazon's move off Oracle caused Prime Day outage in one of its biggest warehouses, internal report says

Getty Images
Jeff Bezos

Amazon is learning how hard it can be to move off of Oracle's database software.

On Prime Day, while the e-retailer was dealing with a major website glitch that slowed sales, the company was also dealing with a technical problem in Ohio at one of its biggest warehouses, leading to thousands of delayed package deliveries, according to an internal report obtained by CNBC.

The problem was in large part due to Amazon's migration from Oracle's database to its own technology, the documents show. The outage underscores the challenge Amazon faces as it looks to move completely off Oracle's database by 2020, and how difficult it is to re-create that level of reliability. It also shows that Oracle's database is more efficient in some aspects than Amazon's rival software, a point that Oracle will likely emphasize during this week's annual OpenWorld conference in San Francisco.

Following the Prime Day outage, Amazon engineers filled out a 25-page report, which Amazon calls a correction of error. It's a standard process that Amazon uses to try to understand why a particular incident took place and how to keep it from happening in the future.

The report shows that Amazon struggled to identify the root cause of the Prime Day issue because of a feature it lost after the database was moved over. It also failed to come up with a contingency plan in case of an error in its newly installed database, called Aurora PostgreSQL, the documents show.

In one question, engineers were asked why Amazon's warehouse database didn't face the same problem "during the previous peak when it was on Oracle." They responded by saying that "Oracle and Aurora PostgreSQL are two different [database] technologies" that handle "savepoints" differently.

Savepoints are an important database tool for tracking and recovering individual transactions. On Prime Day, an excessive number of savepoints was created, and Amazon's Aurora software wasn't able to handle the pressure, slowing down the overall database performance, the report said.

Could have happened anyway

"It's quite possible the outage would not have occurred if Amazon had stuck with Oracle," said Matt Caesar, a computer science professor at the University of Illinois at Urbana-Champaign, after CNBC shared the details of the document. "Also, it appears they would have been able to diagnose the problem sooner if they were using Oracle's database, which could possibly have reduced the outage duration."

An Amazon spokesperson played down the issue in an emailed statement and said there was no outage, even though the internal document states that the database "degradation resulted in lags and complete outages."

"It is important to point out that there was never an outage at the facility, and the issue only resulted in delaying shipping of about one percent of packages for a short period of time," the spokesperson said. "This issue was quickly diagnosed and resolved."

The Ohio warehouse is the largest of the 13 warehouses that moved its database off Oracle prior to Prime Day. During the Prime Day period, it handled over 1.1 million packages per day, the documents say. All services and software that handle inventory and shipping data had been migrated to Aurora in those warehouses.

The outage, which lasted for hours on Prime Day, resulted in over 15,000 delayed packages and roughly $90,000 in wasted labor costs, according to the report. Those costs don't include all the lost hours spent by engineers troubleshooting and fixing the errors or any potential lost sales.

In a section titled, "Lessons Learned," Amazon engineers wrote that, "Savepoint behaves differently in Aurora PostgreSQL than in Oracle," suggesting Oracle's software would have handled the issue more efficiently. It also says SQL statement data did not exist for analysis in PostgreSQL, and having access to that data "would have helped pinpoint" the root cause of the problem.

The outage may have been less severe had Amazon been more prepared. In one part of the document, the company said it "took a long time to mitigate" the problem because of a "lack of a reaction plan when the underlying PostgreSQL DB experiences performance issues." The document also said a "well-established reaction plan or runbook" could have helped "mitigate the impact sooner."

"My guess is that they changed databases a while ago, didn't test the exact load model that occurred during their Amazon Prime Day and got surprised, badly," Henning Schulzrinne, a computer science professor at Columbia University, said after reviewing the document.

Amazon and Oracle have been in a heated battle of words in recent years, as Amazon has expanded its software offerings to more directly compete with Oracle. CNBC reported in August that Amazon is working on moving its entire database off Oracle by early 2020.

'It's really, really hard'

Oracle Chairman and co-founder Larry Ellison isn't buying it. On the company's earnings call in December, Ellison said Amazon "is not moving off of Oracle." He reiterated his point at an August event, saying, "I don't think they can do it."

"They've had 10 years to get off Oracle, and they're still on Oracle," he said. "And it's not going to be easy for them to use their own technology. It's not going to be cost-effective. I mean, it's really, really hard."

Patrick Moorhead, principal analyst at Moor Insights & Strategy, said the incident shows how hard it is for older applications, like those used in Amazon's warehouses, to move off Oracle, which has spent decades working with the world's largest enterprises.

"AWS Aurora is designed for forward-looking applications and Oracle for more legacy applications," he said.

Amazon emphasizes that the problem in this warehouse was completely unrelated to the outages on the Amazon web site on Prime Day, and offered the following statement:

There is a separate internal document from the Amazon Retail team, that CNBC apparently obtained, detailing an issue in a single fulfillment center (out of more than 185 worldwide) that led to the slowing of processing in the fulfillment operations and possibly a slight delay in shipping of products from that facility alone. It is important to point out that there was never an outage at the facility, and the issue only resulted in delaying shipping of about one percent of packages for a short period of time. It's also worth noting that Amazon Aurora did not create a large number of savepoints for the simple reasons that applications create savepoints, and in this instance the application created too many when entering an error loop. This issue was quickly diagnosed and resolved.

Update: This story has been updated to reflect the fact that Amazon creates error reports for incidents of various sizes, not only major incidents. It has also been updated to include Amazon's response.

Click to show more