Pay for usage sounds good, but how you use the database is key.
Migrating our production database from Postgres to Aurora was easy, until we noticed that our daily database costs more than doubled. Our increased costs were driven by storage IO, a type of usage we’d never before optimized or even measured. We resized our instances, shifted reads to the read replica, and re-packed our tables to reduce bloat. Together, these changes brought our costs back down close to pre-migration levels, and improved the overall efficiency of our database.
Our production Postgres database was running fine on AWS Postgres RDS. There was one issue: we spent a lot on storage. We provisioned and paid for 6 terabytes of storage, but only using about 3 terabytes. The only way to reduce the storage size was to do a full migration to a new instance. We’d still need to strike a balance between room for growth and paying for unused storage.
Aurora interested us because it does not require provisioning a fixed amount of storage when creating the instance. Instead, storage grows or shrinks elastically as needed. It’s also Postgres-compatible, requiring no code changes. Finally, the migration itself couldn’t be simpler: spin up an Aurora read replica of the current live Postgres database, wait for it to catch up, then promote it to read-write and update our servers to connect to the new instance.
Earlier in the year, we moved our smaller Workflow database to Aurora, and everything went well. We were ready to move the much larger and busier database for our Energy analytics service. Unlike our Workflow database, our Energy database runs tens of thousands of batch analytics jobs: each day, we read utility meter readings and bills and write hundreds of thousands of analytic results. We set up a new Aurora deployment with similar instance sizes to our Postgres instances, switched our applications to the new instance, and verified that the application was working fine. We thought we were finished with a very smooth production database upgrade.
A few days later, we noticed that our database costs were much higher than we expected:
What’s an IOP?!
From AWS Cost Explorer, we could tell that our database costs were now dominated by something called StorageIOUsage. This was not billed separately for Postgres, but is part of Aurora’s pay for usage pricing model. The Amazon Aurora pricing page includes a “I/O Rate $0.22 per 1 million requests” line item, but doesn’t explain further. The FAQ page has more detail:
IOs are input/output operations performed by the Aurora database engine against its SSD-based virtualized storage layer. Every database page read operation counts as one IO. The Aurora database engine issues reads against the storage layer in order to fetch database pages not present in the buffer cache. Each database page is 16KB in Aurora MySQL and 8KB in Aurora PostgreSQL.
We’d never measured or considered this metric before. Even if we had, it likely would not have helped: in a forum post, an AWS engineer writes that comparing IOPs between Postgres and Aurora is “less comparing apples and oranges as it is comparing apples and, say, a cheeseburger. Aurora uses a log based storage system as opposed to a data block based system for traditional PostgreSQL. As such the writes are very different.”
We didn’t fully understand what was driving this usage, but we knew we needed to start tracking it. But how? The StorageIOUsage line item in Cost Explorer doesn’t match any CloudWatch metrics. The bottom of the RDS Monitoring page shows charts for [Billed] Volume Read IOPS (Count) and [Billed] Volume Write IOPS (Count), but we didn’t see metrics with these names in our Datadog metrics. A table on the Monitoring Amazon Aurora metrics with Amazon CloudWatch page explains that some metrics have different display names: [Billed] Volume read IOPs and [Billed] Volume write IOPs are also known as VolumeReadIOPs / VolumeWriteIOPs. We set up Datadog charts for
aws.rds.volume_write_iops so that we could track this important new-to-us metric in near real-time. It’s unfortunate that such a key billing measure is so difficult to track down.
An article from the AWS Database blog Reducing Aurora PostgreSQL storage I/O costs suggested making sure the Buffer Cache Hit Ratio (
aws.rds.buffer_cache_hit_ratio in Datadog) stays near 99%. Our AWS account team told us that 75% of the instance memory is used for cache, and that any data loaded from the cache is not a billed IOP. The easiest way to increase cache size is to upsize to a large instance with more memory. We upsized the read-write instance to the next size (db.r5.2xl), doubling the memory.
Our account team also suggested we shift read-only traffic to the read replica instance. But IOPs from either instance cost the same; why does this help? The reason is memory and cache: sending traffic to the read replica means we can make use of the memory allocated to both instances, instead of just one. After moving a large chunk of our read traffic, we noticed that the read replica wasn’t nearly as busy as the read-write instance. We downsized our read replica to a smaller instance (db.r5.large); the Buffer Cache Hit Ratio remained consistently over 98%.
Buffer cache hit ratio before and after upsizing the read write instance (blue) and downsizing the read replica (purple) on Jan 6. Before changing the sizes, the read replica cache was underutilized, and the read write instance often dipped below a 98% cache hit ratio.
Our multi-availability zone Postgres RDS deployment ran on a single m4.xlarge instance with 4 vCPU and 16G of memory. We paid for another m4.xlarge instance as a hot standby, but it was not usable by our application. We saw regular spikes to over 80% CPU utilization, so it was probably time to upsize these instances. With Aurora, we send read-write traffic to an r5.2xlarge instance with 8 vCPU and 64G of memory, and read traffic to an r5.large with 2 vCPU and 16G of memory. Unlike Postgres RDS, we can take full advantage of the CPU and memory on the read replica for everyday read traffic, while still being able to quickly promote it to a read-write instance in case of disaster.
Table bloat drives IO usage
Our AWS team suggested that we look for tables with high bloat, as an indication that vacuum processes were not keeping up. This turned out to be the key to reducing our IO usage. Using a query from the Postgres Wiki, we found that several of our large analytic tables were quite bloated: they had many times more dead tuples than live tuples. This didn’t matter when we had more storage than we needed, and didn’t pay for IO operations. We had appropriate indexes on our tables, and our queries seemed fast enough. But now that we’re paying to read each database page, the content of those pages makes a big difference. For example, we might need to load six pages (48KB) of data from storage to get one page (8KB) worth of live data. This additional data was not at all useful (or even visible) to our application, but still charged by AWS as storage IO usage.
Once we identified the bloated tables, we needed to clean them up. We decided to try pg_repack, as described in Dealing with significant Postgres database bloat. While this generates a lot of IO usage, it has the major advantage of not requiring locks on the tables, or downtime for the application that use the tables. Over the course of two days, we ran pg_repack on more than a dozen tables. This reduced the size of the tables from hundreds of gigabytes to tens of gigabytes, and reduced our total database size from three terabytes to one. After a spike in IO usage during the repacking, we saw a dramatic drop in IO usage. After cleaning up a few more smaller but often-used tables, we reduced our database costs to close to what they had been before the migration.
aws.rds.volume_read_iops (light blue) and
aws.rds.volume_write_iops (dark blue). The higher values on Jan 5 and Jan 8 indicate when we ran pg_repack to reduce bloat. The dashed line shows the IOPs cost of 1.2 times the instance cost, the average reported by AWS.
Was it worth it?
One of the reasons we considered moving to Aurora was that it would be a simple change. Instead, we ended up spending a lot of time learning about database internals and Aurora-specific monitoring. The Aurora pricing model is different than the Postgres model, and how we use the database turned out to make a big difference. We still don’t fully understand how (or how not) to write our application to minimize IOPs. It’s not as simple as running an explain on a query and checking for index usage: IO usage depends on the size and competing demand for the Aurora instance’s cache. This makes it difficult to test without production-size servers and load. To avoid surprises when moving to Aurora, it’s essential to run a test that mirrors production as much as possible. This is the only way to get a full picture of how costs might change.
There are definitely benefits to being on Aurora. Our AWS account team was very responsive and provided useful suggestions and guidance. It’s nice to take advantage of our second (read replica) instance, versus having a hot standby locked away for emergency use only. After repacking our tables to reduce IO usage, our database is now a third the size it was before. This reduced both our live storage and backup costs. It likely improved our query times too, although our batch processes are not especially sensitive to this. We no longer need to commit to a hard storage limit that would require a migration; we know that we can easily scale up when needed. Finally, while testing ways to reduce IO, we developed a capability to do quick and easy experiments with production-size data. We used this for another round of optimization, which I’ll describe in an upcoming post.
It sounds a bit like if you had done all the cleanups and modifications that you where forced to do when the bill for AWS came, then your current solution would still have been sufficient 🙂 The only benefit I see at the bottom line is “it’s easier to scale”, which I have always found to be a bit of McGuffin; everybpdy wants to be ablt to do it but nobody knows why they want it and pretty much nobody actually needs it.
Yes, that may be true, and we will evaluate moving back to Postgres on our next major version upgrade. The table bloat did not seem to be an issue on Postgres since we weren’t paying for IO and our query performance was fine.
Nicely written article. One consideration if you size your read replica smaller than the write instance is that in a failover, you end up with a write instance that is now undersized and unlikely to carry the production load. You’re then forced to take a second outage to resize the instance or manually fail back to the original writer once it is back online. This is totally dependent on your business requirements though. As we have seen unplanned failovers with Aurora, we have elected to size the first failover candidate the same as the writer and accept the increased cost.
Thanks for time you took to share and explain this clearly.
Sounds like a deep dive on Aurora I/O of the type AWS ought to do in support of its customers.
I was going to say “check the indexes” for bloat, too, but “repack” hits the indexes as well as the tables. Good thing, that.
Going forward, a kind of “pg-repack lite” would be to drop and rebuild indexes (only.)