In a previous article, I described how table bloat in our Postgres database drove large Storage IO costs when we moved to Aurora. We successfully reduced our Storage IO costs by repacking the tables. Now that we know how to monitor both Storage IO and table bloat, our database costs are stable. But how did we get into this state, and how can we prevent bloat from creeping back into our tables? We used AWS Batch to run production-scale tests with several query strategies. From these results, we found that using SQLAlchemy to modify only changed rows reduced our Storage IO costs, as well as the accumulation of dead tuples in our database.
Initial conditions
A very simplified version of our daily analytics process is: read from the database, do math, write results to the database. Our application is written in Python and uses SQLAlchemy. Each analytic run calculates results for a single utility meter. After each run, the application deleted all of the previous results for a meter, then inserted the new results. Autovacuum ran regularly on all of the analytic result tables, and this seemed to be working fine.
Updating data with this pattern is simple, and works from a data consistency point of view, but it’s very hard on a Postgres-compatible database. Deleting a row does not reclaim the storage space; it just marks that data as invisible as of the transaction that made the change — these are dead tuples. When autovacuum runs, it marks the space as available for reuse if it can detect that no active transactions need it. But the autovacuum process runs only when the database detects that enough dead tuples have accumulated that it’s worth cleaning them up. By default, this is 20% of the table size, which is millions of tuples in some of our large tables. Our analytic process inserted new data immediately after deleting the old data, so it was very unlikely the deleted space was available for reuse. When the database can’t find enough reusable space, it needs to allocate new space. Understanding autovacuum in Amazon RDS for PostgreSQL environments has a detailed explanation of how this works. In short, autovacuum couldn’t keep up — our storage layout resembled an ever-growing paper snowflake.
Illustration of 10 batches of delete + insert followed by a vacuum. Although no additional elements were added, 129 additional squares are allocated. See how this image was generated.
Production-scale experiments
We knew this query pattern wasn’t ideal, and wanted to measure the IO and database impact of different patterns before commiting effort to updating all of our queries. We wanted to test with conditions as close as possible to production, but without affecting production at all. We use AWS Batch to submit and run our daily analytics jobs; the jobs are docker containers, and run on spot instances. Our experimental apparatus consists of an AWS Batch stack, a database cluster restored from a snapshot, and docker container with updated code and a specific tag. This setup made it very easy to test the impact of different query strategies with production data.
We restored a recent Aurora snapshot into a new cluster. Although it takes longer to restore a snapshot than to create a clone, we wanted to run our experiments with identical starting conditions. After running our tests, we stopped and deleted the cluster, then restored the same snapshot for the next test.
For each experiment, we updated just a small portion of the code to use a different query pattern. We chose a section that was small enough to change easily, but that touched enough rows to make a difference in the total number of database operations. We built the image as usual, and tagged it with a distinct tag.
We created an alternate AWS Batch stack to run our experiments. In AWS Batch, a compute environment defines the instance types and how many to run at once; we created a new environment with the same settings as production. A job definition defines the location of the docker image to run as well as the environment variables needed for the job. We created a new job definition identical to production, except for two changes: the database connection string environment variable points to the snapshot-restored cluster, and the docker image has a different tag. Finally, a job queue collects jobs and runs them using an associated compute environment. We created a new queue to run the experimental jobs.
With a small amount of setup, we now have an experimental setup where we can quickly make a code change, run hundreds of jobs against a 1 TB+ database, then measure the effect on the database. Each experiment follows the same steps, and we record observations in a Datadog notebook for easy reference. The steps are:
- tag a docker image containing the experimental code
- restore a database snapshot
- record initial conditions
- submit a single job to confirm configuration is correct
- submit hundreds of jobs and wait for them to complete
- measure IO usage and dead tuples on the restored snapshot cluster
Experiment 0: baseline
To set up the baseline experiment, we tagged our current production docker image with the special tag. We restored a database snapshot, put 400+ jobs on the queue, and measured IOPs usage when they were done. The production code generated 4.3 million (M) read IOPs and 5.2M write IOPs, for a total of 9.5M IOPs. We also recorded live and dead tuples with a query like:
select sum(n_live_tup), sum(n_dead_tup)
from pg_stat_user_tables
where relname in ('use_prediction', 'peak_prediction');
Experiment 1: Do upserts reduce IOPs or dead tuples?
The first experiment I ran was to test whether upserting rows is more efficient than the delete-then-insert pattern. Postgres supports upserts using the insert…on conflict statement. It requires a unique constraint on the table (to decide if there’s a conflict), and looks like this:
insert into use_prediction (meter, occurred, predictions)
values (%(meter)s, %(occurred)s, %(predictions)s)
on conflict (meter, occurred) do update set predictions=%(predictions)s
This statement will attempt to insert a new row; if there’s already a row with the specified meter and occurred values (a conflict), it will update it instead. We thought this could reduce write IOPs, because we wouldn’t need to delete existing rows. Postgres could do either an insert (for new rows) or an update (for existing rows). We’d still need to delete rows that were no longer needed. For this test, we changed the code to use upsert for about 20% of the total writes in each job.
Experiment 2: Does merging with SQLAlchemy reduce IOPs or dead tuples?
Our application uses SQLAlchemy; it can determine at flush time whether an object has been modified and either write it or do nothing. Overall, only about 40% of our analytic result rows actually change from one run to the next, so writing only changed data could significantly reduce IOPs. To test this, we changed the code to create new SQLAlchemy model objects from incoming data. Then, we read data from the database into a dictionary of SQLAlchemy model objects. The key is the set of fields that uniquely identifies a row. If we found an incoming row in the database, we copy the new data into the model object. We add the model objects to the session, and let SQLAlchemy decide whether it should issue an insert, update, or neither.
Illustration of 10 batches of 15% inserts, 15% deletes, and 43% updates, followed by a vacuum. After vacuuming, 54 additional squares are allocated, compared to 129 in the delete + insert simulation. See how this image was generated.
Results
Description | Read IOPs | Write IOPs | Total IOPs | Dead tuples |
---|---|---|---|---|
Delete then insert | 4.3M | 5.2M | 9.5M | 1.4M |
Upsert | 4.7M | 4.0M | 8.7M | 1.4M |
SQLAlchemy merge | 2.5M | 2.4M | 4.9M | 1.3M |
Using upsert instead of delete then insert increased read IOPs from 4.3M to 4.7M, but decreased write IOPs from 5.2M to 4.0M. This was a net reduction in IOPs of 0.7M or 8%. This change was not as dramatic a change as I’d hoped. Dead tuples were the same as the baseline delete then insert test. What happens if the incoming data is exactly the same as the existing row? I couldn’t find any definitive answers. My theory is that an update statement with unchanged data isn’t a no-op, but some savings come from removing the key from the write payload.
Using SQLAlchemy to merge incoming and existing data was much more effective in reducing IOPs usage. This experiment generated 2.5M read IOPs and 2.4M write IOPs, for a total reduction of 44% from the baseline test. Dead tuples were also about 10% lower than in previous experiments.
Final thoughts
With Aurora’s pay for usage pricing model, our bill depends on how efficiently we read and write data. Running experiments with AWS Batch on production-size data clearly showed that deleting and inserting all of our analytic results each run was not as efficient as updating only what changed. With the results of these experiments, we updated more of the persistence steps in our analytics process to use the SQLAlchemy merge strategy. Overall, about 15% of incoming rows result in inserts and 43% result in updates. Each run also deletes obsolete rows accounting for 15% of incoming rows. We regularly measure table bloat to confirm that our autovacuum is keeping up. Our Storage IO costs are now in line with our target of 1.2 times the instance costs, and our total Aurora bill is 13% lower than our pre-migration Postgres bill.