I’ve had occasion recently to dig into how Evergreen uses space in a PostgresSQL database. Before sharing a couple queries and observations, here’s my number one rule for configuring a database server for Evergreen: allocate enough disk space. If you’re using a dedicated database server, you’ll need disk space to store the following:
- the database files storing the actual data,
- current WAL (write-ahead log) files,
- archived WAL files (which should be backed up off the database server as well),
- current database snapshots and backups (again, these should be backed up offline as well),
- scratch space for migrations and data loads,
- future data, particularly if you’re using Evergreen for a growing consortium, and
- the operating system, the Postgres software, etc.
Of course, the amount of disk space required just to store the data depends on the number of records you have. A complete sizing formula would take into account the number of bibs, items, patrons, circulation transactions, and monetary transactions you expect to have, but here’s a rule of thumb based on looking at several production Evergreen 1.6 databases and rounding up a bit: allocate at least 50K per bibliographic record.
That’s only the beginning, however. Postgres uses write-ahead logging to record database transactions; this has the effect of adding a 16M file to the
pg_xlog directory every so often as users catalog and do circulation. In turn, the WAL files should get archived periodically by enabling archive mode so that copies exist both on the database server itself and on backup media.
In a busy system, read “quite often” for “every so often” in the second sentence of the previous paragraph. In a system where you’re actively loading data, particularly if you’re also keeping that database up for production use and are therefore keeping WAL archiving on, read “fast and furious”. Why does this matter? If your database server crashes, you will need the most recent full backup and the accumulated archived WAL files since that backup to recover your database. If you don’t keep your WAL files, be prepared for an involuntary fine amnesty and angry catalogers. Conversely, what happens if you run out of space for your archived WAL files? Postgres will lock up, bringing your Evergreen system to a halt, yielding angry patrons and angry circulation staff.
Archived WAL files don’t need to be kept on the database server forever, fortunately. After each periodic full backup, archived WAL files made prior to that backup won’t be needed in case you need to do a point in time recovery. Of course, that assumes everything goes well during the recovery, so you will still want to keep at least a couple generations of full backups and WAL file sequences, include offline backup copies, and also periodically create logical database dumps using
pg_dump. LOCKSS isn’t just for digitized scholarly papers.
So, what’s my rule of thumb for estimating total disk space needed for an Evergreen database server? 200K per bibliographic record that you expect to have in your database three years from now. I admit that this is on the high side, and this is not the formula that Equinox’s server people necessarily use for hardware recommendations. However, while disk space may or may not be “cheap”, it is often cheaper than a 2 a.m. wake-up call from the library director.
How does this disk space get used? I’ll close with a couple queries to run against your Evergreen database:
pg_total_relation_size(schemaname || '.' || tablename)
)::bigint) AS used
group by schemaname
sum(pg_total_relation_size(schemaname || '.' || tablename))::bigint desc;
This gives you the amount of space used in each schema. The
metabib schema, which contains the indexing tables, will almost certainly be #1. Depending on how long you’ve been using your Evergreen system, either
biblio will be #2.
select schemaname || '.' || tablename AS tab,
schemaname || '.' || tablename
) AS used
order by pg_total_relation_size(schemaname || '.' || tablename) desc;
This will give you the space used by each table.
metabib.real_full_rec will be #1, usually followed by
biblio.record_entry. It is interesting to note that although those two tables essentially store exactly the same data,
metabib.real_full_rec will typically consume five times as much space as