Evergreen – Meta Interchange

Yesterday I gave a lightning talk at the Evergreen conference on being wrong. Appropriately, I started out the talk on the wrong foot. I intended to give the talk today, but when I signed up for a slot, I failed to notice that the signup sheet I used was for yesterday. It was a good thing that I had decided to listen to the other lightning talks yesterday, as that way the facilitator was able to find me to tell me that I was up next.

Oops.

When she did that, I initially asked to do it today as I had intended… but changed my mind and decided to charge ahead. Lightning talks are all about serendipity, right?

The talk went something like this: after mentioning my scheduling mix-up, I spoke about how I have been active in the Evergreen project for almost nine years. I’ve worn a variety of project hats over that time, including those of developer, core committer, release manager, member of the Evergreen Oversight Board, chair of the EOB, and so forth. While I am of course proud of the contributions I’ve made, my history with the project also includes being wrong about many things and failing a lot.

I’ve been wrong about coding issues. I’ve been responsible for regressions. I’ve had my share of brown-bag releases. I’ve misunderstood what library staff and patrons were trying to accomplish. I’ve made assumptions about the working conditions and circumstances of users that were very wrong indeed. Some of my bug reports and test plans have not been particularly clear.

Why bring up my wrongness? Prior to the talk, I had been part of a couple conversations about how some folks feel intimidated about writing bug reports or posting to the mailing lists for fear of being judged if their submission was not perfect. Of course, I don’t want people to feel intimidated; the project needs bug reports and contributions from anybody who cares enough about the software to make the effort. By mentioning how I — as somebody who is unquestionably a senior contributor to the project — have been repeatedly wrong, I hoped to humanize people like me: we’re not perfect. Perfection is not a requirement for gaining status in the community as a respected contributor — and that’s a good thing.

I also wanted to give permission for folks to be wrong, in the hopes that doing so might help lower a barrier to participating.

So much for the gist of the lightning talk. People in the audience seemed to enjoy it, and I got a couple nice comments about it, including somebody mentioning how they wished they had heard something like that as they were making their first contributions to the project.

However, I would also like to expand a bit on a couple points.

Permission to be wrong is not something I can grant all by myself. While I can try to model good ways of providing feedback (and get better myself at it; I’ve certainly been wrong many a time about how to do so), it sometimes doesn’t take much for an interaction with a new contributor (or an experienced one!) to become unwelcoming to the point where we lose the contributor forever. This is not a theoretical concern; while I think we have gotten much better over the years, there were certainly times and circumstances where it was very rational to feel intimidated about participating in the project in certain ways for fear of getting dismissive feedback.

Giving ourselves permission to be wrong is a community responsibility; by doing so we can give ourselves permission to improve. However, this can’t be treated as a platitude: it takes effort and thoughtfulness both to ensure that the community is welcoming at all levels, and to ensure that permission to be wrong isn’t accorded only to people who look like me.

One of the things that the conference keynote speaker Crystal Martin asked the community to consider was this: “Lift as you climb.” I challenge senior contributors to the Evergreen project — including myself — to take this to heart. I have benefited a lot by being able to be wrong; we should act to ensure that everybody else in the community can be allowed to be wrong as well.

In August I made a map of Koha installations based on geolocation of the IP addresses that retrieved the Koha Debian package. Here’s an equivalent map for Evergreen:

Downloads of Evergreen tarballs in past 52 weeks — Click to get larger image

As with the Koha map, this is based on the last 52 weeks of Apache logs as of the date of this post. I included only complete downloads of Evergreen ILS tarballs and excluded downloads done by web crawlers. A total of 1,317 downloads from 838 distinct IP addresses met these criteria.

The interactive version can be found on Plotly.

While looking to see what made it into the upcoming 2.9 beta release of Evergreen, I had a suspicion that something unprecedented had happened. I ran some numbers, and it turns out I was right.

Evergreen 2.9 will feature fewer zombies.

Considering that I’m sitting in a hotel room taking a break from Sasquan, the 2015 World Science Fiction Convention, zombies may be an appropriate theme.

But to put it more mundanely, and to reveal the unprecedented bit: more files were deleted in the course of developing Evergreen 2.9 (as compared to the previous stable version) than entirely new files were added.

To reiterate: Evergreen 2.9 will ship with fewer files, even though it includes numerous improvements, including a big chunk of the cataloging section of the web staff client.

Here’s a table counting the number of new files, deleted files, and files that were renamed or moved from the last release in a stable series to the first release in the next series.

Between release…	… and release	Entirely new files	Files deleted	Files renamed
rel_1_6_2_3	rel_2_0_0	1159	75	145
rel_2_0_12	rel_2_1_0	201	75	176
rel_2_1_6	rel_2_2_0	519	61	120
rel_2_2_9	rel_2_3_0	215	137	2
rel_2_3_12	rel_2_4_0	125	30	8
rel_2_4_6	rel_2_5_0	143	14	1
rel_2_5_9	rel_2_6_0	83	31	4
rel_2_6_7	rel_2_7_0	239	51	4
rel_2_7_7	rel_2_8_0	84	30	15
rel_2_8_2	master	99	277	0

The counts were made using git diff --summary --find-rename FROM..TO | awk '{print $1}' | sort | uniq -c and ignoring file mode changes. For example, to get the counts between release 2.8.2 and the master branch as of this post, I did:

$ git diff --summary  --find-renames origin/tags/rel_2_8_2..master|awk '{print $1}'|sort|uniq -c
     99 create
    277 delete
      1 mode

Why am I so excited about this? It means that we’ve made significant progress in getting rid of old code that used to serve a purpose, but no longer does. Dead code may not seem so bad — it just sits there, right? — but like a zombie, it has a way of going after developers’ brains. Want to add a feature or fix a bug? Zombies in the code base can sometimes look like they’re still alive — but time spent fixing bugs in dead code is, of course, wasted. For that matter, time spent double-checking whether a section of code is a zombie or not is time wasted.

Best for the zombies to go away — and kudos to Bill Erickson, Jeff Godin, and Jason Stephenson in particular for removing the remnants of Craftsman, script-based circulation rules, and JSPac from Evergreen 2.9.

Tomorrow I’m flying out to Hood River, Oregon, for the 2015 Evergreen International Conference.

I’ve learned my lesson from last year — too many presentations at one conference make Galen a dull boy — but I will be speaking a few times:

Hiding Deep in the Woods: Reader Privacy and Evergreen (Thursday at 4:45)

Protecting the privacy of our patrons and their reading and information seeking is a core library value – but one that can be achieved only through constant vigilance. We’ll discuss techniques for keeping an Evergreen system secure from leaks of patron data; policies on how much personally identifying information to keep, and for how long; and how to integrate Evergreen with other software securely.

Angling for a new Staff Interface (Friday at 2:30)

The forthcoming web-based staff interface for Evergreen uses a JavaScript framework called AngularJS. AngularJS offers a number of ways to ease putting new interfaces together quickly such as tight integration of promises/deferred objects, extending HTML via local directives, and an integrated test framework – and can help make Evergreen UI development (even more) fun. During this presentation, which will include some hands-on exercise, Bill, Mike and Galen will give an introduction to AngularJS with a focus on how it’s used in Evergreen. By the end of the session, attendees have gained knowledge that they can immediately apply to working on Evergreen’s web staff interface. To perform the exercises, attendees are expected to be familiar with JavaScript .

Jane in the Forest: Starting to do Linked Data with Evergreen (Saturday at 10:30)

Linked Data has been on the radar of librarians for years, but unless one is already working with RDF triple-stores and the like, it can be a little hard to see how the Linked Data future will look like for ILSs. Adapting some of the ideas of the original Jane-athon session at ALA Midwinter 2015 in Chicago, we will go through an exercise of putting together small sets of RDA metadata as RDF… then seeing how that data can be used in the Evergreen. By the end, attendees will have learned a bit not just about the theory of Linked Data, but how working with it can work in practice.

I’m looking forward to hearing other presentations and the keynote by Joseph Janes, but more than that, I’m looking forward to having a chance to catch up with friends and colleagues in the Evergreen community.

The next few days promise to be busy. Tuesday morning, my first stop is Sea-Tac to pick up a couple other conference attendees who are flying in from the East Coast. After a stop for lunch, it’s a straight shot up I-5 and BC-99 to Vancouver.

Wednesday morning I’ll be bouncing around among the IG and committee meetings, and making a particular point of joining the Web Team and the Cataloging Working Group meetings. I plan to spend most of the afternoon at the hackfest.

Thursday looks to be mostly sessions, but you may also find me distributing Evergreen t-shirts.

On Friday, I’ll be part of two presentations. At noon, I’ll be talking about data quality and Evergreen, and at 2:30 I’ll be joining Rogan Hamby and Robin Johnson to talk about how networking affects Evergreen.

~~Saturday~~ Friday morning I’ll be joining the other members of the Evergreen Oversight Board (and, I hope, other interested community members!) for our business meeting. Later Saturday morning the Oversight Board will give an update to the conference.

And other than that? I’m looking forward to attending the keynotes and catching some sessions. But most of all, I’m looking forward to seeing friends old and new.

Update 9 April 2013: The Evergreen Oversight Board meeting was rescheduled to 8 a.m. on Friday.

Libraries are sneaky, crafty places. If you walk into one, things may never look the same when you walk out.

Libraries are dangerous places. If you open your mind in one, you may be forever changed.

And, more mundanely, university libraries are places that employ a lot of work-study students. I was one of them at Ganser Library at Millersville University. Although I’ve always been a bookish lad, when I started as a reference shelver at Ganser I wasn’t thinking of the job as anything more than a way to pay the rent while I pursued a degree in mathematics. And, of course, there were decidedly limits to how much fascination I found filing updated pages in a set of the loose-leaf CCH tax codes. While some of the cases I skimmed were interesting, I can safely say that a career in tax accountancy was not in my future, either then or now.

Did I mention that libraries are crafty? Naturally, much of the blame for that attaches to the librarians. As time passed, I ended up working in just about every department of the library, from circulation to serials to systems, as if there were a plot to have me learn to love every nook and cranny of that building and the folks who made it live. By the time I graduated, math degree in hand, I had accepted a job with an ILS vendor, directly on the strength of the work I had done to help the library migrate to the (at the time) hot new ILS.

While writing this post, it has hit me hard how much I owe an incredible debt of gratitude to my mentors at Ganser. To name some of them, Scott Anderson, Krista Higham, Barbara Hunsberger, Sally Levit, Marilyn Parrish, Elaine Pease, Leo Shelley, Marjorie Warmkessel, and David Zubatsky have each taught me much, professionally and personally. To be counted among them as a member of the library profession is an honor.

Today I have an opportunity to toot my horn a bit, having been named one of the “Movers and Shakers” this year by Library Journal. I am grateful for the recognition, as well as the opportunity to sneak a penguin into the pages of LJ.

Why a penguin? In part, simply because that’s how my whimsy runs. But there’s also a serious side to my choice, and I’m happy that the photographer and editors ran with it. Tux the penguin is a symbol of the open source Linux project, and moreover is a symbol that the Linux community rallies behind. Why have I emphasized community? Because it’s the strength of the library open source communities, particularly those of the Koha and Evergreen projects, that inspire me from day to day. Not that it’s all sunshine and kittens — any strong community will have its share of disappointments and conflicts. However, I deeply believe that open source software is a necessary part of librarians (I use that term broadly) building their own tools with which to share knowledge (and I use that term very broadly) with the wider communities we serve.

The recognition that LJ has given me for my work for Koha and Evergreen is very flattering, but for me it is at heart an opportunity to reflect, and to thank the many friends and mentors in libraryland I have met over the years.

Thanks, and may the work we share ever continue.

Both Koha and Evergreen use memcached to cache user sessions and data that would be expensive to continually fetch and refetch from the database. For example, Koha uses memcached to cache MARC frameworks, while Evergreen caches search results, bibliographic added content, search suggestions, and other data.

Even though the data that gets cached is transitory, at times it can be useful to look at it. For example, you may need to check to see if some stale data is present in the cache, or you may want to capture some statistics about user sessions that would otherwise be lost when the cache expires.

The library libMemcached include several command-line utilities for interrogating a memcached server. We’ll look at memcdump and memccat.

memcdump prints a list of keys that are (or were, since the data may have expired) stored in a memcached server. Here’s an example of the sorts of keys you might see in an Evergreen system:

memcdump --servers 127.0.0.1:11211
oils_AS_21a5dc5cd2aa42ee7c0ecc239dcb25b5
ac.toc.html.0531301990
open-ils.search_9fd0c6c3553e6979fc63aa634a78b362_facets
open-ils.search_9fd0c6c3553e6979fc63aa634a78b362
oils_auth_8682b1017b7b27035576fecbfc7715c4

The --servers 127.0.0.1:11211 bit tells memcdump to check memcached running on the local server.

A list of keys, however, doesn’t tell you much. To see the value that’s stored under that key, use memccat. Here’s an example of looking at a user session record in Koha (assuming you’ve set the SessionStorage system preference to use memcached):

memccat --servers 127.0.0.1:11211 KOHA78c879b9942dee326710ce8e046acede
---
_SESSION_ATIME: '1363060711'
_SESSION_CTIME: '1363060711'
_SESSION_ID: 78c879b9942dee326710ce8e046acede
_SESSION_REMOTE_ADDR: 192.168.1.16
branch: CPL
branchname: Centerville
cardnumber: cat
emailaddress: ''
firstname: ''
flags: 1
id: cat
ip: 192.168.1.16
lasttime: '1363060711'
number: 51
surname: cat

And here’s an example of an Evergreen user session cached object:

memccat --servers 127.0.0.1:11211 oils_auth_8682b1017b7b27035576fecbfc7715c4
{"authtime":420,"userobj":{"__c":"au","__p":[null,null,null,null,null,null,null,null,null,"186",null,"t",null,"f",119284,38997,0,0,"2011-05-31T11:17:16-0400","0.00","1-888-555-1234","1923-01-01T00:00:00-0500","user@example.org",null,"2015-10-29T00:00:00-0400","User","Test",186,654440,3,null,null,null,"1358890660.7173220299.6945940294",119284,"f",1,null,"",null,null,10,null,1,null,"t",654440,"user",null,"f","2013-01-22T16:37:40-0500",null,"f"]}}

We’ll let the YAMLites and JSONistas square off outside, and take a look at a final example. This is an excerpt a cached catalog search result in Evergreen:

memccat --servers 127.0.0.1:11211 open-ils.search_4b81a8a59544e8c7e9fdcda357d7b05f
{"0":{"summary":{"checked":630,"visible":"546","excluded":84,"deleted":0,"total":630,"complex_query":1},"results":[["74093"],["130197"], ...., ["880940"],["574457"]]}}

There are other tools that let you manipulate the cache, including memcrm to remove keys and memccp to load key/value pairs into memcached.

For a complete list of the command-line tools provided by libMemcached, check out its documentation. To install them on Debian or Ubuntu, run apt-get install libmemcached-tools. Note that the Debian package renames the tools from ‘memdump’ to ‘memcdump’, ‘memcat’ to ‘memccat’, etc., to avoid a naming conflict with another package.

I had a great time at the Evergreen conference this year. Some of the highlights for me are:

I got to see a lot of friends, both old and new.
We started signing the fiscal sponsorship agreement (PDF) with the Software Freedom Conservancy. After the document crosses the country and back and gets the last few signatures, Evergreen will officially become the latest member project of the SFC.
We made great progress setting up for the project’s move from SVN to Git. (If you’re a developer, please weigh in now on the vote taking place on open-ils-dev.)
I talked a lot.
Every restaurant I ate at was good, without exception.
I learned how to play Munchkin.
My wife and I found a house to rent.
I even managed to get some sleep one night.

Many thanks to the conference committee for organizing a wonderful event. I’ll close with this image from the presentation by Rogan Hamby and Shasta Brewer:

I’ve had occasion recently to dig into how Evergreen uses space in a PostgresSQL database. Before sharing a couple queries and observations, here’s my number one rule for configuring a database server for Evergreen: allocate enough disk space. If you’re using a dedicated database server, you’ll need disk space to store the following:

the database files storing the actual data,
current WAL (write-ahead log) files,
archived WAL files (which should be backed up off the database server as well),
current database snapshots and backups (again, these should be backed up offline as well),
scratch space for migrations and data loads,
future data, particularly if you’re using Evergreen for a growing consortium, and
the operating system, the Postgres software, etc.

Of course, the amount of disk space required just to store the data depends on the number of records you have. A complete sizing formula would take into account the number of bibs, items, patrons, circulation transactions, and monetary transactions you expect to have, but here’s a rule of thumb based on looking at several production Evergreen 1.6 databases and rounding up a bit: allocate at least 50K per bibliographic record.

That’s only the beginning, however. Postgres uses write-ahead logging to record database transactions; this has the effect of adding a 16M file to the pg_xlog directory every so often as users catalog and do circulation. In turn, the WAL files should get archived periodically by enabling archive mode so that copies exist both on the database server itself and on backup media.

In a busy system, read “quite often” for “every so often” in the second sentence of the previous paragraph. In a system where you’re actively loading data, particularly if you’re also keeping that database up for production use and are therefore keeping WAL archiving on, read “fast and furious”. Why does this matter? If your database server crashes, you will need the most recent full backup and the accumulated archived WAL files since that backup to recover your database. If you don’t keep your WAL files, be prepared for an involuntary fine amnesty and angry catalogers. Conversely, what happens if you run out of space for your archived WAL files? Postgres will lock up, bringing your Evergreen system to a halt, yielding angry patrons and angry circulation staff.

Archived WAL files don’t need to be kept on the database server forever, fortunately. After each periodic full backup, archived WAL files made prior to that backup won’t be needed in case you need to do a point in time recovery. Of course, that assumes everything goes well during the recovery, so you will still want to keep at least a couple generations of full backups and WAL file sequences, include offline backup copies, and also periodically create logical database dumps using pg_dump. LOCKSS isn’t just for digitized scholarly papers.

So, what’s my rule of thumb for estimating total disk space needed for an Evergreen database server? 200K per bibliographic record that you expect to have in your database three years from now. I admit that this is on the high side, and this is not the formula that Equinox’s server people necessarily use for hardware recommendations. However, while disk space may or may not be “cheap”, it is often cheaper than a 2 a.m. wake-up call from the library director.

How does this disk space get used? I’ll close with a couple queries to run against your Evergreen database:

select schemaname,
       pg_size_pretty(sum(
         pg_total_relation_size(schemaname || '.' || tablename)
       )::bigint) AS used
from pg_tables
group by schemaname
order by 
  sum(pg_total_relation_size(schemaname || '.' || tablename))::bigint desc;

This gives you the amount of space used in each schema. The metabib schema, which contains the indexing tables, will almost certainly be #1. Depending on how long you’ve been using your Evergreen system, either auditor or biblio will be #2.

select schemaname || '.' || tablename AS tab,
       pg_size_pretty(
         pg_total_relation_size(
           schemaname || '.' || tablename
         )
       ) AS used
from pg_tables
order by pg_total_relation_size(schemaname || '.' || tablename) desc;

This will give you the space used by each table. metabib.real_full_rec will be #1, usually followed by biblio.record_entry. It is interesting to note that although those two tables essentially store exactly the same data, metabib.real_full_rec will typically consume five times as much space as biblio.record_entry.

Category: Evergreen

On being wrong, wrong, wrong

Visualizing the global distribution of Evergreen installations from tarballs

Evergreen 2.9: now with fewer zombies

Forth to Hood River!

Hiding Deep in the Woods: Reader Privacy and Evergreen (Thursday at 4:45)

Angling for a new Staff Interface (Friday at 2:30)

Jane in the Forest: Starting to do Linked Data with Evergreen (Saturday at 10:30)

Evergreen Conference 2013 road trip

A pause to reflect

Exploring memcached caches

Evergreen Conference 2011 highlights

Database server disk space usage in Evergreen

Share this:

Share this:

Share this:

Hiding Deep in the Woods: Reader Privacy and Evergreen (Thursday at 4:45)

Angling for a new Staff Interface (Friday at 2:30)

Jane in the Forest: Starting to do Linked Data with Evergreen (Saturday at 10:30)

Share this:

Share this:

Share this:

Share this:

Share this:

Share this: