I am glad to see that PTFS and its LibLime division have contributed the developments that PTFS has been working on for the past year or so, including features commissioned by the East Brunswick and Middletown Township public libraries and others. The text of LibLime’s announcement makes it clear that this is meant as a submission to Koha 3.2 and (more so) 3.4:

The code for the individual new features included in this version has also been made available for download from the GIT repository. The features in this release were not ready for 3.2, but, pending acceptance by the 3.4 Koha Release Manager, could be included in release 3.4.

Chris Cormack (as 3.4 release manager) and I (as 3.2 release manager) have started work on integrating this work in the Koha. Since 3.2 is in feature freeze, for the most part only the bugfixes from Harley will be included in 3.2, although I am strongly considering bringing in the granular circulation permissions work as well. The majority of the features will make their way into 3.4, although they will go through QA and discussion like any other submission.

So far, so good. As a set of contributions for 3.2 and 3.4, “Harley” represents the continuation of PTFS’ ongoing submissions of code to Koha in the past year. Further, I hope that if PTFS is serious about their push for “agile” programming, that they will make a habit of submitting works in progress for discussion and public QA sooner, as in some cases “Harley” features that were obviously completed months ago were not submitted until now.

But here is where the mixed messages come in: “Harley” is prominently listed on koha.org as a release of Koha. Since no PTFS staff are among the elected release managers or maintainers for Koha, that is overreaching. Ever since Koha expanded beyond New Zealand, no vendor has hitherto unilaterally implied that they were doing mainstream releases of Koha outside of the framework of elected release managers.

Before I go further, let me get a couple things out of the way. If somebody wants to enhance Koha and create installation packages of their work in addition to contributing their changes to the Koha project, that’s fine. In fact, if somebody wants to do that without formally submitting their changes, that’s certainly within the bounds of the GPL, although obviously I’d prefer that we have one Koha instead of a bunch of forks of it. If any library wants to download, install, test, and use “Harley”, that’s fine as well. Although there could be some trickiness upgrading from “Harley” to Koha 3.2 or Koha 3.4, it will certainly be possible to do so in the future.

What I am objecting to is the overreach.  Yes, “Harley” is important.  Yes, I hope it will help open a path to resolve other issues between PTFS/LibLime and the rest of the Koha community.  Yes, I thank PTFS for releasing the code, and in particular publishing it in their Git repository.  That doesn’t make it an official release of Koha; it is still just another contribution to the Koha project, the same as if it came from BibLibre, software.coop, Catalyst, Equinox, one of the many individual librarians contributing to Koha, or any other source.

“Harley” is available for download from LibLime’s website at http://www.liblime.com/downloads.  This is where it belongs.  Any vendor-specific distribution of Koha should be retrievable from the vendor’s own website, but it should not be presented as a formal release.  Perhaps there is room to consider having the Koha download service also offer vendor-specific distributions in addition to the main releases, but if that is desired, it should be proposed and discussed on the community mailing lists.

Updating koha.org to remove the implication that “Harley” is an official release is a simple change to make, and I call upon PTFS to do so.

Please see my disclosure statement. In particular, I am release manager for Koha 3.2 and I work for a competitor of PTFS. This post should not be construed as an official statement by Equinox, however, although I stand by my words.

I’ve had occasion recently to dig into how Evergreen uses space in a PostgresSQL database. Before sharing a couple queries and observations, here’s my number one rule for configuring a database server for Evergreen: allocate enough disk space. If you’re using a dedicated database server, you’ll need disk space to store the following:

  • the database files storing the actual data,
  • current WAL (write-ahead log) files,
  • archived WAL files (which should be backed up off the database server as well),
  • current database snapshots and backups (again, these should be backed up offline as well),
  • scratch space for migrations and data loads,
  • future data, particularly if you’re using Evergreen for a growing consortium, and
  • the operating system, the Postgres software, etc.

Of course, the amount of disk space required just to store the data depends on the number of records you have. A complete sizing formula would take into account the number of bibs, items, patrons, circulation transactions, and monetary transactions you expect to have, but here’s a rule of thumb based on looking at several production Evergreen 1.6 databases and rounding up a bit: allocate at least 50K per bibliographic record.

That’s only the beginning, however. Postgres uses write-ahead logging to record database transactions; this has the effect of adding a 16M file to the pg_xlog directory every so often as users catalog and do circulation.  In turn, the WAL files should get archived periodically by enabling archive mode so that copies exist both on the database server itself and on backup media.

In a busy system, read “quite often” for “every so often” in the second sentence of the previous paragraph. In a system where you’re actively loading data, particularly if you’re also keeping that database up for production use and are therefore keeping WAL archiving on, read “fast and furious”. Why does this matter? If your database server crashes, you will need the most recent full backup and the accumulated archived WAL files since that backup to recover your database. If you don’t keep your WAL files, be prepared for an involuntary fine amnesty and angry catalogers. Conversely, what happens if you run out of space for your archived WAL files?  Postgres will lock up, bringing your Evergreen system to a halt, yielding angry patrons and angry circulation staff.

Archived WAL files don’t need to be kept on the database server forever, fortunately.  After each periodic full backup, archived WAL files made prior to that backup won’t be needed in case you need to do a point in time recovery. Of course, that assumes everything goes well during the recovery, so you will still want to keep at least a couple generations of full backups and WAL file sequences, include offline backup copies, and also periodically create logical database dumps using pg_dump. LOCKSS isn’t just for digitized scholarly papers.

So, what’s my rule of thumb for estimating total disk space needed for an Evergreen database server? 200K per bibliographic record that you expect to have in your database three years from now. I admit that this is on the high side, and this is not the formula that Equinox’s server people necessarily use for hardware recommendations. However, while disk space may or may not be “cheap”, it is often cheaper than a 2 a.m. wake-up call from the library director.

How does this disk space get used? I’ll close with a couple queries to run against your Evergreen database:

select schemaname,
       pg_size_pretty(sum(
         pg_total_relation_size(schemaname || '.' || tablename)
       )::bigint) AS used
from pg_tables
group by schemaname
order by 
  sum(pg_total_relation_size(schemaname || '.' || tablename))::bigint desc;

This gives you the amount of space used in each schema. The metabib schema, which contains the indexing tables, will almost certainly be #1.  Depending on how long you’ve been using your Evergreen system, either auditor or biblio will be #2.

select schemaname || '.' || tablename AS tab,
       pg_size_pretty(
         pg_total_relation_size(
           schemaname || '.' || tablename
         )
       ) AS used
from pg_tables
order by pg_total_relation_size(schemaname || '.' || tablename) desc;

This will give you the space used by each table. metabib.real_full_rec will be #1, usually followed by biblio.record_entry. It is interesting to note that although those two tables essentially store exactly the same data, metabib.real_full_rec will typically consume five times as much space as biblio.record_entry.

Apologies to Ranganathan.

Say you have a Git repository you want to publish, and you’ve set up a Gitweb for it at http://git.example.com/?p=myrepo.git;a=summary.  So far, so good: others can browse your commits and download packages and tarballs.  Suppose you’ve also configured git-daemon(1) to publish the repo using the Git protocol.  Great!  Now suppose you’ve told the world to go to http://git.example.com. The world looks at what you have wrought, and then asks: How can we clone your repository?

Even assuming that you’ve used the default options in your git-daemon configuration, the Git clone URL could be any of the following depending on where your OS distribution’s packagers decided to put things:

  • git://git.example.com/myrepo
  • git://git.example.com/myrepo.git
  • git://git.example.com/git/myrepo
  • git:/git.example.com/git/myrepo.git
  • and there are even more possibilities if you did tweak the config

The rub is that Gitweb doesn’t know and can’t know until you tell it.  If you don’t tell it, somebody who wants to clone your repo and who is looking at the Gitweb page can only guess.  If they guess wrong a few times, they may give up.

Fortunately, the solution is easy: to make the Git clone URL display in your Gitweb, go to the repository’s base directory and create a new file called cloneurl and enter the correct clone URL(s), one per line. While you’re at it, make sure that the description file is accurate as well.

I saw a particularly annoying form of comment spam in Dorothea Salo’s excellent summary of various kinds of open information:

screenshot of plagiaristic comment spam
screenshot of plagiaristic comment spam

The author link points to the site of what appears to be a Turkish dietary supplement vendor.  Just a bit off-topic, unless this is somehow a subtle way of announcing that they’re releasing their supplement under an open recipe license.  What really steams me: the text was copied from one of my comments on the post.

Failing grade for plagiarism.

On Wednesday, two committees of the Florida state legislature recommended removing funding for the Florida State Aid to Public Libraries program. This is the second time in as many years that this has happened. To compound the problem, the elimination of state aid would also mean that Florida libraries would no longer qualify for some forms of federal aid.

While a handful of library systems in Florida are independent taxing districts and could (painfully) weather this, elimination of state aid would mean that a lot of rural and city libraries would have to close branches, cut hours, and lay off library staff. Many rural libraries are already operating on shoestrings.

Do you live in Florida? Call your state representative and senator today and ask them to vote to continue funding for state aid to Florida libraries. Also, please ask them to stop this proposal from becoming an annual tradition. No brinkmanship with our libraries, please!

Update 2010-04-28: State aid has been restored! [PDF link] Can we not play this game again next year?

Not paying close attention to Perl’s definition of truth can sometimes lead to subtle bugs. Consider a simple scalar $x that should contain a string exactly one character wide. If the original value of $x can be undefined and you want to make sure it has a default value of a single space, do not do the following:

$x ||= ' ';

Why not? If $x starts off as ‘0’, a permitted value, this line will change it to ' '. Instead, do this

$x = ' ' unless defined $x;

Remember, 0, '0', '', and undef all evaluate to Perl’s notion of false.

Earlier today Chris Cormack and I were chatting on IRC about various ways to manage patches, and decided to stage a little tutorial about how to pull from remote Git repositories:

<chris> speaking of public repo’s … i have been pushing to github for a while
<chris> but i have set up git-daemon on my machine at home too
<gmcharlt> chris: anything you’re ready to have me look at to pull?
<chris> not really
<chris> one interesting thing is the dbix_class branch
<chris> http://git.workbuffer.org/cgi-bin/gitweb.cgi?p=koha.git;a=summary
<gmcharlt> even if it’s trivial, it occurs to me that doing it and writing up how we did it might be useful material for a tutorial blog post or maiilings to koha-devel
<chris> lemme check
<chris> tell ya what
<chris> ill do a history.txt update
<chris> and commit that, and we can pull that
<chris> gmcharlt: http://git.workbuffer.org/cgi-bin/gitweb.cgi?p=koha.git;a=shortlog;h=refs/heads/documentation
<chris> so you can a remote for my repo
<chris> git remote add workbuffer.org git://git.workbuffer.org/git/koha.git
<chris> then git checkout -b documentation –track workbuffer.org/documentation
<chris> (probably need a git fetch workbuffer.org too(
<chris> then you can cherry-pick that commit over
<chris> thats one way to do it
<chris> or you could just checkout a branch
<gmcharlt> chris: yeah, I think I’ll do it as a pull
<chris> checkout -b mydocumentation
<chris> git pull workbuffer.org/documentation
<chris> i think that will do it anyway
<gmcharlt> yeah, then into my staging branch
<gmcharlt> git checkout test
<gmcharlt> git merge mydocumentation/documentation
<gmcharlt> or directly
<gmcharlt> git merge workbuffer.org/documentation
<chris> yep
<chris> i think the pull will do fetch + merge for ya
<gmcharlt> it does indeed
<gmcharlt> fetch first, though
<gmcharlt> lets you do git log –pretty=oneline test workbuffer.org/documentation
<chris> good point
<gmcharlt> chris: well, let’s make it official – send a pull request to the patches list
<chris> will do
<gmcharlt> e.g., Subject: PULL – git://git.workbuffer.org/koha.git – documentation – history changes
<gmcharlt> brief description of changes in body
<gmcharlt> something like that
<chris> works for me
<gmcharlt> “Welcome, all, to DVCS performance theatre”
<chris> off it goes
<chris> this was our first git tutorial right there .. quick someone take photos or something 🙂

Catalogers and technical services managers in public libraries are needed more than ever, but for a variety of reasons their numbers have been declining over the years. ALCTS CRG has convened a forum on careers in technical services in public libraries, but as the forum was not listed in the program guide, here’s the description. Full disclosure: I highly recommend this because, among other reasons, my wife, Marlene Harris, is one of the participants.


ALCTS CRG Forum in Anaheim Focused on Careers for Public Librarians in
Technical Services

Sunday, June 29, 8:00-9:30 a.m., Disney Paradise Pier Hotel, Pacific C/D

Want to get the scoop on the advantages and disadvantages of a technical
services career in public libraries? Be sure to catch the CRG forum,
Technical Services Careers in Public Libraries: Getting Started,
Building Your Career, or Making the Switch, on Sunday, June 29, 2008
from 8 to 9:30 a.m., in Room Pacific C/D of the Disney Paradise Pier
Hotel, when Carolyn Goolsby, Technical Services Manager at the Tacoma
Public Library, and Marlene A. Harris, Division Chief, Technical
Services at the Chicago Public Library, will offer advice and describe
from personal experience the ups and downs, ins and outs, of a career in
technical services within the public library setting. Ample time will be
provided for questions and answers after presentations by both
panelists.

The moderator is Elaine Yontz from the faculty of the Library and Information Science program at Valdosta State University.

Sponsored by ALCTS CRG (Council of Regional Groups)

Technorati Tags: , , ,

I’ll be attending the following programs during ALA Annual this year.

Friday, 27 June

  • 10:30 to 12:00: Old Records, New Records, New Interfaces (ALCTS Catalog Form and Function Interest Group)

Saturday, 28 June

  • 13:30 to 15:30: Metadata Mashup: Creating and Publishing Application Profiles (ALCTS) or There’s No Catalog Like No Catalog (LITA)
  • 16:00 to 18:00: Getting Ready for RDA and FRBR (ALCTS) or Science Fiction and Fantasy: Looking at IT and the Information Rights of the Individual. Hmm, RDA or Cory Doctorow? Decisions, decisions…

Sunday, 29 June

  • 08:00 to 12:00: Creating the Future of the Catalog and Cataloging (ALCTS and LITA). And where did I leave Hermione’s hourglass?
  • 10:30 to 12:30: The Open Library, Promise and Peril (LITA)
  • 13:30 to 15:00: Top Technology Trends (LITA)
  • 15:30 to 17:00: Koha Interest Group Meeting (leaving early)

Monday, 30 June

  • 10:30 to 12:00: Legal Issues in Developing Open Source Systems for Libraries Understanding Free/Open Source Software Licenses, Project Forms, and Project Governance Options
  • 13:30 to 15:30: Open Source Systems Interest Group (LITA)

Other than that, I’ll be variously at the LibLime booth, in meetings, or hacking Koha.

Last  Wednesday I gave a lightning talk at Code4LibCon on some musings about Git qua distributed version control system and ideas for distributed cataloging. Check out my slides.

Slides from the other lightning talks are being posted here. Be sure to check out Andy Mullen’s presentation when his slides and the video are posted — making player piano MIDI files from OCRs of scanned scores is special enough, but his sense of dramatic timing during his presentation was marvelous.