Visualizing the global distribution of Koha installations from Debian packages

A picture is worth a thousand words:

Downloads of Koha Debian packages in past 52 weeks
Click to get larger image.

This represents the approximate geographic distribution of downloads of the Koha Debian packages over the past year. Data was taken from the Apache logs from debian.koha-community.org, which MPOW hosts. I counted only completed downloads of the koha-common package, of which there were over 25,000.

Making the map turned out to be an opportunity for me to learn some Python. I first adapted a Python script I found on Stack Overflow to query freegeoip.net and get the latitude and longitude corresponding to each of the 9,432 distinct IP addresses that had downloaded the package.

I then fed the results to OpenHeatMap. While that service is easy to use and is written with GPL3 code, I didn’t quite like the fact that the result is delivered via an Adobe Flash embed.  Consequently, I turned my attention to Plotly, and after some work, was able to write a Python script that does the following:

  1. Fetch the CSV file containing the coordinates and number of downloads.
  2. Exclude as outliers rows where a given IP address made more than 100 downloads of the package during the past year — there were seven of these.
  3. Truncate the latitude and longitude to one decimal place — we need not pester corn farmers in Kansas for bugfixes.
  4. Submit the dataset to Plotly with which to generate a bubble map.

Here’s the code:

#!/usr/bin/python

# adapted from example found at https://plot.ly/python/bubble-maps/

import plotly.plotly as py
import pandas as pd

df = pd.read_csv('http://example.org/koha-with-loc.csv')
df.head()

# scale factor the size of the buble
scale = 3

# filter out rows where an IP address did more than
# one hundred downloads
df = df[df['value'] <= 100]

# truncate latitude and longitude to one decimal
# place
df['lat'] = df['lat'].map('{0:.1f}'.format)
df['lon'] = df['lon'].map('{0:.1f}'.format)

# sum up the 'value' column as 'total_downloads'
aggregation = {
    'value' : {
        'total_downloads' : 'sum'
    }
}

# create a DataFrame grouping by the truncated coordinates
df_sub = df.groupby(['lat', 'lon']).agg(aggregation).reset_index()


coords = []
pt = dict(
    type = 'scattergeo',
    lon = df_sub['lon'],
    lat = df_sub['lat'],
    text = 'Downloads: ' + df_sub['value']['total_downloads'],
    marker = dict(
        size = df_sub['value']['total_downloads'] * scale,
        color = 'rgb(91,173,63)', # Koha green
        line = dict(width=0.5, color='rgb(40,40,40)'),
        sizemode = 'area'
    ),
    name = '')
coords.append(pt)

layout = dict(
        title = 'Koha Debian package downloads',
        showlegend = True,
        geo = dict(
            scope='world',
            projection=dict( type='eckert4' ),
            showland = True,
            landcolor = 'rgb(217, 217, 217)',
            subunitwidth=1,
            countrywidth=1,
            subunitcolor="rgb(255, 255, 255)",
            countrycolor="rgb(255, 255, 255)"
        ),
    )

fig = dict( data=coords, layout=layout )
py.iplot( fig, validate=False, filename='koha-debian-downloads' )

An interactive version of the bubble map is also available on Plotly.

CC BY-SA 4.0 Visualizing the global distribution of Koha installations from Debian packages by Galen Charlton is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.