The 'koboloader' package for R

The “koboloadeR” R package is designed to make it easy to retrieve data collected using the KoBo Toolbox or other services using the same API (for example, KoBo Humanitarian Response, Ona). Of these, KoBo Toolbox is quite generous with 1,500 submissions per user per month as their limit (and I believe something like 5GB per month for data submissions–like photos, videos, or audio recordings that can be linked to a survey).

The package is available at GitHub. Get it using:

source("http://news.mrdwab.com/install_github.R")
install_github("mrdwab/koboloadeR")

(Note: install_github via @jtilly)

I won’t go into details here about KoBo. It’s an awesome tool for quick mobile data collection. Developing the survey tools is very easy and provides you with a good range of question types. You can collect data while offline and sync it with the server when you have a data connection available. And, you can export data into different forms for analysis later on.

Read on for more details!

My typical workflow with KoBo used to be:

  1. Login to the site.
  2. Navigate to my “projects” page.
  3. Click on the project I want to work with.
  4. Click on a “download data” button that would ask me for the output form I wanted (csv).
  5. Click on a button to generate a “new export” (assuming that there have been new entries since the last time I accessed the data).
  6. Wait a few seconds for the new export to be generated before being provided with the new link.
  7. Download the file.
  8. Open R.
  9. Read the data using fread or read.csv.
  10. Repeat whenever I needed fresh data.

That’s not too bad, but KoBo also offers free API access, so I thought this would be an opportunity to simplify the process of accessing and using the data with R.

Introducing the “koboloadeR” package

This package presently includes three main functions:

  • kobo_datasets: Lists the datasets available for a given user. Returns a data.table with the basic metadata about the available datasets.
  • kobo_submission_count: Lists the number of submissions for a particular data collection project. Returns a single integer. This function is mostly for use within the kobo_data_downloader function but can be useful if you have developed your own functions and want to know whether new data is available to download.
  • kobo_data_downloader: Downloads a specified dataset via the KoBo API. Returns a data.table of the entire dataset requested. If the dataset already exists in your Global Environment, the function first checks to see whether the number of rows matches with the number of submissions. If the number of submissions is different, it re-downloads the data.

However, to make things even more direct and interactive, I’ve also included a Shiny app that lets you browse the datasets available and have them automatically downloaded.

The app is runnable using kobo_apps("data_viewer"), which would result in the following:

On this screen, you can enter your user credentials and click on “List Available Datasets” (1). If you do not have an account or just want to test out the app with publicly available datasets, enter NULL for both the username and the password. Basic details are included in the main window area (2).

Once you have done that, the screen would change to look like the following.

Here, you get a datatable view (1) of the available datasets (by the form’s “id” and “title”). The values also automatically populate a dropdown menu (2) where you can type or select the name of the dataset you want to view. Once the selection is made, click on “Load Requested Dataset” (3) and switch over to the “Requested Dataset” tab (4) to view the dataset. Depending on the size of your dataset and the speed of your internet connection, this may take a

I’ve logged in with NULL for the username and NULL for the password, and requested a dataset named “Coffee locations”. The requested dataset looks like this:

Again, this returns the datatable view (1) of the requested dataset. Notice that you can filter by the values in individual columns, or you can search the entire dataset using the search box.

That’s all fun, but another fun part is when you switch back over from the Shiny app to the console.

You should have a view like this:

In the “Source” pane (1), you would see that we just needed to load the package and run the “data_viewer” app. (I used suppressPackageStartupMessages() just to keep things a little quiet in the console.) In the console, you would see what happened while we were running the app. Notice that upon first requesting the “Coffee locations” dataset, we got the message (2) that “No local dataset found. Downloading remote file.” This file gets read in as a data.table and is assigned in the user’s Global Environment (3).

On subsequent runs, if this were being used as a part of a project, the function would first check the number of rows in the local dataset and compare it with the submission count for the online dataset. If they differ, the data would be downloaded afresh.

Future Developments

There are several possible developments for this package that can take place from here.

  • The package already includes a utility function for parsing the date values in the dataset. Perhaps this can be applied to the known date columns that are present in the dataset to allow filtering by date.
  • The API includes access to media files, which I have not considered in this version of the package.
  • Your dataset may also collect GPS information. Perhaps additional apps can be developed to make use of this data.
  • The API is supposed to be able to allow queries that change how much data would be downloaded. I haven’t figured out how to make that work :-( so it has not been integrated.
  • Right now, deciding whether to download the data or not is based solely on the number of rows in the dataset. However, it may be better to use some form of an identifier available via the metadata of each dataset, for instance, by using the values in the _uuid column. It might be good to see whether a single column can be retrieved via the API, use that to compare against the local dataset, and then download only the data which are missing. For a small project, however, downloading the full dataset should not require too much bandwidth.

Ideas, issues, and contributions would be appreciated. Please use the issue tracker for this project for such purposes.

More Reading
Older// The Spaniard
comments powered by Disqus