Nodobo

Nodobo Data Release

Download:

Contents

Introduction

Nodobo-2011-01-v1 is the data gathered during a study of the mobile phone usage of 27 high-school students, from September 2010 to February 2011. This dataset includes 13035 call records, 83542 message records, 5292103 presence records, and other related data.

All code in this release is licensed under the terms in LICENSE.txt. We ask that if you use this dataset for research, you cite the most relevant Nodobo publication available. You can find a list of our publications here.

If you have any questions, or if you find any bugs in the code or discrepancies in the data, please contact us:

Alisdair McDiarmid <a.mcdiarmid@strath.ac.uk>
James Irvine <j.m.irvine@strath.ac.uk>
University of Strathclyde

Recreation

db.sqlite3.dump.bz2 is a bzipped SQL dump of the sqlite3 database. You can recreate the database by doing the following:

bzcat db.sqlite3.dump.bz2 | sqlite3 db.sqlite3

The resulting db.sqlite3 file will be approximately 1GB. We have prepared a Ruby interface for this dataset, which enables casual investigation of the data. More details on using this are in the section on “nodobo.rb” below.

Database schema

The following tables are used:

Calls and Messages

CellTowers

Devices

Presences

Users

Wifis

All tables

Anonymisation

This section of the document describes how we transformed the data to anonymise its contents.

Note: we have not munged timestamps or added random noise to the data to achieve any kind of k-anonymity. As a result, it is possible that some or all users could be identified by an attacker with sufficient resources. Please don’t bother doing so just to prove a point: we know it’s possible.

The following fields have been altered to remove personal information from the dataset:

Each real value for these fields maps 1:1 to a randomly-generated anonymous value. The process for generating these values is as follows:

Location Information

For this release of the data, we have been fairly cautious about data anonymisation. In the future we may release the real CID/LAC/BSSID information, to better allow location and movement patterns to be estimated.

Software and studies

Also included in the dataset download are programs for three sample studies. These are detailed below.

Each program can be run with ruby: for example, “ruby conversation-length.rb”. The programs assume that your current working directory is the one with the database and the nodobo.rb code.

Software used:

Ruby interface: nodobo.rb

We have supplied a simple ActiveRecord interface to the database, “nodobo.rb”. This gives classes and relations for each of the types of data in the dataset.

The interface can be used by running “irb -r ./nodobo.rb”, or by using “require ‘nodobo’” in your own programs. A sample irb session is given below:

>> u = User.find(19)
=> #<User id: 19, name: "Anonymous", number: "07102745960", created_at: "2010-11-11 10:19:34", updated_at: "2010-11-11 10:19:34">
>> u.calls.size
=> 976
>> study_calls = u.calls.select {|c| c.other != nil }; study_calls.size
=> 133
>> Hash[study_calls.group_by(&:other_id).map {|k,v| [k, v.size]}]
=> {16=>2, 19=>1, 25=>2, 14=>4, 21=>124}
>> v = User.find(21)
=> #<User id: 21, name: "Anonymous", number: "07456622368", created_at: "2010-11-11 10:19:35", updated_at: "2010-11-11 10:19:35">
>> v.calls.select {|c| c.other != nil }.size
=> 175

Note that this interface is not particularly efficient, and is intended for basic exploration of the data. We have added indices to the database to improve performance where possible, but many computations (especially those involving presence data) still require significant CPU time.

Conversation Length

This study examines the number of messages in an SMS conversation. A recent publication with a smaller dataset found that most SMS conversations are two messages long, with the number of conversations rapidly decreasing as the conversation length increases. Our data reproduces this result.

Results are output to the csv directory, and plots can be shown with “gnuplot conversation-length.gnuplot”.

Daily/Hourly Stats

We binned calls, messages, and presence by hours of the day and days of the week. This shows how the study users use different aspects of their phones.

Results are output to the csv directory, and plots can be shown with “gnuplot daily-hourly-stats.gnuplot”.

Dichotomous Social Graph

Our most complex program is an initial attempt to estimate the social graph of the study users. This is achieved by using three dichotomous links between users: one each for calls, messages, and presence.

For calls and messages, a link exists between two users if A has contact B, and B has contacted A: reciprocal communications. For presence, a link exists if the users were in proximity on a certain percentage of days, for a certain number of minutes (these parameters default to 4/7 days and 30 minutes per day).

This study is by far the most computationally expensive, and with this naive implementation will take around an hour on a top-end workstation.

Results are output to “dichotomous-social-graph.dot”, a dot-format network description, which can be inspected manually or rendered as a graph with GraphViz or other visualisation software.


Alisdair McDiarmid, 2011