Applying Differential Privacy to Anonymize User Mobility on a Large-Scale WiFi Campus Network
The ubiquitous collection of real-world, fine-grained user mobility data
from WiFi access points (APs) has the potential to revolutionize the
development and evaluation of mobile network research. However, access to
real-world network data is hard to come by; and public releases of network
traces without adequate privacy guarantees can reveal users’ visit
locations, network usage patterns, the number of devices each user is
associated with, and even identify co-traveller information. Past efforts
in anonymization, including k-anonymity based models, have been
insufficient in protecting personally identifiable information, in the
presence of auxiliary data [1].
In this work, we utilize differentially private algorithms to provide a
public release of 802.11 WiFi network traces of a large-scale campus
network with more than 3,000 access points and nearly 40,000 daily users. A
differentially private (DP) algorithm adds sufficient random “noise” to any
output computed from a sensitive collection of data, so that a precise
statistical privacy condition is met. The cost of achieving this precise
privacy guarantee is a potential decrease in dataset utility. To provide
the best trade-off between privacy gained and data utility for network
researchers, we focused on two representative classes of queries - network
resource dimensioning for anticipating congested network APs; and
characterization of individual mobile node trajectories, including
inter-visit times between locations, and in-network session times. We
evaluated a range of state-of-the-art DP algorithms (MWEM, DAWA, Hb,
GreedyH, Identity, Uniform), aided by an open-source framework, Ektelo [1].
We present the following early results from applying DP algorithms for our
two representative use cases; (1) we released noisy histograms for
campus-arrival rates, service time distributions and inter-AP transitions
over a range of noise thresholds for the DP algorithms that offer the best
utility to noise trade-off, (2) we found that most of the data-dependent DP
algorithms (those that exploit features of the dataset), performed poorly
compared to data-independent DP algorithms, and finally, (3) we built
privatized per-AP visit trajectories for each user, based on a
variable-length n-gram DP-approach [2].
Our work lies at the intersection of network measurement and differential
privacy and we plan to investigate the following open questions following
our initial results; (1) how well can we predict future network usage based
on querying the noisy DP dataset as opposed to the private dataset and does
the added noise cause changes to the ground-truth insights? (2) how do we
embed timing information in a differentially-private dataset release while
still providing strong privacy guarantees? (3) we found the n-gram approach
to creating noisy trajectories suffer from significant false-positives, or
fake trajectories. We plan to improve on the n-gram approach by using a
constrained prefix-tree approach that takes into account network topology,
and human mobility constraints.
[1] Zhang, D., et., al., (2018). Ektelo: A framework for defining differentially-private computations. *Proceedings of ACM ICMD, 2018* (115–130).
[2] Chen, R., et. al., (2012). Differentially private sequential data publication via variable-length n-grams. *Proceedings of ACM CCS, 2012* (638–649).