Any city with data initiatives faces the same questions: How much information should the government release, and in what form? Seattle’s proactive but cautious approach could provide the answer.
You don’t have to dig deep to find out what can go wrong with open data initiatives. Just look back to 2014, when the New York City Taxi and Limousine Commission released hundreds of millions of records on taxi trips in the city, with data anonymized to protect identifiable details—at least in theory. In reality, the data was recorded in a format that allowed a software engineer to re-identify the license numbers of the taxis and drivers. A Gawker journalist then linked this to celebrities taking cab rides across the city months after the initial release, speculating on the routes they had taken and even how much they tipped.
It’s certainly not the most sensitive data to ever be leaked or hacked, but it is an important illustration of the risks cities face in releasing data from their many constituent agencies; the implications aren’t always apparent until long after the information is out in the public domain.
In recent years, as the Seattle metro area has grown into a thriving tech hub, the city has been pioneering a progressive, carefully considered approach to releasing public data. Slowly but surely, city authorities are crafting what they hope can serve as a model for “smart cities” around the world. One key part of this plan came in 2016, when the city adopted a resolution that all civic data would be “open by preference,” rather “open by default.” As David Doyle, Open Data Program Manager for Seattle Information Technology, explains, this extra layer of caution aims to make the city more deliberate about its data practices from the start.
“Policies were first developed with ‘open by default’ in mind, but that isn’t really feasible when you consider factors like privacy,” he says. “Seattle took a more nuanced approach of being open by preference: This means we can be open [with data] once we mitigate for privacy risks, release of personally identifying information, and other kinds of harm.”
Essentially, an open by default policy would mean “publish first, ask questions later”: datasets collected by all city government agencies—police department, housing authority, department of transportation, etc.—would be released online unless and until there was a clear reason not to. Open by preference speaks to a more measured approach: Civic datasets are evaluated proactively with a view to releasing them wherever possible, but only after they’ve been reviewed by city officials.
In one example, Doyle explains that data from the city’s Aging and Disability Service was released to support a hackathon focused on solving accessibility challenges. Since the data on disability, income, ethnicity, and exact location were extremely sensitive, teams from multiple departments worked together to group it into neighborhood segments and age brackets, reducing the identifiability while still providing a useful resource to participants in the event.
As part of the 2016 resolution on open data, Seattle also committed to an annual, publicly released risk assessment of its open data program. This year the privacy-focused nonprofit Future of Privacy Forum (FPF) was commissioned to undertake the task, which culminated in the release of a draft report in August that is currently open to public input.
The report aims not only to analyze the city’s progress with data release, but also to develop a framework for evaluating the risks of open data initiatives overall. The idea is to lay out clear criteria for judging the benefits and drawbacks of publishing a certain dataset, leading to a score that can inform a decision on how to proceed.
Still, correctly gauging the risks of a privacy breach is a difficult task: Some personal data is easy to classify as too sensitive for public release—Social Security numbers are one example, or at least should be. But other data fall into a gray area. Making some medical information public is important for epidemiological research, for example, but details on specific medical conditions should not be traceable back to individual patients.
In order to reap the potential benefits of the sensitive-but-not-secret category, data is usually anonymized before being released—but truly guaranteeing anonymity is much easier said than done. In a widely cited study from 2000, Harvard professor Latanya Sweeney (then at Carnegie Mellon) found that 87 percent of Americans could be uniquely identified in a dataset by only gender, date of birth, and ZIP code. That can then be cross-referenced with voter records to identify each individual by name.
This is the central problem that cities like Seattle face when trying to release anonymized data: Details that are non-identifying when isolated can easily become unique in combination.
Given the abundant need for privacy controls in the digital age, it might come as a surprise to learn that legal definitions of personally identifying information have not been updated for decades. In the U.S., “Personally Identifying Information” is a legal term with specific meaning in the Privacy Act of 1974; but the law draws the line between identifying and non-identifying information in a way that ignores the realities of modern information security.
“The problem is, as an actual technical matter, this is a distinction without meaning,” says Joseph Jerome, a policy counsel at the Center for Democracy and Technology. “When you look at open data policy, there’s a question over how many different indirect identifiers can be put into data before you have something that completely identifies someone. So in some respects, this is a legal policy debate, but it’s also a technical debate… and the answer isn’t clear.”
The debate is complicated by the need to release information that will be useful for analysis while also being protective of user privacy, two factors that are often in direct opposition. Technically data is at its least identifying if every individual in the group has exactly the same score for every variable—but then the data is effectively meaningless. By definition, useful data must be identifying to some degree, and a judgement must be made over where to draw the line between the two.
As a guideline, statistical de-identification expert Khaled El Emam has suggested that no more than six to eight indirect identifiersshould be included in any dataset, which should also be modified so as to ensure a certain threshold of “k-anonymity:” A term meaning that even by combining the indirect identifiers, a minimum number of individuals will always share the same values, so that no one record is completely unique.
All of these technical and legal constraints can make it difficult for cities to know when data has been processed well enough to be safely released. It can be even harder for citizens to know whether they could be identified from a given dataset. Compounding this problem is the fact that municipal governments, unlike private corporations, can also be compelled to disclose information under public records laws.
It was exactly this situation that led New York’s Taxi and Limousine Commission to release the insufficiently anonymized data on taxi trips. Had it not been for a request from a freedom of information activist, would not have been publicly disclosed in the first place.