Techblog: Human Web - data protection when building web statistics
The Human Web proves that it is possible to collect statistical data without endangering the users' anonymity and privacy. This Techblog post explains how it works.
Cliqz needs statistical data to power the services it offers: search, tracking protection, anti-phishing, etc. This data is collected in a very different way than typical data collection. Human Web is a methodology and system developed by Cliqz to collect statistical data from the collective of Cliqz users while protecting their privacy and anonymity.
Cliqz wants to depart from the current standard model, where users must trust that the company collecting the data will not misuse it, ever, in any circumstance. Legal obligations aside, there are many ways this trust model can fail. Hackers can steal data. Governments can issue subpoenas, or get direct access to the data. Unethical employees can dig on the data for personal interests. Companies can go bankrupt and the data auctioned to the highest bidder. Finally, companies can unilaterally decide to change their privacy policies.
In the current standard model, the user has little control. This is not something that Cliqz wants to be part of. The Human Web is our proposal for a more responsible and less invasive data collection.
Fundamentals
The fundamental idea of the Human Web data collection is simple: to actively prevent record linkage.
Record linkage is basically the ability to know that multiple data elements, e.g. messages, records, come from the same user. This linkage leads to sessions, and these sessions, are very dangerous with regards to privacy. For instance, Google Analytics data can be used to build sessions that can sometimes be de-anonymized by anyone that has access to them. Was it intentional? Most likely not. Will Google Analytics try to de-anonymize the data? We bet not. But still, the session is there, stored somewhere, and trust that it is not going to be misused is the only protection users have.
The Human Web basically is a methodology and system designed to collect data, which cannot be turned into sessions once they reach Cliqz. How? Because any user-identifier (UID) that could be used to link records as belonging to the same person are strictly forbidden, not only explicit UID’s but also implicit ones. Consequently, aggregation of user’s data on the server-side (on Cliqz premises) is not technically feasible, as we have no means to know who is the original owner of the data. This is a strong departure from the industry standard of data collections.
Problem: Server-side aggregation of data produces privacy side-effects
Let us illustrate that with an example (a real one): Since Cliqz is a search engine we need to know for which queries our results are not good enough. A very legitimate use case, let’s call it bad-queries. How do we achieve this?
It is easy to do if the users help us with their data. Simply observe the event in which a user does a query q in Cliqz and then, within one hour, does the same query on a different search engine. That would be a good signal that Cliqz’s results for query q need to be improved. There are several approaches to collect the data needed for quality assessment. We want to show you why the industry standard approach has privacy risks.
Let’s first start with the typical way to collect data: the server-side aggregation.
We would collect URLs for search engine result pages, the query and search engine can be extracted from the URL. We would also need to keep a timestamp and a UID so that we know which queries were done by the same person. With this data, it is then straightforward to implement a script that finds the bad-queries we are looking for.
The data that we would collect with the server-side aggregation approach would look like that:
...
SERP=cliqz.com/q=firefox hq address, UID=X, TIMESTAMP=2017...
SERP=google.com/q=firefox hq address, UID=X, TIMESTAMP=2017...
SERP=google.com/q=facebook cristina grillo, UID=X, TIMESTAMP=2017...
SERP=google.com/q=trump for president, UID=Y, TIMESTAMP=2017...
...
A simple script would traverse the file(s) checking for the repetitions of the tuple UID and query within one-hour interval. By doing so, in the example, we would find that the query “firefox hq address” seems to be problematic. Problem solved? Yes.
This data in fact can be used to solve many other use cases. The problem is, that some of this additional use cases are extremely privacy sensitive. With this data, we could build a session for a user, let’s say user with then anonymous UID X:
user=XXX, queries={'firefox hq address','facebook cristina grillo'}
Suddenly we have the full history of that persons search queries! On top of that, perhaps one of the queries contain personal identifiable information (PII) that puts a real name to the user X. That was never the intention of whoever collected the data. But now the data exists, and the user can only trust that her search history is not going to be misused by the company that collected it.
This is what happens when you collect data that can be aggregated by UID on the server-side. It can be used to build sessions. And the scope of the session is virtually unbounded, for the good, solving many use cases, and for the bad, compromising the user’s privacy.
Solution: Move aggregation of data to the client-side
We do not want to aggregate the data on the server due to privacy implications, and at some point, all the queries of the user in a certain timeframe must be accessible somewhere otherwise we cannot resolve the use case. But that place does not need to be on the server-side, it can be done on the client, in the browser. We called it client-side aggregation.
What we do is to move the script that detects bad_queries to the browser, run it against the queries that the user does in real-time and then, when all conditions are met, send the following data back to our servers:
...
type=bad_query,query=firefox hq address,target=google
...
This is exactly what we were looking for, examples of bad queries. Nothing more, nothing less.
The aggregation of data, can always be done on the client-side, i.e. the user device and therefore under the full control of the user. That is the place to do it. As a matter of fact, this is the only place where it should be allowed.
The snippet above satisfies the bad_queries use case and most likely will not be reusable for other use cases, that is true, but it comes without any privacy implication or side-effect.
The query itself could contain sensitive information, of course, but even in that case, that we could associate that record to a real person, but that would be the only information that would be learned. Think what happens on the server-side aggregation model. The complete session of that user would be compromised, all the queries in her history. Or only a fraction of it if the company collecting that data was sensitive enough to not use permanent UIDs. Still, unnecessary. And sadly, server-side aggregation is the norm not the exception.
Client-side aggregation has some drawbacks, namely:
- It requires a change on mindset by the developers.
- Processing and mining data implies code to be deployed and run on the client-side.
- The data collected might not be suitable to satisfy other use cases. Because data collected has been aggregated by users, it might not be reusable.
- Aggregating past data might not be possible as the data to be aggregated may no longer be available on the client.
However, these drawbacks are a very small price to pay in return to the ease of mind of knowing that the data being collected cannot be transformed into sessions with uncontrollable privacy side-effects.
The goal of Human Web is not so much to anonymize data, for that purpose there are good methods like differential privacy, l-diversity, etc., rather than trying to preserve the privacy of a data-set that contains sensitive information. The aim of Human Web is to prevent those data-set to be collected in the first place.
Conclusion
In this Techblog post we’ve explained the fundamental idea and basic functionality of Human Web. We can get rid of UIDs and the session they generate by changing the approach of data collection to client-side. In the next post, we will take a closer look at UIDs and explain how we make sure that no private information reaches Cliqz’s servers.