Techblog: How Human Web reliably removes any UIDs

Our second Techblog post about Human Web explains how we ensure that no user identifier (UID) or other private information is stored on Cliqz's servers.

Fingerprinting (Bild: iStock / Maksim Kabakou)

Josep M. PujolChief Scientist

Blog

As we have shown in the first Techblog post about Human Web, there are alternatives to standard server-side aggregation. We can get rid of user identifiers (UIDs) and the session they generate by changing the approach of data collection to the client-side. Such approach is general, it satisfies a wide range of use cases. As a matter of fact, we have yet to find a use case that cannot be satisfied by client-side aggregation alone.

UIDs are pervasive

The client-side aggregation is the approach to remove explicit UIDs. The UIDs that are added to make the data linkable on the server-side. However, even if you remove all explicit UIDs the job is not done. There are more UIDs than the explicit ones…

Communication UIDs

Data needs to be transported from the user’s device to the data collection servers. This communication, if direct, can be used to establish record-linkage via network level information such as IP and other network level data, doubling as UIDs.

Anonymous communication is a well-studied problem that has off the shelf solutions like TOR. Unfortunately we cannot rely on TOR because it was not designed to account for message replays; a malicious actor could try to send multiple messages unlawfully inflating the popularity of a page of their choice, and consequently, affecting the ranking of our search engine. To achieve replay protection and anonymous communication we had to devise an alternative sub-system called HPN (HumanWeb Proxy Network).

For instance, we want to collect the audience of a certain domain. When a user visits a web page whose domain has not been visited in the last natural day the following message will be emitted:

{url-visited: 'http://josepmpujol.net/', timestamp: '2017-10-10'}

If all users are normative we can assume that if the above message is received 100 times, it means that 100 different users visited that domain on October 10th 2017. However, there is a non-zero chance that not all users are “normative”.

A malicious actor can exploit this setup to artificially inflate the popularity of a site. He only needs to replay the message as much as he wants. Given that we have absolutely no information about the user sending the statistical data, how can we know if 100 messages are from 100 different users and not from a single malicious one?

HPN solves this issue by filtering out this kind of attacks by heavy use of crypto, which allows us to filter out repeated messages from the same user without ever knowing anything about the user. If you want to learn more about the HPN, the source code is always available.

Implicit UIDs

We have seen that we get rid of the need to explicit UIDs by using client-side aggregation. And by using the Human Web Proxy Network we eliminate communication UIDs. However, there is still another big group of user identifiers: the implicit UIDs.

Content Independent Implicit UIDs

Even in the case of anonymous communication, the way and time in which the data arrives can still be used to achieve certain record linkage, a weak one, but still a session. For instance:

Spatial correlations: Messages need to be atomic. If messages are grouped or batched on the same network request for efficiency, the receiver will be able to tag them as coming from the same user.
Temporal correlations: Even if messages are send atomically on different requests an attacker could still use the time on which messages arrive to probabilistically link multiple messages to the same user. Messages should be sent at random intervals to remove such correlations.

The Human Web already takes care of those two cases of implicit UIDs. Whenever a message is sent via CliqzHumanWeb.sendMessage it will be placed into a queue that is emptied at random intervals. Naturally, messages are not grouped or pipelined, each message (encrypted) will use a brand-new HTTP request. Keys used for encryption are always one time only, to avoid the key to become a UID.

Content Dependent Implicit UIDs

The content dependent implicit UIDs are, as the name suggests, specific to the content of the message, thus application dependent. For that reason, it is not possible to offer a general solution since it varies from message to message, or in other words, it varies from use case to use case.

We can, however, provide some examples of good practices and elaborate how we make sure that implicit UIDs, or other private information, never reaches Cliqz’s servers for some of our more complex messages. Learn more at GitHub.

Final Words

Human Web is not a closed system and is constantly evolving to offer the maximum privacy guarantees to the users. We at Cliqz do firmly believe that this methodology is a major step forward from the typical server-side aggregation widely used by the industry. Client-side aggregation at Cliqz is done at the browser level. However, it is perfectly possible to do the same using only standard JavaScript and HTML5, check out a prototype of a Google Analytics look-alike.

With our unique approach, we mitigate the risk of gathering information that we would rather not have. The risks for privacy leaks are close to zero, although there is no formal proof of privacy. We would never be able to know things like the list of queries a particular person has done in the last year. Not because our policy on security and privacy prevent us of doing so. But because it cannot be done, it is not technically possible even if we were asked to do so. In our opinion, the Human Web is a Copernican shift on the way data is collected.