Tracking is so much more than just cookies

Ever since discussing with Joel Purra at one of our Homebrew Website Club meetings back in spring, I was curious to look deeper into the work he did in his thesis work "Swedes Online: You Are More Tracked Than You Think" (full text PDF) - a piece of Computer Science research built on a premise that redefines "tracking" and visualizes just how big an issue both users and website providers are facing when the concept is looked at in a broader way.

In this post, I try to summarize some of the highlights from Joel's work and connect it with my philosophy of privacy-aware design.

Researching tracking with a holistic aim

As I have pointed out many times before, the issue with tracking is not only one of obvious trackers - those little pieces of JavaScript code that leave a pile of, commonly third-party, cookies in your browser and follow you as you navigate the WWW - but that, at least in theory, every piece of content loaded from other sites than the originating domain enables some degree of third-party tracking.

Joel shares this perspective, elaborating as he defines his methodology:

One assumption is that all resources external to the initially requested (origin) domain can act as trackers, even for static (non-script, non-executable) resources with no capabilities to dynamically survey the user’s browser, collecting data and tracking users across domains using for example the referer (sic) HTTP header. While there are lists of known trackers, used by browser privacy tools, they are not 100% effective due to not being complete, always up to date or accurate. Lists are instead used to emphasize those external resources as confirmed and recognized trackers.

This leads to a methodology that looks well beyond what is commonly seen as "tracking" (namely cookies) and instead focuses on the assumption that tracking is always possible based on variables that can be accessed simply based on the request of a resource from a server, even regardless of the presence of factual "tracking cookies":

While cookies used for tracking have been a concern for many, they are not necessary in order to identify most users upon return, even uniquely on a global level. Cookies have not been considered to be an indicator of tracking, as it can be assumed that a combination of other server and client side techniques can achieve the same goal as a normal tracking cookie.

In the light of these methodological choices, things become interesting as we look at some of the results of the study:

  • of a list with Swedish top websites, over 90% load external resources, hence enabling some form of tracking

  • visiting Swedish media websites leads to requests of resources from at least 57 organisations, enabling these to track the user

  • using a blocking service (Disconnect.me, in the context of this research), only 10% or less of all requests to potentially tracking external resources are blocked

  • traffic data of at least 90% websites of importance to the Swedish general public end up on Google servers (this includes not only Google Analytics, but yet again in the broader mode of inquiry in this study, also any other resources retrieved from services under Google ownership)

In his discussion, Joel makes yet another point (p. 35) that supports a long-standing perspective by privacy-concerned designers: blocker software commonly differentiates between blocking data transfer that is purely for the purpose of tracking alone (blocked), and tracking that occurs on delivery of content from an external source (not commonly blocked, as it is considered part of a site's content).

Google tracking the user as they watch a YouTube video or Google Map embedded in a website are simple examples; probably even more ubiquitous is the use of Google CDNs (Content Delivery Networks, a means to speed up the loading times of files on the internet) for fonts, JavaScript libraries and the like.

The design perspective

Where does this leave us in our day-to-day work on building "the web"? Loading third-party content is an essential mechanism of how the web works (and comes with a range of beneficial use cases), yet whenever a website loads content from elsewhere, there is a possibility that these sites engage in some form of tracking or connecting users across websites into user profiles.

On the other hand, as Joel Purra points out, browser-side blocking is always a trade-off between user experience and reliability:

It does seem as if the blacklist model needs to be improved – perhaps by using whitelisting instead of blacklisting. The question then becomes an issue of weighing a game of cat and mouse – if the whitelist is shared by many users – against convenience – if each user maintains their own whitelist. At the moment it seems convenience and blacklists are winning, at the cost of playing cat and mouse with third parties who end up being blocked.

...not to mention (this is the sociologist in me speaking) that improved blocking mechanisms, as long as they are not default features of any browser, only serve those who understand the impact tracking has on their lives. So, in addition to developing smarter tools to detect and block tracking for those concerned (and, really, everybody should be - not least when reading Joel's report), this topic is of highest relevance for those designing the web and building these sites.

While being able to profile or even identify users is obviously a core aspect of how a lot of business is being done online today, I strongly believe that changing this reliance on often ethically questionable intrusion of users' privacy towards new, sustainable yet privacy-aware paradigms is in order: Just like "fair trade" chocolate is a business, "privacy-aware websites" should be too. With an increasingly aware public, these aspects are likely to even become more important.

With the current backlash on tracking, it is only a question of time that deep, non-consensual tracking is rendered worthless; accelerated by policy regulations such as the GDPR in the EU which once in effect will render any tracking illegal that is not based on explicit and informed consent by the user. Looking at the research findings cited above, that may potentially result in a lot of information and consent required!?

Disconnect.me demonstrates how a privacy policy does not have to be a multi-page legal document; in an age of increasing privacy awareness this could well be a valuable brand asset (and once tracking becomes opt-in in the EU as of May 2018, solves a whole lot of looming user experience issues).

As for the designer or software engineer facing such questions: Trust is at the core here - since it will never be possible to for sure determine the intent of a party providing content on their servers to be used on external sites, assessing the trustworthiness of such source is just one of many requirements for the privacy-aware designer. And when in doubt, content should only be embedded after the user expresses their consent (this is already done by many websites, making users aware that embedding YouTube videos or Tweets will "tell" Google that they are using that site; also this website of mine has had such solution in place for many years).

Berlin-Brandenburg public broadcaster RBB only embeds third-party social media content after users have acknowledged a warning about the privacy consequences ("when loading this content, data will be transmitted to the service provider and potentially other third parties").

Analytics data can already today be gathered in ways that are compliant with existing law (in Germany, for example, by anonymizing IP addresses) while preserving an often sufficient share of the relevant information. And no matter how detailed the data collection has to be, there are brilliant, decentralised, alternatives to Google Analytics that - obviously always depending on an organisation's use case - may be used to keep user data in-house rather than enabling Big Data corporations to cross-link it with data from millions of other websites.

Last but not least - loading code libraries and fonts from the own server is often seen as archaic and bad for the user experience (granted, loading times on slow networks benefit from a pre-cached CDN copy of a JavaScript library), but this is only when "user experience" is defined as data transfer times, not as the privacy a user may value even higher than how fast a website loads - besides, maybe the real question should even be is all that code really needed in the first place?

Bibliography

J. Purra. 2015. Swedes Online: You Are More Tracked Than You Think. Master's thesis. Linköping University (LiU), Linköping, Sweden. https://joelpurra.com/projects/masters-thesis/

J. Purra, N. Carlsson, Third-party Tracking on the Web: A Swedish Perspective, Proc. IEEE Conference on Local Computer Networks (LCN), Dubai, UAE, Nov. 2016. https://joelpurra.com/projects/masters-thesis/