When our browser navigates on a website, depending on its HTML structure, we can end up connecting to a variety of different web-services. This interconnection among web-services is rock solid since the 90s. In this project, two issues have been investigated: (i) the impact of these interconnected elements on user privacy - because third party services know what the user is doing online, but the user is often not aware of what they are doing, and (ii) and security - because the more third-party webservies a user connect to, the higher the risk of a user being attacked and exposed.

Privacy researchers, security advocates, and NGOs for digital safety have been investigating this phenpmenon since few years. It can't be really seen as a recent hot topic, as it is old enough to be named "the original sin of the internet" by Ethan Zuckerman, one of its creators. And it remains a fundamental political issue for the alleged manipulative consequences of electoral targeted advertising, for the lack of accountability, and for browser security.

Interconnected elements

I consider four high-level interconnected aspects to interpret the findings and the tool usages in this project. I'm also grateful for the support of my hosting organization, CodingRights because the interdisciplinary approach adopted here comes from the effort we made in explaining such technical results to a non-technical audience.

Technology: specialized tools such as Google Analytics, Google fonts, Adobe fonts, Cloudflare, are intended to offer advanced functionalities to people who need these products without having the resource to deploy such services internally. Even services such as implement additional third party elements, and this chain of inclusion has nowadays become a spread common practice. Technical specialization it is something which explains many of the third party inclusions.

Security: third-party authentication options (such as "login with facebook"), which delegate statistics aggregation and anonymization to get insightful analytics are specialized services. They have been proven to be a key factor for the progress of web services. On the one hand third-party inclusion is a way to offer stronger data protection, on the other hand but implicitly increase the attack surface if the security of the third party is not robust enough within the thread model of the website under analysis, it'll represent the weak ring of the chain.

Cultural: the technical complexity and the abstraction of behavioral profiling are hard concepts to explain. It is even harder to explain their impact without real-world cases such as Cambridge-Analytica. This confirms how hard is to do advocacy raising awareness on issues which are not yet exploded into a public scandal.

Political: identifying protection methods matters more to people belonging to a minority or a community at risk. Information (technically known as selectors) such as the membership or affinity to a specific religion can be a serious issue in cerain regions but not represent a problem somewhere else. The risk assessment in analysis like this, need to keep in account the website and the audience. The complexity aforementioned is a key factor in the dissemination of the results. Otherwise, previous advocacy experiences show how a technical analysis without a political declination don't trigger any change.

Existing work and competitors successes

Considering the public debate among data privacy, the fairness of monetization upon massive user profiling, and HTTPS security, other researchers made significant studies and developed valuable tools in my same time frame.
During the time of my fellowship, other projects covered some aspects I assumed to implement, this made me slightly reframe the scope of my tool and my research.

If the reader wishes to explore other related and complementary projects, you could look at:

Research development from the 2016-2017 scenario

The automatized analysis of the top 1 million websites contributed giving a grasp of the phenomena for the scientific community, unfortunately, rarely reported in mainstream news media. The report tends to get diluted, if not even banalized, as "Google is present in 80+% of the websites we navigate on", which somehow neutralizes the complexity of the phenomena. is a technological framework which enables analysis of web-trackers belonging to a selected cluster of websites.The communication effort is delegated to groups promoting the open web agenda, reducing the influence of users profiling, and algorithmic driven personalized experiences. This project goal is to provide a data stream, through API or CSV data, to groups able to translate the technical complexity in effective advocacy.

This project has not kept into account the ad-blocking phenomena. It is, indeed, the most efficient self-defense tool available to an audience, but only an expert and the educated audience can make use of it. This project aims at increasing such awareness.My educated guess is that victims of mass profiling are often the less tech savvy.. It has been observed in the development of new techniques, and in some new factors which try to change the narrative around it. Despite the fact that the topic is fascinating for me, it was out of scope in this.

Risk scenario, when third-party trackers decrease the navigation security

CitizenLab released a report, named "bad traffic." The report explain how in some regions, network devices invisible in the network topology, inject malicious code in non-encrypted web sessions. This attack pattern targets the individual by discriminating internet traffic (we can't know whichselectors are used, but the vantage point is the one of network interception).

Politecnico of Milan and Santa Barbara University describe an attack model in which third parties can take control over the main browser window via a javascript code injection. In this case, the target can be selected with the advertising network capability (such as offering some money to target a set of individuals profiled with the granularity permitted by the online advertising industry).

Target selection and attack deployments

As a human subject, we have a group of websites and concepts we consume more often than others. Now let's depict this cluster of sites as the "information diet" of a user. Such habit has a recurring pattern, which a the network can tap and/or the advertising industry uses to profile users behaviour. In the medium term the adversary harvest enough data points to categorize the user with a precision sufficient to judge whether to targeting the user or not.

Techniques mentioned above imply the user can be exploited by sites, as they represent a dimension in which the user interacts and is exposed to. They are the attack surface for client-side attacks.

We can group websites or URLs and categories, to build an effective communication campaign, because a said cluster of content or values likely permit to find a specific threat model, with more apparent regional conflicts, or political struggles. aims to analyze a cluster of websites because the output could be more effectively linked to a concrete threat, making third-party awareness campaign compelling.

Security evaluations, remediation and responsibilities

The most effective way used by experts to control exposure to third-party trackers is monitoring, and default deny, unwanted web connections; NoScript is the most flexible tool available, privacy-badger is the most exciting compromise between safe default, users simplicity in training and ghostery/brave browser represent some more user-friendly, whitelist-blacklist approach.

Self-defense tools are great for individuals' privacy and security-awareness, but if we consider an organization as a whole, its security depends on the weakest ring of the chain. It is therefore clear that these solutions are not viable nor scalable as an organization policy:

Besides the justification above, the political issue looks like "a technical self-defense tool works well only in a technocratic society where the informed is safer than the person without the means." To address the systemic issue, we should leverage the websites: they are the ultimate enablers of third party information leakage and users exposure. They represent the website we trust, and we represent their audience. I feel that it is the invisibility, and the abstraction of such violation of trust that permits misuse of private information. This project intention is provide a re-usable tool which show evidences. Trace the link between third-parties and website responsibilities.

Analysis report

Three analysis compose this report. In the chapters below, it is documented the context of the experiments, technology usage, and the findings.

1st Analysis. We address the issue of users health data leakage via third-party trackers. CodingRights, my host organization, selected websites belonging to online clinics. A region in which private healthcare insurance can access the data-brokers market, and can discriminate on insurance prices. Where a data protection regulation is in place, it may prohibit the usage of algorithmic driven decision making for tasks impacting individuals lives. Nowadays it is still an issue, and it will likely remain an open debate for years.

2nd Analysis. With the collaboration of four Latin American NGOs we analyzed government websites, the chapter is also used to introduce the framework we implemented.

3rd Analysis. Social media links monitoring during electoral campaigns (Argentinian 2017 and Italy 2018 campaigns)

Technical references

On github, the project repository and the campaign repository (clinics analysis), (latam institutions).