Abstract

To test the concept of the context-targeted campaign, we analyzed web trackers in online clinics. The pages of such websites contains explanation of medical treatment and exams available in a hospital. A website like these can leak through the navigation sequence, which is the medial issue of a user (eventually, correlated among multiple clinics, time spent in the pages)

Third-party trackers can even don't know anything on the website in which they are running, but metadata such as the interest of a user for a cluster or web pages are metadata attributed by AI analyzing the analytics, and in the case of a medical condition, that's a protected selector.

Our findings show how, advanced interactive experience for the user, leak potentially sensible (medical!) information through third parties. It is a known phenomenon, but the punctual per-site analysis provides actional input for data-driven advocacy.

Clinics, a Chupadados analysis

Beside the clear business interest in detecting users behavior and research, we should consider the leakage of users interest happen also when third party trackers are not present. In Website Detection Using Remote Traffic Analysis is explored the condition in which: attackers can find out which website, or which page on a website, a user is accessing simply by monitoring the packet size distribution. This analysis has to be considered quite dangerous also because applicable to skilled users protected with PETs. This to say, the effort made by webservices offering services related to protected selector should be stronger than other. This contribute can eventually begin to raise the attention, because if we can hope the techniques of passive website fingerprint is an attack not massively deployed yet, we know for sure users bahvior profiling is.

"It turn out it is extremely lucrative known you have diagnosed with a deasese. or you haven't even get diagnosed, but you have the a propensity to have the deases. only your doctor and hospital would have known, but there are probably hundreds of business model looking for this information."

-- BBC Giving Away Data

Recurring terminology

In this report are mentioned often:

In the report the analysis is focus on B and D, the visualizations used have a code number, the Viz#1 is on C, Viz#2 on B, Viz#3 on D.

Primary insight

In the next visualization, we'll see the Content-Type correlated to third-party resources; the class of contents have the same color and should be consistent in most of the graphs on this report, except where diversly explain.

Commonly analysts don't talk of third-party trackers, after all, it is a technical term often replaced by the broader acknowledge concept of "digital advertising". But this is not always true, especially in specialized websites such as healthcare, advertising is extremely rare.

Is not my scope analyze Companies presence and behavior, but as you can see below, they represent a significant amount of the third party connections; Using Tableau is simply highlight the set of company-attributed third-party resources and exclude them in further analysis.

Brasil

Viz#1 This graph highlights the amount of third-party connections attributed to known platforms. They can be judged by the advocacy team, for our analysis the intention was to highlight these companies and remove them from the deeper analysis.

Using tableau has been simple select all the third-party inclusion with an attributed company and hide them, to generate the next visualization

Viz#2 The picture below highlight which are the third party domain included, the bar size is proprortional to the amount of remote inclusions, the color, as always, is the same described above.

Considering "attack surface" the amount of javascript and resources served by a third party, the most contributors are not the well-known corporation observed in the Viz#1, but rather specialized services such as wix.com; wix.com is a technology to create websites, it serves the web app framework, and besides the traditional cookies uses localStorage in two different instances:

Although carloschagas.com.br looks like the websites with more inclusion, the invasiveness is relatively low.

www.fleaming-lab.com.br hospital displays a different pattern: do not include third-parties except the well-known Google+Facebook and the javascript doing more invasive operation are native to the first party domain:

This is a logic in which a website offering a complex web app could gain more knowledge on usability locally, rather then delocalize the information. The pattern seems the one of an evercookie, a mechanism which can re-link user behavior even if the user deletes cookies, but no hardware fingerprint is involved. An open question if the system links profiles collected when the user is non-logged to the customer profile (a logged user)

Mexican

check out the interactive visualizations.

Below you can see the Viz#1 and Viz#2 of Mexican Clinics, derived from the site list, in CSV format: here on github.

Below the first example of Viz#3, it analyze the javascript call usable to fingerprint a device. It is explained better in the next chapter.

Nothing special here, notable:

Colombian

APIs are available to retrieve report data. They have been developed to display visualization on advocacy websites. In the API below, in the URL, is present the variable which is the code name of the chupadados healthcare related analysis in Colombia. The analyst can create it is campaign updating this file.

Chilean

How much can we judge if a website is protecting our data enough or not? In the Chilean list of clinics?

I add the Viz#1 because it displays one of the well-known company offering the service of Session replay Mouseflow. Princeton University publication on session replay raise the awareness on the issue, and Mouseflow issued an answer, in which they explain how compliant are to GDPR (but would they give the same data protection to Chilean users?), the collection of anonymized dataset only, and even a smart way to obfuscate data inserted in a form. They also suggest, correctly, to their customer to be fully transparent. The clinics don't make any mention of it. It is interesting because, in these competing interest, the privacy of the user looks considered, but this session monitoring begins at the moment you load the webpage.

On the invasiveness of the javascript, nothing atypical looks happen, but it is insightful the only session replay caught, access to localStorage only once. Implicitly, the tool invi.sible.link cannot yet spot such behavior in a script, without doing reverse engineering of these activities and then implement a high-level selector to perform such attribution.

Reliability of time-comparison

One of the aspects I was intended to explore with invi.sible.link, is the possibility of monitor how third-party trackers change after updated. This is actually possible, the API implemented permits to download the results of the past, but how much can we compare the trend?

In the visualization below we get a numeric comparison of the consistency of the checks, it display the number of inclusion count. This assessment has little value for security purposes but helps to get a grasp on the stability of the test performed with PhantomJS. The number represent the amount of javascript loaded, every day, from all the chiliean clinics in analysis:

How explain the frequent changes?

The benefit of having an analysis performed every night is the comparison in the long term. It is quite disappointing to discover how frequently the third party trackers change. Three different vantage point can even perform the analysis, and the time dedicated to every session reach 1 minute each to permit every resource to load in machines, considering the loading of third-party resources

Some reasons could the web structure change, for the infrastructural reason, CDNs, caches or marketing agreements. This last is not expected in the clinics because their business model (should not be) advertising, but in a different context the dynamic loading of ads causes very variable inclusions links.

Considering the dependency map of third-party loaded (courtesy of Evidon Trackermap):

Looking at the example above, if we take into consideration resources such as Taboola, with additional third-parties loaded after it gets rendered, we see how hardly replicable this graph can be across multiple tests.

Conclusions

In this chapter, we saw how an advocacy group and organize a test on a selected group of websites, and via API, retrieve the result of the analysis to do a data-driven campaign. There is material to do advocacy, and the day-by-day analysis can measure improvement in the web-ecosystem.

The test used as the example, Latin-American healthcare companies, display an expected lack of care on the subject, at least we didn't observe a significant presence of data broker. Our analysis stops at the browser level, after, if someone sells/lose/suffer a breach of healthcare data, it is something we can spot from here.