Abstract

Considering the 2017 WebFoundation report on Algorithm driven experiences, and the analysis of Staltz, the majority of users interactions are algorithm driven. Also, the trend of deep linking lead to skip homepage and reach content directly from social media.

Most of the interaction follows a personalized flow, and can't be determined a priori. Considering this project goal is to shed transparency on the kind of third-party scripts our browser executes, it is part of the accountability issues we can face as researchers.

To be concrete, considering the previous chapter of this report: when I analyze "the pages followed by an individual which is getting informed on medical therapy, how can I be sure the pages selected are the correct one?"

Personalized experiences

The interactions are also a thicker feedback-loop in the process of testing the user's response to the selected input. Study the user, see how to react, test again, do it for months. The complexity of the data brokering and the modern deep-learning tendencies go in this direction.

I put in context these thoughts with the current state-of-art trying to see their limits:

The Argentinian experiment

For a different class of problem, the WorldWideWeb Foundation organized this analysis:

The way we experience the web today is largely through algorithms. Search algorithms determine the results we see. Targeting algorithms decide which ads we are shown. Algorithms on social media services select what content makes it to our news feeds — and what is hidden.

This role of curation gives tech companies a huge degree of power over our public discourse. Yet, the opaque nature of these algorithms means we have little comprehension of how they work and how they are affecting our information diets.

Seeking to better understand how these algorithms curate content, this research focuses on Facebook’s News Feed — one of the world’s most important algorithms, selecting content for Facebook’s nearly two billion users.

We ran a controlled experiment — based in Argentina — setting up six identical profiles following the same news sources on Facebook and observed which stories each profile received.

During the fellowship, I took a small pause from invi.sible.link to support this research: The invisibile curation of content, and is become an opportunity to address the aforementioned issue.

We can't analyze all the web pages for computational limits, but observing on the social media experience of a puppet-user we can look the user's unique point of view and analyze the website selected by Facebook for them.

The research question is how many different third parties are present in the content pages compares to the homepage?

The integration

Using the browser extension from the project facebook.tracking.exposed this workflow become possible:

  1. When the browser extension is installed, it changes a bit the appearance of a Facebook interface (green-blue row, and optional control panel)
  2. In the control panel you can specify your belonging to a certain group, here in the example represented with "myUniqueId"
  3. When the user scroll on facebook.com/ the public post (only the post shared with everyone, with the world icon as privacy settings) are copied to a server as an individual copy.
  4. In the server, parsing scripts extract external links
  5. Using the unique ID (myUniqueId) as part of the input, it is possible to retrieve the derived links with an API (see below), and then feed with them the invi.sible.link pipeline:
    http https://facebook.tracking.exposed/api/v1/metaxpt/myUniqueId/href/2650
    
    HTTP/1.1 200 OK
    Access-Control-Allow-Origin: *
    Connection: keep-alive
    Content-Length: 48382
    Content-Type: application/json; charset=utf-8
    Date: Sun, 01 Apr 2018 13:16:41 GMT
    ETag: W/"bcfe-bXzGxrJtDtjmK6UtiTWzkw"
    Server: nginx/1.10.3 (Ubuntu)
    X-Powered-By: Express
    
    {
        "queryInfo": {
            "hoursback": 2650, 
            "now": "2018-04-01T15:16:41+02:00", 
            "selector": {
                "groupId": "myUniqueId"
            }, 
            "timeId": "0525adf095f653e34aea6bc6e7f8e7f0f89fec5d", 
            "times": [
                "2017-12-12T02:00:00.000Z", 
                "2017-12-12T03:00:00.000Z"
            ]
        }, 
        "results": [
            {
                "link": "http://bit.ly/2B4AsMd", 
                "permaLink": "/DiarioTiempoArgentino/posts/1757759997589646", 
                "putime": "2017-12-11T19:45:00.000Z", 
                "source": "Diario Tiempo Argentino"
            }, 
            {
                "link": "http://www.lanacion.com.ar/2090422-agustin-rossi-la-palabra-de-carrio-vale-muy-poco?utm_campaign=Echobox&utm_medium=Echobox&utm_source=Facebook#link_time=1513031400", 
                "permaLink": "/lanacion/posts/10155581744944220", 
                "putime": "2017-12-12T01:15:37.000Z", 
                "source": "LA NACION"
            }]
    

Terminology

Comparison between homepage and content URLs

To perform this comparison I took 245 links appeared on a user timeline in 2 hours, then I selected the homepage belonging to the 245 resulting in 32 unique sites.

The graph has one URL per line, and then a bar on the left representing the number of unique companies attributed, on the right the number of third-party resources loaded.

The color code is tight to the company detected, The reason why on the right column there is a lot of blue, are the "unattributed" third-party. Likely the large amount is due to CDN with dediated domain name.

This visualization gives some takeaways compared to the previous reports:

The URL list is a mix of different websites, (the homepages of the link appear on the Facebook experience) the first on top is all news media, and they have a more substantial number of companies embedded in the navigation than institutions and clinics. It is not the same for all the news media.

Below we can see the same visualization apply to the content effectively display on the Facebook experience: the size of companies and third-party content higher, but the visualization could look misleading, because the content URL was more than two hundred, and the first on top appears on average to the same company (perfil.com and pagina12.com.ar), I suggest looking at the complete visualization: content URLs, and homepage URLs.

To better understand a large phenomena, I zoomed with a single domain name (pagina12.com.ar)

With the table below is simpler figure out there is not a clear determination. In black, the homepage numbers represent the amount I want to check, below an assessment link by link, how many javascript (D) and companies (C) are detected. 50% of the time is more, and the other is less.

There are differences, which we can observe with the javascript fingerprinting approach (Viz#3)

This test and the one above are related to two URL url shortener (redirect services). The question answered is: the page redirect does not inject any content on the page, just act as a passive intermediary, this approach can follow-up, confirm or find new insights starting from the research lead by University Santa Barbara and Politecnico of Milan Stranger Danger: Exploring the Ecosystem of Ad-based URL Shortening Services.

The content links are pointing to different services among them.

The presence of persistent storage is an open question: how much should we as use have control of an hard disk space we give to third-parties and webapplication? how we could exert more control on these caches?

Conclusions

In this chapter, we saw how a browser extension could scrape links social media offer to you, or to your group, to analyze a group of websites which are exactly the pages observed by the user.

This method permits to focus the trackers research more on a targeted audience, tuning also the communication. Otherwise, the scientifical approach tends to use lists which are, at best, indicative or statistically meaningful.

At last, we saw how content page and homepage contains a diverse ecosystem of trackers, which can be less or more, but the diversity may imply a content-based preference, such as "pages talking of X want to have data-broker K". Other justifications can exist, and this could be an interesting point to explore. In this research, I also apply semantic analysis on the links content, but finding a correlation between the third-party trackers and the keyword was harder than expected. I've no definitive determination on this hypothesis.