⛅ invi.sible.link, operative reporting

This page get updated near the beginning of every month, to understand the overall picture, consult the Project Plan.

Task list

  1. Improve browser emulation and javascript sand boxing, integrating the Honeynet project Thug, technically this allows us to get a list of all the javascript functions executed going beyond just a static source code analysis
    Phase:
  2. Having a data sharing capability in every node, and look for differences between tracking code
    Phase:
  3. In browser visualization of the results, usable to monitor the trend or visually identify anomalies
    Phase:
  4. Import the browser history of a person to map their profile of exposure / support community driven input (through github files), this approach would allow a more personalized analysis, that goes beyond just looking at the Alexa top 500 sites for each country
    Phase:
  5. Integrate the tool developed by Princeton university in doing trackers fingerprinting, this will provide an intermediate level of detail, still lower than the Thug code analysis capability
    Phase:
  6. Research into how to identify anomalies and tracking related functionality based on the dynamic code analysis provided by 1.
    Phase:
  7. Research into the privacy implications and device fingerprinting used in tracking
    Phase:
  8. Support Latin American communities running the tool, interpolating their results
    Phase:
  9. Write a research report
    Phase:
  10. Work with CodingRights in disseminating the results in Latin American communities
    Phase:
  11. Researcher visualization: the difference between this and point 3 is the amount of detail provided
    Phase:
  12. Wrapping up the project and performing last touches and cleanups
    Phase:

August, September and October 2017

This report, which is supposed to be monthly, is published after three months. I collaborated with a third party organization in experimenting an analysis based on the scraping of links from social media links and trackers analysis. At the moment I'm writing, beginning of November 2017, a report is in development (is not related to trackers analysis), and the opportunity raised a new research branch in this research.

Social media observer and realtime analysis

One of the challenges in web analysis is the impossibility to test all the condition in which a user will navigate. Making a common assumption, you can end up in doing a mass study but not to spot some targeted web surveillance because the script you hope to catch will just trigger in specific content.

The state of art solutions are:

Considering how Facebook and Google are the de-facto gatekeepers of the WWW, the most appropriate solution to monitor the quality of web pages, is look at what is effectively provided by these platform. But, in order to do so, you have to observe the personalized experience of the users. With a browser extension is it possible collects the links appearing in an user timeline (or the one shared by a selected community), and analyzed them.

This approach is currently in use, and will provide output during the month.

LatAm realtime tracking experiment

A work in progress is the development of argentinin.tracking.exposed, analyzing the links shared by Argentinian media and political figures.

Analysis of javascript tracking behavior

Has been implemented the usage of Chrome with PrivacyBadger as part of the testing suite, this permit the first javascript fingerprinting and visual reporting.

"VP": "virtual", 
"_id": "59b3ef87f50635667354b8b6", 
"acquired": "2017-09-09T13:06:06.072Z", 
"campaign": "germany", 
"href": "http://www.gulli.com/", 
"id": "2db2fd4cbf447b63c7be59cb0d0993f95164b68e", 
"inclusion": "imagesrv.adition.com", 
"navigator_plugins": 7, 
"navigator_userAgent": 1, 
"needName": "badger", 
"promiseId": "e70987de9f779c3924cffa601cd53f2c926d7973", 
"screen_width": 1, 
"scriptHash": "32f51deedfd3dd9a2f3a52a352367f6d6bae6a67", 
"scriptacts": 9, 
"subjectId": "a6074ec62e6d2d3775919ed4a8b53ba4f990e885", 
"version": 1, 
"when": "2017-09-09T13:41:24.310Z"
}{
"VP": "virtual", 
"_id": "59b3ef87f50635667354b8b7", 
"acquired": "2017-09-09T13:06:06.072Z", 
"campaign": "germany", 
"href": "http://www.gulli.com/", 
"id": "2db2fd4cbf447b63c7be59cb0d0993f95164b68e", 
"inclusion": "pagead2.googlesyndication.com", 
"navigator_userAgent": 7, 
"needName": "badger", 
"promiseId": "e70987de9f779c3924cffa601cd53f2c926d7973", 
"screen_availWidth": 2, 
"screen_width": 2, 
"scriptHash": "63d92bc0c16414f12fd8bb04e6e8189fb9a62ea6", 
"scriptacts": 18, 
"subjectId": "a6074ec62e6d2d3775919ed4a8b53ba4f990e885", 
"version": 1, 
"when": "2017-09-09T13:41:24.310Z", 
"window_localStorage": 7

This result can be fetched via API (i.e.: https://invi.sible.link/api/v1/details/itatopex ), it contains most of the javascript calls usable in browser fingerprinting.

JS details visualization

The interface wannts to show graphically what the third parties can potentially access. Every campaign is producing this outut daily, the incomplete visualization can be this example from the Cilean online clinics: Clinics-CL

July 2017

The last month has been used for advocacy, networking and personal time.

In July, I was in Cartagena (Colombia), then in Rio Magdalena meeting indigenous community during their pacification process, then in Lima (Perù) meeting some local digital rights group, then to SHA2017 hacker camp near Amsterdam.

The most exciting developments are the conferences and the meeting, in the Latin American countries, I made with CodingRights.

I explained tracking implication in the last years, with mixed result. I am pleased to see how an analysis including only a contextual group of sites enables compelling narratives.

The story line used with CodingRights, in Cartagena was following this logic:

  1. you as a citizen could have a health issue, and health insurance is necessary for it.
  2. In the so-called quantify society, personalized services are a mean permitting more customer exploitation. Policy and common knowledge seem not ready yet to face these offers.
  3. If an online clinics include third party trackers, your activity on their websites has only one link of distance to your physical person.
  4. Your navigation an online clinic website could leak some patterns: a particular exam you are looking for, symptoms, prescription.
  5. This information can be used by the data processor, sell through data broker and ends up in an Insurance company. then used to increase the profit.
  6. The business model of hospital, public or private, is not ADS based, third party trackers have not the same justification used, for example, in the news media debate.

This simple sequence it is worked in explaining the necessity of ad-blockers, the responsibility of websites and power dynamics of the data processor.

The visualization elaborated follow the same pattern for every country, and using Tableau has been quite simple do some prototyping.

The outreach of I was looking for, is meeting partners who try to figure out their concrete problem, and how data broker could exploit the context they live.

If I have to make a list of the topic raised in the discussions, I recall more often:

I am exploring a theory: every connected human, belongs to many social context.

Moreover, we as humans are not vulnerable in all of this environment, you can even belong to thug gang, and nobody in your town will harm you, but your risk not finding any insurance coverage because a multinational denies your health care.

This approach tries to simplify the creation of campaign intended to the life aspect in which a person is vulnerable. The content produced is designed to speak to a group of a person which feel themselves at risk.

Imagine two characters: "A political opponent in Iran" and "a poker addict looking for a new job." They face different risks, your empathy on the situation is probably different and the assistance they deserve too. InviSibleLink is a framework, can be used by two separate group, one speaking Persian and the other talking to addicts because massive web profiling can harm both of them.

The "nothing to hide" narrative want to be addressed making many websites. The hopes are anybody will find the one who speaks to the part in which is vulnerable. Few persons feel completely safe, and they are not the target.

This approach has been confirmed and would define the cultural inheritance left after the fellowship, speaking of which, I have to run a little bit now to catch up with the deliverable planned.

June 2017

The month has been used for reseaerch and advocacy more than development.

IACAP conference and presentation

I had my first presentation about third party trackers analysis, the content collected and the experience done would be reused in future presentations, a blogpost explaining the context will be published as soon as I make new visualizations with Tableau. I'm doing data investigation with that tool because it is much more efficient than developing my visualizations before have understood the complexity of the data. I've written a blogpost to test a simplify communication on algorithms, profiling and political impact: profiling, algorithm surveillance and religious freedom.

General software improvement for the campaign checking

A simple approach to monitor the trend on how website are doing has been implemented (a random example), using the interface of last activitivies me and others partecipants check the trends.

urlscan.io

I get in touch and obtain a key of urlscan.io, it is a service which monitor websites inclusion from their own infrastructure. Can be useful as comparison. A driver to use the service has still to be implemented.

WebTAP and their publications

WebTAP is The Web Transparency and Accountability Project, of Princeton University. I subscribed to get access to of their data as researcher. I didn't yet have a chance to try these. The research team has released some inspiring paper, too, I had the opportunity to read this month:

PhantomJS will be unmaintained

This is not a big deal, considering the probes diversification I have in mind, but considering the capability of collecting OpenGraph data, probably I have to replace the support of phantomJS with nightmarejs, but in general, I'll look forward to integrate once for all Thug.

Experimental visualizations

I'm experimenting with Tableau Public, on this accountClaudio at tracking exposed, visualizations. I will use these for the conference in Cartagena.

May 2017

In May 2017 the first two campaign got released, mostly I worked as a facilitator, text editor, visualization revision, double checking the results and using this achievement to display a vision for partners.

Results publications to a broad audience

The month of May hasn't lead to any particular improvement. Rather a stabilization of the interfaces and the workflow. The month has got the two releases expected and the progress with CodingRights about our presentation in July.

In specific, the episode of the TV show using the analysis website has got 20k unique IP access and this spike in the web traffic:

Deflect.ca offers the CDN and the technological interface to query the users.

At the moment, the experience done has been useful to stabilize a tangible result for a broader audience. Also, minor initiatives are running, and currently I'm in Istanbul to make progress on an analysis in Turkey.

April 2017

Most of the time in April 2017 has been dedicated in separating the analysis content from the campaing content. Campaign has to be delegated as much as possible to the campainer (the local community aware of the social and digital issues), and this has required some polishment from my side. An example campaign, with 100% HTML and zero code, is here implemented.

Organizing documentation

With the first adoption of the technology, I've started to organize documentation and define how the project might be integrated into other project doing the same analysis. The README on the campaigns is the reference for them and is currently kept reference by who's organizing those.

Academy and outreach

Three important events involving my research in the fellowship are going to happen between May and July.

  1. I'm working with an investigative journalism team to explain, for a (very) large and basic audience third party trackers, privacy and security implications. In the second half of May might be done.
  2. Big Data for the South, in Cartagena, has accepted the application of Me and Joana of CodingRights, about third party analysis in Latin America compared with other Western countries.
  3. I got accepted at the annual meeting of Stanford University, in International Association for Computing and Philosophy. Has been accepted a talk of mine about third party profiling and the impact of race and religious discrimination.

In the first and third point above, The potential outcome is a quite large visibility over the code repository, in the hope some open source developer with free time take interest in the project.

Experiment with OpenWPM

Princeton University, after webXray, improves their technology with OpenWPM it is a nice developed tool that might represent a valid integration and extension to my analysis. It uses a different format, support much more interaction with a non-headless browser and is less orchestrated.

March 2017

RightsCon and the research of local supporters

The month of March (and the first days of April) I attend in RighsCon and at the International Journalism Festival. My presence in there was justified by some talks I gave about algorithm accountability, I had some meetings with teams from different countries and contexts. Discussions are proceeding further, in order to begin an analysis campaign.

The countries of interests are Iran and Turkey. Finding local supporters is getting more vital, and I'm expanding the side of the project intended to communicate the results. The goal is split my technical analysis and the graphs with the advocacy material. Having a clear separation of duties would make, in theory, my and local supporters work for the same goal without blocking each other. A clear separation between the technical analysis and the local declination intended to be done.

Human Rights Researcher and Internet Policy Analyst has been my target to get in touch with.

Side life

I started a trip to and around Europe, to meet a certain number of potential collaborators, I'm traveling in these months and my update schedule is getting some delay. The third party trackers analysis keep raising political (for example, the Sleeping Giant) and technical interests as highlighted below:

Updates from Academy and NGO

Interesting updates are happening in the academic environment. Below the most meaningful that provide additional arguments usable by this project outreach: Security and harm caused by third-party inclusions, Amnesty International on data brokers (and their impact in religious profiling), Cross browser hardware fingerprinting.

Research plan with CodingRights

Me and CodingRights team apply for a conference, we'll do in the next months a comparate analysis of sensitive (and less sensitive) websites among latin American and western country (as a comparison). Is expected to be one of the core results of this fellowship, or at least, a lasting example of the analysis method. In the meantime, analytics and deeper script analysis with Thug will be supported. Results are scheduled to be available in June 2017.

February 2017

Manage a campaign based website

I'm realizing three campaign with a Western audience in mind. These campaigns do not strictly fits with my fellowship goals, but are useful steps for outreach and early feedback.

I started the development of the tool named social-pressure.

In RightsCon, by the end of March 2017, I'll meet partners from Turkey, Iran and other countries to discuss how to begin some analysis campaign.

As technical imporvement, I've extended the campaign manager in importing CSV: this enables collaborator to work with a spreadsheet and github without dealing with more technically complex format (I use JSON natively). Also, it might be edited directly in github, lowering the entance barrier.

Campaigns in progress and testing of the workflow

In the three campaign I'm testing, the d3 plugin sankey is helping in the generation of a graphical appealing scalable visualization, like:

Joint application with Coding Rights

A research paper to compare 10 South American countries and 5 Western countries is a work in progress, we'll apply for academic conference.

Stable monitoring of the analysis

The monitoring pipeline is proved to be stable, the statistics are available here, they are showing the last two days and there the last 20 days (might take a while to load the graphs)

below you can see a strange pattern that happens only to the machine located in Washington.

Details: I'm using three boxes. Washington, Amsterdam and Hong-Kong. They share the same software and they execute the same command at the same time. This is done to reduce the differencies across tests. In a specific website under investigation, a different code is sent to the Washington box. Has the side effect of freeze phantomjs and keep it running. From the load average graph I spot this first anomaly:

In the next months, with the integration of Thug, will be easier perform javascript inspection and investigate on the reasons.

January 2017

An update on the vision

Inspired by this title Hacking the attention economy, Has become clear that my production has not just to be a website full of results, because a website has these limits:

This is apipeline, a series of tools that constantly process input to produce output. My output has to be:

Therefore, in this post-prototypel phase, campaign pipeline has to be the priority. This might permit to experience since the beginning how reach out to different social circles, and will force the prject in keeping an operating workflow despite the tecnical challenges that are going to be faced later.

Flexibility in target specification

2075 ۞  ~/Dev/invi.sible.link DEBUG=* filter='{"iso3166":"BR"}' bin/directionTool.js 
  directionTool Unspecified 'needsfile' ENV, using default config/dailyNeeds.json +0ms
  directionTool Using config/dailyNeeds.json as needs generator +2ms
  directionTool content {"needName":"basic","lastFor":{"amount":24,"unit":"h"},"startFrom":"midnight"} +31ms
  directionTool Processing timeframe: startFrom {"amount":24,"unit":"h"} (options: midnight|now), lastFor "midnight" +0ms
  lib:mongo read in subjects by {"iso3166":"BR"} sort by {} got 1 results +13ms
  directionTool Remind, everything with rank < 100 has been stripped off +1ms
  directionTool Generated 80 needs +5ms
  directionTool The first is {"subjectId":"65bdefee473b2aa910ff52efdcb0425f3d4201d6","href":"http://google.com.br","rank":1,"needName":"basic","start":"2017-01-11T02:00:39.251Z","end":"2017-01-12T02:00:39.251Z","id":"d8dcdbc594dbe873a3b5d4378420ea5eddc1ce9c"} +0ms
  lib:mongo writeMany in promises of 80 objects +0ms

This approach is in experimentation in February, for the first targeted campaign. The goal of such campaing is getting visibility, constructive criticism, and see the overall reaction about this kind of monitoring approach.

CodingRights campaign progress

We made a meeting in CodingRights planning the Chupadados campaign, and currently I'm running a prototypal experiment outside the fellowship scope, in order to test the infrastructured and the content production pipeline

Long term monitoring it is working properly

The stats page is working smoothly since time, I'm using it to keep in check the multiple operation performed, new graph might be add apply to specific campaign. Study which kind of graph is still a work in progress, but I've already done successiful experiment in integrating: rawgraphs.io.

December 2016

web Crawling and Orchestration works for me

The structure running is simple and easily to be distributed. It involved few componenents.

Vigile (central authoritity)

At 5AM GMT, a command is executed, it creates a list of tasks that has to be completed. This list is derived by the list of Subjects under analysis, and can be reached publicly via API:

http http://invi.sible.link:7200/api/v1/getTasks/Aname/20
۞  ~/Dev/invi.sible.link http http://invi.sible.link:7200/api/v1/getTasks/Aname/20
    [   {
            "AMS": true,
            "HK": true,
            "end": "2017-01-03T00:00:03.467Z",
            "href": "http://baidu.com",
            "id": "c0ff474789434497abdf3b335f1bdb1def18993a",
            "Aname": false,
            "needName": "basic",
            "rank": 1,
            "start": "2017-01-02T00:00:03.467Z",
            "subjectId": "b4ef98150c7eeb7c03afb40437ab4c34ec0620ad"
        }, {
            "AMS": true,
            "HK": true,
            "end": "2017-01-03T00:00:03.467Z",
            "href": "http://qq.com",
            "id": "54a5d8d9ce08d5899a782efb5adc73e29f62dd3d",
            "Aname": false,
            "needName": "basic",
            "rank": 2,
            "start": "2017-01-02T00:00:03.467Z",
            "subjectId": "35eabd32318c6082cb645fbacfe5bca8f2baeb50"
        }

This model has technical properties that helps me in the orchestration:

  1. the field needName specify the needs. At the moment, the only need is namedbasic and means, crawl with phantomjs. This permit specialisation in distribution, because, if the vantage point don't support that test, can just skip to the next need.
  2. the fields HK, AMS and Aname have boolean values. It mean if the vantage point (specified in the request) has absolved or not the task. The value false means the VP has only got the task, if the value is true means has solved the task and confirmed the execution.
  3. b start and end describe the window of time in which the task can be absolved.

Chopsticks

quick and dirty approach: every two minutes contrab call for tasks to be done. ask for 30 to be exectuted 10 per time. maximum time 30 seconds, after 35 is killed. I'll measure performances and failure ratio later on.

it saves and import in mongodb results

Above you can see the level of detail experimented now. Having many descriptive fields will helps finding correlation, trend, pattern.

Exposer

make the results available for who need these, referenced by the promiseId. It display some basic graph of the data stored. was it working with only 1 day of results, with more than one, require an optimization of the analytic, because is too big.

As part of this improvement, the component machete will be completed.

Visulisation with Raw and c3

Work in progress is integrating RAW, framework, and c3To begin with a decent visualisation of technical results

Logical workflow of the pipeline

A decent distribution and resiliency is going on with the designed pipeline, here the scheduled tasks. the next component will fetch from the result and complete the pipeline.

November 2016

Components design

Define the components that should to work together in order to accomplish the pipeline. The goal is pretty ambitious, because the system has to operate on many vantage point, and be centrally coordinated. Enable the analyst to get easily results, and enable CodingRights and me to setup declined campaign without effort. The current schema is composed by 6 components each of it, with a small dedicated tasks. I have not used the prototype named Trackography-2, because the risk was a complexity increment. The reason in the components splitting, is to keep the design "simple and stupid" as possible.

  1. Component "storyteller": is the on running in the public website https://invi.sible.link, will contain information for technical audience and all the research tool developed this year. Will serve the results as open data, enabling third parties like CodingRights to integrate the data in their advocacy.
  2. Component "machete": aggregate the results from the vantage point and perform analysis, correlation, high level function to produce results. For example: rank the most invasive trackers, find correlations among the last day result and the last month. Will be the tool operating over the database and producing data-driven-insights.
  3. Component "vigile": will orchestrate the test on the vantage point, the analysis of machete, and keep track of the infrastructure performances
  4. Component "chopstick": inheritance from the Littlefork pipeline in which I worked following Christo's of TacticalTech directions. Is the component wrapping the execution of phantomjs and Thug, being a specialized micro-service on the vantage point.
  5. Component "exposer": The technical service needed to export the results from the vantage point to machete
  6. Component "social pressure": as the name evoke, is one of the key experiment of this project. A components containing the libraries, API keys to be a simple social media bots feed by machete.

Setup boxes infrastructure

Thanks to the OTF cloud, I setup easily four boxes to have the components runs, and a situation in which, box less or box more, the system can continue to operate and easily migrated, if other organization show interest in maintain the project after the fellowship, or just to run their own set of tests.

I recovered the lists I was using to do the previous experiment, they are nicely visualized with DataTables here:

chupadados.na.tracking.exposed

CodingRights has launch at the end of November a campaign website targeting Latin America communities, the name is Chupadados it is a campaign exploring different narratives to raise awareness on data surveillance, government and corporate, for Spanish and Portuguese audience. The firstdeclination of invi.sible.link will be on a selected list of Brazilian websites, all related to the sexual health services. This would be an experiment to advocate on a target community outside our common audience.

Test webXray on OTF cloud

The tool webXray has many things in common with this project; I start to assess if code base can be re-used. As first, I tested webXray on the three vantage points on OTF cloud, it is worked smoothly with a low effort. It is an interactive tool, therefore some of the assumptions behind the architecture might be different from my needs, still, looking in the internals:

In my current design these blocking operation are a small engineering problem. In the examples below you'll see the effect of don't managing such blocks. Without a manual intervention the pipeline remain blocked forever, and spotting all the possible conditions is a complex problem.

When you see the 7 dayand 7 minutes is because I killed the process manually. webXray solved this problem with the hardcoded time limit, probably I'll use the same, if a smarter solution keep failing.