Data analysis on Mitacs Globalink 2015 projects: Part 1 – The Data

Introduction

Mitacs Globalink Research Internships is a project from Mitacs which allows undergraduate students from Brazil, France, China, India, Mexico, Saudi Arabia, Turkey or Vietnam to perform a 3 months internship in some university lab research in Canada.

I am interested in taking part of the program, and one of the application process steps is to choose between 3 and 7 projects from their 1.782 projects list (as I write this article). Using their website, you can filter those projects by university, province, language and by keywords.

Motivation

I started performing queries with keywords such as “web” and other areas I am familiar with, but soon I realized that there were many other cool projects I could also apply to, so I ended up manually looking into all 1.700+ projects title and writing in a text file the ones I should spend more time reading the prerequisites and description.

Mitacs Globalink 2015 projects list.

When I was done, I got really curious about the data. “Which province is offering more projects?”, “What is the average amount of projects being offered per professor?“, “What would a word cloud with projects titles look like?”

Getting the data

I could not find any “export” link on the page where they list all the projects. I have some experience with web scraping, but unfortunately their platform is made on Flash :(.

How could I get the data I needed? First thought was: “Hmmm. Maybe it is possible to do some reverse engineering on Flash!“, but I realized they pull the data asynchronously, as soon as I hit the “Projects” page. This data must be coming from somewhere, so I decided to try to monitor the network.

It is important to note that, while the page that loads the Flash component is using HTTPS, the requests made by the application itself are using plain HTTP! This allowed me to see the traffic on Wireshark, for example. I decided, however, to use a “higher-level” approach.

On Chrome’s Developer Tools I was able to find the request I was looking for. Things would have been a lot easier if this data came in plain json or xml, so I would be able to sneak around, but headers informed me they were using “Content-Type:application/x-amf“.

I am not familiar with Flash development, so I had no idea what amf was and if it was possible to open it. It turns out that amf (Action Message Format) “is a binary format used to serialize object graphs such as ActionScript objects and XML, or send messages between an Adobe Flash client and a remote service” (source: Wikipedia).

A quick search on Google revealed me lots of tools to decode amf, but none of them worked for this particular file. I ended up trying lots of tools, including: Charles proxy, a Ruby library (rocketamf), three JavaScript libraries (JSAMF, amf, and amfjs), two python libraries (pyamf and amfast), two Firebug plugins (AMF Explorer and Flashbug), two PHP libraries (Amfphp and SabreAMF), a JMeter plugin, Wireshark, two Fiddler plugins (AMFParser and Fiddle AMF Parser), ServiceCapture web proxy, WebScarab web proxy, minerva, FlashDevelop and some others!

For each tool, I had to read how to install it and how to use it. On some libraries, there was no manual, so I had to look directly into their source code or unit tests.

All tools failed to some extent. Almost all of them gave some sort of decoding error with no further details. Some of the libraries were more specific, stating problems with DSQ externalizable class. The only tool that kept my hope was Charles. Charles was able to decode the amf message and show it on their user interface, but there was no way to export the data!

Charles proxy showing the decoded amf.

After almost 8 hours using all the tools mentioned previously, I decided to give a try to something else. On Charles I was able to copy selected elements from the tree, but I had to manually expand the nodes and select them. I spent an additional hour recording macros to expand nodes from Charles, select them and paste in a text file, but Charles trial version would close every 30 minutes, and this would take forever!

It was late in the night, so I decided to throw in the towel. It was time to ask for help from the gods. I posted an question in StackOverflow.

Next day morning, no answer to my question. I was close to give up. But then, I thought: what if this were an Mitacs project assignment? Would I just give up like that? Of course not!

More research, more tools, and no success.

I was very close to start writing my own amf deserializer, when I found FlashFirebug, a professional tool for debugging Flash applications. The license costs $34.99/yr, but you can get a 2-days trial for $0.50. At this point I thought, “Why not?” and paid for the trial version.

The amf decoder, once again, failed. But, messing around, I noticed that FlashFirebug allowed me to inspect the Flash object, just as an HTML page! Soon I found a DataGrid element holding all the data I needed! But how to export this data? The tool also provided me an ActionScript live console. Few minutes on Google and I was able to write my first ever ActionScript snippet: a loop iterating over the data used to fill the projects table!

FlashFirebug ActionScript3 console.

Boy, I was happy! After almost 10h work, I finally had what I needed to start the real work, which is analyzing the data. What for me is usually the most trivial part (getting the data), this time turned out to be a real challenge. A lot was learned during the process, but I am glad it is over.

Coming up, I will try to extract some useful information from all this data.

Sharing the data

From the loop written on last screenshot, I was able to generate a text file with all project titles, one per line. Using this code, I was able to generate a XML file with all the information I needed, including project descriptions, university name and professors name.

Just as a reminder: this is data gathered from Mitacs Student plataform. You do not even need to log in to see this data. I am just “reorganizing” it. Also, this post was written on August 19th 2014. It seems that more projects may have been added to the list afterwards.

4 Responses to Data analysis on Mitacs Globalink 2015 projects: Part 1 – The Data

Pingback: Data analysis on Mitacs Globalink 2015 projects: Part 2 – Word Cloud | Fernando Brito
Apurv says:

December 5, 2014 at 3:07 pm

Good Work Brito

Apurv says:

December 5, 2014 at 3:22 pm

However, the “XML file with all the information” isn’t working that well.

- Fernando Brito says:
  
  December 23, 2014 at 8:34 pm
  
  Oooops. The link was broken. My bad. It is fixed now

Data analysis on Mitacs Globalink 2015 projects: Part 1 – The Data

Introduction

Motivation

Getting the data

Sharing the data

4 Responses to Data analysis on Mitacs Globalink 2015 projects: Part 1 – The Data

Leave a Reply to Fernando Brito Cancel reply

Recent Posts

Archives

Pages

Categories

Suscribe

Data analysis on Mitacs Globalink 2015 projects: Part 1 – The Data

Introduction

Motivation

Getting the data

Sharing the data

4 Responses to Data analysis on Mitacs Globalink 2015 projects: Part 1 – The Data

Leave a Reply to Fernando Brito Cancel reply

Most Viewed Posts

Recent Posts

Archives

Pages

Categories

Suscribe