Mitacs Globalink Research Internships is a project from Mitacs which allows undergraduate students from Brazil, France, China, India, Mexico, Saudi Arabia, Turkey or Vietnam to perform a 3 months internship in some university lab research in Canada.
I am interested in taking part of the program, and one of the application process steps is to choose between 3 and 7 projects from their 1.782 projects list (as I write this article). Using their website, you can filter those projects by university, province, language and by keywords.
I started performing queries with keywords such as “web” and other areas I am familiar with, but soon I realized that there were many other cool projects I could also apply to, so I ended up manually looking into all 1.700+ projects title and writing in a text file the ones I should spend more time reading the prerequisites and description.
When I was done, I got really curious about the data. “Which province is offering more projects?”, “What is the average amount of projects being offered per professor?“, “What would a word cloud with projects titles look like?”
Getting the data
I could not find any “export” link on the page where they list all the projects. I have some experience with web scraping, but unfortunately their platform is made on Flash :(.
How could I get the data I needed? First thought was: “Hmmm. Maybe it is possible to do some reverse engineering on Flash!“, but I realized they pull the data asynchronously, as soon as I hit the “Projects” page. This data must be coming from somewhere, so I decided to try to monitor the network.
It is important to note that, while the page that loads the Flash component is using HTTPS, the requests made by the application itself are using plain HTTP! This allowed me to see the traffic on Wireshark, for example. I decided, however, to use a “higher-level” approach.
On Chrome’s Developer Tools I was able to find the request I was looking for. Things would have been a lot easier if this data came in plain json or xml, so I would be able to sneak around, but headers informed me they were using “Content-Type:application/x-amf“.
I am not familiar with Flash development, so I had no idea what amf was and if it was possible to open it. It turns out that amf (Action Message Format) “is a binary format used to serialize object graphs such as ActionScript objects and XML, or send messages between an Adobe Flash client and a remote service” (source: Wikipedia).
For each tool, I had to read how to install it and how to use it. On some libraries, there was no manual, so I had to look directly into their source code or unit tests.
All tools failed to some extent. Almost all of them gave some sort of decoding error with no further details. Some of the libraries were more specific, stating problems with DSQ externalizable class. The only tool that kept my hope was Charles. Charles was able to decode the amf message and show it on their user interface, but there was no way to export the data!
After almost 8 hours using all the tools mentioned previously, I decided to give a try to something else. On Charles I was able to copy selected elements from the tree, but I had to manually expand the nodes and select them. I spent an additional hour recording macros to expand nodes from Charles, select them and paste in a text file, but Charles trial version would close every 30 minutes, and this would take forever!
It was late in the night, so I decided to throw in the towel. It was time to ask for help from the gods. I posted an question in StackOverflow.
Next day morning, no answer to my question. I was close to give up. But then, I thought: what if this were an Mitacs project assignment? Would I just give up like that? Of course not!
More research, more tools, and no success.
I was very close to start writing my own amf deserializer, when I found FlashFirebug, a professional tool for debugging Flash applications. The license costs $34.99/yr, but you can get a 2-days trial for $0.50. At this point I thought, “Why not?” and paid for the trial version.
The amf decoder, once again, failed. But, messing around, I noticed that FlashFirebug allowed me to inspect the Flash object, just as an HTML page! Soon I found a DataGrid element holding all the data I needed! But how to export this data? The tool also provided me an ActionScript live console. Few minutes on Google and I was able to write my first ever ActionScript snippet: a loop iterating over the data used to fill the projects table!
Boy, I was happy! After almost 10h work, I finally had what I needed to start the real work, which is analyzing the data. What for me is usually the most trivial part (getting the data), this time turned out to be a real challenge. A lot was learned during the process, but I am glad it is over.
Coming up, I will try to extract some useful information from all this data.
Sharing the data
From the loop written on last screenshot, I was able to generate a text file with all project titles, one per line. Using this code, I was able to generate a XML file with all the information I needed, including project descriptions, university name and professors name.
Just as a reminder: this is data gathered from Mitacs Student plataform. You do not even need to log in to see this data. I am just “reorganizing” it. Also, this post was written on August 19th 2014. It seems that more projects may have been added to the list afterwards.