Introduction
Mitacs Globalink Research Internships is a project from Mitacs which allows undergraduate students from Brazil, France, China, India, Mexico, Saudi Arabia, Turkey or Vietnam to perform a 3 months internship in some university lab research in Canada.
This series of post is a personal attempt to perform some basic data analysis over projects information, such as projects title and description. Check Part 1 to see my saga on collecting the data.
Motivation
First question I wanted to answer was: “What are most of the projects about?”. I decided that generating an word cloud over the projects title would be an easy and quick way to get an overview on the keywords and topics used to describe the projects.
On Part 1 of this series, I posted a link to a text file containing, one per line, all projects title, exactly as they were written on Mitacs application platform. This was the data I chose to work with.
I have a professor that really likes word clouds. He uses it on the first lecture of every course he teaches in order to give his students an overview of what the class is going to be about. I asked him which tool he uses and he recommended me Wordle.net. Creating a word cloud on this website is as simple as copying and pasting your data and tweaking the colors and fonts.
The first word cloud I generated ended up looking like this:
Even this initial version already surprised me. Some words, like “cancer”, were totally unexpected to me. However, after a quick glance, I noticed some room for improvements.
Also, it is important to note the importance of looking at the words count before drawing any conclusion. The size of each word is relative to the count of other words, and not to the total count of words. On our example, we may think that hundreds of projects have the word “development” on its title, but this is true only for 82 projects of our data, out of 1700+. This show us how heterogeneous our data is.
In other words, a word cloud alone show us only which words appears the most, but not how much they actually appears on our data. This may vary a lot depending on our input.
Preprocessing
Wordle already does some kind of preprocessing for us, which is very nice. It removes stop words (common words, such as “of”, “and”, “for”, etc) and it is case insensitive (“ANALYSIS” and “analysis” will be grouped).
However, looking at the image above, you may notice we have on the left side, the word “systems”, and on the right side, “system”. I would like to count them as only one word. As the data is not that big, I could manually “search and replace” all the words I would like to group, but this would require me to inspect each generated word cloud and repeat the process many times.
An automatic way to perform this can be done by using a preprocessing technique called stemming, which reduces words to their “root” form (stem), removing plurals, conjugations and derivations. Stemming is not present on Wordle, so I had to do this kind of preprocessing on an external tool. For didactic purposes, I used KNIME, an open-source data analytic tool which has a graphical user interface.
On KNIME I was able to remove French stop words (as Wordle would allow me to remove either English or French stop words, but not both). This was important because some project titles are in French. Moreover, I applied English stemming. KNIME works with workflows. This is the one I made:
Now that I made the preprocessing I wanted, I returned to Wordle to make a bew word cloud. KNIME is also able to create word clouds, but the layout options offered by Wordle are far more appealing (in my opinion). My the word cloud “version 2″ looked like this:
Some words got bigger (like “system” and “model”), but it is somehow harder to read this word cloud, as it is now made of “stems”. Take for example the stem “applic”. We are not used to read this “word”.
Conclusion
So, which version is “better”? The first one or the second one? Which one should you use?
Overall, I believe each version has its own value. Applying stemming can reveal some interesting information, but it may also hide some other things. Now both “mobile” and “mobility” appears as “mobil”, even though they may refer to very different things in different contexts.
I would recommend playing will different settings (stemming, no stemming, etc) and comparing all results, instead of just looking for a “final absolute” version.
Looking at both the word clouds, I was happy with the results. Stems like “cancer”, “health”, “energi”, “polymer” and “sustain” helped me understand what kind of projects are on Mitacs this year. I hope you also had the chance to learn something new on this post. On the next post I will try to find which provinces are offering most projects, drawing the results on a map.