How to collect geographic website rankings from the internet?
The data for the social network sites by country visualisations was collected from the Alexa country rankings site. This is the only known available free source which provides this information for a reasonable number of countries. I wrote some perl scripts to automate the process and allow the same data to be collected in the future for comparison. (I can send the scripts to anyone who’s interested, just drop me an email). This is a brief description on how the data was collected:-
1. Extract the top websites in each country
The main ranking page for each country was first saved. The site ranking, site name and site description were extracted from the source html.
2. Identify the social networking sites
Most websites in Alexa have a brief description about the website. All the sites that had the word social in the text were saved in the social network definition file.
This process proved to be a good first guess at generating a list of social network sites. Some of the identified social networking sites (like badoo, yonja and perfspot) weren’t in the wikipedia list of social networking sites. As expected, the process also identified sites with the word social that aren’t SN sites. These sites were manually deleted.
After running the SN extract process and the results were analysed, some countries, especially non English speaking countries, had very low scores for the social networking sites that were originally identified. This was because these countries were in fact using social networking sites in their own language. This was most evident in Russian speaking countries using the website Vkontakte, which is very popular in Russian speaking countries.
Finally a few popular sites that weren’t identified from the above process were manually inserted from the Wikipedia list of social networking websites.
The list of selected SN sites is by no means exhaustive and there are some popular sites that have social networking features embedded in them which make them almost SN sites. Qq.com the popular Chinese site, and Flikr are two prime examples. These sites were excluded from the list because they are not purely Social Networking sites, but they evolved from another website form (instant messaging and photo sharing in the case of Qq and flickr) into social networking sites.
This is the final list of sites that were used in the section process.
3.Retrieve the highest ranking website from the country rankings
The final extraction process selected the highest ranking SN site from the list along with the ranking and site name. The website address was used to match the websites to avoid any language problems with different character sets. The original idea was to include only SN sites that have a ranking between 1-20, however in order to populate the world map visualisation better this ranking restriction was omitted. In some cases there’s more than 1 SN site listed in the top 10 list which suggests that the distinction between the SN sites used in the country is not clear cut. Unfortunately in the world map visualisation it was difficult to split the country in 2 to show two different SN sites, therefore only the first ranking site in each case was considered.
4. Adding data from other sources
To help with the visualisation of the data collected data some additional data that wasn’t available in the original data source was required. The additional data was continent data, and the number of internet users in each country. At first I tried to look for this data from the new data search engine graphwise however the results returned were far from satisfactory. A google search later pointed me to the data sources that were finally used to get this information.
Continent Data
Internet Users Data
Logon
Fill out the form below to logon to this site, or sign up below.
Signup
Fill out the form below to join as a member of this site.


