KiBeKi - Searching the Persian Blogosphere by a Robot
KiBeKi (means “What’s up?” in Persian) is a project I have been working on since early December 2007. The aim of this self-funded project is to find a sketch of the connection graph of the Persian blogosphere. The latest reports of project KiBeKi can be found here. For more information, especially if you are interested in using the results of this work in your research, please drop me a lint at arash@kamangir.net.
Long Version:
Essentially, KiBeKi is a page-crawler which starts from a set of seeds and then extracts links from pages. After each page is scanned through, the resulting links are added to a pool of links. The exact way this procedure is carried out is described below. The code has seen two major revisions (generations 0 and 1), and generation 2 is underway.
Generation 0 - Starting from persian.kamangir.net, every website stumbled upon would be analyzed. As of January 14, 2008, the system had discovered 284,000 potential sources, out of which 78,000 were Persian blogs which had been completely analyzed. Furthermore, 94,000 others were known to be Persian blogs which awaited analysis. A very early graphical analysis of these results can be found at Statistics of 78,000 Persian Blogs - Report on KiBeKi’s results so far.
A while after these results were collected, the database of the system maxed out, resulting in the end of this generation.
Generation 1 - Finished on March 4, 2008, in this generation the crawl initiates from persian.kamangir.net but rather than scanning through all sources discovered, a more mature procedure is carried out. At each stage, the database is sorted out in terms of number of incoming links. Then, the 1% topmost sources which have not been scanned yet are selected and the page-parsing procedure is applied on them. Spending a few hours of search, a database of over 3000 scanned through sources were collected. The complete report can be found here.
It was found out that many of the blogs in the popular Persian blogging service blogfa are infected with a suspicious code which would hide a huge number of links in the templates, out of sight, and therefore make irrelevant blogs seem important. Therefore, the output from any blog on blogfa was ignored (take a look at the source of this blog as an example).
Generation one suffers from two major shortcomings. One, as an Internet Explorer-based API is used for collecting information from websites, blogrolls embedded in pages using the popular service blogrolling.com will not be included in the results. These links as well as links posted using other services, such as delicious, will be perceived as javascript code, and not the exact links. Secondly, the blogfa-effect, which was described in the above, has to be dealt with at an earlier stage of the process.
Generation 2 - Starting on March 5, 2008, code was evolved into generation 2. The major change, to date, is the move from one seed to ten seeds. These ten blogs are the ten topmost blogs found in the report published on March 4, 2008. Furthermore, in this generation, the outgoing links from blogs on blogfa are disregarded.
Last update: 4 March 2008











