Zipfiles of link structures of collections of academic websites, as crawled by SocSciBot are available on Figshare

The Academic Web Link Database Project

Databases of academic web links 2000-2006 for the world research community

This project was created for research into web links: including web link mining, and the creation of link metrics. It is aimed at providing the raw data and software for researchers to analyse link structures without having to rely upon commercial search engines, and without having to run their own web crawler. You may use all of the resources on this site for non-commercial reasons but please notify us if you have an academic paper or book published that uses the data in any way (so that we know the site is getting good use). This site contains the following.

What can the data and software be used for? To count links between universities or departments in various countries; to produce random lists of web pages or URLs; and to analyse the topological structure of national academic webs. Here is an example of statistical research about semi-supervised learning on graphs from another team using this data.

Databases - Tools for mining the data - Crawling methodology - Web link research - Research group

***Zipfiles of link structures of collections of academic websites, as crawled by SocSciBot are available via Figshare ***

Database 23: UK university Web sites June-July 2006

Database 22: Australian university Web sites February-April 2006

Database 21: New Zealand university Web sites January 2006

Database 20: UK university Web sites June-July 2005

Database 19: Australian university Web sites January-March 2005

Database 18: New Zealand university Web sites January 2005

Database 17: Some US university Web sites July 2004

Database 16: UK university Web sites June 2004

Database 15: Australian university Web sites February 2004

Database 14: New Zealand university Web sites December 2003

Database 13: UK university web sites June 2003

Database 12: Australian university Web sites February-March 2003

Database 11: New Zealand university Web sites January 2003

Database 10: Spanish university web sites July 2002

Database 9: UK university web sites June-July 2002

Database 8: Taiwanese university web sites February-March 2002

Database 7: Mainland China university web sites December, 2001 - January, 2002

Database 6: New Zealand university web sites January 2002 to February 2002

Database 5: Australian university web sites October 2001 to January 2002 (slow crawl)

Database 4: UK university web sites July, 2001

Database 3: UK university web sites June-July 2000

Database 2: New Zealand university web sites July-August 2000

Database 1: Australian university web sites July-August 2000

Tools

Some of these programs require large amounts of memory. Please save any work before running them in case they crash your computer. The programs are being offered as a free service and as a result they have not been tested in other environments. They should run on most versions of Windows. Please email if there is any problem. Some of the programs may take a long time to run (days if you have a slow computer and are processing the large database files). We are sorry for the awful interfaces provided on the programs but are happy to advise researchers on which programs will be useful to conduct the type of analysis that they are interested in.

Description of Crawling Methodology

Basic crawling techniques and issues are described in: A Free Database of University Web Links: Data Collection Issues. Additional crawling issues and techniques are discussed in the following article.

Thelwall, M. (2001). A Web Crawler Design for Data Mining. Journal of Information Science, 27(5), 319-326.

Web Link Research

For our publications, please see the Statistical Cybermetrics Research Group home page. There is a large list of related work available on the web site of the e-journal Cybermetrics. A much bigger Unix-based archive that is similar in spirit is available at http://www.archive.org/.

About this project

This project is run by the Statistical Cybermetrics Research Group at the University of Wolverhampton. We do not charge for any of the data or tools placed here because we feel that we have an obligation to make our raw data available for free since we collected it for free from the Web sites covered. The crawling is resource intensive and time-consuming so we are unfortunately not able to respond to requests such as "please crawl country X". If any bodies, such as national research agencies, would like to see their countries' universities included, then this would involve a charge. We would expect, but not insist, that the data resulting from such an arrangement would be subsequently made available on this site, also without charge.

Other web page or web link datasets

Please tell us if you have one and we will link to it.