The Academic Web Link Database
Project
Making available databases of academic web links to the world research community
This project was created for research into web links: including web link mining, and the creation of link metrics. It is aimed at providing the raw data and software for researchers to analyse link structures without having to rely upon commercial search engines, and without having to run their own web crawler. You may use all of the resources on this site for non-commercial reasons but please notify us if you have an academic paper or book published that uses the data in any way (so that we know the site is getting good use). This site contains the following.
-
Complete databases of link structures of collections of academic web sites.
-
Files of summary statistics about the link databases. prototype system available and Windows-based statistical analyser available. *NEW* Extra statistics about the graph structure of database 20 created by Tobias Escher (see below).
-
Software tools for researchers to extract the information that they are particularly interested in.
-
Frequency lists of words found in each entire downloaded corpus.
-
Descriptions of the methodologies used to crawl the web so that the information provided can be critically evaluated.
-
Files of information used in the web crawling process.
What can the data and software be used for? To count links between universities or departments in various countries; to produce random lists of web pages or URLs; and to analyse the topological structure of national academic webs. Here is an example of statistical research about semi-supervised learning on graphs from another team using this data.
Databases - Tools for mining the data - Crawling methodology - Web link research - Research group
Database 23: UK university Web sites June-July 2006
- Full database of the link structure of all 112 UK universities, as of June-July 2006 (Very big file, 196Mb). I MISSED THE NEW CHESTER UNIVERSITY BY MISTAKE, SORRY CHESTER. Zip file needs an Unzip program to extract the link files. Expect to need 3.5Gb of disk space to extract this file and work with it.
- The list of the known domain names of the universities covered is in the main zip file. Note that some universities changed their names and/or domain names and/or primary domain name since previous crawls (e.g., ucreative.ac.uk for surrart.ac.uk) and these are reflected in the domain_names.txt file in the archive.
- The list of URLs and partial URLs that were ignored by the crawler is in the main zip file. University name list (partial).
Database 22: Australian university Web sites February-April 2006
Database 21: New Zealand university Web sites January 2006
Database 20: UK university Web sites June-July 2005
- Full database of the link structure of all 112 UK universities, as of June-July 2005 (Very Big file, 192Mb). Zip file needs an Unzip program to extract the link files. Expect to need 3.5Gb of disk space to extract this file and work with it.
- The list of the known domain names of the universities covered is in the main zip file. Note that the University of Manchester and UMIST merged for this crawl (but not for any previous crawls). Also, the primary names of some universities swapped with secondary domain names, so the domain_names.txt file has some re-orderings from previous years.
- The list of URLs and partial URLs that were ignored by the crawler is in the main zip file. University name list.
- ADDITIONAL GRAPH PROPERTIES SPSS FILE created by Tobias Escher, a research fellow at UCL, UK., see also http://www.governmentontheweb.org and The Web Structure of E-Government.
- - directed diameter; average directed distance; number of unreachable pairs; Bow-Tie Partition; all measures once for the full site and once for navigable content only (excluding pdfs, docs, images etc.)
- Tobias comments that: "Please note that I did only include links that are INTERNAL to the site.
For some reason Plymouth did not seem to have been crawled correctly and is therefore not included in the database."
Database 19: Australian university Web sites January-March 2005
Database 18: New Zealand university Web sites January 2005
Database 17: Some US university Web sites July 2004
- Full database of the link structure of 23 US universities, as of July 2004 (Very Big file, 112Mb). Zip file needs an Unzip program to extract the link files. Expect to need 3.5Gb of disk space to extract this file and work with it. This is for testing purposes and not for research because the sample of universities is not systematic.
- The zip file contains the domain names and banned list it is in a *NEW STRUCTURE* designed for compatability with SocSciBot Tools. It must be unzipped into a subfolder of the folder "crawler_data" created by SocSciBot/SocSciBot Tools so that the path structures created are "crawler_data/US_2004/link results/" and "crawler_data/US_2004/info/". SocSciBot Tools will then be able to analyze the data most easily.
Database 16: UK university Web sites June 2004
- Full database of the link structure of 125 UK universities, as of June 2004 (Very Big file, 189Mb). Zip file needs an Unzip program to extract the link files. Expect to need 3.5Gb of disk space to extract this file and work with it.
- The zip file contains the domain names and banned list it is in a *NEW STRUCTURE* designed for compatability with SocSciBot Tools. It must be unzipped into a subfolder of the folder "crawler_data" created by SocSciBot/SocSciBot Tools so that the path structures created are "crawler_data/UK_2004/link results/" and "crawler_data/UK_2004/info/". SocSciBot Tools will then be able to analyze the data most easily.
Database 15: Australian university Web sites February 2004
Database 14: New Zealand university Web sites December 2003
Database 13: UK university web sites June 2003
Database 12: Australian university Web sites February-March 2003
Database 11: New Zealand university Web sites January 2003
Database 10: Spanish university web sites July 2002
Database 9: UK university web sites June-July 2002
Database 8: Taiwanese university web sites February-March 2002
Database 7: Mainland China university web sites December, 2001 - January, 2002
Database 6: New Zealand university web sites January 2002 to February 2002
Database 5: Australian university web sites October 2001 to January 2002 (slow crawl)
Database 4: UK university web sites July, 2001
Database 3: UK university web sites June-July 2000
Database 2: New Zealand university web sites July-August 2000
Database 1: Australian university web sites July-August 2000
Tools
Some of these programs require large amounts of memory. Please save any work before running them in case they crash your computer. The programs are being offered as a free service and as a result they have not been tested in other environments. They should run on most versions of Windows. Please email if there is any problem. Some of the programs may take a long time to run (days if you have a slow computer and are processing the large database files). We are sorry for the awful interfaces provided on the programs but are happy to advise researchers on which programs will be useful to conduct the type of analysis that they are interested in.
Description of Crawling Methodology
Basic crawling techniques and issues are described in: A Free Database of University Web Links: Data Collection Issues. Additional crawling issues and techniques are discussed in the following article.
Thelwall, M. (2001). A Web Crawler Design for Data Mining. Journal of Information Science, 27(5), 319-326.
Web Link Research
For our publications, please see the Statistical Cybermetrics Research Group home page. There is a large list of related work available on the web site of the e-journal Cybermetrics. A much bigger Unix-based archive that is similar in spirit is available at http://www.archive.org/.
About this project
This project is run by the Statistical Cybermetrics Research Group at the University of Wolverhampton. We do not charge for any of the data or tools placed here because we feel that we have an obligation to make our raw data available for free since we collected it for free from the Web sites covered. The crawling is resource intensive and time-consuming so we are unfortunately not able to respond to requests such as "please crawl country X". If any bodies, such as national research agencies, would like to see their countries' universities included, then this would involve a charge. We would expect, but not insist, that the data resulting from such an arrangement would be subsequently made available on this site, also without charge.
Other web page or web link datasets
Please tell us if you have one and we will link to it.