The Academic Web Link Database
Project
Databases of academic web links 2000-2006 for the world research community
This project was created for research into web links: including web link mining, and the creation of link metrics. It is aimed at providing the raw data and software for researchers to analyse link structures without having to rely upon commercial search engines, and without having to run their own web crawler. You may use all of the resources on this site for non-commercial reasons but please notify us if you have an academic paper or book published that uses the data in any way (so that we know the site is getting good use). This site contains the following.
-
Complete databases of link structures of collections of academic web sites - stored on Figshare
-
Files of summary statistics about the link databases. prototype system available and Windows-based statistical analyser available. Extra statistics about the graph structure of database 20 created by Tobias Escher (see below).
-
Software tools for researchers to extract the information that they are particularly interested in.
-
Frequency lists of words found in each entire downloaded corpus.
-
Descriptions of the methodologies used to crawl the web so that the information provided can be critically evaluated.
-
Files of information used in the web crawling process.
What can the data and software be used for? To count links between universities or departments in various countries; to produce random lists of web pages or URLs; and to analyse the topological structure of national academic webs. Here is an example of statistical research about semi-supervised learning on graphs from another team using this data.
Databases - Tools for mining the data - Crawling methodology - Web link research - Research group
***Zipfiles of link structures of collections of academic websites, as crawled by SocSciBot are available via Figshare ***
Database 23: UK university Web sites June-July 2006
- Full database of the link structure of all 112 UK universities, as of June-July 2006 (Very big file, 196Mb). I MISSED THE NEW CHESTER UNIVERSITY BY MISTAKE, SORRY CHESTER. Zip file needs an Unzip program to extract the link files. Expect to need 3.5Gb of disk space to extract this file and work with it.
- The list of the known domain names of the universities covered is in the main zip file. Note that some universities changed their names and/or domain names and/or primary domain name since previous crawls (e.g., ucreative.ac.uk for surrart.ac.uk) and these are reflected in the domain_names.txt file in the archive.
- The list of URLs and partial URLs that were ignored by the crawler is in the main zip file. University name list (partial).
Database 22: Australian university Web sites February-April 2006
- Full database of the link structure of all 38 Australian universities, as of January-April 2006 (Very Big file, 114Mb). Zip file needs an Unzip program to extract the link files. Expect to need 3.5Gb of disk space to extract this file and work with it. The file contains some processed results from SocSciBot Tools in subfolders.
- The list of the known domain names of the universities covered is in the main zip file.
- The list of URLs and partial URLs that were ignored by the crawler is in the main zip file. University name list.
Database 21: New Zealand university Web sites January 2006
- Full database of the link structure of all 8 New Zealand universities, as of January 2006 (Big file, 10Mb). Zip file needs an Unzip program to extract the link files. Expect to need 0.5Gb of disk space to extract this file and work with it.
- The list of the known domain names of the universities covered is in the main zip file.
- The list of URLs and partial URLs that were ignored by the crawler is in the main zip file.University name list.
Database 20: UK university Web sites June-July 2005
- Full database of the link structure of all 112 UK universities, as of June-July 2005 (Very Big file, 192Mb). Zip file needs an Unzip program to extract the link files. Expect to need 3.5Gb of disk space to extract this file and work with it.
- The list of the known domain names of the universities covered is in the main zip file. Note that the University of Manchester and UMIST merged for this crawl (but not for any previous crawls). Also, the primary names of some universities swapped with secondary domain names, so the domain_names.txt file has some re-orderings from previous years.
- The list of URLs and partial URLs that were ignored by the crawler is in the main zip file. University name list.
- ADDITIONAL GRAPH PROPERTIES SPSS FILE created by Tobias Escher, a research fellow at UCL, UK., see also http://www.governmentontheweb.org and The Web Structure of E-Government.
- - directed diameter; average directed distance; number of unreachable pairs; Bow-Tie Partition; all measures once for the full site and once for navigable content only (excluding pdfs, docs, images etc.)
- Tobias comments that: "Please note that I did only include links that are INTERNAL to the site.
For some reason Plymouth did not seem to have been crawled correctly and is therefore not included in the database."
Database 19: Australian university Web sites January-March 2005
- Full database of the link structure of all 38 Australian universities, as of January-March 2005 (Very Big file, 78Mb). Zip file needs an Unzip program to extract the link files. Expect to need 3.5Gb of disk space to extract this file and work with it.
- The list of the known domain names of the universities covered is in the main zip file.
- The list of URLs and partial URLs that were ignored by the crawler is in the main zip file. University name list.
Database 18: New Zealand university Web sites January 2005
- Full database of the link structure of all 8 New Zealand universities, as of January 2005 (Big file, 10Mb). Zip file needs an Unzip program to extract the link files. Expect to need 0.5Gb of disk space to extract this file and work with it.
- The list of the known domain names of the universities covered is in the main zip file.
- The list of URLs and partial URLs that were ignored by the crawler is in the main zip file.University name list.
Database 17: Some US university Web sites July 2004
- Full database of the link structure of 23 US universities, as of July 2004 (Very Big file, 112Mb). Zip file needs an Unzip program to extract the link files. Expect to need 3.5Gb of disk space to extract this file and work with it. This is for testing purposes and not for research because the sample of universities is not systematic.
- The zip file contains the domain names and banned list it is in a *NEW STRUCTURE* designed for compatability with SocSciBot Tools. It must be unzipped into a subfolder of the folder "crawler_data" created by SocSciBot/SocSciBot Tools so that the path structures created are "crawler_data/US_2004/link results/" and "crawler_data/US_2004/info/". SocSciBot Tools will then be able to analyze the data most easily.
Database 16: UK university Web sites June 2004
- Full database of the link structure of 125 UK universities, as of June 2004 (Very Big file, 189Mb). Zip file needs an Unzip program to extract the link files. Expect to need 3.5Gb of disk space to extract this file and work with it.
- The zip file contains the domain names and banned list it is in a *NEW STRUCTURE* designed for compatability with SocSciBot Tools. It must be unzipped into a subfolder of the folder "crawler_data" created by SocSciBot/SocSciBot Tools so that the path structures created are "crawler_data/UK_2004/link results/" and "crawler_data/UK_2004/info/". SocSciBot Tools will then be able to analyze the data most easily.
Database 15: Australian university Web sites February 2004
Database 14: New Zealand university Web sites December 2003
Database 13: UK university web sites June 2003
- Full database of the link structure of 125 UK university institutions, as of June, 2003 (Huge file, 180Mb). Zip file needs an Unzip program to extract the link files. Expect to need 4Gb of disk space to extract this file and work with it. The 125 institutions include all official UK universities (with institutions from the universities of London and Wales counted separately) and all independent Higher Education institutions that conduct research above a minimum threshold (calculated from RAE 2001 values).
- The list of the known domain names of the universities covered. University name list.
- List of URLs and partial URLs that were ignored by the crawler.
- Frequency list of all words found in the web pages i.e. all consecutive strings of non-whitespace characters inside the body of the page (or in the title of the page) but outside HTML tags, but excluding any words containing any non-alphabetic characters except '. Note in particular that words with numbers or hyphens in the middle are completely excluded and the maximum word length is 25 characters. [about 5M zip file]
Database 12: Australian university Web sites February-March 2003
Database 11: New Zealand university Web sites January 2003
Database 10: Spanish university web sites July 2002
Database 9: UK university web sites June-July 2002
Database 8: Taiwanese university web sites February-March 2002
Database 7: Mainland China university web sites December, 2001 - January, 2002
- Full database of the link structure of 76 UK universities, December 2001 - January, 2002 (Huge file, 240Mb). Self-extracting zip file: run the program to extract the link files. Expect to need 3Gb of disk space to extract this file and work with it.
- The list of the known domain names of the universities covered.
- Files of summary statistics will be added.
Database 6: New Zealand university web sites January 2002 to February 2002
Database 5: Australian university web sites October 2001 to January 2002 (slow crawl)
Database 4: UK university web sites July, 2001
Database 3: UK university web sites June-July 2000
- Full database of the link structure of 107 UK universities and similar institutions, as of July, 2000 (Huge file, 46Mb). Self-extracting zip file: run the program to extract the link files. Expect to need 850Mb of disk space to extract this file and work with it.
- A summary file containing counts of links to external URLs from each British university web site (Very big file, 14Mb).
- This data file is less reliable than the 2001 version because of one domain name root only per institution. This does not impact on places like Wolverhampton much (almost all pages use wlv.ac.uk) but would more on Sussex (a lot of pages use both sussex.ac.uk and susx.ac.uk). University name list.
- At the time of collection of this data it was not intended to make it publicly available, so is presented here 'as is' and has not been cleaned up to remove universities that were subsequently not used in any analysis, e.g. the federal University of Wales. The ignored URLs list is believed to have been lost but will be posted if found.
Database 2: New Zealand university web sites July-August 2000
- Full database of the link structure of all NZ universities, as of July-August, 2000 (Big file, 3.3Mb). Self-extracting zip file: run the program to extract the link files. Expect to need 100Mb of disk space to extract this file and work with it.
- A summary file containing counts of links to external URLs from each New Zealand university web site (940k).
- This data file is less reliable than the 2001 version because of one domain name root only per institution. University name list.
- At the time of collection of this data it was not intended to make it publicly available, so is presented here 'as is' and has not been cleaned up. The ignored URLs list is believed to have been lost but will be posted if found.
Database 1: Australian university web sites July-August 2000
- Full database of the link structure of all known Australian universities, as of July-August, 2000 (Very big file, 13.5Mb). Self-extracting zip file: run the program to extract the link files. Expect to need 300Mb of disk space to extract this file and work with it.
- A summary file containing counts of links to external URLs from each Australian university web site (Big file, 5Mb).
- This data file is less reliable than the 2001 version because of one domain name root only per institution. University name list.
- At the time of collection of this data it was not intended to make it publicly available, so is presented here 'as is' and has not been cleaned up. The ignored URLs list is believed to have been lost but will be posted if found.
Tools
Some of these programs require large amounts of memory. Please save any work before running them in case they crash your computer. The programs are being offered as a free service and as a result they have not been tested in other environments. They should run on most versions of Windows. Please email if there is any problem. Some of the programs may take a long time to run (days if you have a slow computer and are processing the large database files). We are sorry for the awful interfaces provided on the programs but are happy to advise researchers on which programs will be useful to conduct the type of analysis that they are interested in.
Description of Crawling Methodology
Basic crawling techniques and issues are described in: A Free Database of University Web Links: Data Collection Issues. Additional crawling issues and techniques are discussed in the following article.
Thelwall, M. (2001). A Web Crawler Design for Data Mining. Journal of Information Science, 27(5), 319-326.
Web Link Research
For our publications, please see the Statistical Cybermetrics Research Group home page. There is a large list of related work available on the web site of the e-journal Cybermetrics. A much bigger Unix-based archive that is similar in spirit is available at http://www.archive.org/.
About this project
This project is run by the Statistical Cybermetrics Research Group at the University of Wolverhampton. We do not charge for any of the data or tools placed here because we feel that we have an obligation to make our raw data available for free since we collected it for free from the Web sites covered. The crawling is resource intensive and time-consuming so we are unfortunately not able to respond to requests such as "please crawl country X". If any bodies, such as national research agencies, would like to see their countries' universities included, then this would involve a charge. We would expect, but not insist, that the data resulting from such an arrangement would be subsequently made available on this site, also without charge.
Other web page or web link datasets
Please tell us if you have one and we will link to it.