Using wget to retrieve web pages

30 May 2006

Wget is a utility for non-interactive download of files from the Web. It is particularly useful for downloading multiple pages, or for inclusion in scripts. It is a unix program, included in most Linux distributions and is also available for windows using cygwin (see cygwin Linux Commands on Windows, choosing wget during the setup stage).

You can test that you have wget by just running it on the command line.


$ wget
wget: missing URL
Usage: wget [OPTION]... [URL]...

Try `wget --help' for more options.

The most basic operation is to download a single web page. This is achieved by including the url (web page address) after the command.

wget http://www.zyxwebsite.com/index.html

will result in a file called index.html with the contents of the web page. The wget command will not normally overwrite the file, but will instead add .1 .2 .3 etc. if the file already exists. The -r option can be used so that instead of adding the numbering it overwrites the file, or the -O option can be used to set an output file name.

wget http://www.zyxwebsite.com/index.html -O retrievedfile.html

You should however note that if using the -O option and trying to download multiple files, all the pages will end up concatinated into a single file.

I use this format of the command in one of my scripts. I can't run cron jobs on my hosted Penguin Tutor web site, so to create my Google Sitemap file I use wget to pull information from the website, process it on my home Linux computer and then upload it back to the hosted site using sitecopy.

For more on Google Sitemaps see:

Another use of the wget tool is recursive downloads. Using this it is possible to save a copy of a number of linked web pages, or even an entire website. Obviously this should be used responsibly. The -r (--recursive) option is used.

wget -r -l1 http://www.zyxwebsite.com/index.html

The -l (--level) option says the maximum depth of pages you want to download. Setting the level to 0 (inf) would download all linked pages in that site.

When performing multiple downloads wget will attempt to download a robots.txt file, and if it finds one with honour the rules. To ignore the robots.txt file then the option
-e robots=off
should be included. Again this should be used responsibly, and you should consider why they don't want a search engine to download certain files. This option can be useful if you want to download things from your own website that has a robots.txt file blocking certain dowloads.

The final option that I am going to detail is how to use wget through a proxy server, which is not very well detailed in the man page. In some environments (particularly from inside company intranets) you may need to use a proxy server to act as an intermediary.

First you should set the http_proxy environment variable using:
export http_proxy=http://proxy.yourdomain.com:8000
replacing with the address and port numbers of your proxy server.

Then use the --proxy=on option. E.g.:
wget --proxy=on http://www.zyxwebsite.com/index.html

If your proxy server needs to be authenticated then the following two options will provide authenticated proxy logins.
--proxy-user=user
--proxy-password=password

There are a lot more options and different ways in which the wget command can be used, but these should at least give you an idea of how useful it can be.