Highland Linux User Group

Linux Community
It is currently Mon Feb 06, 2012 8:33 pm

All times are UTC [ DST ]




Post new topic Reply to topic  [ 1 post ] 
Author Message
 Post subject: Wget Usage
PostPosted: Wed Feb 17, 2010 6:45 pm 
Offline
Moderator
User avatar

Joined: Tue Oct 03, 2006 12:27 pm
Posts: 162
Location: Inverness UK
Simple Usage

* Say you want to download a URL. Just type:

Code:
     wget http://fly.cc.fer.hr/


The response will be something like:
Code:
      --13:30:45--  http://fly.cc.fer.hr:80/
                 => `index.html'
      Connecting to fly.cc.fer.hr:80... connected!
      HTTP request sent, fetching headers... done.
      Length: 1,749 [text/html]

          0K -> .

      13:30:46 (68.32K/s) - `index.html' saved [1749/1749]


* But what will happen if the connection is slow, and the file is lengthy? The connection will probably fail before the whole file is retrieved, more than once. In this case, Wget will try getting the file until it either gets the whole of it, or exceeds the default number of retries (this being 20). It is easy to change the number of tries to 45, to insure that the whole file will arrive safely:
Code:
      wget --tries=45 http://fly.cc.fer.hr/jpg/flyweb.jpg


* Now let's leave Wget to work in the background, and write its progress to log file `log'. It is tiring to type `--tries', so we shall use `-t'.

Code:
      wget -t 45 -o log http://fly.cc.fer.hr/jpg/flyweb.jpg &


The ampersand at the end of the line makes sure that Wget works in the background. To unlimit the number of retries, use `-t inf'.
* The usage of FTP is as simple. Wget will take care of login and password.

Code:
      $ wget ftp://gnjilux.cc.fer.hr/welcome.msg
      --23:35:55--  ftp://gnjilux.cc.fer.hr:21/welcome.msg
                 => `welcome.msg'
      Connecting to gnjilux.cc.fer.hr:21... connected!
      Logging in as anonymous ... Logged in!
      ==> TYPE I ... done.  ==> CWD not needed.
      ==> PORT ... done.    ==> RETR welcome.msg ... done.
      Length: 1,340 (unauthoritative)
     
          0K -> .
     
      23:35:56 (37.39K/s) - `welcome.msg' saved [1340]


* If you specify a directory, Wget will retrieve the directory listing, parse it and convert it to HTML. Try:
Code:
      wget ftp://prep.ai.mit.edu/pub/gnu/
      lynx index.html


Advanced Usage

* You would like to read the list of URLs from a file? Not a problem with that:

Code:
      wget -i file


If you specify `-' as file name, the URLs will be read from standard input.
* Create a mirror image of GNU WWW site (with the same directory structure the original has) with only one try per document, saving the log of the activities to `gnulog':

Code:
      wget -r -t1 http://www.gnu.ai.mit.edu/ -o gnulog


* Retrieve the first layer of yahoo links:

Code:
      wget -r -l1 http://www.yahoo.com/


* Retrieve the index.html of `www.lycos.com', showing the original server headers:

Code:
      wget -S http://www.lycos.com/


* Save the server headers with the file:
Code:
      wget -s http://www.lycos.com/
      more index.html


* Retrieve the first two levels of `wuarchive.wustl.edu', saving them to /tmp.

Code:
      wget -P/tmp -l2 ftp://wuarchive.wustl.edu/


* You want to download all the GIFs from an HTTP directory. `wget http://host/dir/*.gif' doesn't work, since HTTP retrieval does not support globbing. In that case, use:

Code:
      wget -r -l1 --no-parent -A.gif http://host/dir/


It is a bit of a kludge, but it works. `-r -l1' means to retrieve recursively (See section Recursive Retrieval), with maximum depth of 1. `--no-parent' means that references to the parent directory are ignored (See section Directory-Based Limits), and `-A.gif' means to download only the GIF files. `-A "*.gif"' would have worked too.
* Suppose you were in the middle of downloading, when Wget was interrupted. Now you do not want to clobber the files already present. It would be:

Code:
      wget -nc -r http://www.gnu.ai.mit.edu/


* If you want to encode your own username and password to HTTP or FTP, use the appropriate URL syntax (See section URL Format).

Code:
      wget ftp://hniksic: mypassword@jagor.srce.hrThis e-mail address is being protected from spambots. You need JavaScript enabled to view it /.emacs


* If you do not like the default retrieval visualization (1K dots with 10 dots per cluster and 50 dots per line), you can customize it through dot settings (See section Wgetrc Commands). For example, many people like the "binary" style of retrieval, with 8K dots and 512K lines:
Code:
      wget --dot-style=binary ftp://prep.ai.mit.edu/pub/gnu/README


You can experiment with other styles, like:

Code:
      wget --dot-style=mega ftp://ftp.xemacs.org/pub/xemacs/xemacs-20.4/xemacs-20.4.tar.gz
      wget --dot-style=micro http://fly.cc.fer.hr/


To make these settings permanent, put them in your `.wgetrc', as described before (See section Sample Wgetrc).

Guru Usage

* If you wish Wget to keep a mirror of a page (or FTP subdirectories), use `--mirror' (`-m'), which is the shorthand for `-r -N'. You can put Wget in the crontab file asking it to recheck a site each Sunday:

Code:
      crontab
      0 0 * * 0 wget --mirror ftp://ftp.xemacs.org/pub/xemacs/ -o /home/me/weeklog

* You may wish to do the same with someone's home page. But you do not want to download all those images--you're only interested in HTML.

Code:
      wget --mirror -A.html http://www.w3.org/


* But what about mirroring the hosts networkologically close to you? It seems so awfully slow because of all that DNS resolving. Just use `-D' (See section Domain Acceptance).

Code:
      wget -rN -Dsrce.hr http://www.srce.hr/


Now Wget will correctly find out that `regoc.srce.hr' is the same as `www.srce.hr', but will not even take into consideration the link to `www.mit.edu'.
* You have a presentation and would like the dumb absolute links to be converted to relative? Use `-k':

Code:
      wget -k -r URL


* You would like the output documents to go to standard output instead of to files? OK, but Wget will automatically shut up (turn on `--quiet') to prevent mixing of Wget output and the retrieved documents.

Code:
    wget -O - http://jagor.srce.hr/ http://www.srce.hr/


You can also combine the two options and make weird pipelines to retrieve the documents from remote hotlists:

Code:
wget -O - http://cool.list.com/ | wget --force-html -i -


Extracted from http://www.editcorp.com/Personal/Lars_A ... get_7.html

_________________
Computers are like air conditioners, They stop working properly when you open Windows!


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 1 post ] 

All times are UTC [ DST ]


Who is online

Users browsing this forum: No registered users and 1 guest


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB © 2010 Highlands Linux Users Group