Wget Download All Files In Directory

Wget command syntax: wget. To get downloaded file to a specific directory we should use -P or –directory-prefix=prefix. From wget man pages.P prefix -directory-prefix=prefix Set directory prefix to prefix. The directory prefix is the directory where all other files and subdirectories will be saved to, i.e. The top of the retrieval tree. Wget is a popular, non-interactive and widely used network downloader which supports protocols such as HTTP, HTTPS, and FTP, and retrieval via HTTP proxies. By default, wget downloads files in the current working directory where it is run. Read Also: How to Rename File While Downloading with Wget in Linux.

This evening I came across a Google Search result for Game of Thrones Audiobooks and my immediate question was how can I download all these .mp3 files at once to my computer. As going through each of those directory and clicking on those files to download was, well a bit too boring and time consuming.

  1. Wget is a free and very powerful file downloader that comes with a lot of useful features including resume support, recursive download, FTP/HTTPS support, and etc. In “The Social Network” movie, Mark Zuckerberg is seen using the Wget tool to download all the student photos from his university to.
  2. The wget command will put additional strain on the site’s server because it will continuously traverse the links and download files. A good scraper would therefore limit the retrieval rate and also include a wait period between consecutive fetch requests to reduce the server load.

So I gave a thought to it and tried to play with the very famous shell command wget and that’s it! After few trials it gave me what I needed.

Explanation with each options

Directory
  • wget: Simple Command to make CURL request and download remote files to our local machine.
  • --execute='robots = off': This will ignore robots.txt file while crawling through pages. It is helpful if you’re not getting all of the files.
  • --mirror: This option will basically mirror the directory structure for the given URL. It’s a shortcut for -N -r -l inf --no-remove-listing which means:
    • -N: don’t re-retrieve files unless newer than local
    • -r: specify recursive download
    • -l inf: maximum recursion depth (inf or 0 for infinite)
    • --no-remove-listing: don’t remove ‘.listing’ files
  • --convert-links: make links in downloaded HTML or CSS point to local files
  • --no-parent: don’t ascend to the parent directory
  • --wait=5: wait 5 seconds between retrievals. So that we don’t thrash the server.
  • <website-url>: This is the website url from where to download the files.

Happy Downloading

When you request a downloaded dataset from the Data Portal, there are many ways to work with the results. Sometimes, rather than accessing the data through THREDDS (such as via .ncml or the subset service), you just want to download all of the files to work with on your own machine.

Wget To Directory

There are several methods you can use to download your delivered files from the server en masse, including:

  • shell – curl or wget
  • python – urllib2
  • java – java.net.URL

Below, we detail how you can use wget or python to do this.

It’s important to note that the email notification you receive from the system will contain two different web links. They look very similar, but the directories they point to differ slightly.

First Link: https://opendap.oceanobservatories.org/thredds/catalog/ooi/sage-marine-rutgers/20171012T172409-CE02SHSM-SBD11-06-METBKA000-telemetered-metbk_a_dcl_instrument/catalog.html

The first link (which includes thredds/catalog/ooi) will point to your dataset on a THREDDS server. THREDDS provides additional capabilities to aggregrate or subset the data files if you use a THREDDS or OpenDAP compatible client, like ncread in Matlab or pydap in Python.

Second Link: https://opendap.oceanobservatories.org/async_results/sage-marine-rutgers/20171012T172409-CE02SHSM-SBD11-06-METBKA000-telemetered-metbk_a_dcl_instrument

The second link points to a traditional Apache web directory. From here, you can download files directly to your machine by simply clicking on them.

All

Using wget

First you need to make sure you have wget installed on your machine. If you are on a mac and have the homebrew package manager installed, in the terminal you can type:

Alternatively, you can grab wget off GitHub here https://github.com/jay/wget

Wget Download All Files In Directory Download

Once wget is installed, you can recursively download an entire directory of data using the following command (make sure you use the second (Apache) web link (URL) provided by the system when using this command):

This simpler version may also work.

Here is an explanation of the specified flags.

  • -r signifies that wget should recursively download data in any subdirectories it finds.
  • -l1 sets the maximum recursion to 1 level of subfolders.
  • -nd copies all matching files to current directory. If two files have identical names it appends an extension.
  • -nc does not download a file if it already exists.
  • -np prevents files from parent directories from being downloaded.
  • -e robots=off tells wget to ignore the robots.txt file. If this command is left out, the robots.txt file tells wget that it does not like web crawlers and this will prevent wget from working.
  • -A.nc restricts downloading to the specified file types (with .nc suffix in this case)
  • –no-check-certificate disregards the SSL certificate check. This is useful if the SSL certificate is setup incorrectly, but make sure you only do this on servers you trust.

Using python

wget is rather blunt, and will download all files it finds in a directory, though as we noted you can specify a specific file extension.

If you want to be more granular about which files you download, you can use Python to parse through the data file links it finds and have it download only the files you really want. This is especially useful when your download request results in a lot of large data files, or if the request includes files from many different instruments that you may not need.

Here is an example script that uses the THREDDS service to find all .nc files included in the download request. Under the hood, THREDDS provides a catalog.xml file which we can use to extract the links to the available data files. This xml file is relatively easier to parse than raw html.

The first part of the main() function creates an array of all of the files we would like to download (in this case, only ones ending in .nc), and the second part actually downloads them using urllib.urlretrieve(). If you want to download only files from particular instruments, or within specific date ranges, you can customize the code to filter out just the files you want (e.g. using regex).

Wget Download All Files In Directory Software

Don’t forget to update the server_url and request_url variables before running the code. You may also need to install the required libraries if you don’t already have them on your machine.

— Last revised on May 31, 2018 —