How to use wget to download files

How to Use wget to Download Files Table of Contents 1. [Introduction](#introduction) 2. [Prerequisites](#prerequisites) 3. [Basic wget Syntax](#basic-wget-syntax) 4. [Installation Guide](#installation-guide) 5. [Basic File Downloads](#basic-file-downloads) 6. [Advanced Download Options](#advanced-download-options) 7. [Downloading Multiple Files](#downloading-multiple-files) 8. [Website Mirroring and Recursive Downloads](#website-mirroring-and-recursive-downloads) 9. [Authentication and Cookies](#authentication-and-cookies) 10. [Network Configuration](#network-configuration) 11. [Common Use Cases](#common-use-cases) 12. [Troubleshooting](#troubleshooting) 13. [Best Practices](#best-practices) 14. [Conclusion](#conclusion) Introduction wget (short for "web get") is a powerful, free command-line utility for downloading files from web servers. Originally developed for Unix-like systems, wget has become an essential tool for system administrators, developers, and power users who need to download files efficiently from the internet. This comprehensive guide will teach you everything you need to know about using wget, from basic file downloads to advanced website mirroring techniques. Whether you're downloading a single file, creating backups of websites, or automating download processes in scripts, wget provides the flexibility and reliability you need. Unlike web browsers, wget operates entirely from the command line and can handle interrupted downloads, follow redirects, and work with various authentication methods. Prerequisites Before diving into wget usage, ensure you have: - Operating System: Linux, macOS, or Windows with WSL/Cygwin - Command Line Access: Terminal or command prompt - Internet Connection: Active network connection - Basic Command Line Knowledge: Understanding of basic terminal commands - Permissions: Appropriate write permissions for download directories System Requirements - Memory: Minimal RAM requirements (typically under 50MB) - Storage: Sufficient disk space for downloaded files - Network: Stable internet connection for reliable downloads Basic wget Syntax The fundamental syntax of wget follows this pattern: ```bash wget [OPTIONS] [URL] ``` Essential Components - wget: The command itself - OPTIONS: Flags that modify wget's behavior - URL: The web address of the file or resource to download Simple Example ```bash wget https://example.com/file.zip ``` This basic command downloads `file.zip` from the specified URL to the current directory. Installation Guide Linux Systems Most Linux distributions include wget by default. If not installed: Ubuntu/Debian: ```bash sudo apt update sudo apt install wget ``` CentOS/RHEL/Fedora: ```bash sudo yum install wget or for newer versions sudo dnf install wget ``` Arch Linux: ```bash sudo pacman -S wget ``` macOS Using Homebrew: ```bash brew install wget ``` Using MacPorts: ```bash sudo port install wget ``` Windows Windows Subsystem for Linux (WSL): ```bash sudo apt install wget ``` Git Bash or Cygwin: Download and install through their respective package managers. Verification Confirm installation by checking the version: ```bash wget --version ``` Basic File Downloads Single File Download The simplest wget operation downloads a single file: ```bash wget https://releases.ubuntu.com/20.04/ubuntu-20.04.3-desktop-amd64.iso ``` This command: - Downloads the Ubuntu ISO file - Saves it in the current directory - Uses the original filename - Shows download progress Specifying Output Filename Use the `-O` (capital O) option to specify a custom filename: ```bash wget -O ubuntu-desktop.iso https://releases.ubuntu.com/20.04/ubuntu-20.04.3-desktop-amd64.iso ``` Downloading to Specific Directory Use the `-P` option to specify the download directory: ```bash wget -P /home/user/downloads/ https://example.com/file.pdf ``` Background Downloads For large files, run wget in the background: ```bash wget -b https://example.com/largefile.zip ``` The download progress is logged to `wget-log` file. Advanced Download Options Resume Interrupted Downloads Use the `-c` (continue) option to resume partial downloads: ```bash wget -c https://example.com/largefile.zip ``` This is particularly useful for large files or unstable connections. Limiting Download Speed Control bandwidth usage with the `--limit-rate` option: ```bash wget --limit-rate=200k https://example.com/file.zip ``` Common rate formats: - `200k` - 200 kilobytes per second - `1m` - 1 megabyte per second - `500` - 500 bytes per second Timeout Settings Configure timeout values for better reliability: ```bash wget --timeout=30 --tries=3 https://example.com/file.pdf ``` Options explained: - `--timeout=30`: Wait 30 seconds for response - `--tries=3`: Attempt download 3 times before giving up User Agent Modification Some servers block or restrict access based on user agents: ```bash wget --user-agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36" https://example.com/file.zip ``` Quiet and Verbose Modes Quiet mode (suppress output): ```bash wget -q https://example.com/file.pdf ``` Verbose mode (detailed output): ```bash wget -v https://example.com/file.pdf ``` Downloading Multiple Files From Text File List Create a text file with URLs (one per line): ```bash urls.txt https://example.com/file1.pdf https://example.com/file2.zip https://example.com/file3.tar.gz ``` Download all files: ```bash wget -i urls.txt ``` Wildcards and Patterns Download multiple files using patterns: ```bash wget https://example.com/files/document{1..10}.pdf ``` This downloads `document1.pdf` through `document10.pdf`. Sequential Downloads For numbered files: ```bash for i in {1..5}; do wget https://example.com/file$i.zip done ``` Website Mirroring and Recursive Downloads Basic Website Mirroring Mirror an entire website: ```bash wget --mirror --convert-links --adjust-extension --page-requisites --no-parent https://example.com ``` Options explained: - `--mirror`: Enable mirroring options - `--convert-links`: Convert links for local viewing - `--adjust-extension`: Add appropriate extensions - `--page-requisites`: Download CSS, images, etc. - `--no-parent`: Don't ascend to parent directory Recursive Download with Depth Limit Control recursion depth: ```bash wget -r -l 2 https://example.com/documentation/ ``` - `-r`: Recursive download - `-l 2`: Limit recursion to 2 levels deep Filtering File Types Download only specific file types: ```bash wget -r -A ".pdf,.doc,*.txt" https://example.com/documents/ ``` Reject specific file types: ```bash wget -r -R ".gif,.jpg,.jpeg,.png" https://example.com/ ``` Domain Restrictions Stay within specific domains: ```bash wget -r --domains=example.com,subdomain.example.com https://example.com/ ``` Authentication and Cookies HTTP Authentication Basic Authentication: ```bash wget --http-user=username --http-password=password https://example.com/protected/file.zip ``` Prompt for Password: ```bash wget --http-user=username --ask-password https://example.com/protected/file.zip ``` FTP Authentication ```bash wget --ftp-user=username --ftp-password=password ftp://ftp.example.com/file.zip ``` Using Cookies Save cookies: ```bash wget --save-cookies cookies.txt --keep-session-cookies https://example.com/login ``` Load cookies: ```bash wget --load-cookies cookies.txt https://example.com/protected/file.zip ``` Certificate Handling Ignore SSL certificate errors (use cautiously): ```bash wget --no-check-certificate https://example.com/file.zip ``` Specify CA certificate: ```bash wget --ca-certificate=mycert.pem https://example.com/file.zip ``` Network Configuration Proxy Settings HTTP Proxy: ```bash wget --proxy=on --http-proxy=proxy.example.com:8080 https://example.com/file.zip ``` SOCKS Proxy: ```bash wget --proxy=on --https-proxy=socks5://proxy.example.com:1080 https://example.com/file.zip ``` IPv4/IPv6 Preferences Force IPv4: ```bash wget -4 https://example.com/file.zip ``` Force IPv6: ```bash wget -6 https://example.com/file.zip ``` Connection Settings Multiple connections: ```bash wget --max-redirect=5 --retry-connrefused https://example.com/file.zip ``` Common Use Cases 1. Downloading Software Releases ```bash #!/bin/bash Download latest software release VERSION="3.2.1" wget -O software-${VERSION}.tar.gz \ "https://github.com/project/releases/download/v${VERSION}/software-${VERSION}.tar.gz" ``` 2. Website Backup ```bash #!/bin/bash Complete website backup SITE="example.com" DATE=$(date +%Y%m%d) mkdir -p backups/${DATE} cd backups/${DATE} wget --mirror \ --convert-links \ --adjust-extension \ --page-requisites \ --no-parent \ --directory-prefix=${SITE} \ https://${SITE} ``` 3. Downloading Documentation ```bash Download all PDF documentation wget -r -l1 -A "*.pdf" -np https://example.com/docs/ ``` 4. API Data Retrieval ```bash Download JSON data from API wget --header="Authorization: Bearer TOKEN" \ --header="Content-Type: application/json" \ -O data.json \ "https://api.example.com/data" ``` 5. Batch Image Downloads ```bash #!/bin/bash Download images from a list while IFS= read -r url; do filename=$(basename "$url") wget -O "images/$filename" "$url" sleep 1 # Be respectful to the server done < image_urls.txt ``` Troubleshooting Common Error Messages "Connection refused" ```bash Solution: Check URL, network connectivity, or use proxy wget --retry-connrefused --waitretry=1 --read-timeout=20 --timeout=15 -t 0 URL ``` "Certificate verification failed" ```bash Temporary solution (use with caution) wget --no-check-certificate URL Better solution: Update certificates sudo apt update && sudo apt upgrade ca-certificates ``` "403 Forbidden" ```bash Try different user agent wget --user-agent="Mozilla/5.0 (compatible; Googlebot/2.1)" URL ``` "File already exists" ```bash Overwrite existing files wget --overwrite URL Or create numbered backups wget --backup-converted URL ``` Network Issues Slow downloads: ```bash Use multiple attempts and adjust timeouts wget --tries=3 --timeout=30 --read-timeout=60 URL ``` Unstable connections: ```bash Enable continue and increase retries wget -c --tries=0 --retry-connrefused URL ``` Permission Problems Cannot write to directory: ```bash Check permissions ls -la /path/to/directory Change to writable directory cd ~/Downloads wget URL ``` Debug Mode Enable debug output for troubleshooting: ```bash wget --debug URL ``` Best Practices 1. Be Respectful to Servers Add delays between requests: ```bash wget --wait=2 --random-wait -r URL ``` Limit connection rate: ```bash wget --limit-rate=100k URL ``` 2. Use Appropriate User Agents Don't impersonate browsers unnecessarily, but use descriptive user agents: ```bash wget --user-agent="MyScript/1.0 (contact@example.com)" URL ``` 3. Handle Errors Gracefully In scripts, check exit codes: ```bash #!/bin/bash if wget -q URL; then echo "Download successful" else echo "Download failed with exit code $?" exit 1 fi ``` 4. Organize Downloads Create directory structures: ```bash DATE=$(date +%Y-%m-%d) mkdir -p downloads/$DATE wget -P downloads/$DATE URL ``` 5. Log Activities Keep download logs: ```bash wget -o download.log -b URL ``` 6. Security Considerations Verify checksums when available: ```bash wget https://example.com/file.zip wget https://example.com/file.zip.sha256 sha256sum -c file.zip.sha256 ``` Use HTTPS when possible: ```bash Prefer HTTPS over HTTP wget https://example.com/file.zip ``` 7. Configuration Files Create `~/.wgetrc` for default settings: ```bash ~/.wgetrc timeout = 30 tries = 3 continue = on user_agent = MyWget/1.0 ``` 8. Monitoring Large Downloads Use progress indicators: ```bash wget --progress=bar:force:noscroll URL ``` For scripts, use dot progress: ```bash wget --progress=dot:giga URL ``` 9. Bandwidth Management During business hours, limit bandwidth: ```bash #!/bin/bash HOUR=$(date +%H) if [ $HOUR -ge 9 ] && [ $HOUR -le 17 ]; then RATE="--limit-rate=100k" else RATE="" fi wget $RATE URL ``` 10. Error Recovery Implement retry logic: ```bash #!/bin/bash MAX_ATTEMPTS=3 ATTEMPT=1 while [ $ATTEMPT -le $MAX_ATTEMPTS ]; do if wget -c URL; then echo "Download successful on attempt $ATTEMPT" break else echo "Attempt $ATTEMPT failed" ATTEMPT=$((ATTEMPT + 1)) sleep 5 fi done if [ $ATTEMPT -gt $MAX_ATTEMPTS ]; then echo "Download failed after $MAX_ATTEMPTS attempts" exit 1 fi ``` Advanced Tips and Tricks 1. Custom Headers Send custom HTTP headers: ```bash wget --header="X-API-Key: your-api-key" \ --header="Accept: application/json" \ URL ``` 2. POST Requests Send POST data: ```bash wget --post-data="param1=value1¶m2=value2" \ --header="Content-Type: application/x-www-form-urlencoded" \ URL ``` 3. Following Redirects Control redirect behavior: ```bash wget --max-redirect=10 URL ``` 4. Timestamping Only download if remote file is newer: ```bash wget -N URL ``` 5. Spider Mode Check links without downloading: ```bash wget --spider URL ``` Performance Optimization 1. Concurrent Downloads Use GNU parallel for multiple simultaneous downloads: ```bash parallel -j 4 wget {} :::: urls.txt ``` 2. Memory Usage For very large files, adjust buffer sizes: ```bash wget --buffer-size=8192 URL ``` 3. DNS Caching For multiple downloads from same domain: ```bash wget --dns-cache=on URL1 URL2 URL3 ``` Conclusion wget is an incredibly versatile and powerful tool for downloading files from the internet. From simple single-file downloads to complex website mirroring operations, wget provides the functionality needed for virtually any download scenario. This comprehensive guide has covered everything from basic usage to advanced techniques, troubleshooting, and best practices. Key takeaways from this guide: 1. Start Simple: Begin with basic downloads and gradually incorporate advanced options as needed 2. Be Respectful: Always consider server load and implement appropriate delays and rate limiting 3. Handle Errors: Implement proper error handling and retry logic in automated scripts 4. Security First: Use HTTPS when possible and verify file integrity when checksums are available 5. Organize Efficiently: Structure your downloads and maintain proper logging for future reference Whether you're a system administrator automating backups, a developer downloading dependencies, or a researcher gathering data, mastering wget will significantly improve your productivity and reliability when working with web-based resources. Remember to always respect robots.txt files, terms of service, and server resources when using wget for automated downloads. With great power comes great responsibility, and wget certainly provides great power for file downloading and web scraping tasks. Continue exploring wget's extensive documentation with `man wget` or `wget --help` to discover even more options and capabilities that can be tailored to your specific use cases.