urlcache

Name

urlcache — Caches local copies of URLs.

Synopsis

Description

The urlcache is designed to make it easy to manage local copies of resource files which reside on the Internet. This can be extremely useful when you want to mirror a subset of data which is available on the Internet, or provide data "as needed".

Take a look at the following two invocations which retrieve the external IP address of the machine via http://www.networksecuritytoolkit.org/nst/cgi-bin/ip.cgi. You'll notice how the information was retrieved via wget only upon the first invocation:

[pkb@localhost ~]$ urlcache -m cat -v -u\
 http://www.networksecuritytoolkit.org/nst/cgi-bin/ip.cgi
Created directory: /opt/home/pkb/tmp/nst/urlcache/www.networksecuritytoolkit.org/nst/cgi-bin
2005-05-31 06:05:46: Starting download of: http://www.networksecuritytoolkit.org/nst/cgi-bin/ip.cgi
  /usr/bin/wget -S "http://www.networksecuritytoolkit.org/nst/cgi-bin/ip.cgi"
--06:18:46--  http://www.networksecuritytoolkit.org/nst/cgi-bin/ip.cgi
           => `ip.cgi'
Resolving www.networksecuritytoolkit.org... 209.126.140.16
Connecting to www.networksecuritytoolkit.org[209.126.140.16]:80... connected.
HTTP request sent, awaiting response...
 1 HTTP/1.1 200 OK
 2 Date: Tue, 31 May 2005 11:18:42 GMT
 3 Server: Apache/1.3.33 Built by www.CQhost.com (Unix) Chili!Soft-ASP/3.6.2 PHP/4.3.11 mod_ssl/2.8.22 OpenSSL/0.9.7e Resin/2.1.9 mod_auth_pam_external/0.1 FrontPage/4.0.4.3 mod_perl/1.29
 4 Connection: close
 5 Content-Type: text/plain

    [ <=>                                 ] 12            --.--K/s

06:18:47 (117.19 KB/s) - `ip.cgi' saved [12]

2005-05-31 06:05:47: Finished download of: http://www.networksecuritytoolkit.org/nst/cgi-bin/ip.cgi
***CACHED_FILE: /opt/home/pkb/tmp/nst/urlcache/www.networksecuritytoolkit.org/nst/cgi-bin/ip.cgi
65.29.66.13
[pkb@localhost ~]$ urlcache -m cat -v -u\
 http://www.networksecuritytoolkit.org/nst/cgi-bin/ip.cgi
File in cache: /opt/home/pkb/tmp/nst/urlcache/www.networksecuritytoolkit.org/nst/cgi-bin/ip.cgi
***CACHED_FILE: /opt/home/pkb/tmp/nst/urlcache/www.networksecuritytoolkit.org/nst/cgi-bin/ip.cgi
65.29.66.13
[pkb@localhost ~]$

The urlcache has several modes of operation which are specified via -m MODE or --mode MODE. The following modes of operation are available (the default is usage if not specified).

--mode cat --url URL: This will first check to see if the information requested is available in the cache. If it isn't, it will be retrieved and added to the cache. The contents of the file added to the cache will then be displayed to the standard output. If we fail to satisfy the request, a error is reported and the script returns 1. If we satisfy the request, the script returns 0.
--mode clean: The cache directory and ALL of its contents will be deleted when this mode is specified. This will free up disk space, but will require files to be downloaded again the next time they are needed.
--mode dir: This simply prints the root directory which will be used for caching files which are downloaded. If the directory happens to exist, the script will exit with a return code of 0. If the directory has not yet been created (it is only created as needed), the script will exit with a return code of 1.
--mode file --url URL: This will first check to see if the information requested is available in the cache. If it isn't, it will be retrieved and added to the cache. The location of the file stored in the cache is then printed to the standard output. If we fail to satisfy the request, a error is reported and the script returns 1. If we satisfy the request, the script returns 0.
--mode check --url URL: This will check to see if the information requested is available in the cache. The location of the file where the URL would be/is stored in the cache is then printed to the standard output. If the file is present in the cache, the script returns 0. If the file is not yet present, then 1 is returned.
--mode list: This lists all of the directories and files within the cache. It returns 0 unless the cache directory does not exist.
--mode rm --url URL: This determines the location in the cache for the URL specified. If the location corresponds to a directory, then the entire contents of that directory will be pruned (deleted) from the cache. If the location corresponds to a single file, then that file will be removed. If the script actually removes something from the cache directory, 0 will be returned, otherwise 1 will be returned.
Note
If the removal of a file creates a empty directory, then the directory will be removed as well as any of its now empty parent directories.
--mode usage: This simply prints the amount of disk space used by the cache via the du command. It always returns 0.

Examples

Here are some examples of making use of urlcache:

Example 1. Using urlcache as a Unix PIPE

[root@probe ~]# urlcache -m cat -u\
 http://www.networksecuritytoolkit.org/nst/log/release-1.2.2.xml.gz |\
 gzip -dc | grep snort 
      <name>airsnort</name>
      <url>http://airsnort.shmoo.com/</url>
      <name>snort</name>
      <url>http://www.snort.org/</url>
      <name>snorter</name>
      <url>http://www.snort.org/external/?url=http://shweps.free.fr/snorter.html</url>
      <name>snort-rules</name>
      <url>http://www.snort.org/dl/rules</url>
      <name>snort</name>
      <url>http://www.snort.org/</url>
      <name>snort-mysql</name>
      <url>http://www.snort.org/</url>

[root@probe ~]#

The following makes sure we have the page associated with http://www.google.com/ and then runs the wc command against it.

Example 2. Using urlcache To Download A File

[root@probe ~]# FILE="$(urlcache -m file -u http://www.google.com/)"
[root@probe ~]# if [ -r "${FILE}" ]; then wc "${FILE}"; fi
  12  101 2077 /root/tmp/nst/urlcache/www.google.com/default.root.document
[root@probe ~]#

The following makes sure we have at least two files from http://www.google.com/ in the cache and one file from another location. It then uses the --mode rm --url URL to specify that we want ALL files cached from the http://www.google.com/ domain to be removed.

Example 3. Cleaning Files Downloaded From http://www.google.com/

[pkb@localhost ~]$ urlcache -m file -u http://www.google.com/index.html
/opt/home/pkb/tmp/nst/urlcache/www.google.com/index.html
[pkb@localhost ~]$ urlcache -m file -u http://www.google.com/options/index.html
/opt/home/pkb/tmp/nst/urlcache/www.google.com/options/index.html
[pkb@localhost ~]$ urlcache -m file -u http://www.mekwin.com/welcome.html
/opt/home/pkb/tmp/nst/urlcache/www.mekwin.com/welcome.html
[pkb@localhost ~]$ urlcache

Cache directory (/opt/home/pkb/tmp/nst/urlcache) usage:

8.0K    ./www.mekwin.com
20K     ./www.google.com/options
28K     ./www.google.com
40K     .
[pkb@localhost ~]$ urlcache -m rm -u http://www.google.com/
[pkb@localhost ~]$ urlcache

Cache directory (/opt/home/pkb/tmp/nst/urlcache) usage:

8.0K    ./www.mekwin.com
12K     .
[pkb@localhost ~]$

Finally, here is a snippet from a script which downloads border, road and river data from an imaginary GIS server. This demonstrates how one might use urlcache to load a subset of pertinent data from a server containing massive amounts of information.

#
# Set variables for information to get.
#
GIS_ROOT_URL="http://www.imaginary-gis-server.com/map/data";
DATA_STATES="IL IN OH";
DATA_TYPES="border road river";

#
# Make sure we have all data cached and available
#
ALL_OK="true"; # Assume success

for s in ${DATA_STATES}; do
  for t in ${DATA_TYPES}; do
    # Fetch MIF and MDB files
    for e in mif mdb; do
      if ! urlcache -m file -u "${GIS_ROOT_URL}/${s}/${t}.${e}"; then
        ALL_OK="false";
      fi
    done;
  done;
done;

Options

The following command line options are available:

[-m TEXT] | [--mode TEXT]: This option controls what urlcache will do. If you specify usage (the default), it will show disk usage information of the cache. If you specify list it will list ALL of the files within the cache. If you specify dir, it will report the top level directory for the cache and exit with a return code of 0 if it exists. If you specify file it will download the URL specified by -url URL if necessary and then display the name of the path to the file created in the cache. If you specify cat it will cat the contents of the URL specified by -url URL (using cached contents if possible). If you specify rm it will determine the location in the cache for the file/directory associated the URL specified by -url URL. If the file/directory exists, it will then be pruned (removed) from the cache. If you specify clean, the entire cache directory will be removed. If you specify check, then the location which the URL will/is mapped to in the cache is displayed (but no attempt is made to retrieve it).
[-u TEXT] | [--url TEXT]: This option must be specified if you use the --mode cat, --mode rm or --mode file option as it indicates what URL you want to dump the contents of.
[-r [true]|false] | [--refresh [true]|false]: This option may be specified if you use the --mode cat or --mode file option. When this option is specified, we will fetch the file from the server if we don't have it yet, or the version in the cache is older than the version on the server. NOTE: This feature is apt to not work resulting in files that are always downloaded, or never updated.
[-d FILENAME] | [--cache-dir FILENAME]: This allows one to change the root directory used for caching files which are downloaded. If omitted, the default value of $HOME/tmp/nst/urlcache will be used. If this option is used, then the directory specified MUST exist (we only create the default cache directory).
[-h [true]|false] | [--help [true]|false]: When this option is specified, urlcache will display a short one line description of urlcache, followed by a short description of each of the supported command line options. After displaying this information urlcache will terminate.
[-H [true]|false] | [--help-long [true]|false]: This option will attempt to pull up additional urlcache documentation within a text based web browser. You can force which browser we use setting the environment variable TEXTBROWSER, otherwise, we will search for some common ones.
[-v [true]|false] | [--verbose [true]|false]: When you set this option to true, urlcache will produce additional output. This is typically used for diagnostic purposes to help track down when things go wrong.
[--version [true]|false]: If this option is specified, the version number of the script is displayed.

Files

${HOME}/tmp/urlcache: The default location for the URL cache directory (when not specified on the command line). This directory (and any necessary sub directories) will be created as needed.

Environment

TEXTBROWSER: This controls what text based browser is used to display help information about the script. If not set, we will search your system for available text-based browsers (Ex: elinks, lynx ...).