urlcache — Caches local copies of URLs.
urlcache
[
-m
TEXT
| --mode
TEXT
] [
-u
TEXT
| --url
TEXT
] [
-r
[true]|false
| --refresh
[true]|false
] [
-d
FILENAME
| --cache-dir
FILENAME
] [
-h
[true]|false
| --help
[true]|false
] [
-H
[true]|false
| --help-long
[true]|false
] [
-v
[true]|false
| --verbose
[true]|false
] [ --version
[true]|false
]
The urlcache is designed to make it easy to manage local copies of resource files which reside on the Internet. This can be extremely useful when you want to mirror a subset of data which is available on the Internet, or provide data "as needed".
Take a look at the following two invocations which retrieve the external IP address of the machine via http://www.networksecuritytoolkit.org/nst/cgi-bin/ip.cgi. You'll notice how the information was retrieved via wget only upon the first invocation:
[pkb@localhost ~]$
urlcache -m cat -v -u\ http://www.networksecuritytoolkit.org/nst/cgi-bin/ip.cgi
Created directory: /opt/home/pkb/tmp/nst/urlcache/www.networksecuritytoolkit.org/nst/cgi-bin 2005-05-31 06:05:46: Starting download of: http://www.networksecuritytoolkit.org/nst/cgi-bin/ip.cgi /usr/bin/wget -S "http://www.networksecuritytoolkit.org/nst/cgi-bin/ip.cgi" --06:18:46-- http://www.networksecuritytoolkit.org/nst/cgi-bin/ip.cgi => `ip.cgi' Resolving www.networksecuritytoolkit.org... 209.126.140.16 Connecting to www.networksecuritytoolkit.org[209.126.140.16]:80... connected. HTTP request sent, awaiting response... 1 HTTP/1.1 200 OK 2 Date: Tue, 31 May 2005 11:18:42 GMT 3 Server: Apache/1.3.33 Built by www.CQhost.com (Unix) Chili!Soft-ASP/3.6.2 PHP/4.3.11 mod_ssl/2.8.22 OpenSSL/0.9.7e Resin/2.1.9 mod_auth_pam_external/0.1 FrontPage/4.0.4.3 mod_perl/1.29 4 Connection: close 5 Content-Type: text/plain [ <=> ] 12 --.--K/s 06:18:47 (117.19 KB/s) - `ip.cgi' saved [12] 2005-05-31 06:05:47: Finished download of: http://www.networksecuritytoolkit.org/nst/cgi-bin/ip.cgi ***CACHED_FILE: /opt/home/pkb/tmp/nst/urlcache/www.networksecuritytoolkit.org/nst/cgi-bin/ip.cgi 65.29.66.13[pkb@localhost ~]$
urlcache -m cat -v -u\ http://www.networksecuritytoolkit.org/nst/cgi-bin/ip.cgi
File in cache: /opt/home/pkb/tmp/nst/urlcache/www.networksecuritytoolkit.org/nst/cgi-bin/ip.cgi ***CACHED_FILE: /opt/home/pkb/tmp/nst/urlcache/www.networksecuritytoolkit.org/nst/cgi-bin/ip.cgi 65.29.66.13[pkb@localhost ~]$
The urlcache has several
modes of operation which are specified via
-m MODE
or --mode MODE
. The
following modes of operation are available
(the default is usage
if not specified).
--mode cat --url URL
This will first check to see if the information requested is available in the cache. If it isn't, it will be retrieved and added to the cache. The contents of the file added to the cache will then be displayed to the standard output. If we fail to satisfy the request, a error is reported and the script returns 1. If we satisfy the request, the script returns 0.
--mode clean
The cache directory and ALL of its contents will be deleted when this mode is specified. This will free up disk space, but will require files to be downloaded again the next time they are needed.
--mode dir
This simply prints the root directory which will be used for caching files which are downloaded. If the directory happens to exist, the script will exit with a return code of 0. If the directory has not yet been created (it is only created as needed), the script will exit with a return code of 1.
--mode file --url URL
This will first check to see if the information requested is available in the cache. If it isn't, it will be retrieved and added to the cache. The location of the file stored in the cache is then printed to the standard output. If we fail to satisfy the request, a error is reported and the script returns 1. If we satisfy the request, the script returns 0.
--mode check --url URL
This will check to see if the information requested is available in the cache. The location of the file where the URL would be/is stored in the cache is then printed to the standard output. If the file is present in the cache, the script returns 0. If the file is not yet present, then 1 is returned.
--mode list
This lists all of the directories and files within the cache. It returns 0 unless the cache directory does not exist.
--mode rm --url URL
This determines the location in the cache for the URL specified. If the location corresponds to a directory, then the entire contents of that directory will be pruned (deleted) from the cache. If the location corresponds to a single file, then that file will be removed. If the script actually removes something from the cache directory, 0 will be returned, otherwise 1 will be returned.
If the removal of a file creates a empty directory, then the directory will be removed as well as any of its now empty parent directories.
--mode usage
This simply prints the amount of disk space used by the cache via the du command. It always returns 0.
Here are some examples of making use of urlcache:
Example 1. Using urlcache as a Unix PIPE
[root@probe ~]#
urlcache -m cat -u\ http://www.networksecuritytoolkit.org/nst/log/release-1.2.2.xml.gz |\ gzip -dc | grep snort
<name>airsnort</name> <url>http://airsnort.shmoo.com/</url> <name>snort</name> <url>http://www.snort.org/</url> <name>snorter</name> <url>http://www.snort.org/external/?url=http://shweps.free.fr/snorter.html</url> <name>snort-rules</name> <url>http://www.snort.org/dl/rules</url> <name>snort</name> <url>http://www.snort.org/</url> <name>snort-mysql</name> <url>http://www.snort.org/</url>
[root@probe ~]#
The following makes sure we have the page associated with
http://www.google.com/
and then runs the
wc command against it.
Example 2. Using urlcache To Download A File
[root@probe ~]#
FILE="$(urlcache -m file -u http://www.google.com/)"
[root@probe ~]#
if [ -r "${FILE}" ]; then wc "${FILE}"; fi
12 101 2077 /root/tmp/nst/urlcache/www.google.com/default.root.document
[root@probe ~]#
The following makes sure we have at least two files from
http://www.google.com/
in the cache and one
file from another location. It then uses the --mode rm
--url URL
to specify that we want ALL files cached from
the http://www.google.com/
domain to be
removed.
Example 3. Cleaning Files Downloaded From http://www.google.com/
[pkb@localhost ~]$
urlcache -m file -u http://www.google.com/index.html
/opt/home/pkb/tmp/nst/urlcache/www.google.com/index.html[pkb@localhost ~]$
urlcache -m file -u http://www.google.com/options/index.html
/opt/home/pkb/tmp/nst/urlcache/www.google.com/options/index.html[pkb@localhost ~]$
urlcache -m file -u http://www.mekwin.com/welcome.html
/opt/home/pkb/tmp/nst/urlcache/www.mekwin.com/welcome.html[pkb@localhost ~]$
urlcache
Cache directory (/opt/home/pkb/tmp/nst/urlcache) usage: 8.0K ./www.mekwin.com 20K ./www.google.com/options 28K ./www.google.com 40K .[pkb@localhost ~]$
urlcache -m rm -u http://www.google.com/
[pkb@localhost ~]$
urlcache
Cache directory (/opt/home/pkb/tmp/nst/urlcache) usage: 8.0K ./www.mekwin.com 12K .[pkb@localhost ~]$
Finally, here is a snippet from a script which downloads border, road and river data from an imaginary GIS server. This demonstrates how one might use urlcache to load a subset of pertinent data from a server containing massive amounts of information.
# # Set variables for information to get. # GIS_ROOT_URL="http://www.imaginary-gis-server.com/map/data"; DATA_STATES="IL IN OH"; DATA_TYPES="border road river"; # # Make sure we have all data cached and available # ALL_OK="true"; # Assume success for s in ${DATA_STATES}; do for t in ${DATA_TYPES}; do # Fetch MIF and MDB files for e in mif mdb; do if ! urlcache -m file -u "${GIS_ROOT_URL}/${s}/${t}.${e}"; then ALL_OK="false"; fi done; done; done;
The following command line options are available:
-m TEXT
] | [--mode TEXT
]
This option controls what
urlcache will do. If you specify
usage
(the default), it will show disk usage
information of the cache. If you specify list
it will list ALL of the files within the cache. If you specify
dir
, it will report the top level directory for
the cache and exit with a return code of 0 if it exists. If you
specify file
it will download the
URL specified by -url URL
if
necessary and then display the name of the path to the file
created in the cache. If you specify cat
it
will cat the contents of the URL specified by
-url URL
(using cached contents if
possible). If you specify rm
it will determine
the location in the cache for the file/directory associated the
URL specified by -url
URL
. If the file/directory exists, it will then be
pruned (removed) from the cache. If you specify
clean
, the entire cache directory will be
removed. If you specify check
, then the location
which the URL will/is mapped to in the cache
is displayed (but no attempt is made to retrieve it).
-u TEXT
] | [--url TEXT
]
This option must be specified if you use the
--mode cat
, --mode rm
or
--mode file
option as it indicates what
URL you want to dump the contents
of.
-r [true]|false
] | [--refresh [true]|false
]
This option may be specified if you use the
--mode cat
or --mode file
option. When this option is specified, we will fetch the file from
the server if we don't have it yet, or the version in the cache is
older than the version on the server. NOTE: This feature is apt to
not work resulting in files that are always downloaded, or never
updated.
-d FILENAME
] | [--cache-dir FILENAME
]
This allows one to change the root directory used for
caching files which are downloaded. If omitted, the default value
of $HOME/tmp/nst/urlcache
will be used. If this
option is used, then the directory specified MUST exist (we only
create the default cache directory).
-h [true]|false
] | [--help [true]|false
]
When this option is specified, urlcache will display a short one line description of urlcache, followed by a short description of each of the supported command line options. After displaying this information urlcache will terminate.
-H [true]|false
] | [--help-long [true]|false
]
This option will attempt to pull up additional
urlcache documentation within a text based
web browser. You can force which browser we use setting the
environment variable TEXTBROWSER
, otherwise,
we will search for some common ones.
-v [true]|false
] | [--verbose [true]|false
]
When you set this option to true, urlcache will produce additional output. This is typically used for diagnostic purposes to help track down when things go wrong.
--version [true]|false
]
If this option is specified, the version number of the script is displayed.
${HOME}/tmp/urlcache
The default location for the URL cache directory (when not specified on the command line). This directory (and any necessary sub directories) will be created as needed.