Here is a script I use to monitor pages for changes that don't supply feeds. It features the ability to check only a subset of the page using regular expression filters. You feed it a YAML file of the pages you want to monitor and optionally the email address you wish to send the results to (if you are running it as a cron job). The links file might originally look something like this:
I say "originally" because the script will add some information to this file to keep track of whether or not the page has changed on subsequent invocations. Here is the script:
The script tries to use last modified and etag headers if the server supports them, but will fall back to creating an md5 hash of the page if necessary. To scrape just a portion of the page, pass in "start" and "end" regular expressions in the YAML to instruct the script where to start and stop scraping.
No comments:
Post a Comment