Tuesday, November 11, 2008

Ruby Page Monitor

Here is a script I use to monitor pages for changes that don't supply feeds. It features the ability to check only a subset of the page using regular expression filters. You feed it a YAML file of the pages you want to monitor and optionally the email address you wish to send the results to (if you are running it as a cron job). The links file might originally look something like this:

http://www2.hawaii.edu/~lepape/index.html:
http://www.mvleadvocate.com/:
view raw gistfile1.txt hosted with ❤ by GitHub


I say "originally" because the script will add some information to this file to keep track of whether or not the page has changed on subsequent invocations. Here is the script:

#!/usr/bin/ruby -w
%w[rubygems digest/md5 net/smtp pony rest-open-uri yaml].each do |lib|
require lib
end
abort "Usage is: #{File.basename($0)} links" if ARGV.length < 1
# Builds the info hash about the result for a url given the response.
def build_info(response, old_info, md5)
new_info = Hash[old_info]
new_info['status'] = response.status[0]
new_info['etag'] = response.meta['etag']
new_info['last_modified'] = response.last_modified
new_info['md5'] = md5
new_info['changed'] = case new_info['status']
when '304' then false
when '200' then !(old_info && old_info['md5'] == md5)
else true
end
new_info
end
# Does a possibly filtered read from io.
def filter(io, start, ender)
return io.read if !start && !ender
start = /#{start}/ if start
ender = /#{ender}/ if ender
ret = ''
while !io.eof? && io.readline !~ start; end
while !io.eof? && (line = io.readline) !~ ender; ret << line; end
ret
end
filename, to_email, from_email = *ARGV
results = {}
YAML.load(File.open(filename)).each do |url, info|
info ||= {}
headers = {:method => :head, 'accept-encoding' => 'gzip',
'user-agent' => 'ruby pagemon dburger@hawaii.edu'}
headers['If-Modified-Since'] = info['last_modified'].to_s if info['last_modified']
headers['If-None-Match'] = info['etag'] if info['etag']
md5 = nil
begin
response = open(url, headers)
# fallback check for 200 status, needed especially if filtering
if response.status[0] == '200'
headers[:method] = :get
# need to get actual text and not gzip if filtering
headers.delete('accept-encoding') if info['start'] || info['end']
response = open(url, headers)
md5 = Digest::MD5.hexdigest(filter(response, info['start'], info['end']))
end
rescue OpenURI::HTTPError => e
response = e.io
end
new_info = build_info(response, info, md5)
results[url] = new_info
puts "#{url}: #{response.status[0]}, #{new_info['changed']}"
end
File.open(filename, 'w') {|f| f.puts(YAML::dump(results))}
message = results.inject('') do |memo, (url, info)|
memo << ((info['changed']) ? "#{url} changed (#{info['status']})\n" : '')
end
if to_email && message.length > 0
Pony.mail(:to => to_email, :from => from_email || to_email,
:subject => 'pagemon report', :body => message)
end
view raw gistfile1.rb hosted with ❤ by GitHub


The script tries to use last modified and etag headers if the server supports them, but will fall back to creating an md5 hash of the page if necessary. To scrape just a portion of the page, pass in "start" and "end" regular expressions in the YAML to instruct the script where to start and stop scraping.