This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
http://www2.hawaii.edu/~lepape/index.html: | |
http://www.mvleadvocate.com/: |
I say "originally" because the script will add some information to this file to keep track of whether or not the page has changed on subsequent invocations. Here is the script:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/ruby -w | |
%w[rubygems digest/md5 net/smtp pony rest-open-uri yaml].each do |lib| | |
require lib | |
end | |
abort "Usage is: #{File.basename($0)} links" if ARGV.length < 1 | |
# Builds the info hash about the result for a url given the response. | |
def build_info(response, old_info, md5) | |
new_info = Hash[old_info] | |
new_info['status'] = response.status[0] | |
new_info['etag'] = response.meta['etag'] | |
new_info['last_modified'] = response.last_modified | |
new_info['md5'] = md5 | |
new_info['changed'] = case new_info['status'] | |
when '304' then false | |
when '200' then !(old_info && old_info['md5'] == md5) | |
else true | |
end | |
new_info | |
end | |
# Does a possibly filtered read from io. | |
def filter(io, start, ender) | |
return io.read if !start && !ender | |
start = /#{start}/ if start | |
ender = /#{ender}/ if ender | |
ret = '' | |
while !io.eof? && io.readline !~ start; end | |
while !io.eof? && (line = io.readline) !~ ender; ret << line; end | |
ret | |
end | |
filename, to_email, from_email = *ARGV | |
results = {} | |
YAML.load(File.open(filename)).each do |url, info| | |
info ||= {} | |
headers = {:method => :head, 'accept-encoding' => 'gzip', | |
'user-agent' => 'ruby pagemon dburger@hawaii.edu'} | |
headers['If-Modified-Since'] = info['last_modified'].to_s if info['last_modified'] | |
headers['If-None-Match'] = info['etag'] if info['etag'] | |
md5 = nil | |
begin | |
response = open(url, headers) | |
# fallback check for 200 status, needed especially if filtering | |
if response.status[0] == '200' | |
headers[:method] = :get | |
# need to get actual text and not gzip if filtering | |
headers.delete('accept-encoding') if info['start'] || info['end'] | |
response = open(url, headers) | |
md5 = Digest::MD5.hexdigest(filter(response, info['start'], info['end'])) | |
end | |
rescue OpenURI::HTTPError => e | |
response = e.io | |
end | |
new_info = build_info(response, info, md5) | |
results[url] = new_info | |
puts "#{url}: #{response.status[0]}, #{new_info['changed']}" | |
end | |
File.open(filename, 'w') {|f| f.puts(YAML::dump(results))} | |
message = results.inject('') do |memo, (url, info)| | |
memo << ((info['changed']) ? "#{url} changed (#{info['status']})\n" : '') | |
end | |
if to_email && message.length > 0 | |
Pony.mail(:to => to_email, :from => from_email || to_email, | |
:subject => 'pagemon report', :body => message) | |
end |
The script tries to use last modified and etag headers if the server supports them, but will fall back to creating an md5 hash of the page if necessary. To scrape just a portion of the page, pass in "start" and "end" regular expressions in the YAML to instruct the script where to start and stop scraping.
No comments:
Post a Comment