In one of the experiments I'm running for my research, I have to take a snapshot of a page and serve it locally. Of course, if I just grab the HTML, any relative URLs will break and the locally served page is unlikely to look much like the original. So, I put together a bit of code to make the links absolute. I remember trying to do this a few years ago in Python and having enormous headaches, but this Ruby version was relatively painless. That says more about my skills as a coder than anything about the relative (get it?) merits of Python and Ruby.
%w[uri net/http hpricot].each {|lib| require lib}
url = 'http://en.wikipedia.org/wiki/Night'
response = Net::HTTP.get_response(URI.parse(url))
body = Hpricot.parse(response.body)
absolutisable = { 'a' => %w[href],
'applet' => %w[codebase],
'area' => %w[href],
'blockquote' => %w[cite],
'body' => %w[background],
'del' => %w[cite],
'form' => %w[action],
'frame' => %w[longdesc src],
'iframe' => %w[longdesc src],
'head' => %w[profile],
'img' => %w[longdesc src usemap],
'input' => %w[src usemap],
'ins' => %w[cite],
'link' => %w[href],
'object' => %w[classid codebase data usemap],
'q' => %w[cite],
'script' => %w[src],
}
(body/"#{absolutisable.keys.join('|')}").each do |elem|
absolutisable[elem.name].each do |attr|
uri = elem.attributes[attr]
elem.raw_attributes[attr] =
URI::parse(url).merge(uri).to_s unless uri.nil?
end
end
puts bodyThis code doesn't take into account @import'ing CSS, and internal CSS links like url will break it, but I think it accounts for everything else.