In one of the experiments I'm running for my research, I have to take a snapshot of a page and serve it locally. Of course, if I just grab the HTML, any relative URLs will break and the locally served page is unlikely to look much like the original. So, I put together a bit of code to make the links absolute. I remember trying to do this a few years ago in Python and having enormous headaches, but this Ruby version was relatively painless. That says more about my skills as a coder than anything about the relative (get it?) merits of Python and Ruby.
%w[uri net/http hpricot].each {|lib| require lib} url = 'http://en.wikipedia.org/wiki/Night' response = Net::HTTP.get<em>response(URI.parse(url)) body = Hpricot.parse(response.body) absolutisable = { 'a' => %w[href], 'applet' => %w[codebase], 'area' => %w[href], 'blockquote' => %w[cite], 'body' => %w[background], 'del' => %w[cite], 'form' => %w[action], 'frame' => %w[longdesc src], 'iframe' => %w[longdesc src], 'head' => %w[profile], 'img' => %w[longdesc src usemap], 'input' => %w[src usemap], 'ins' => %w[cite], 'link' => %w[href], 'object' => %w[classid codebase data usemap], 'q' => %w[cite], 'script' => %w[src], } (body/"#{absolutisable.keys.join('|')}").each do |elem| absolutisable[elem.name].each do |attr| uri = elem.attributes[attr] elem.raw</em>attributes[attr] = URI::parse(url).merge(uri).to_s unless uri.nil? end end puts body
This code doesn't take into account @import'ing CSS, and internal CSS links like url will break it, but I think it accounts for everything else.