November 04, 2004
Curious, George: Scraping websites
I've got a VERY rudimentary understanding of Perl but, so far, I've been able to put together my website by using free scripts from sites like this one. Alas, my google-fu doesn't seem to be strong enough to find a script for what I want to do next.
I'm using this script to allow visitors to e-mail their friends about my site. I'd like to modify it so that, instead of e-mailing the actual URL of a page, it submits the URL to tinyurl.com, scrapes the new tiny-fied URL from the resulting webpage, and then e-mails that URL instead. It seems like it ought to be fairly easy to do this, but I can't find existing code that I can easily cut-and-paste, and after spending two days trying to figure out how to do it myself, my tiny brain aches. Is there an easy way to do this? Or is this one of those deceptively complex things that I should just leave to the professionals?
#!/usr/bin/env perl require CGI; $url = $ARGV[0]; $reply = `curl -s 'http://tinyurl.com/create.php?url=$url'`; # I haven't seen any tinyurl's longer than 6, so 7 should be OK: if ($reply =~ m!http://tinyurl\.com/[0-9a-z]{3,7}!) { $tinyurl = $&; print("$tinyurl\n"); } # Put some decent error handling here: else { print("Error!\n"); }
It might be better to use the LWP library instead of curl; this is a bit quick and dirty...