This week-end I finally got around importing all my old use.perl.org blog entries to Fearful Symmetry. To ease off the migration, I ended up writing two itsy-bitsy scripts. They’re nothing fancy, but in case they might help someone, here they are.
Harvest the entries
This was easy. For each account, use.perl.org
has a journal entries listing page. So the whole operation consisted of grabbing that webpage and mirroring everything on it looking like a journal entry. Not terribly sophisticated, but for this specific job it’s all we need.
Of the script itself, the most interesting part is LWP::Simple::getstore()
. Most people know and use LWP::Simple::get()
, but more than a few forget its sibling, which saves the retrieved webpage directly to a file — which is perfect for harvesting activities like this one.
#!/usr/bin/perl use 5.10.0; use strict; use warnings; use LWP::Simple; my $uid = '3196'; my $username = 'Yanick'; my $main = get( 'https://use.perl.org/journal.pl?op=list&uid=' . $uid ); while ( my ($entry_id) = $main =~ m#//use.perl.org/~$username/journal/(\d+)#g ) { say "retrieving $entry_id..."; getstore( "https://use.perl.org/~$username/journal/$entry_id", $entry_id ); sleep 1; # let's be nice to the server, shall we? }
Extract the information off the harvested pages
As one might suspect, the harvested use.perl.org
pages contain a little bit more than the raw blog entries. Getting to the information we want — the blog entry’s title, creation date, body, etc — is not hard, but it’s a little onerous to do by hand.
There are a lot of ways to extract information from a webpage, from quick and dirty regular expressions (like I did in for the script above) to full-fledged DOM parsing using, say, HTML::Tree. As I’m playing a lot with jQuery these days, I wondered if there was anything Perlish available offering the same type of interface. Guess what? There is: pQuery.
After playing with it a little bit, I’d say that pQuery
is not quite as slick and ready for prime-time as its JavaScript forebear. But again, for this small task, it allowed me to do the job.
The resulting script is as straight-forward as they come. I used Firebug to find out which html elements I want, tested the resulting paths with jQuery and, once I was happy with the result, adapted the result to pQuery.
#!/usr/bin/perl use 5.10.0; use strict; use warnings; use pQuery; use utf8; #unless you want xml, you can skip utf8'ing the output $/ = undef; # it's slurping time my $p = pQuery(<>); say "title: ", $p->find('.title h3')->get(1)->innerHTML; my ( $month, $day, $year ) = $p->find('.journaldate')->html() =~ /(\w{3})\w* 0?(\d+), (\d{4})$/; say "date: ", "$day $month $year"; say "original url: http:" . $p->find('.h-inline a')->get(0)->getAttribute('href'); say "\n"; utf8::encode( my $entry = $p->find('.intro')->get(0)->innerHTML ); say $entry;
It’s harvesting time
With those two scripts ready to go, the harvesting process becomes much less of a chore:
$ perl files/harvest_entries.pl retrieving 38951... retrieving 38951... $ perl files/extract_entry.pl 38951 title: Breaking off from the use.perl.org mothership date: 10 May 2009 original url: https://use.perl.org/~Yanick/journal/38951 <p> For the last couple of months, as a concession between visibility and control, I'd been double-posting my blog entries both here and on my personal blog. But now that my blog is registered on both the <a href="https://perlsphere.net/" rel="nofollow">Perlsphere</a> and <a href="https://ironman.enlightenedperl.org/" rel="nofollow">IronMan</a> aggregators, the need for the second posts here has dwindled. So... I'm going on a limb and tentatively turn off the echoing. See y'all on <a href="https://babyl.dyndns.org/techblog" rel="nofollow">Hacking Thy Fearful Symmetry</a>!</p>
Of course, there is still the grooming of the use.perl.org
html, and the actual importing to the new blogging engine. But… surely a handful of other scripts can take care of that, right? :-)
No comments