Exporting Old use.perl.org Blog Entries

Posted in: Technical Track

This week-end I finally got around importing all my old use.perl.org blog entries to Fearful Symmetry. To ease off the migration, I ended up writing two itsy-bitsy scripts. They’re nothing fancy, but in case they might help someone, here they are.

Harvest the entries

This was easy. For each account, use.perl.org has a journal entries listing page. So the whole operation consisted of grabbing that webpage and mirroring everything on it looking like a journal entry. Not terribly sophisticated, but for this specific job it’s all we need.

Of the script itself, the most interesting part is LWP::Simple::getstore(). Most people know and use LWP::Simple::get(), but more than a few forget its sibling, which saves the retrieved webpage directly to a file — which is perfect for harvesting activities like this one.

#!/usr/bin/perl
use 5.10.0;
use strict;
use warnings;
use LWP::Simple;
my $uid = '3196';
my $username = 'Yanick';
my $main = get( 'https://use.perl.org/journal.pl?op=list&uid=' . $uid );
while ( my ($entry_id) =
$main =~ m#//use.perl.org/~$username/journal/(\d+)#g ) {
say "retrieving $entry_id...";
getstore( "https://use.perl.org/~$username/journal/$entry_id", $entry_id );
sleep 1; # let's be nice to the server, shall we?
}

Extract the information off the harvested pages

As one might suspect, the harvested use.perl.org pages contain a little bit more than the raw blog entries. Getting to the information we want — the blog entry’s title, creation date, body, etc — is not hard, but it’s a little onerous to do by hand.

There are a lot of ways to extract information from a webpage, from quick and dirty regular expressions (like I did in for the script above) to full-fledged DOM parsing using, say, HTML::Tree. As I’m playing a lot with jQuery these days, I wondered if there was anything Perlish available offering the same type of interface. Guess what? There is: pQuery.

After playing with it a little bit, I’d say that pQuery is not quite as slick and ready for prime-time as its JavaScript forebear. But again, for this small task, it allowed me to do the job.

The resulting script is as straight-forward as they come. I used Firebug to find out which html elements I want, tested the resulting paths with jQuery and, once I was happy with the result, adapted the result to pQuery.

#!/usr/bin/perl
use 5.10.0;
use strict;
use warnings;
use pQuery;
use utf8; #unless you want xml, you can skip utf8'ing the output
$/ = undef; # it's slurping time
my $p = pQuery(<>);
say "title: ", $p->find('.title h3')->get(1)->innerHTML;
my ( $month, $day, $year ) =
$p->find('.journaldate')->html() =~ /(\w{3})\w* 0?(\d+), (\d{4})$/;
say "date: ", "$day $month $year";
say "original url: http:"
. $p->find('.h-inline a')->get(0)->getAttribute('href');
say "\n";
utf8::encode( my $entry = $p->find('.intro')->get(0)->innerHTML );
say $entry;

It’s harvesting time

With those two scripts ready to go, the harvesting process becomes much less of a chore:

$ perl files/harvest_entries.pl
retrieving 38951...
retrieving 38951...
$ perl files/extract_entry.pl 38951
title: Breaking off from the use.perl.org mothership
date: 10 May 2009
original url: https://use.perl.org/~Yanick/journal/38951
<p>
For the last couple of months, as a concession between
visibility and control, I'd been double-posting my blog
entries both here and on my
personal blog.
But now that my blog is registered on both the
<a href="https://perlsphere.net/" rel="nofollow">Perlsphere</a> and
<a href="https://ironman.enlightenedperl.org/" rel="nofollow">IronMan</a> aggregators,
the need for the second posts here has dwindled. So... I'm going
on a limb and tentatively turn off the echoing.
See y'all on <a href="https://babyl.dyndns.org/techblog" rel="nofollow">Hacking Thy Fearful
Symmetry</a>!</p>

Of course, there is still the grooming of the use.perl.org html, and the actual importing to the new blogging engine. But… surely a handful of other scripts can take care of that, right? :-)

email
Want to talk with an expert? Schedule a call with our team to get the conversation started.

No comments

Leave a Reply

Your email address will not be published. Required fields are marked *