Third party cookies may be stored when visiting this site. Please see the cookie information.

PenguinTutor YouTube Channel

Google Sitemaps, and wordpress blog tool

Google have a feature called sitemaps. The sitemap allow you to provide a list of all your webpages directly to google, which should mean that they are better at adding your website to the search engine.


The website includes details on how to create a sitemap using a python script generator. This is OK if you have a static site, or are happy for the script to use your accesslog as a basis for the sitemap, but I have created a short perl script that can create a sitemap based on the wordpress entries. This is not as comprehensive as the Wordpress plug in that has been developed, but it fits in with how I've designed my site. The script creates a url list file with the last modified dates so that Google knows whether it needs to re-read the pages or not. It also combines this with a static urltext file so that none wordpress pages can be included (and any index pages can be manually added to the static file).

The perl code follows:


#!/usr/bin/perl -w

# Get wordpress pages and add to a google sitemap urllist format file
# Also incorporate a static file into the output
# Runs google sitemap script when finished

use strict;
use DBI;
use File::Copy;

my $version = "0.1 devel";

my $outputfile = '/var/www/data/urllist.txt';
my $statics = '/var/www/data/import_staticlist.txt';
# This includes the path to the google provided sitemap-gen program, 
# may need to change if this is different
my $sitemapcmd = '/opt/sitemap_gen-1.3/sitemap_gen.py \
--config=/var/www/data/website_config.xml';

# code db info as config is in php format rather than perl
my $dbname = 'wordpress';
my $dbuser = 'wordpress';
my $dbpass = 'password';
my $dbhost = 'localhost';
# If prefix is not wp_, then will need to change the following line 
my $dbtable = 'wp_posts';

# Priority to give to all pages (gives all the same)
my $priority = '0.7';

# First copy statics to output file, then we can append
copy ($statics, $outputfile) or die "Error copying $statics to $outputfile";

# Open file to append
open (OUTPUT, ">>$outputfile") or die "Unable to append to $outputfile";
# Make sure we are on a newline
print OUTPUT "\n";

my $dbh = DBI->connect("DBI:mysql:$dbname:$dbhost",$dbuser,$dbpass)  \
or die "unable to connect to $dbname as $dbuser";

my $query =  \
$dbh->prepare("SELECT guid, post_modified, post_status FROM $dbtable");
$query -> execute or die "Error getting data from DB: $dbh->errstr";

my ($url, $date, $poststatus, $null);
while (($url, $date, $poststatus) = $query->fetchrow_array())
{
# Ignore if not yet published
if ($poststatus ne "publish" && $poststatus ne "static") {next;}
# remove time from date
($date, $null) = split / /, $date;
# Otherwise add the details
print OUTPUT "$url lastmod=$date priority=$priority\n";
}
$query->finish;
$dbh->disconnect;
close OUTPUT;

# Finished create file - now run the google program
system ($sitemapcmd);

The script can be run manually, or in my case is called from a crontab job:

0 23 * * * /opt/googlesitemap/wpresspages.pl

Fairly basic, but provides a quick and easy way of having the sitemap file updated regularly.