- Learn Linux
- Learn Electronics
- Raspberry Pi
- LPI certification
- News & Reviews
22 October 2009
This Apache log analysis tool is no longer being maintained. This is included here for future reference or to allow anyone to pick up and continue with the project.
This is simple Log Analysis software designed to give the webmaster useful information about
who is visiting their website. The software has only a few pre-requisites that should be met
by most websites already. Once installed it allows viewing of the reports through a web browser.
Whilst designed and tested for a linux system, it should work on any Apache / Perl / PHP webserver, although
it will need to be installed in different directories on a windows system.
The software is released under the GPL as free open source software.
Full details can be seen by viewing the:
Download the file in tgz (tar gzipped) format:
The latest Robots Entry can be used to update the line in your loginfo configuration file (loginfo.cfg).
The entry should be copied as a single line.
our @robots = ('Googlebot', 'Yahoo! Slurp', 'Netcraft Web Server Survey', 'Ask Jeeves/Teoma', 'grub', 'msnbot', 'Wget', 'Feedster Crawler', 'BlogSearch', 'Syndic8', 'Cerberian', 'WISEnutbot', 'BlogPulse', 'Technoratibot', 'A2B Location-Based Search Engine', 'BlogsNowBot', 'Blogslive', 'Blogshares', 'UniversalFeedParser', 'ping.blo.gs', 'PageBitesHyperBot', 'PubSub-RSS', 'SurveyBot', 'walhello', 'Mirar', 'OmniExplorer', 'W3C_Validator', 'IconSurf', 'TurnitinBot', 'psbot', 'aipbot', 'StumbleUpon', 'Gigabot', 'LinkWalker', 'rojo.com', 'ConveraCrawler', 'DiamondBot', 'HenryTheMiragoRobot', 'Baiduspider', 'WebFilter Robot', 'SURF', 'topicblogs', 'BecomeBot' );
our @robots = ('Googlebot', 'Yahoo! Slurp', 'Netcraft Web Server Survey', 'Ask Jeeves/Teoma', 'grub', 'msnbot', 'Wget', 'Feedster Crawler', 'BlogSearch', 'Syndic8', 'Cerberian', 'WISEnutbot', 'BlogPulse', 'Technoratibot', 'A2B Location-Based Search Engine', 'BlogsNowBot', 'Blogslive', 'Blogshares', 'UniversalFeedParser', 'ping.blo.gs', 'PageBitesHyperBot', 'PubSub-RSS', 'SurveyBot', 'walhello', 'Mirar', 'OmniExplorer', 'W3C_Validator', 'IconSurf', 'TurnitinBot', 'psbot', 'aipbot', 'StumbleUpon', 'Gigabot', 'LinkWalker', 'rojo.com', 'ConveraCrawler', 'DiamondBot', 'HenryTheMiragoRobot', 'Baiduspider', 'WebFilter Robot', 'SURF' );
our @robots = ('Googlebot', 'Yahoo! Slurp', 'Netcraft Web Server Survey', 'Ask Jeeves/Teoma', 'grub', 'msnbot', 'Wget', 'Feedster Crawler', 'BlogSearch', 'Syndic8', 'Cerberian', 'WISEnutbot', 'BlogPulse', 'Technoratibot', 'A2B Location-Based Search Engine', 'BlogsNowBot', 'Blogslive', 'Blogshares', 'UniversalFeedParser', 'ping.blo.gs', 'PageBitesHyperBot', 'PubSub-RSS', 'SurveyBot', 'walhello', 'Mirar', 'OmniExplorer', 'W3C_Validator', 'IconSurf', 'TurnitinBot', 'psbot' );
The following old versions are no longer being developed. You should move to the latest version.
LogInfo provides a way to analyse the Apache web logs. It focuses on the information that is useful to
a webmaster trying to improve the appeal of their website, identifying the most popular areas of the site
as well as the way they are referred to the site.
The problem with much of the log analysis software is that they are either too complex, or don't include
the features needed to analyse script based websites, or in many cases both. My primary aim was that the
program should quick to get up and running and simple to use. Another important feature is the ability to
handle scripts that use session information without getting a entry for every single url variation. It was
this feature that made me write this software rather than using httpdstats which was one of the tools I tried
before writing my own.
This program was created to provide me with some useful information from my own webserver logs.
It has been made public through the GPL and time permitting I will continue to develop this. If you would
rather develop your own program based on my original code then the GPL will allow you to do this as long as
the resulting software is also released through the GPL. If you do follow that route, please rename the program
and let me know to avoid confusion between the original and new software. Alternatively if you would like to
contribute to the development of this program please send an email with any suggestions. For example if
you have some code that could provide better browser detection then please provide me with details. All
submitted code should be provided either free of any copyright or using the GPL / LGPL licenses. If included
in the software it will be issued with the GPL license. Please email firstname.lastname@example.org.
The software is licensed under the GPL. Full details are provided in the text file gpl.txt which
should be distributed with this software.
LogInfo Apachelog Log Analysis Tool Web: http://www.watkissonline.co.uk Copyright (C) 2005 Stewart Watkiss This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
Future versions may include:
There is however no expected date for an updated version, which at the time of writing has not been started on.
Please read these installation instructions fully before installing the
software. Ensure that you have read and understood the security implications
of adding the software to cron before you install the software and before creating
a scheduled job.
These instructions are based on a webserver running GNU/Linux. The files may need to be
installed manually on other systems.
The following pre-requisites are needed. They will be installed by default on most systems.
If the Time::ParseDate module is not installed on your system it can be installed as follows:
perl -MCPAN -e shell
install Time::ParseDate quitIf you get the following error when you try and run the program then you may need to follow these instructions:
Can't locate Time/ParseDate.pm in @INC
Upgrading to version 0.1.1 is achieved by extracting the files into the same directory as the previous install (e.g. /usr/local/loginfo). The PHP file has not been updated so there is no need to copy that across. Existing .cfg files can be used but it is recommended that new configuration files are created from the new sample.cfg which includes new features. You may need to run the chown / chmod commands to set the permissions correctly.
The latest version of the code is available at: www.watkissonline.co.uk. Other sites may allow you to download the software, but you should check it is the most recent version.
There is no automated installer at present. The installation is just a few simple steps that can be tailored to your own needs.
tar -xvzf loginfo-x.x.x.tgz(replace x.x.x with the version number of the software).
e.g. #!/usr/bin/perl -w -I/usr/local/loginfo
cp -R loginfo-x.x.x/program/* /usr/local/loginfoIf the directory is not /usr/local/loginfo edit the apachelog.pl file and change the directory name on the first line.
cp loginfo-x.x.x/php/index.php /var/www/html/webstats
cp /usr/local/loginfo/sample.cfg /usr/local/loginfo/loginfo.cfg
chown -R root:root /usr/local/loginfo/* chmod 500 /usr/local/loginfo/apachelog.pl chmod 600 /usr/local/loginfo/*.cfg chmod 400 /usr/local/loginfo/Modules/*
cd /var/www/html/webstats vi .htaccess (create the following entries) AuthUserFile /.htpasswd AuthGroupFile /dev/null AuthName "Authorised Users Only" AuthType Basic require valid-user htpasswd -c .htpasswd <username> (you will be prompted for the users password)
The main configuration file is normally stored in the same directory as the program file, although it can be stored anywhere on the system. (e.g. in an etc directory). If there is only one on the system then it would normally be called loginfo.cfg. If using virtual hosts it may be better to have multiple configuration files, one for each virtual server in which case the filename would normally include an element of the virtual server name. See Virtual Hosts for more details.
The configuration file is written in perl format. If there is a syntax or other error then it may stop the program from running. Care is therefore required when editing the file. All entries must end with a semi-colon ; and anything after a hash # character is a comment. All entries are prefixed with our to define them as publicly available.
The following entries are used:
The value is used as the title of the report file. This should normally be set to the name of the website, e.g. www.watkissonline.co.uk. This is particularly important when using virtual hosts to distinguish between the different files. If this is not changed then it will still work, but the report title will not be customised.
The $websitedomain value is used to filter out your own domain from the referer list. This should be set either to the domain or the hostname of the website being analyzed. For example www.watkissonline.co.uk could be added with or without the www part. This will behave slightly differently if you have multiple webservers within your domain. The parameter can be left out, but the main benefit of including it is that the percentage values for the referers will be more accurate.
This is the directory and filename of the apache access_log file. The default value is the directory used by Mandriva Linux and some other Linux distributions. The log file must be in the combined log format, which is the default on many systems. Refer to the Apache documentation for more details. The access_log file should be rotated on a monthly basis (at the end of each month).
This refers to the apache error_log file. The default value is used by Mandriva Linux and some other Linux distributions. Unlike the accesslog file the program will run if the file doesn't exist, but obviously will be unable to report on error messages. File Not Found (404) messages will be taken from the accesslog file rather than the errorlog file.
This is the filename for the report. It should include the path of the directory. The file will the year month and extension (.html or .txt) added by the program. The filename part should reflect the name of the website, especially if using virtual servers, as it will be listed in the menu page. This value also needs to be added to index.php if using the index page.
The ignoreaddress list includes any IP addresses that are to be excluded from the
report. Typically this should include any addresses that the webmaster uses to test
the site, and any servers that may be running automated tests to check that the server
is still running. Values should be contained within single quotes and seperated with
The wildcard * can be used to match any address range, e.g. 192.168.1.* matches all addresses from 192.168.1.0 to 192.168.1.255, or 192.168.* would match from 192.168.0.0 to 192.168.255.255.
Any files with extensions listed in the ignorefile lists will not be included in the statistics. The report will provide a count of the number of hits against each file, but not down to the individual file level. This is to make the report more relevant. The extensions are not case sensitive, but must appear on the end of a filename prefixed with a dot. The entries must been quoted and comma seperated.
Any lines in the error log matching the ignoreerrors will not be listed in the error report. This will not effect the rest of the report, only the error section. This will match on both file not found errors, and any other kind of errors. It is recommended to include just those files that are not on the server that browsers or search engines look for. Therefore robots.txt and favicon.ico are useful entries. If you have a favicon.ico and robots.txt file, you may want to remove them to ensure that you see any problems with these.
The html variable can be either 1 or 0. A 1 will create html formatted files with the extension .html, whereas zero will create plain text files with extension .txt. Note the format of the report may change in future versions (or even become xml / xhtml).
If you are using your own webserver to view the report then you may need to set the filechmod value to the required permssions. This should be the octal permissions value used by chmod. A value of 775 should be suitable for most users, although depending upon the user in which you run the program 755 may be more restrictive.
The sessionsscripts list allows you to specify scripts that contain session information. Any scripts listed in this will have the details after the question mark ? stripped off. Instead of getting a single line for every single url used (typically one per page, per user) you will get a list of the number of times the script has been called. The example of wordpress instead of listing every different page listed, will count these into the number of entries read. The safest way to use this is to include the full path (as it appears in the URL), which will prevent it matching other directories with similar names. You can enter the value as a directory to apply to all scripts in that directory, or to an individual script file.
The following are advanced settings, please ensure you understand the implications before making any manual changes.
By specifying a filename a log will be created with any debug messages. Typically this will include details of any user agent strings not recognised, which can be used to improve the robots listing. It should normally be left commented out or blank so that no log file is created.
Web Robots (also known as webbots) can effect the log results. To report on these separately then they should be listed in the @robots list. This is done by using the user_agent value given by the web robot. To add additional webrobots enter a string that will match against the robot, but not against a normal web browser. The word netscape would not really be a good word to include, whilst there is a netscape search engine (although I believe it uses other engines bots) it may also conflict with the user agent in a browser that is being used.
Entries should be enclosed in quotes and be comma seperated. The default list should work for most websites, although you may want to add some country specific entries. As the list is updated details will be posted on www.watkissonline.co.uk.
After updating the loginfo config file you should also edit the index.php file if you want to use the menu to the reports. This file must be in the same directory as the output files are created. The file will need to be changed if the $outputfile variable has been changed. This is a php file so has a different format to the perl file in the standard configuration file. The most significant thing is that instead a hash character to signify a comment the php file uses two slashes // . There are only two entries that need to be changed which are:
Set this to the name of the $outputfile used in the earlier config file. This should be the filename after the last slash / but without the date and.html /.txt extension. If you have multiple hosts then this is comma seperated with quotes around the individual names.
This needs the same value as used in the loginfo config file. This is used to look for files ending with .html and .txt. If you wish to have both text based and html based reports then you will need a extra config file for each and an extra copy of the index.php file.
You can test that the configuration files are correct by running the program manually. On the command line enter the apachelog.pl command followed by the configuration file (full path names may be required). e.g.
/usr/local/loginfo/apachelog.pl /usr/local/loginfo/loginfo.cfgYou should then be able to view the report using your web browser. E.g. to view the reports on the same server use:
The program can be automated by adding it to the crontab file. To work correctly with log rotation scripts it needs to be run to complete before the log rotation scripts start and before (but close to) the 1st day of the month. This should be done a before log rotation occurs on the last day of each month. You may prefer to run it more frequently than that, perhaps on a daily basis so that you can see a partial report for the current month. The sample entry will create a scheduled task to run every day shortly before midnight. It can be run as root, but read the security implications and ensure that you understand how to secure the scripts if using root.
There is a sample file called crontab.sample which can be edited and then loaded into cron. As the user you would like the program to run as enter the following command:
crontab -l >> crontab.sample crontab crontab.sampleThe first line will copy the current crontab entries into the sample file to ensure that these are retained when the new crontab file is loaded. Use
man 5 crontabto see the syntax of the crontab file.
To run, the program must have read access to the apache logs. The apache logs are often restricted to root only. To overcome this either the apachelog.pl program needs to be run as root, or the log files need to be changed so that another user can read them. There may be complications in the second option of changing the permission on the log files in that this would need to be included in the log rotation scripts that may differ across different platforms / distributions. For this reason the installation instructions have been written assuming that root will be accessing the file, if you have a good understanding of how the logs and log rotation work on your particular system you could overcome some of the security implications by running the program as a normal user instead of root.
There are some important implications if this program is being run automatically from cron, particularly if running as root. As the program is written in perl anyone that has write access to the program file, the module Date.pm or the configuration file can add a command that will be run under the username of the cron task. It is therefore strongly recommended that only root should have write access to any of these files. This is the reason for the chown / chmod commands needed during the installation.
Whilst some people are happy to make their webstats publicly available you may need to ensure that no personal information is released. In particular if you have cgi scripts then in the event of them issuing an error message it may include information such as the user, their ipaddress and even their password. For this reason it is strongly recommended that the log files are restricted. This can be achieved using .htaccess / .htpasswd, or could be achieved by setting it up in your httpd config file.
If you do choose to make the statistics publicly available then removing the @errorlog entry should prevent any sensitive information from being published.
If you are running virtual hosts on your system then it may be beneficial to split the logs into seperate files. If you aren't configured for virtual hosts or don't know what virtual hosts are, and only have one website running on your server then you can ignore this.
If using virtual hosts then the logs for each of the different virtual hosts needs to be sent to seperate files. The easiest way to do this is to add the following lines to each virtual server in the Vhost.conf file.
CustomLog /var/log/httpd/websitename.access_log combined ErrorLog /var/log/httpd/websitename.error_logEnsuring that websitename is unique for each server. Then create a seperate loginfo config file for each virtual server, ensuring that $outputfile is unique across each different virtual host, and that each config file points at the relevant log file. You will also need to update index.php to have multiple entries and update yourlog rotation scripts accordingly. Additional crontab entries will be needed to call apachelog.pl against each of the config files. These should be entered so as each one completes before the next and all before logrotation occurs. A fifteen minute interval between each entry should be sufficient.
Some entries have been removed to make this easier to view.
Version 0.1 alpha
Report Compiled: Thu Jun 9 15:31:30 2005
LogInfo Apachelog Log Analysis Tool Web: http://www.watkissonline.co.uk Loginfo comes with ABSOLUTELY NO WARRANTY This is free software, and you are welcome to redistribute it under certain conditions See the User Manual and gpl.txt for more information.
|Netcraft Web Server Survey||8|
|Page||Number of Hits||Percentage|
|Referer||Number of Hits||Percentage|
|Broswer||Number of Hits||Percentage|
|MSIE 6.0||140||46.1 %|
|Firefox 1||107||35.2 %|
|MSIE 5.01||7||2.3 %|
|MSIE 5.5||4||1.3 %|
|Konqueror 3.4||3||1.0 %|
|Opera 8.0||3||1.0 %|
|MSIE 6.0b||2||0.7 %|
|Firefox 0.9.3||2||0.7 %|
|Opera 7.54 [en]||2||0.7 %|
|MSIE 4.01||1||0.3 %|
|MSIE 5.0||1||0.3 %|
|Firefox 0.8||1||0.3 %|
|Firefox 0.9.1||1||0.3 %|
|IntranetExploder 08.15||1||0.3 %|
|Konqueror 3.0-rc2||1||0.3 %|
|Konqueror 3.0-rc5||1||0.3 %|
|Konqueror 3.3||1||0.3 %|
|Netscape 7.1||1||0.3 %|
|Safari 312||1||0.3 %|
|Safari 412||1||0.3 %|
|OS||Number of Hits||Percentage|
|Windows NT 5.1||139||45.7 %|
|Linux i686||68||22.4 %|
|Windows NT 5.0||39||12.8 %|
|Windows 98||12||3.9 %|
|Linux i686 (x86_64||5||1.6 %|
|Windows NT 4.0||3||1.0 %|
|Linux ppc||2||0.7 %|
|PPC Mac OS X||2||0.7 %|
|Windows NT 5.2||2||0.7 %|
|i686 Linux||2||0.7 %|
|FreeBSD i386||1||0.3 %|
|Linux i586||1||0.3 %|
|Linux x86_64||1||0.3 %|
|PPC Mac OS X Mach-O||1||0.3 %|
|Windows 95||1||0.3 %|
|Windows CE||1||0.3 %|
|Date & Time||Number of Hits||Percentage|
|Time||Number of Hits||Percentage|
|0:00 to 1:00||2||0.7%|
|1:00 to 2:00||1||0.3%|
|2:00 to 3:00||1||0.3%|
|3:00 to 4:00||1||0.3%|
|4:00 to 5:00||1||0.3%|
|5:00 to 6:00||2||0.7%|
|6:00 to 7:00||2||0.7%|
|7:00 to 8:00||1||0.3%|
|8:00 to 9:00||1||0.3%|
|9:00 to 10:00||3||1.0%|
|10:00 to 11:00||1||0.3%|
|11:00 to 12:00||2||0.7%|
|12:00 to 13:00||3||1.0%|
|13:00 to 14:00||2||0.7%|
|14:00 to 15:00||2||0.7%|
|15:00 to 16:00||2||0.7%|
|16:00 to 17:00||7||2.3%|
|17:00 to 18:00||4||1.3%|
|18:00 to 19:00||3||1.0%|
|19:00 to 20:00||1||0.3%|
|20:00 to 21:00||2||0.7%|
|21:00 to 22:00||2||0.7%|
|22:00 to 23:00||3||1.0%|
|23:00 to 24:00||2||0.7%|
|Date & Time||Number of Hits||Percentage|
|01/06/2005 11:00 to 12:00||3||1.0%|
|01/06/2005 12:00 to 13:00||7||2.3%|
|01/06/2005 13:00 to 14:00||5||1.6%|
|01/06/2005 15:00 to 16:00||1||0.3%|
|01/06/2005 16:00 to 17:00||5||1.6%|
|01/06/2005 17:00 to 18:00||2||0.7%|
|01/06/2005 18:00 to 19:00||2||0.7%|
|01/06/2005 19:00 to 20:00||2||0.7%|
|01/06/2005 20:00 to 21:00||1||0.3%|
|01/06/2005 21:00 to 22:00||4||1.3%|
|01/06/2005 22:00 to 23:00||2||0.7%|
|01/06/2005 23:00 to 24:00||1||0.3%|
|02/06/2005 0:00 to 1:00||1||0.3%|
|02/06/2005 1:00 to 2:00||1||0.3%|
|02/06/2005 2:00 to 3:00||1||0.3%|
|02/06/2005 3:00 to 4:00||2||0.7%|
|02/06/2005 7:00 to 8:00||1||0.3%|
|02/06/2005 8:00 to 9:00||2||0.7%|
|02/06/2005 10:00 to 11:00||3||1.0%|
|02/06/2005 11:00 to 12:00||2||0.7%|
|02/06/2005 12:00 to 13:00||1||0.3%|
|02/06/2005 13:00 to 14:00||3||1.0%|
|02/06/2005 14:00 to 15:00||2||0.7%|
|02/06/2005 15:00 to 16:00||3||1.0%|
|02/06/2005 17:00 to 18:00||1||0.3%|
|02/06/2005 18:00 to 19:00||1||0.3%|
|02/06/2005 19:00 to 20:00||1||0.3%|
|02/06/2005 20:00 to 21:00||3||1.0%|
|02/06/2005 21:00 to 22:00||3||1.0%|
|03/06/2005 2:00 to 3:00||1||0.3%|
|03/06/2005 4:00 to 5:00||1||0.3%|
|03/06/2005 5:00 to 6:00||1||0.3%|
|03/06/2005 7:00 to 8:00||2||0.7%|
|09/06/2005 11:00 to 12:00||2||0.7%|
|09/06/2005 12:00 to 13:00||3||1.0%|
|09/06/2005 13:00 to 14:00||2||0.7%|
|09/06/2005 14:00 to 15:00||2||0.7%|
|09/06/2005 15:00 to 16:00||2||0.7%|
|Filename||Number of Errors|
|Error||Number of Errors|
|[client 184.108.40.206] script not found or unable to stat
|script not found or unable to stat: /data/www/cgi-bin/openwebmail