C#: Parsing Apache / IIS log files (NCSA/W3C) Part 1



Webserver log files (e.g. NCSA/W3C) contain all kinds of useful data - amount of visitors to a website, how pages are performing/underperforming, bandwidth consumption, missing pages etc.

There are a number of excellent applications out there that parse these log files and provides the webmaster with all kinds of sexy reports (containing funky graphs/grids about the unfortunate victims/visitors to our websites).

For example (feel free to add to the list):

http://www.weblogexpert.com
http://awstats.sourceforge.net
http://www.smartertools.com/smarterstats/web-analytics-seo-software.aspx

Some of these tools, like smarterstats (if memory serves me correctly) even feature a simple report builder which empowers a webmaster to build their own reports.

But even that amount of flexibility can prove to be a bit unreliable, observe the following extract of a NCSA format log:

 
crawl-66-249-72-249.googlebot.com - - [11/Apr/2011:00:01:47 +0200] "GET /robots.txt HTTP/1.1" 301 291 "-" "Googlebot-Image/1.0"
41.0.34.4 - - [11/Apr/2011:00:02:21 +0200] "GET /components/com_virtuemart/show_image_in_imgtag.php?filename=resized%2FPaddle_Colt_Gove_4c68c62b3111e_90x90.jpg&newxsize=70&newysize=80&fileout= HTTP/1.1" 200 5129 "-" "Mozilla/4.0 (compatible;)"
41.0.34.4 - - [11/Apr/2011:00:14:15 +0200] "GET /components/com_virtuemart/show_image_in_imgtag.php?filename=resized%2FLooper_Kydex_Bel_4ce4c8c4a74b1_90x90.gif&newxsize=70&newysize=80&fileout= HTTP/1.1" 200 3046 "-" "Mozilla/4.0 (compatible;)"
41.0.34.4 - - [11/Apr/2011:00:14:16 +0200] "GET /components/com_virtuemart/show_image_in_imgtag.php?filename=resized%2FESP_Baton_21__4c52fdff9221e_90x90.gif&newxsize=70&newysize=80&fileout= HTTP/1.1" 200 2041 "-" "Mozilla/4.0 (compatible;)"
41.0.34.4 - - [11/Apr/2011:00:14:37 +0200] "GET /components/com_virtuemart/show_image_in_imgtag.php?filename=resized%2FITP_A3_EOS_4bb9a8780aaef_90x90.jpg&newxsize=70&newysize=80&fileout= HTTP/1.1" 200 5761 "-" "Mozilla/4.0 (compatible;)"
ool-18bc86a4.dyn.optonline.net - - [11/Apr/2011:00:17:28 +0200] "GET /index.php?page=shop.product_details&category_id=3&flypage=tpflypage.tpl&product_id=8&option=com_virtuemart&Itemid=10 HTTP/1.1" 200 7503 "http://www.google.com/url?sa=t&source=web&cd=12&ved=0CCYQFjABOAo&url=http%3A%2F%2Fwww.hailstormza.com%2Findex.php%3Fpage%3Dshop.product_details%26category_id%3D3%26flypage%3Dtpflypage.tpl%26product_id%3D8%26option%3Dcom_virtuemart%26Itemid%3D10&rct=j&q=esp%20baton&ei=bP2UTZvGH6OD0QHkn-XkCw&usg=AFQjCNGur0K_j-lqCC8VZX5Ftz5I-NnjuA" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; WOW64; Trident/4.0; SearchToolbar 1.2; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.5.30729; InfoPath.2; .NET CLR 3.0.30729; .NET4.0C; SRS_IT_E8790476BD76555A36A199; SRS_IT_E8790476BD76555A31A090)"
 

In the preceding extract you might have noticed the page "show_image_in_imgtag.php" being called multiple times. The page in question (from virtuemart - a joomla shop web application) is used to dynamically output resized images to the browser.

If we're interested in an accurate amount of page views (not to be confused with hits), we'll need to exclude that file from our report (since its not actually a page view) - something we can't expect the reporting application to know (perhaps some apps support manual exclusion of files?).

Which obviously means that the structure of a website can influence the accuracy of our reports; Incidently the website where I took this extract from reported over 150,000 in page views using AWStats, but as soon as I manually excluded the "show_image_in_imgtag.php" page, views fell to under 10,000 - a rather horrible inflation of the actual stats don't you think?

Now I am not going to tell everyone to dump their reporting applications and to write their own, I am however going to give you a few clues on how to manually retrieve and interpret the raw data yourself.

The first thing we're going to have a look at is how to parse a log file.

Note that log files come in various formats, like seen in the following image (from IIS 7.5 Manager):
Image listing log file format options

Instead of writing our own log parser(s) we can use a pre-written one like the "log parser 2.2" tool Microsoft created back in 2005 (it can be downloaded from here), the tool supports all of the formats seen in the image above (and even a few others).

In your project add a reference to the LogParser.dll - if you installed the above mentioned tool it will generally be located within your 32bit program files folder under the "Log Parser 2.2" folder.

Observe the following snippet:

 
using System;
using System.Data;
using MSUtil;
 
public static Type[] types = new Type[] 
{  
    Type.GetType("System.Int32"), 
    Type.GetType("System.Single"),
    Type.GetType("System.String"), 
    Type.GetType("System.DateTime"),
    Type.GetType("System.Nullable")
};
 
public static DataTable ParseLog<T>(string query) where T : new()
{
    LogQueryClassClass log = new LogQueryClassClass();
    ILogRecordset recordset = log.Execute(query, new T());
    ILogRecord record = null;
 
    DataTable dt = new DataTable();
    Int32 columnCount = recordset.getColumnCount();
 
    for (int i = 0; i < columnCount; i++)
    {
        dt.Columns.Add(recordset.getColumnName(i), types[recordset.getColumnType(i) - 1]);
    }
 
    for (; !recordset.atEnd(); recordset.moveNext())
    {
        DataRow dr = dt.NewRow();
 
        record = recordset.getRecord();
 
        for (int i = 0; i < columnCount; i++)
        {
            dr[i] = record.getValue(i);
        }
        dt.Rows.Add(dr);
    }
    return dt;
}
 

In order to use the ParseLog generic method (like seen in the preceding snippet), we need to pass the type of log we're attempting to parse and a SQL style query as parameter.

SQL style query? Well, that is the true power of the microsoft log parser tool, you're able to write queries against the parsed entries e.g.

 
// W3C
DataTable dt = ParseLog<COMW3CInputContextClassClass>(@"select * from c:\logs\w3c\*.*");
 
// NCSA
DataTable dt2 = ParseLog<COMIISNCSAInputContextClassClass>(@"select * from C:\logs\ncsa\*.*");
 


Awesome, so now we've got our logs all parsed and ready to go, everyone happy?

Well, not exactly, there is a few crucial shortcomings I need to point out...

As soon as we're working with custom fields everything goes a bit haywire, something I only noticed by accident.

The COMIISNCSAInputContextClassClass parser strictly requires the log file to be in NCSA format with an exact set of fields, no custom fields allowed and lines that don't conform gets excluded - makes sense since the log doesn't contain any column definitions (What happend to me was that the administrators of this webserver added two extra fields to the log files and all of a sudden the parser failed parsing lines).

The COMW3CInputContextClassClass parser handles things a bit better in that it actually supports custom fields (also due to the fact that the log file contains column definitions). Unfortunately there is a problem in the way it handles custom fields, observe the following extract:

 
#Software: Microsoft Internet Information Services 7.5
#Version: 1.0
#Date: 2011-05-24 11:49:35
#Fields: date time s-ip cs-method cs-uri-stem cs-uri-query s-port cs-username c-ip cs(User-Agent) sc-status sc-substatus sc-win32-status time-taken
2011-05-24 11:49:35 ::1 POST /UIT/AppWebServices/LookupMruWebService.asmx - 5555 UIT\Administrator ::1 Mozilla/4.0+(compatible;+MSIE+8.0;+Windows+NT+6.1;+WOW64;+Trident/4.0;+SLCC2;+.NET+CLR+2.0.50727;+.NET+CLR+3.5.30729;+.NET+CLR+3.0.30729;+.NET4.0C;+.NET4.0E) 200 0 0 2387
#Software: Microsoft Internet Information Services 7.5
#Version: 1.0
#Date: 2011-05-24 13:04:55
#Fields: date time s-ip cs-method cs-uri-stem cs-uri-query s-port cs-username c-ip cs(User-Agent) cs-host sc-status sc-substatus sc-win32-status time-taken
2011-05-24 13:04:55 ::1 GET /blog/247/default.aspx - 5555 - ::1 Mozilla/4.0+(compatible;+MSIE+8.0;+Windows+NT+6.1;+WOW64;+Trident/4.0;+SLCC2;+.NET+CLR+2.0.50727;+.NET+CLR+3.5.30729;+.NET+CLR+3.0.30729;+.NET4.0C;+.NET4.0E) localhost:5555 401 1 2148074254 1640
 

Notice the #Fields rows, it contains a definition of the columns represented in each of its following rows, but oddly enough you will find that the last entry in the extract won't be parsed.

The reason being that the COMW3CInputContextClassClass parser takes the first #Fields line in the log file and assumes that it is an accurate representation for all its following rows (disregarding every other #Fields that might be different, like seen in the above extract).

Personally I think it would have made a lot more sense if the parser returned all W3C Logging Fields (ones that don't exist in the log returned as nulls) and simply use the #Fields line as reference to which column a field belongs - which means we're not going to need to exclude entries that are actually valid.

In part 3 we'll look at alternative ways of parsing log files, but in the next part of this post however we're going to look at how to interpret the parsed log entries - status codes, useragents, bots/spiders etc..




Post/View comments
 

Got burgled, some thoughs regarding technology



Last week (sunday 8th of May) one of my friends broke his leg playing soccer, so we decided to go visit him, as we got home at about 10:00PM I noticed a rather strange looking stick outside our flat door.

As we opened the door to our flat my wife noticed that the curtain by the door is pulled back and pieces of glass all over the floor, a robber (probably one of the seven dwarfs, my guess is "Grumpy") broke a window (seen below) and managed to climb through the burglar bars.

breaked in


The robber got away with two laptops, a tablet, external drives and some of my wifes jewelry - all of it worth around R 30 000 (about $ 4 300) along with all my hard earned intellectual property (incl backups). Incidently Jon Skeet also got burgled a few weeks before us (also lost some data)...

All of this misfortune made me think about revising my backup strategy, its obviously not very clever to have all your eggs in one basket (fire, theft, hardware failure etc), we can't afford having a single point of failure.

This makes a nice case for having your data somewhere on the server in the sky aka cloud, but this doesn't mean that you're immune against the odd cloud burst ;), even google won't guarantee nor take responsibility for your loss of data on their servers as seen in the following extract out of their terms of service:


15. LIMITATION OF LIABILITY

15.1 SUBJECT TO OVERALL PROVISION IN PARAGRAPH 14.1 ABOVE, YOU EXPRESSLY UNDERSTAND AND AGREE THAT GOOGLE, ITS SUBSIDIARIES AND AFFILIATES, AND ITS LICENSORS SHALL NOT BE LIABLE TO YOU FOR:

(A) ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL CONSEQUENTIAL OR EXEMPLARY DAMAGES WHICH MAY BE INCURRED BY YOU, HOWEVER CAUSED AND UNDER ANY THEORY OF LIABILITY.. THIS SHALL INCLUDE, BUT NOT BE LIMITED TO, ANY LOSS OF PROFIT (WHETHER INCURRED DIRECTLY OR INDIRECTLY), ANY LOSS OF GOODWILL OR BUSINESS REPUTATION, ANY LOSS OF DATA SUFFERED, COST OF PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES, OR OTHER INTANGIBLE LOSS;



Lets have a quick look at a few possible things to consider (consider I am no expert, just a few random thoughts).

  1. Location, Location, Location

    The biggest lesson I learned out of this experience is probably the physical location of my backups e.g. needed an offsite backup strategy as well.

    We can get a dedicated server, or use one of the many online services like Rsync.

    (Or keep backups using removable harddrives at your office and home)

  2. Type of data

    I also noticed the amount of time I needed to re-install my software, to get everything up and running again, service packs, patches etc (even now I am not even close to having my development environment fully back to normal).

    What a waste of time! Time I could have rather spent developing software.

    Now instead of simply/solely having backups of my work, it would have been nice if I had an ISO or something that contains everything I need to get up and running again - a simple restore there you go sir.

    As for my email, it would have made a lot more sense if I used IMAP instead of POP.

  3. Frequency

    How often do we need to backup our data? Obviously critical/active data as often as possible.

    Bandwidth (in context of offsite & frequency) is also something to consider (In my country bandwidth is currently quite expensive) when looking at the frequency of backup (using a cloud strategy).

  4. Versioning

    It would be ideal if we can revert back to a previous version of our data, if you or someone else stuffed up :).

  5. Verification

    We need to verify that our backups are indeed running and not failing and take it seriously, hehehe. We've got team foundation server at work, but I neglected to check everything in due to com failure.


Now I am just waiting for my insurance to process my claim, though I am not convinced that my insurance is worth the money I am paying for it - perhaps rather invest i something thats not looking for a reason not to pay out ;)




Post/View comments
 
First 1 2 3 4 5 6 7 8 9 10 Last / 65 Pages (130 Entries)

Latest Posts

MS SQL: Parameter Sniffing


2012-05-21 22:38:48

Be the best stalker you can be


2011-12-13 22:33:54

Top 5 posts

Moving items between listboxes in ASP.net/PHP example


Move items between two listboxes in ASP.net(C#, VB.NET) and PHP
2008-06-12 17:07:43

Simple WYSIWYG Editor


Creating a WYSIWYG textbox for your website is actually quite simple.
2007-02-01 12:00:00

C# YouTube : Google API


Post on how to integrate with YouTube using the Google Data API
2011-03-12 08:37:51

Populate a TreeView Control C#


Populate a TreeView control in a windows application.
2009-08-27 16:01:03

Cross Browser Issues: Firefox Word Wrapping


Firefox word wrapping issues
2008-06-09 09:51:21