WindowsDevCenter.com
oreilly.comSafari Books Online.Conferences.

advertisement


AddThis Social Bookmark Button

Using Regular Expressions and XML Classes to Parse Your Log Files

by Roy Osherove
06/09/2003

Most systems these days can generate log files to store data about the activity of the system. What about when you are asked to transform all of that data into usable information? I will show you how to use regular expressions and .NET's XML classes to turn your log files into a dataset to allow you to search, sort, and report on your data.

What We'll Cover

  • Using the Regex class and regular expression capture groups
  • Using the XMLTextWriter class
  • Using the DataSet class

What You'll Need

The Log File

Taking a look at one of the log files on my system, I see the following:

  
  
  25/05/2002 21:49 Search Dozer Anita1
  25/05/2002 21:51 Update Dozer Anita1
  26/05/2002 11:02 Search Manda Gerry2k
  26/05/2002 11:12 Update Manda Gerry2k
  27/05/2002 15:34 Search Anka Anita1
  .
  .
  .
  12/08/2002 10:14 Search Amber Huarez
   

Each line is built of the following columns, delimited by tabs:

Related Reading

Regular Expression Pocket Reference
By Tony Stubblebine

  1. Date (date/month/year)
  2. Time (HH:MM)
  3. Action Type
  4. Record Name
  5. User Name

The Game Plan

We need to transform this blob of text into something a little more structured. You might be thinking, "Hmm, why not import the file into Access using the tab-delimited wizard?" That solution would be totally OK, if we have one file or just a few files. The solution here requires a little more automation. Plus, had the log files been written in a different format, for example, several lines per log data, we'd be in trouble. What we need here is a structured data format; enter XML.

We can see several benefits from transforming these files into XML. With XML, we can:

  • Import the data into any number of programs including Excel or Access.
  • Directly load a dataset object from this XML, and perform searches on that dataset, in memory.
  • Create reports using XSLT.
  • Pretty much do anything we want with this data, since it is pure XML.
But how do we perform this magical act? By using the .NET Framework to:
  • Parse the log file text.
  • Write XML files from the parsed text.

Parsing the Log File

If you've worked with regular expressions before, you know that using them is one of the fastest ways of parsing text. In the .NET Framework, the main class to be used in this area is the System.text.RegularExpressions.Regex class. One of the most powerful features of this class is the ability to specify, within the search pattern, Groups that will easily allow parsing and retrieval of parts of the text.

For example, given the date 17/08/1975 and the regular expression (?<day>\d{1,2})/(?<month>\d{1,2})/(?<year>(?:\d{4}|\d{2}), I can write code to retrieve any part of the text in the date by name, like so:

const string pattern = @"(?<day>\d{1,2})/" + 
                       @"(?<month>\d{1,2})/" + 
                       @"(?<year>(?:\d{4}|\d{2}))";

string GivenDate = @"17/08/1975";

Match match = Regex.Match(GivenDate,pattern);

if(match.Success)
{
  Console.WriteLine(string.Format("Day:{0},Month:{1},Year:{2}",
    match.Groups["day"].Value,
    match.Groups["month"].Value,
    match.Groups["year"].Value));
}

This yields:

Result: Day:17,Month:08,Year:1975

Note: If you don't understand the code above, you should refer to the two articles mentioned at the beginning of this article.

Transforming to XML

Once we have a bunch of data, like the date example before, we can easily transform it to XML. The XML representation of the date we have from the previous example could be represented like this:

<Date>
  <day>17</day>
  <month>08</month>
  <year>1975</year>
</Date>

Outputting this kind of XML used to be a pretty easy, but pretty error-prone, task. Sure, you could just slap each string into a memory buffer with XML tags, but the amount of errors you could get makes this approach pretty untenable. The XMLTextWriter class in the System.XML namespace rids us of a lot of details here, and very conveniently abstracts away all of the "boilerplate" code you need to write, allowing us to concentrate on the content we wish to write in our XML document.

To show just how easy it is to use this class, here's a class that takes in the MatchGroup object from the last example, writes an XML document with this data into a memory buffer, and returns this XML output:


public class XMLUtil
{
  public static string ToXML(Match regexMatch)
  {
    StringBuilder output = new StringBuilder();

    // Write the XML into an in-memory string buffer
    XmlTextWriter writer = 
      new XmlTextWriter(new StringWriter(output));

    // Make the XML more readable
    writer.Formatting = Formatting.Indented; 
 
    // Write the Start is a standard XML document
    writer.WriteStartDocument();
    
    // Create the opening node for our date element
    writer.WriteStartElement("Date");
 
    // Write out each date element value as a separate node
    writer.WriteElementString("day", 
                              regexMatch.Groups["day"].Value );
    writer.WriteElementString("month", 
                              regexMatch.Groups["month"].Value );
    writer.WriteElementString("day", 
                              regexMatch.Groups["year"].Value );
                  
    // Close the date and finish the document
    writer.WriteEndElement(); 
    
    // Closes any open elements automatically
    writer.WriteEndDocument(); 
 
    // Close the writer
    writer.Close();
    return output.ToString();     
  }
}

The output looks like this:


<?xml version="1.0" encoding="utf-8"?>
<Date>
  <day>17</day>
  <month>08</month>
  <year>1975</year>
</Date>

As you can see, it's a very easy job to write XML using this class. I first create an in-memory StringBuilder that will house the created XML. I then hand it off to the constructor of a StringWriter, which is used to construct our XMLTextWriter object. I could have easily passed in any System.IO.StreamWriter-derived object; thus, I have the flexibility of writing to pretty much anything I want. I then call the WriteStartDocument method, which creates the <xml version-..> tag at the beginning of the XML text. (I don't have to call it, though. I can just start writing out elements right away.) I then open a new element tag that will contain sub-elements, using the WriteStartElement method. Then I proceed to write the actual values as sub-nodes in the open element using the WriteElementString method, passing in the name of the node, and the value inside of it. To finish, I call the WriteEndDocument method, which closes all open elements in the XML. Had I wanted to just close the current Date element, I would call the WriteEndElement method, and continue writing more elements.

Possible Encoding Problem

Now, if you're trying out the code to produce this XML, you might find a little surprise in the generated XML file. in the XML file, the first line might read <?xml version="1.0" encoding="utf-16"?> instead of <?xml version="1.0" encoding="utf-8"?>, and as a consequence, you might have some problems reading in the XML file later on. In order to control the encoding with which the XMLTextWriter writes the XML file, you'll need to specify the encoding in the XMLTextWriter's constructor. This also means that it is simpler to just pass in a filename to the contructor rather than use an in-memory buffer, which will then be written to a file anyway. Here's the code to initialize the writer with a file name and the proper encoding:


//Create an XML textWriter object instance with a file name
XmlTextWriter xmlFile = 
  new XmlTextWriter(FileName + ".xml",Encoding.UTF8);

This should solve our problem, and since we are writing to a file, we can get rid of the code that writes the StringBuilder into the file.

A More Generic Approach

Actually, we can make the writing function much more generic, by automatically going through all of the groups of a given match and writing their names and values as XML. The following bit of code shows how to do this:


// Write out all the groups of a match to XML
Regex reg = new Regex(pattern);
Match = regexMatch reg.Match(inputString);
if(regexMatch.Success)
{
  for (int i=1;i<regexMatch.Groups.Count;i++)
  {
    writer.WriteElementString(reg.GroupNameFromNumber(i),
                              regexMatch.Groups[i].Value);
  }
}

In order to achieve this, we need to have an instance of the Regex class to play with. We have to use this same instance to receive the Match object. Then we can use the Regex instance to retrieve the name of a group, based on its number:


reg.GroupNameFromNumber(i)

Don't ask me why the group name is not a property of the Group class. This means that for this functionality to work, we can't use the static Match()method of the Regex class, which makes things a bit more cumbersome. That's why, for the remainder of the code samples, I'll use the earlier version of the code, although it's less generic. You can then implement this method, if you wish, in your programs.

Putting It All Together

OK. We know how to parse, and we know how to output to XML. Let's try to wrap this up using a class that takes in a single log file and transforms it into an XML file. This class should receive the name of the log file to read, parse it line by line, and generate a [logFileName].xml file:


public class LogConverter
{
  public static void ConvertLogFile(string FileName)
  {
    string Pattern = @"(?<date>(?<day>\d{1,2})/" + 
      @"(?<month>\d{1,2})/(?<year>(?:\d{4}|\d{2}))" + 
      @"(?x))\t(?<time>(?<hour>\d{2}):(?<minutes>\d{2}))\t" + 
      @"(?<action>.*)\t(?<record>.*)\t(?<user>\w*)";
    string line = String.Empty;
            
    // Open the Log file for reading
    TextReader reader = 
      new StreamReader(File.OpenRead(FileName));
 
    // Create an XML textWriter object instance that
    // will write to in-memory String Buffer named 'output'
    StringBuilder output = new StringBuilder();
    XmlTextWriter xmlFile = 
      new XmlTextWriter(new StringWriter(output));
 
    // Initialize the xml writer
    xmlFile.Formatting = Formatting.Indented;
    xmlFile.WriteStartDocument();
    xmlFile.WriteStartElement("Entries");
            
    // Read each line in the file
    while((line = reader.ReadLine())!=null)
    {
      // Try to match the line using regular expressions
      Match parsed = Regex.Match(line, Pattern);
      if (parsed.Success)
      {
        // If we get a regex Match, we pass
        // the XML writer off to a method that will
        // use the Match groups to generate XML data
        // inside our XML document
        WriteAsXML(parsed,xmlFile);
      }
    }

    //Finish off any open elements 
    xmlFile.WriteEndDocument();
    xmlFile.Close();
                  
    // Write the xml log to a file
    StreamWriter fs =  File.CreateText(FileName + ".xml");
    fs.Write(output.ToString());
    fs.Close();
  }
 
  private static void WriteAsXML(Match regexMatch,
                                 XmlTextWriter writer)
  {
    // Open a new 'Entry' element
    writer.WriteStartElement("Entry");
 
    // Write out each date element value as a separate node
                  
    // Date: Full format, and separated to day,month,year
    writer.WriteElementString("date",
                              regexMatch.Groups["date"].Value );
    writer.WriteElementString("day",
                              regexMatch.Groups["day"].Value );
    writer.WriteElementString("month",
                              regexMatch.Groups["month"].Value );
    writer.WriteElementString("day",
                              regexMatch.Groups["year"].Value );
                  
    // Time: Full format, hours, and minutes
    writer.WriteElementString("time",
                              regexMatch.Groups["time"].Value );
    writer.WriteElementString("hour",
                              regexMatch.Groups["hour"].Value );
    writer.WriteElementString("minutes",
                              regexMatch.Groups["minutes"].Value );
            
    // Record ,actions and users
    writer.WriteElementString("action",
                              regexMatch.Groups["action"].Value );
    writer.WriteElementString("record",
                              regexMatch.Groups["record"].Value );
    writer.WriteElementString("user",
                              regexMatch.Groups["user"].Value );
 
    writer.WriteEndElement();
  }
}

This class is pretty straightforward. Here's what's taking place:

  • The class receives a file name to parse.
  • It creates an in-memory XMLTextWriter object and initializes it to the proper settings. It then creates an open Entries element inside of it, into which all of the child Entry elements (for each line) will be written.
  • It then goes through each line in the log file, and uses the Regex.Match method on that line, using a pattern that matches each sub-group we identified at the beginning of this article.
  • If the match was a success, it passes both the XMLWriter instance and the Match object to a separate method, which writes the group names and values into the XML writer instance.
  • After going through all of the lines in the log file, it closes all of the elements in the XML file and writes it all to a file named the same as the original log file, with the addition of ".log" at the end.
If we now open the generated XML log file, we can see the following:

<?xml version="1.0" encoding="utf-8"?>
<Entries>
  <Entry>
    <date>25/05/2002</date>
.
.
.
.
</Entries>

Letting the User Search for Data

Now that we have our data stored as structured XML, we can use it to let the user easily search through it. To do this, we'll use a very easy technique already given to us inside the .NET framework. We'll use a DataSet object to load our XML data, then we'll Select data from the dataset using a filter that can be specified by the user. We can then display the resulting DataRows to the user.

The DataSet class has a LoadXML method, which allows us to pass it a file name and have it automatically load the data into a table structure inside the dataset. For our purposes, we can send in the file name without any additional parameters. What will be generated inside of the dataset's memory will be a table that contains a collection of DataRows, each one holding a set of columns that corresponds to the set of properties we created in the log file -- Date, Time, Hour, Action, and so on. Once we have this table in place, we can use the DataTable's Select method to retrieve any number of DataRow objects that match the filter we provide. Here's the code to do this:


private void LoadXMLFile()
{
  // Load the XML file into the dataset
  m_ds.ReadXml(txtFileName.Text);
                
  // Show All log entry Rows at first load
  // by passing in a 'true' filter
  // this is just like specifying 
  // SELECT * FROM ENTRIES WHERE true
  RefreshResults("true");
}

private void RefreshResults(string filter)
{
  try
  {
    // Clear the result list view
    lvResults.Items.Clear();

    // Get the first datatable inside the dataset
    // we know this one contains the data we need
    DataTable table = m_ds.Tables[0];

    // Get the datarows that match the user's filter
    // the filter can be any valid SQL filter
    DataRow[] rows = table.Select(filter);

    foreach(DataRow row in rows)
    {
      // Add an item to the list view
      ListViewItem item = 
        new ListViewItem(row["date"].ToString());
      item.SubItems.Add(row["time"].ToString());
      item.SubItems.Add(row["record"].ToString());
      item.SubItems.Add(row["action"].ToString());
      item.SubItems.Add(row["user"].ToString());
                      
      lvResults.Items.Add(item);
    }
  }
  catch(Exception e)
  {
    // The user might pass invalid filter expressions, 
    // in which case we get an exception notifying
    // the filter parsing error in question
    MessageBox.Show(e.Message);
  }
}

Using this straightforward code, we can let the user load any XML file, and filter its contents based on a SQL-like filter expression. Basically, if you have written SQL code, you can use a WHERE clause to select the specified rows. We receive an array of DataRows, and since we know beforehand the names of the columns for each DataRow (same as the XML elements in our log file), we can just display the values for each column.

We could have just as easily looped through all of the available columns and display each one's value to the user, without even knowing what kind of data was inside of our DataRow. We could dynamically add columns to our ListView corresponding to the name of each DataColumn in the DataRow, and voila -- you have yourself a more generic searching mechanism for practically any simple XML file.

Summary

We've learned the following:
  • Parsing log files is easy.
  • Writing XML files is easy.
  • Searching XML files is easy.
  • Generating XML log files and searching them should be pretty darn easy!

References:

Roy Osherove has spent the past 5+ years developing data driven applications for various companies in Israel. He's acquired several MCP titles, written a number of articles on various Net topics(most of which can be found on his weblog), and loves discovering new things everyday.


Return to ONDotnet.com