WindowsDevCenter.com
oreilly.comSafari Books Online.Conferences.

advertisement


AddThis Social Bookmark Button

Dealing with Group By and Order By in .NET

by Satya Komatineni
01/14/2003

A Short Introduction to DataSets

When you are working with ASP.NET, it is more than likely that you are exposed to DataSets. A DataSet is a collection of tables. A table is an abstract concept representing rows and columns and can represent data not only from relational sources but from other sources as well. Sometimes these tables can be interlinked through foreign key relationships as well, although that is not essential to the current discussion. When you execute a database command, you can receive one or more DataTables that can be embedded in a DataSet.

A Short Introduction to Group By and Order By

A plain SQL SELECT statement will yield a collection of rows where each row has a set of columns. Usually each row is distinct from the other row in the set. But sometimes you have a column that is repeated in multiple rows. If you do an order by on the repeated columns, you will see those rows together. With this clause you can clearly see the repeated columns and the distinct columns. If the columns that are distinct are numbers, then you can use the group by clause on the repeated columns, so that you can do such aggregate functions as SUM, MULT, etc. With a group by you will have a lesser number of rows where each row is again distinct.

Sometimes you cannot do a group by because the non-repeated columns are not amounts that can be aggregated, the necessary function is not available, or the client side needs to take actions on each of these rows. Consider the following example:

Related Reading

ADO.NET in a Nutshell
By Bill Hamilton, Matthew MacDonald

Data example

Columns that can be repeated: supplier, email, date

Columns that are distinct: order number, quantity


SELECT supplier, email, date, 
	order-number, quantity 
FROM ...

sup  date email order qty
**** **** ****  ***** ******
s1   d1   se1   o1    10
s1   d1   se1   o2    20
s1   d2   se1   o3    14
s1   d2   se1   o4    15

As you can see, supplier one on day one shipped two orders (so1,so2) and she has the email of (se1 - repeated).

If the above DataSet were to be bound to a DataGrid, then the user will see the repeated rows. Assume that the user wants to see the following on the screen instead:


Shipping   activity
Supplier   date
********** **********
s1         d1
s1         d2
s2         d2

Supplier1 has some activity on day1 and day2.
Supplier2 has some activity on day2.

If the user were to SELECT (s1, d1) row, then the Web site will send a detailed report about orders for that supplier on that day. One way to do this is not to retrieve the orders in the first place, so that we do not have to eliminate the rows. We can then issue the second SELECT directly, so that we will only get the suppliers and their activity on a given date. When the user selects a row, then we can issue a SELECT at that time to get the other orders. This may not be the right thing to do every time. If the user selects a bunch of rows, then we have to call the SELECT that many times and it might slow down things. Or, a subsequent SELECT means you may get more or less data because the database may have changed in the interim. So it depends. So, assume for a second that it is advantageous to retrieve the rows in a single SELECT.

Now the question becomes how can we narrow the first set of rows to the second set of rows. One can perhaps use the DataSet filters to the advantage. When I have looked into filters, I could not find an obvious solution. Perhaps there is. I will leave that to you to investigate.

The proposed solution will break up output of the first SELECT statement into the following:


row1,s1,d1,se1(hidden) // For supplier 1 on day 1
     o1,10
     o2,20
Row2,s1,d2,se1(hidden) // For supplier 1 on day 2
     o3,14
     o4,15
Row3,s2,d1,se2(hidden) // For supplier 2 on day 1

..so on and so forth

As a result now we will have multiple tables/DataSets

Datatabe 1:
     row1,s1,d1,se1(hidden) // For supplier 1 on day 1
     row2,s1,d2,se1(hidden) // For supplier 1 on day 2
     row3,s2,d1,se2(hidden) // For supplier 2 on day 1

DataSet2
     row1-datatable
          o1,10
          o2,20
     row2-datatable
          o3,14
          o4,15


C# Class To Represent a Split DataSet

Here is how you can devise a c# class to represent these resulting DataSets:


public class SplitDataSet
{
  public DataTable mainDataTable;
  public DataSet childTables;

  public SplitDataSet():this(null,null){}

  public SplitDataSet(DataTable mainDt, DataSet childDataSet)
  {
		mainDataTable = mainDt;
		childTables = childDataSet;
  }
}

mainDataTable represent the top level rows. childTables is a DataSet where each DataTable has a name derived from the row for which these internal rows belong to.

How To Use SplitDataSet


 //obtain a split DataSet using a DataTable
 SplitDataSet sds = 
                  DataSetUtils.splitTableUsingOrderBy(dt,
                                          new PANSplitConfigurator());

 // bind the main table to a data grid
 dataGrid.DataSource = sds.mainDataTable;

 // bind it
 dataGrid.bind();

 // save the split DataSet in the session 
 // for processing individual rows
 Session["sds"] = sds;

Where the static utility function splitTableUsingOrderBy will walk through the output of the main stored procedure (or SQL) and split it, outputting a SplitDataSet object. My assumption is that the DataTable that is passed into this function is obtained through external means, and the code for it is not shown here. The main DataTable from this split DataSet can be used as a data source for a datagrid on the aspx page. And, lastly, you may want to save the split DataSet as a whole in the session. This is because when we have displayed the main table we have access to only the primary rows. When a primary row is selected in the gui, we need to process the child rows for that primary row. These child rows are available in the session for processing.

Processing the Child Tables Based on the Primary Row Selected

When a primary row is selected, you will retrieve a key representing that row. This could be as simple as row1, row2, etc. The child tables in the DataSet are keyed by this name. Here is a code sample how the child rows for a primary row ("row1") are retrieved and processed:


     SplitDataSet sds = (SplitDataSet)Session["sds"];
     DataTable dt = sds.childTables.Tables["row1"];
     // process dt

Logic for Splitting the Rows

The first step is to assume that the select statement will do an order by on the repetitive columns. In the above example it would be:


     Select supplier, date, email, order, quantity
     From tableT
     Order by supplier, date

As you walk through the results, every time (supplier, date) changes, you know that you have started a new child table. You will collect these child rows and create a DataTable, which will be named using a name derived from the primary row. For this function to work, you need to know when a change of row implies a new child DataSet and how should it be named. Obviously this aspect is variable. This is accommodated into an interface which will be implemented by the client calling this function. You will see this more clearly when you examine the code below:

Annotated Source Code for the Split Function


public static SplitDataSet splitTableUsingOrderBy(
                    DataTable dt,          // datatable that is order by
                    ISplitConfigurator sc) // a configurator to tell
                                           // which rows are the same 
                                           // and which are not
{
	// if the incoming DataTable has no rows, return an empty split DataSet
	if (dt.Rows.Count == 0)
		return new SplitDataSet(dt,null);

	// Rows exist
	SplitDataSet sds = new SplitDataSet();
	
	// Create a primary DataTable. A DataTable needs a name.
	// Lets call it "maintable"
	DataTable primaryDataTable = createANewDataTable ("maintable", dt);
	
	// Add an extra column to the primary table
	// to keep the child table name so that you can locate your children
	// when a primary row is selected
	primaryDataTable.Columns.Add(new DataColumn ("childTable",
	                                 System.Type.GetType("System.String")));
	
	// Create a child DataSet to hold the child tables
	DataSet childDataSet = new DataSet("childDataSet");
	
	// current working set variables
	DataTable      curChildDataTable = null;
	DataRow   prevDataRow = null;

	int rownum=0;
	foreach(DataRow cdr in dt.Rows)
	{
		if (prevDataRow == null)
		{
			// very first row
			
			// Create a new child DataTable
			curChildDataTable = createANewDataTable(
                                        sc.getTableName(cdr,rownum), dt);
			
			// add the current row to the child
			DataRow tcdr = curChildDataTable.NewRow();
			copyADataRow(cdr,tcdr);
			curChildDataTable.Rows.Add(tcdr);
			
			// add the row to the primary table
			DataRow pdr = primaryDataTable.NewRow();
			copyADataRow(cdr,pdr);
			pdr["childTable"]=sc.getTableName(cdr,rownum);
			primaryDataTable.Rows.Add(pdr);
		}
		else if (sc.compare(prevDataRow,cdr))
		{
			// rows are the same
			// add the current row
			// add the current row to the child
			DataRow tcdr = curChildDataTable.NewRow();
			copyADataRow(cdr,tcdr);
			curChildDataTable.Rows.Add(tcdr);
		}
		else
		{
			// rows are different
			// Add the current table
			childDataSet.Tables.Add(curChildDataTable);
			
			// Create a new child DataTable
			curChildDataTable = createANewDataTable(
					sc.getTableName(cdr,rownum),
					dt);
			
			// add the current row to the child
			DataRow tcdr = curChildDataTable.NewRow();
			copyADataRow(cdr,tcdr);
			curChildDataTable.Rows.Add(tcdr);
			
			// add the row to the primary table
		     DataRow pdr = primaryDataTable.NewRow();
               copyADataRow(cdr,pdr);
               pdr["childTable"]=sc.getTableName(cdr,rownum);
               primaryDataTable.Rows.Add(pdr);
          }
          // either way
          prevDataRow = cdr;
          rownum++;
     }// exited the for loop
     
     // add the table to the DataSet
     childDataSet.Tables.Add(curChildDataTable);
     
     return new SplitDataSet(primaryDataTable,childDataSet);
} // end of function

Use of ISplitConfigurator

If you have noticed the code above, you see that two subsequent rows are compared using an ISplitConfigurator that is passed in. ISplitConfigurator is also used to figure out the key name for a child table. Here is the ISplitConfigurator interface:


public interface ISplitConfigurator
{
     // Given two data rows tell me if they are same
     bool compare(DataRow dr1, DataRow dr2);
     
     // Given a data row and its row number tell me a 
     // string key I can use for the children of that data row
     string getTableName(DataRow dr1, int rownum);
}

Here is how we can implement our configurator:


public class MyConfigurator : ISPlitConfigurator
{
     // Give two data rows tell me if they are same
     bool compare(DataRow dr1, DataRow dr2)
     {
          // compare suppliers
          string supplier1 = (string)dr1["supplier"];
          string supplier2 = (string)dr1["supplier"];
          if (supplier1 != supplier2) return false;
          
          // compare dates
          ...
          return true;
     }
     
     // Given a data row and its row number tell me a string key I can 
     // use for the children of that data row
     string getTableName(DataRow dr1, int rownum)
     {
          return "row" + Convert.ToString(rownum);
     }
}

Configurator Pattern: Inheritance Vs. Delegation

Let me digress a bit here and talk about what I call a configurator pattern. I apologize if there is already a name for it in the pattern languages, and if there is one, I am afraid I have to leave the reader to research it. When you want to specialize the behavior of an object, you use virtual inheritance thereby allowing the end users to define their own functionality. This is specialization using inheritance. For example, a base class can implement a series of functions that will perform a concrete work (referred to as implementing a protocol) in terms of pure private virtual functions defined in that class. The implication is that the derived classes will provide what happens in these virtual functions. By making these functions private you have ensured that they are only called by the base class and no one else. (Sometimes, this is also called the template and the hook pattern. Template is the protocol, and specialization is the hook).

Sometimes inheritance implies bringing all of the dependent base classes into the picture. The alternative is to define a pure interface for the specific purpose of configuring that behavior. And then let the clients implement that behavior and pass it into the protocol implementor. In the case above, the utility function is the protocol implementor. And the MyConfigurator determines what should happen. By using a factory to load up these configurators on the fly, one can dynamically alter their strategies and end results.

Additional Functions That Are Used In Implementing Split Function

If you have noticed the code a bit more closely, you will see that we have used two more functions:


public static DataTable  
  createANewDataTable(string name, DataTable dt); 

public static void  
  copyADataRow(DataRow srcRow, DataRow targetRow); 

Let me start with creating a new DataTable:


public static DataTable createANewDataTable(string name, DataTable dt)
{
     DataTable nt = new DataTable(name);
     foreach(DataColumn dc in dt.Columns)
     {
          nt.Columns.Add(new DataColumn(dc.ColumnName,dc.DataType));
     }
     return nt;
}

A DataTable has a name, a set of columns, and a bunch of rows that you can add to. You cannot get a data column from one table and add it to another table even though it has the same type and name. You have to create new columns and add them to the table. That is exactly what the code above is for. Abstracting this process of creating a new table is based on an existing table.

The same thing applies to copying data rows from one DataTable to the other. Once a data row is attached to a table, you can't attach it to another table, so the child rows of a primary row are newly created from their older counterparts. The following code demonstrates this process:


public static void copyADataRow(DataRow srcRow, DataRow targetRow)
{
     int i=0;
     foreach(object item in srcRow.ItemArray)
     {
          targetRow[i++] = item;
     }
     return;
}

Summary

You have seen how you can generalize the process of splitting DataTables based on order by clauses. You also have seen a pattern called configurator pattern that you can use in place of inheritance for your advantage. You also have seen the limitations in creating new DataSets using existing ones. Once this utility is available you can use it for such things as pivot tables as well. I am also hoping that the code presented will work as a sample code that you can refer to while working with DataSets, tables, rows, etc. in the .NET world.

I also leave you with a few extra thoughts. The first one is: could we have used XML DataSet support available in the SQL server? Again, I haven't looked into it to discard it outright. You may want to see if that is an alternative. The second regards filters. Could we use filters to display a subset of an existing DataSet in the case above? And, thirdly, can you re-think your database access itself so that you don't have to break the rows up in the client tier, that is, can you reformulate the problem so that you don't have to do this to begin with?

Satya Komatineni is the CTO at Indent, Inc. and the author of Aspire, an open source web development RAD tool for J2EE/XML.


Return to ONDotnet.com