Statsar Statistics Library

v1.0.1 for .NET

Product Guide



More Information...

Valid XHTML


3. Processing Data

3.1 Overview

The Statsar statistics library contains several methods for processing data. This chapter explains how these methods work. Fundamentally the library provides 5 data processing operations:

  1. Filtering - Filtering rows or columns to exclude data.
  2. Transforming - Combining columns to produce new data.
  3. Grouping - Grouping rows to produce summaries.
  4. Splitting – Splitting a data sheet into sub-sheets.
  5. Sorting - Changing the order of rows or columns.

Each of these methods may be applied before statistics calculation (pre-processing of data) or after statistics have been calculated (post-processing of data).

3.2 Filtering Rows

3.2.1 The DataSheet Filter Method

The DataSheet.Filter method accepts an expression which is used to filter rows. This method does not alter the data sheet. Instead a new data sheet with filtered rows is returned. The Statsar statistics library includes a built-in powerful expression engine, which is defined in detail in Appendix B. However, the way in which expressions work are best understood by example.

To illustrate how the Filter method works, consider the following data sheet which shows annual salaries, in thousands of US dollars, for managers in selected US states:

State Salary
Arizona 35
California 49
Florida 60
Louisiana 60
Maryland 60
New Jersey 65
Ohio 50
Texas 71

Table 3.1 - Annual salaries by US state.

Suppose we wish to filter this data to include only states with a salary of at least $60,000 or above. The expression we can use is:

"Salary >= 60"

We can filter the sheet with the following single line of code:

[C#]
// Filter rows to contain only salaries of $60,000 or above.
DataSheet filterSheet = sheet.Filter("Salary >= 60");

The Filter method does not modify the existing data sheet. Instead, the expression is evaluated, and all rows that satisfy the filter are added to a new data sheet, named filterSheet in the above code snippet. The new data sheet is illustrated below:

State Salary
Florida 60
Louisiana 60
Maryland 60
New Jersey 65
Texas 71

Table 3.2 - Salaries of at least $60,000.

The resulting data sheet contains only the states Florida, Louisiana, Maryland, New Jersey and Texas, since these are the only states with an annual salary of at least $60,000 or above.

As a second example, suppose that instead of salaries of at least $60,000 we wanted to find out which states have a salary of between $45,000 and $55,000 inclusive. We could use the following expression to do this:

"Salary >= 45 AND Salary <= 55"

However, the expression syntax also supports the BETWEEN keyword, meaning that we can also use the following more readable expression:

"Salary BETWEEN 45 AND 55"

The following line of code illustrates using this expression:

[C#]
// Filter rows to contain salaries between $45,000 and $55,000.
DataSheet filterSheet = sheet.Filter("Salary BETWEEN 45 AND 55");

The resulting data sheet is illustrated below:

State Salary
California 49
Ohio 50

Table 3.3 - Salaries between $45,000 and $55,000.

The filtered data sheet contains only the states California and Ohio, since these are the only states with a salary of between $45,000 and $55,000. The Filter method and expressions provide a powerful way to filter rows using only a single line of code. For a complete definition of the syntax of expressions, please see Appendix B.

3.2.2 The SheetFilter Class

As well as accepting an expression, the DataSheet.Filter method is overloaded to accept an instance of the SheetFilter class. This class is found in the Simplexar.Statsar.Filters namespace, and requires the following using statement at the top of each code file that uses the class:

[C#]
using Simplexar.Statsar.Filters;

The SheetFilter class combines several data processing operations that represent a filter. This combined filter operates on a data sheet's rows. This class is instantiated and configured, and then passed to the DataSheet.Filter method as a parameter.

The SheetFilter class has three constituents, each of which is optional:

  • The Select method is used to specify what the resulting columns will be. The resulting columns can be the same as the original columns, or can be combinations or transformations of the original columns.
  • The GroupBy method is used to specify which columns should be grouped on. Using this method means that the resulting data sheet will be a summary of the original sheet.
  • The Filter property is used to specify an expression to filter rows. Only rows where the filter is satisfied will be included in the resulting data sheet.

If only the Filter property is set, and the Select and GroupBy methods are not used, then using the SheetFilter class is equivalent to using the DataSheet.Filter overload that accepts an expression.

3.2.3 Transforming Data

The simplest use of the SheetFilter class is to transform data. As a first example, consider the following data sheet which represents student's scores in English and in French:

Name English French
Graham 76 55
John 62 88
Mark 43 92
Andy 26 64
Marcus 55 14

Table 3.4 - Test scores for English and French.

Suppose we wish to create a new data sheet with two columns: the student's name, and the total score. The two expressions we are looking for are:

"Name" and "Total = English + French"

The first expression means that we wish to select the student's name in the resulting sheet. The second expression uses the equality operator to mean that we wish to create a new column named "Total" which is equal to the sum of the English and French scores.

We can use SheetFilter class to produce the transformation, as the following code illustrates:

[C#]
// Configure sheet filter.
SheetFilter filter = new SheetFilter();
filter.Select("Name", "Total = English + French");
 
// Transform data to produce new sheet.
DataSheet transformSheet = sheet.Filter(filter);

On the second line of code above, the select method on the SheetFilter class is used to specify a list of transform columns. These will be the columns that define the transform sheet. This resulting data sheet is illustrated below:

Name Total
Graham 131
John 150
Mark 135
Andy 90
Marcus 69

Table 3.5 - Total scores for each student.

The resulting data sheet contains two columns, the name and the total score, which is made up of the sum of English and French scores.

As a second example of transforming data, suppose that we wish to create a new data sheet with the columns "Name" and "Average" score. However, suppose that we also wish to include only English scores of at least 50. We can use the following two select expressions to transform the data:

"Name" and "Average = (English + French) / 2"

We will also need to use the following filter expression:

"English >= 50"

The first select expression indicates that we wish to select the student's name. The second select expression uses the equality operator to mean that we wish to create a new column named "Average" which is equal to the mean of the English and French scores. Finally, the filter expression allows us to select only rows with a score of at least 50. The SheetFilter class can be used to perform both the column transformation and the row filtering. The code snippet below illustrates this:

[C#]
// Configure sheet filter.
SheetFilter filter = new SheetFilter();
filter.Select("Name", "Average = (English + French) / 2");
filter.Filter = "English >= 50"
 
// Transform data to produce new sheet.
DataSheet transformSheet = sheet.Filter(filter);

The transformed data sheet is illustrated below:

Name Average
Graham 65.5
John 75
Marcus 34.5

Table 3.6 - Average test scores.

The first line of the above code instantiates a new SheetFilter object. The second line specifies the select expressions of Name and Average – these are the expressions that are used to transform the data. The third line of code specifies the row filter, i.e. that we wish to include only rows with an English score of at least 50. Finally, the last line of code above uses the filter method to produce the transform. The resulting data sheet contains only rows for Graham, John and Marcus since these were the only students with English scores of at least 50.

3.2.4 Grouping Data

As well as a list of select columns and a row filter, the SheetFilter class also provides a GroupBy method. This method allows a list of columns to be specified so that the original data can be grouped. When grouping data, rows are combined together from the original data to produce a set of summary rows. To illustrate how grouping works, consider the original data sheet below (i.e. before any filtering is performed):

Name Height Weight Jog Tennis
Robert 68 135 True True
George 67 180 False False
Agatha 63 110 False False
Sandy 60 125 True True
Bill 65 160 True False

Table 3.7 - Sports data.

When performing a grouping operation, the SheetFilter select method can accept a list of aggregate statistic functions. Aggregate functions (such as COUNT, MIN, MAX, MEAN etc.) operate on each group of rows. Expressions may contain aggregate functions – these are the same as the descriptive statistics functions described in Chapter 4.

Given the above data, suppose we are interested in considering two groups of people, those who jog and those who do not jog. We can summarize the data above by using a sheet filter. For each of these two groups, suppose we wish to calculate the number of people in the group, and the average weight. The select expressions we are interested in are:

"Jog", "COUNT(Jog)" and "Mean = MEAN(Weight)"

The group expression we are interested in is

"Jog"

We can use the following code to group the data:

[C#]
// Configure sheet filter.
SheetFilter filter = new SheetFilter();
filter.Select("Jog", "COUNT(JOG)", "Mean = MEAN(Weight)");
filter.GroupBy("Jog");

// Filter data to produce summary.
DataSheet summarySheet = sheet.Filter(filter);

The first line of the code above creates a new SheetFilter object. The second line specifies what columns the summary sheet will have. The third line specifies that we are grouping by the Jog column. The last line creates the summary sheet by applying the filter method, passing in the SheetFilter object.

To understand the second line, let us break it down into its constituent parts. We wish to select 3 columns. These are:

  1. Jog
  2. COUNT(Jog)
  3. Mean = MEAN(Weight)

The first column will be the original Jog column. The second column will be the number of joggers. Since we have not named the output and we are using an aggregate function, the default will be applied (hence the result is named Column2). The third column we wish to select is the mean of the weight column. By using the equality operator we can name the column as "Mean".

The code above results in the following summary data sheet:

Jog Column2 Mean
True 3 140
False 2 180

Table 3.8 - Grouped sports data.

As can be seen, two groups have been created, i.e. the rows have been combined into two groups depending on the value of the group column. In addition, we have used aggregate functions to create summary statistics.

3.3 Filtering Columns

3.3.1 The Column Filter Method

As well as filtering data sheet rows, it is possible to filter column values. There are two ways to filter column values, either by using an expression, or by using the TrimFilter class. Both these methods return an array containing the resulting column values.

To illustrate filtering a column's values using an expression, consider the following data sheet that represents fictional stock prices:

Stock Price
ABP 65.43
XPP 32.23
TFN 12.32
JPW 73.43
MNA 55.32
SBM 44.63

Table 3.9 - Fictional stock prices.

Suppose we wish to return an array containing the prices of stocks of at least 50. The VALUE keyword in an expression is used to indicate the values of the column being filtered. Thus, the expression that we are looking for is:

"VALUE >= 50"

The following C# code may be used to return the values we are interested in:

[C#]
DoubleColumn priceColumn = (DoubleColumn)sheet.Columns["Price"];
double[] values = priceColumn.Filter("VALUE >= 50");

The first line of the code above gets hold of the price column. We need to cast the column to a DoubleColumn since the column contains real numbered values. The second line of code filters the column with an expression that contains the VALUE keyword. The resulting filtered values are returned as an array of double values.

As a second example, suppose we wanted to select the values of the first 3 stocks. The INDEX keyword in an expression is issued to indicate the position of a value within the column. Since we start counting positions with position 0, the expression that we are looking for is:

"INDEX < 3"

The following C# code may be used to return the filtered values

[C#]
DoubleColumn priceColumn = (DoubleColumn)sheet.Columns["Price"];
double[] values = priceColumn.Filter("INDEX < 3");

This will result in an array of double values containing 65.43, 32.23 and 12.32, because these are the prices with positions less than 3, since 0 is if the first position index.

The TrimFilter class allows you to trim a certain proportion of values from a column. For example, to trim 15% of values from both ends of a column, we can use the following code:

[C#]
double[] trimmedValues = column.Filter(new TrimFilter(0.15));

The resulting values will be sorted in ascending order. The TrimFilter constructor also accepts a member of the TrimType enumeration. This may be All, Top or Bottom to filter from both ends, the top end or the bottom end respectively. For example, the following C# code will trim 20% of values from the bottom end of a column:

[C#]
double[] trimmedValues = column.Filter(
    new TrimFilter(TrimType.Bottom, 0.2));

The resulting trimmed values will be sorted in ascending order.

3.3.2 The Column FilterIndices Method

The column Filter method accepts a filter expression using the VALUE or INDEX keywords, and returns an array of filtered values. The FilterIndices method also accepts a filter expression, but instead returns the positions of the filtered values.

As well as filtering columns of real numbered values, it is also possible to filter objects. Consider the following data sheet:

Product Price
Milk 55
NULL NULL
NULL NULL
Bread 82
NULL 95
Cheese NULL

Table 3.10 - Data with NULL values.

Suppose that given the above data, we wished to determine the positions of all valid rows, i.e. the positions of rows without NULL values. The expression that we should use is:

"VALUE IS NOT NULL"

Since we are interested in the position of the filtered values, and not the values themselves, we use the FilterIndices method, as the following C# code illustrates:

[C#]
ObjectColumn productColumn = (ObjectColumn)sheet.Columns["Product"];
int[] indices = productColumn.FilterIndices("VALUE IS NOT NULL");

The first line of the code above gets hold of the product column. We need to cast the column to an ObjectColumn since the column contains strings and not real numbered values. The second line of code filters the column with an expression that contains the VALUE keyword. The resulting filtered positions are returned as an array of integers. For the above example, the resulting array would contain the positions 0, 3 and 5, since these are the positions of the products Milk, Bread and Cheese.

3.4 Splitting Data

The DataSheet.Split method allows you to split a data sheet into two or more sub-sheets. The method is overloaded to accept either a list of column indices, or a list of column names. Consider the following data which will be used to illustrate how splitting works:

Name Gender Score
Michael Male 62
Mary Female 75
Peter Male 82
James Male 77
Sue Female 61

Table 3.11 - Exam results for males and females.

We may want to split this sheet into two sub-sheets. The first sheet should contain male rows, and the second sheet should contain female rows. We can use the DataSheet.Split method to do this, as the following code snippet illustrates:

[C#]
// Split sheet by gender.
DataSheet[] genderSheets = sheet.Split("Gender");

// Access sheets individually.
DataSheet maleSheet = genderSheets[0];
DataSheet femaleSheet = genderSheets[1];

The two resulting data sheets are illustrated below:

Name Gender Score
Michael Male 62
Peter Male 82
James Male 77

Table 3.12 - Male exam results.


Name Gender Score
Mary Female 75
Sue Female 61

Table 3.13 - Female exam results.

The first line of the above code snippet splits the sheet by gender. This returns an array of split sub-sheets. In this case, the array will contain two sheets, one for male scores and the other for female scores. The second and third lines above access the individual sub-sheets for further processing.

3.5 Sorting Data

3.5.1 Sorting Rows

The DataSheet.Sort method may be used to sort a data sheet. This method modifies a data sheet by reordering its rows according to a specified sort criteria. The sort method is overloaded but essentially it accepts a list of columns followed by an optional list of sort orders. If the sort orders are not specified, then the default will be used. The default is specified by the StatsConfiguration class SortOrder property. If not set, the sort order property will default to sorting in ascending order.

To illustrate how sorting works, consider the following data which represents the percentage result threshold for various exam grades:

Grade Result
A 80
B 65
1st 90
D 40
C 50
E 30

Table 3.14 - Unsorted data for exam grade thresholds.

As can be seen, the above data is not sorted. Suppose we wish to sort by grade. We can use the following line of code to do this:

[C#]
// Sort sheet by grade.
sheet.Sort("Grade");

This will result in the following sorted data sheet:

Grade Result
1st 90
A 80
B 65
C 50
D 40
E 30

Table 3.15 - Data after sorting by grade.

In order to sort data in descending order, we can use an overload of the DataSheet.Sort method that also accepts a value from the SortOrder enumeration. The following line of code will sort the data sheet in descending order by result:

[C#]
// Sort sheet by result descending.
sheet.Sort("Result", SortOrder.Descending);

This will result in the following sorted data sheet:

Grade Result
E 30
D 40
C 50
B 65
A 80
1st 90

Table 3.16 - Data after sorting by result descending.

If you wish to reorder a sheet's rows explicitly it is possible to use the RowList Reorder method. This method accepts a list of row indices and reorders the data sheet accordingly. For example, suppose we wished to reorder the above data in acending order, but with the row for grade 1st at the end. We could do this with the following line of code:

[C#]
// Reorder the rows explicitly.
sheet.Rows.Reorder(0, 1, 4, 3, 5, 2);

Note that this method accepts a list of row positions. Since we start with row position 0, the above code will result in the following explicitly sorted data sheet:

Grade Result
A 80
B 65
C 50
D 40
E 30
1st 90

Table 3.17 - Data after explicitly reordering rows.

3.5.2 Sorting Column Values

In order to sort only a single column leaving other columns intact it is possible to use the column Sort method. This method accepts a SortOrder value and sorts only a single column. Similarly, it is possible to use the column reorder method to explicitly reorder a column's values.