3. Processing Data
3.1 Overview
The Statsar statistics library contains several methods for
processing data. This chapter explains how these methods work.
Fundamentally the library provides 5 data processing operations:
- Filtering - Filtering rows or columns to exclude data.
- Transforming - Combining columns to produce new data.
- Grouping - Grouping rows to produce summaries.
- Splitting – Splitting a data sheet into sub-sheets.
- Sorting - Changing the order of rows or columns.
Each of these methods may be applied before statistics calculation
(pre-processing of data) or after statistics have been calculated
(post-processing of data).
3.2 Filtering Rows
3.2.1 The DataSheet Filter Method
The DataSheet.Filter method accepts an expression which is used to
filter rows. This method does not alter the data sheet. Instead a
new data sheet with filtered rows is returned. The Statsar statistics
library includes a built-in powerful expression engine, which is
defined in detail in Appendix B. However, the way in which expressions
work are best understood by example.
To illustrate how the Filter method works, consider the following data sheet
which shows annual salaries, in thousands of US dollars, for managers in
selected US states:
| State |
Salary |
| Arizona |
35 |
| California |
49 |
| Florida |
60 |
| Louisiana |
60 |
| Maryland |
60 |
| New Jersey |
65 |
| Ohio |
50 |
| Texas |
71 |
Table 3.1 - Annual salaries by US state.
Suppose we wish to filter this data to include only states with
a salary of at least $60,000 or above. The expression we can use
is:
"Salary >= 60"
We can filter the sheet with the following single line of code:
[C#]
// Filter rows to contain only salaries of $60,000 or above.
DataSheet filterSheet = sheet.Filter("Salary >= 60");
The Filter method does not modify the existing data sheet. Instead,
the expression is evaluated, and all rows that satisfy the filter
are added to a new data sheet, named filterSheet in the above code
snippet. The new data sheet is illustrated below:
| State |
Salary |
| Florida |
60 |
| Louisiana |
60 |
| Maryland |
60 |
| New Jersey |
65 |
| Texas |
71 |
Table 3.2 - Salaries of at least $60,000.
The resulting data sheet contains only the states Florida, Louisiana,
Maryland, New Jersey and Texas, since these are the only states with an annual
salary of at least $60,000 or above.
As a second example, suppose that instead of salaries of at least $60,000 we
wanted to find out which states have a salary of between $45,000 and $55,000
inclusive. We could use the following expression to do this:
"Salary >= 45 AND Salary <= 55"
However, the expression syntax also supports the BETWEEN keyword, meaning that
we can also use the following more readable expression:
"Salary BETWEEN 45 AND 55"
The following line of code illustrates using this expression:
[C#]
// Filter rows to contain salaries between $45,000 and $55,000.
DataSheet filterSheet = sheet.Filter("Salary BETWEEN 45 AND 55");
The resulting data sheet is illustrated below:
| State |
Salary |
| California |
49 |
| Ohio |
50 |
Table 3.3 - Salaries between $45,000 and $55,000.
The filtered data sheet contains only the states California and
Ohio, since these are the only states with a salary of between
$45,000 and $55,000. The Filter method and expressions provide
a powerful way to filter rows using only a single line of code.
For a complete definition of the syntax of expressions, please
see Appendix B.
3.2.2 The SheetFilter Class
As well as accepting an expression, the DataSheet.Filter method
is overloaded to accept an instance of the SheetFilter class.
This class is found in the Simplexar.Statsar.Filters namespace,
and requires the following using statement at the top of each
code file that uses the class:
[C#]
using Simplexar.Statsar.Filters;
The SheetFilter class combines several data processing operations
that represent a filter. This combined filter operates on a
data sheet's rows. This class is instantiated and configured,
and then passed to the DataSheet.Filter method as a parameter.
The SheetFilter class has three constituents, each of which is optional:
-
The Select method is used to specify what the resulting
columns will be. The resulting columns can be the same as
the original columns, or can be combinations or transformations
of the original columns.
-
The GroupBy method is used to specify which columns should be
grouped on. Using this method means that the resulting data sheet
will be a summary of the original sheet.
-
The Filter property is used to specify an expression to filter
rows. Only rows where the filter is satisfied will be included
in the resulting data sheet.
If only the Filter property is set, and the Select and GroupBy
methods are not used, then using the SheetFilter class is
equivalent to using the DataSheet.Filter overload that accepts
an expression.
3.2.3 Transforming Data
The simplest use of the SheetFilter class is to transform data.
As a first example, consider the following data sheet which represents
student's scores in English and in French:
| Name |
English |
French |
| Graham |
76 |
55 |
| John |
62 |
88 |
| Mark |
43 |
92 |
| Andy |
26 |
64 |
| Marcus |
55 |
14 |
Table 3.4 - Test scores for English and French.
Suppose we wish to create a new data sheet with two columns:
the student's name, and the total score. The two expressions
we are looking for are:
"Name" and "Total = English + French"
The first expression means that we wish to select the student's name in the resulting
sheet. The second expression uses the equality operator to mean that we wish to create
a new column named "Total" which is equal to the sum of the English and French scores.
We can use SheetFilter class to produce the transformation,
as the following code illustrates:
[C#]
// Configure sheet filter.
SheetFilter filter = new SheetFilter();
filter.Select("Name", "Total = English + French");
// Transform data to produce new sheet.
DataSheet transformSheet = sheet.Filter(filter);
On the second line of code above, the select method on the
SheetFilter class is used to specify a list of transform columns.
These will be the columns that define the transform sheet.
This resulting data sheet is illustrated below:
| Name |
Total |
| Graham |
131 |
| John |
150 |
| Mark |
135 |
| Andy |
90 |
| Marcus |
69 |
Table 3.5 - Total scores for each student.
The resulting data sheet contains two columns, the name and the total score,
which is made up of the sum of English and French scores.
As a second example of transforming data, suppose that we wish to
create a new data sheet with the columns "Name" and "Average" score. However,
suppose that we also wish to include only English scores of at least 50.
We can use the following two select expressions to transform the data:
"Name" and "Average = (English + French) / 2"
We will also need to use the following filter expression:
"English >= 50"
The first select expression indicates that we wish to select the
student's name. The second select expression uses the equality
operator to mean that we wish to create a new column named
"Average" which is equal to the mean of the English and French
scores. Finally, the filter expression allows us to select only
rows with a score of at least 50. The SheetFilter class can be
used to perform both the column transformation and the row
filtering. The code snippet below illustrates this:
[C#]
// Configure sheet filter.
SheetFilter filter = new SheetFilter();
filter.Select("Name", "Average = (English + French) / 2");
filter.Filter = "English >= 50"
// Transform data to produce new sheet.
DataSheet transformSheet = sheet.Filter(filter);
The transformed data sheet is illustrated below:
| Name |
Average |
| Graham |
65.5 |
| John |
75 |
| Marcus |
34.5 |
Table 3.6 - Average test scores.
The first line of the above code instantiates a new SheetFilter
object. The second line specifies the select expressions of Name
and Average – these are the expressions that are used to transform
the data. The third line of code specifies the row filter, i.e.
that we wish to include only rows with an English score of at
least 50. Finally, the last line of code above uses the filter
method to produce the transform. The resulting data sheet contains
only rows for Graham, John and Marcus since these were the only
students with English scores of at least 50.
3.2.4 Grouping Data
As well as a list of select columns and a row filter, the
SheetFilter class also provides a GroupBy method. This method
allows a list of columns to be specified so that the original
data can be grouped. When grouping data, rows are combined
together from the original data to produce a set of summary rows.
To illustrate how grouping works, consider the original data sheet
below (i.e. before any filtering is performed):
| Name |
Height |
Weight |
Jog |
Tennis |
| Robert |
68 |
135 |
True |
True |
| George |
67 |
180 |
False |
False |
| Agatha |
63 |
110 |
False |
False |
| Sandy |
60 |
125 |
True |
True |
| Bill |
65 |
160 |
True |
False |
Table 3.7 - Sports data.
When performing a grouping operation, the SheetFilter select method
can accept a list of aggregate statistic functions. Aggregate
functions (such as COUNT, MIN, MAX, MEAN etc.) operate on each
group of rows. Expressions may contain aggregate functions – these
are the same as the descriptive statistics functions described in
Chapter 4.
Given the above data, suppose we are interested in considering
two groups of people, those who jog and those who do not jog.
We can summarize the data above by using a sheet filter. For
each of these two groups, suppose we wish to calculate the number
of people in the group, and the average weight. The select
expressions we are interested in are:
"Jog", "COUNT(Jog)" and "Mean = MEAN(Weight)"
The group expression we are interested in is
"Jog"
We can use the following code to group the data:
[C#]
// Configure sheet filter.
SheetFilter filter = new SheetFilter();
filter.Select("Jog", "COUNT(JOG)", "Mean = MEAN(Weight)");
filter.GroupBy("Jog");
// Filter data to produce summary.
DataSheet summarySheet = sheet.Filter(filter);
The first line of the code above creates a new SheetFilter object.
The second line specifies what columns the summary sheet will have.
The third line specifies that we are grouping by the Jog column.
The last line creates the summary sheet by applying the filter
method, passing in the SheetFilter object.
To understand the second line, let us break it down into its
constituent parts. We wish to select 3 columns. These are:
- Jog
- COUNT(Jog)
- Mean = MEAN(Weight)
The first column will be the original Jog column. The second
column will be the number of joggers. Since we have not named
the output and we are using an aggregate function, the default
will be applied (hence the result is named Column2). The third
column we wish to select is the mean of the weight column. By
using the equality operator we can name the column as "Mean".
The code above results in the following summary data sheet:
| Jog |
Column2 |
Mean |
| True |
3 |
140 |
| False |
2 |
180 |
Table 3.8 - Grouped sports data.
As can be seen, two groups have been created, i.e. the rows have
been combined into two groups depending on the value of the group
column. In addition, we have used aggregate functions to create
summary statistics.
3.3 Filtering Columns
3.3.1 The Column Filter Method
As well as filtering data sheet rows, it is possible to filter
column values. There are two ways to filter column values,
either by using an expression, or by using the TrimFilter class. Both
these methods return an array containing the resulting column
values.
To illustrate filtering a column's values using an expression,
consider the following data sheet that represents fictional stock prices:
| Stock |
Price |
| ABP |
65.43 |
| XPP |
32.23 |
| TFN |
12.32 |
| JPW |
73.43 |
| MNA |
55.32 |
| SBM |
44.63 |
Table 3.9 - Fictional stock prices.
Suppose we wish to return an array containing the prices of
stocks of at least 50. The VALUE keyword in an expression is
used to indicate the values of the column being filtered.
Thus, the expression that we are looking for is:
"VALUE >= 50"
The following C# code may be used to return the values we are interested in:
[C#]
DoubleColumn priceColumn = (DoubleColumn)sheet.Columns["Price"];
double[] values = priceColumn.Filter("VALUE >= 50");
The first line of the code above gets hold of the price column.
We need to cast the column to a DoubleColumn since the column
contains real numbered values. The second line of code filters
the column with an expression that contains the VALUE keyword.
The resulting filtered values are returned as an array of double
values.
As a second example, suppose we wanted to select the values of the
first 3 stocks. The INDEX keyword in an expression is issued to
indicate the position of a value within the column. Since we
start counting positions with position 0, the expression that
we are looking for is:
"INDEX < 3"
The following C# code may be used to return the filtered values
[C#]
DoubleColumn priceColumn = (DoubleColumn)sheet.Columns["Price"];
double[] values = priceColumn.Filter("INDEX < 3");
This will result in an array of double values containing 65.43,
32.23 and 12.32, because these are the prices with positions
less than 3, since 0 is if the first position index.
The TrimFilter class allows you to trim a certain proportion
of values from a column. For example, to trim 15% of values
from both ends of a column, we can use the following code:
[C#]
double[] trimmedValues = column.Filter(new TrimFilter(0.15));
The resulting values will be sorted in ascending order.
The TrimFilter constructor also accepts a member of the
TrimType enumeration. This may be All, Top or Bottom to
filter from both ends, the top end or the bottom end
respectively. For example, the following C# code will
trim 20% of values from the bottom end of a column:
[C#]
double[] trimmedValues = column.Filter(
new TrimFilter(TrimType.Bottom, 0.2));
The resulting trimmed values will be sorted in ascending order.
3.3.2 The Column FilterIndices Method
The column Filter method accepts a filter expression
using the VALUE or INDEX keywords, and returns an array
of filtered values. The FilterIndices method also accepts
a filter expression, but instead returns the positions of
the filtered values.
As well as filtering columns of real numbered values,
it is also possible to filter objects. Consider the
following data sheet:
| Product |
Price |
| Milk |
55 |
| NULL |
NULL |
| NULL |
NULL |
| Bread |
82 |
| NULL |
95 |
| Cheese |
NULL |
Table 3.10 - Data with NULL values.
Suppose that given the above data, we wished to determine
the positions of all valid rows, i.e. the positions of rows
without NULL values. The expression that we should use is:
"VALUE IS NOT NULL"
Since we are interested in the position of the filtered values,
and not the values themselves, we use the FilterIndices method,
as the following C# code illustrates:
[C#]
ObjectColumn productColumn = (ObjectColumn)sheet.Columns["Product"];
int[] indices = productColumn.FilterIndices("VALUE IS NOT NULL");
The first line of the code above gets hold of the product
column. We need to cast the column to an ObjectColumn since
the column contains strings and not real numbered values.
The second line of code filters the column with an expression
that contains the VALUE keyword. The resulting filtered
positions are returned as an array of integers. For the
above example, the resulting array would contain the positions
0, 3 and 5, since these are the positions of the products
Milk, Bread and Cheese.
3.4 Splitting Data
The DataSheet.Split method allows you to split a data sheet
into two or more sub-sheets. The method is overloaded to accept
either a list of column indices, or a list of column names.
Consider the following data which will be used to illustrate
how splitting works:
| Name |
Gender |
Score |
| Michael |
Male |
62 |
| Mary |
Female |
75 |
| Peter |
Male |
82 |
| James |
Male |
77 |
| Sue |
Female |
61 |
Table 3.11 - Exam results for males and females.
We may want to split this sheet into two sub-sheets.
The first sheet should contain male rows, and the second
sheet should contain female rows. We can use the DataSheet.Split
method to do this, as the following code snippet illustrates:
[C#]
// Split sheet by gender.
DataSheet[] genderSheets = sheet.Split("Gender");
// Access sheets individually.
DataSheet maleSheet = genderSheets[0];
DataSheet femaleSheet = genderSheets[1];
The two resulting data sheets are illustrated below:
| Name |
Gender |
Score |
| Michael |
Male |
62 |
| Peter |
Male |
82 |
| James |
Male |
77 |
Table 3.12 - Male exam results.
| Name |
Gender |
Score |
| Mary |
Female |
75 |
| Sue |
Female |
61 |
Table 3.13 - Female exam results.
The first line of the above code snippet splits the sheet
by gender. This returns an array of split sub-sheets. In
this case, the array will contain two sheets, one for male
scores and the other for female scores. The second and third
lines above access the individual sub-sheets for further
processing.
3.5 Sorting Data
3.5.1 Sorting Rows
The DataSheet.Sort method may be used to sort a data sheet.
This method modifies a data sheet by reordering its rows
according to a specified sort criteria. The sort method
is overloaded but essentially it accepts a list of columns
followed by an optional list of sort orders. If the sort
orders are not specified, then the default will be used.
The default is specified by the StatsConfiguration class
SortOrder property. If not set, the sort order property
will default to sorting in ascending order.
To illustrate how sorting works, consider the following
data which represents the percentage result threshold for
various exam grades:
| Grade |
Result |
| A |
80 |
| B |
65 |
| 1st |
90 |
| D |
40 |
| C |
50 |
| E |
30 |
Table 3.14 - Unsorted data for exam grade thresholds.
As can be seen, the above data is not sorted. Suppose we wish
to sort by grade. We can use the following line of code to do
this:
[C#]
// Sort sheet by grade.
sheet.Sort("Grade");
This will result in the following sorted data sheet:
| Grade |
Result |
| 1st |
90 |
| A |
80 |
| B |
65 |
| C |
50 |
| D |
40 |
| E |
30 |
Table 3.15 - Data after sorting by grade.
In order to sort data in descending order, we can use an
overload of the DataSheet.Sort method that also accepts a
value from the SortOrder enumeration. The following line
of code will sort the data sheet in descending order by result:
[C#]
// Sort sheet by result descending.
sheet.Sort("Result", SortOrder.Descending);
This will result in the following sorted data sheet:
| Grade |
Result |
| E |
30 |
| D |
40 |
| C |
50 |
| B |
65 |
| A |
80 |
| 1st |
90 |
Table 3.16 - Data after sorting by result descending.
If you wish to reorder a sheet's rows explicitly it is possible
to use the RowList Reorder method. This method accepts a list of
row indices and reorders the data sheet accordingly. For example,
suppose we wished to reorder the above data in acending order,
but with the row for grade 1st at the end. We could do this with
the following line of code:
[C#]
// Reorder the rows explicitly.
sheet.Rows.Reorder(0, 1, 4, 3, 5, 2);
Note that this method accepts a list of row positions. Since we
start with row position 0, the above code will result in the
following explicitly sorted data sheet:
| Grade |
Result |
| A |
80 |
| B |
65 |
| C |
50 |
| D |
40 |
| E |
30 |
| 1st |
90 |
Table 3.17 - Data after explicitly reordering rows.
3.5.2 Sorting Column Values
In order to sort only a single column leaving other columns
intact it is possible to use the column Sort method. This
method accepts a SortOrder value and sorts only a single column.
Similarly, it is possible to use the column reorder method to
explicitly reorder a column's values.