2. Importing and Exporting Data
2.1 Loading Excel CSV files
The Statsar statistics library is designed to import and export data using data sheets.
A data sheet is a concept that will be familiar to anybody who has
used a spreadsheet. Within Statsar data is organized into tables
consisting of rows, columns and column headers. For example,
the following data sheet holds data representing girls and boys
heights within a school:
| Grade |
Name |
Gender |
Height |
| 3 |
Jason Smith |
Male |
160 |
| 4 |
Mary Turner |
Female |
165 |
| 3 |
Ben Goodham |
Male |
155 |
| 5 |
Jack Inwood |
Male |
172 |
| 5 |
James Carter |
Male |
168 |
| 3 |
Catherine Long |
Female |
140 |
| 2 |
Claire Moody |
Female |
135 |
Table 2.1 - A sample data sheet.
To import data from Microsoft Excel, it is recommended to first save your workbook in CSV format (Comma Seperated Values):
Figure 2.2 - Exporting an Excel Spreadsheet in CSV format.
Excel's default file format (XLS) is specific to a particular version of Excel, whereas CSV files can be read and written by all good statistics packages.
Once you have prepared your input data file, the next step is to
load the file into Statsar. The data sheet class represents a sheet
within Statsar (i.e. state), whereas the calculator class represents
the calculation engine, and is used to perform computations on the sheet
(i.e. actions). Both of these may be found within the Simplexar.Statsar namespace.
The data sheet class is named DataSheet, whilst the calculator class is
named StatsCalculator.
The calculator Load() method is overloaded to support multiple loading options,
which are documented further below. The most straightforward option is to load a
CSV file by name. The following code demonstrates five library methods:
[C#]
// 1. Create a calculator.
StatsCalculator calculator = new StatsCalculator();
// 2. Use the calculator to load the sheet.
DataSheet sheet = calculator.Load(
@"C:\Excel Sheets\SchoolHeights.csv");
// 3. Convert the sheet to a string for display.
Console.WriteLine(sheet.ToString());
// 4. Rename the sheet.
sheet.SheetName = "ImportSheet";
// 5. Copy the sheet.
DataSheet sheet2 = (DataSheet)sheet.Clone();
sheet2.SheetName = "Copy of ImportSheet";
Console.WriteLine(sheet2.ToString());
The code above illustrates how to create a calculator instance and then
load a data sheet from an Excel CSV file (steps 1 and 2). The
DataSheet.ToString() method converts the sheet to a well formatted
string, suitable for display to the console (step 3). Once
loaded, we can manipulate the sheet's data and properties, for
example, we can rename the data sheet (step 4).
In step 5, we create a copy of the sheet using the DataSheet.Clone()
method, and then rename the copy to "Copy of ImportSheet".
Finally, we again write the sheet out to the screen, this time
using the copy of the sheet, leaving the original sheet intact.
2.2 Working with Columns
When data is imported, the StatsCalculator class will create a
data sheet consisting of rows and columns. If your original data
contained column headers, then these will be imported, otherwise
default column names will be used. If no column headers are
specified, then the columns will be named Column1, Column2,
Column3, etc.
Each data sheet column is typed, meaning that it holds a specific
type of data. The two supported column types are double columns and object
columns. Double columns hold real numbers, while object columns hold all
other types of data, e.g. strings, dates, etc. The reason for this
partition is that most Statsar routines are optimized to run as quickly
and efficiently as possible on double columns. On the whole, the library
will ignore object columns, but some support for these types are supported
(e.g. sorting, grouping and filtering).
There are three column classes: ColumnBase, DoubleColumn and ObjectColumn.
The two typed columns (DoubleColumn and ObjectColumn) both derive from
ColumnBase, which provides common column properties and methods.
To illustrate how best to work with columns, consider the following data sheet:
| Name |
Age |
BirthDate |
Height |
| Jack |
23 |
18 Dec 1984 |
162 |
| Mary |
30 |
24 Mar 1978 |
182 |
| Peter |
21 |
14 Jun 1986 |
158 |
Table 2.3 - Ages, birth dates and heights.
This sheet contains columns named Name, Age, BirthDate and Height.
These are of type string, double, DateTime and double respectively.
The corresponding data sheet loaded by the StatsCalculator class
will contain two DoubleColumn columns (Age and Height) and two
ObjectColumn columns (Name and BirthDate).
In order to access these columns we can use a cast, as the following C# code illustrates:
[C#]
// Access columns by column name
ObjectColumn nameColumn = (ObjectColumn)sheet.Columns["Name"];
DoubleColumn ageColumn = (DoubleColumn)sheet.Columns["Age"];
// Access columns by column position
ObjectColumn birthDateColumn = (ObjectColumn)sheet.Columns[2];
DoubleColumn heightColumn = (DoubleColumn)sheet.Columns[3];
The first two lines above access a DoubleColumn and ObjectColumn
respectively by using the Columns property of the DataSheet class,
passing in a column name. The last two lines above access an
ObjectColumn and DoubleColumn by passing in a column number.
The first column number is zero, hence we can access the second
and third column of the sheet (the BirthDate and Height columns)
by passing in the column indices 2 and 3. Once we have got hold
of a column, we can access individual items of data:
[C#]
Console.WriteLine(heightColumn[2]); // Gets the third row
heightColumn[2] = 168; // Sets the third row
Console.WriteLine(heightColumn.Count); // Gets the number of rows
The first line above accesses the third height in the data sheet
(since rows start with row 0), whereas the second line changes
Peter's height from 158 to 168. The last line above uses the
column count property to display the number of items in the height
column, which is 3.
There are three methods which may used to modify data within a column. These are the add, insert and remove methods:
[C#]
heightColumn.Add(192); // Add to end of column.
heightColumn.Insert(1, 175); // Insert at position 1.
heightColumn.Remove(2); // Remove at position 2.
heightColumn.Remove(3); // Remove at position 3.
If we consider the data sheet before the above code was applied,
the height column will contain the original values 162, 182 and
158 for Jack's, Mary's and Peter's heights respectively. The first
line of the above code adds the value 192 to the end of the column.
The column would then contain the values 162, 182, 158 and 192.
The insert method on the second line takes two arguments: the
position to insert at and the value. In this case we insert at
position 1 the value 175. Since we start positions at position
0, this results in column holding the values 162, 175, 182, 158
and 192. On lines three and four, we then use the remove method
to remove the values at position 2 and 3. The final values in the
column are thus 162, 175 and 192.
As mentioned above it is possible to access a column by either name
or position. In order to convert a column name to a column position
number, the Columns collection provides an overloaded IndexOf method.
[C#]
// Convert a single column name to a position.
int position = sheet.Columns.IndexOf("Name");
// Convert multiple column names to positions.
int[] positions = sheet.Columns.IndexOf(
new string[]{ "Age", "BirthDate", "Height" });
// Get the total number of columns.
Console.WriteLine(sheet.Columns.Count);
In the above code, the first line converts the column called
"Name" to a position, which will be zero as this is the position
of the first column. The second line converts multiple column
names to positions. In this case an array of integers is returned,
storing the corresponding column positions. In the above example,
the array would contain the positions 1, 2, and 3 respectively.
The third line above uses the columns count property to display
the total number of the columns in the sheet, which is 4 columns
for the above example.
So far we have been working with values within an individual
column. It is also possible to work with entire columns, as the
following methods illustrate. The Add method allows you to import
data and create a new column at the same time. The method is
overloaded to allow you to create a new column from a list of
values:
[C#]
// Add a new column from a list of integers.
sheet.Columns.Add("Salary", 70, 62, 82);
The Copy method allows you to copy a column by specifying the new
column name, followed by the source column name. The code below
copies the "Height" column and creates a new column called "Height2"
with the same values:
[C#]
// Copy a column using column names.
sheet.Columns.Copy("Height2", "Height");
The Move method moves a column to a new position. This is done by
specifying the column name, followed by the destination column
index. For example, the code below moves the column named "Height"
to between the Name and Age columns, i.e. the column is moved to
position 1:
[C#]
// Moves a column.
sheet.Columns.Move("Height", 1);
The Remove method removes a column by column name or by column
position:
[C#]
// Removes a column by column name.
sheet.Columns.Remove("Name");
// Removes a column by column position.
sheet.Columns.Remove(3);
Finally, the Reorder method accepts a list of integers. The
columns on the sheet are then reordered according to the new
column positions as specified by the input array. If we consider
the original data sheet with columns Name, Age, BirthDate and
Height then the following code will reorder the sheet:
[C#]
// Removes a column by column name.
sheet.Columns.Reorder(0, 3, 2, 1);
The above line of code will reorder the columns according to
positions 0, 3, 2 then 1. This will result in the data sheet
having columns Name, Height, BirthDate and Age.
2.3 Working with Rows
The DataSheet class provides a Rows property that allows
operations to be performed on a row by row basis. To illustrate
these methods, consider the following data sheet:
| Product |
Cost |
Units |
| Cheese |
4.55 |
55 |
| Milk |
0.79 |
25 |
| Bread |
0.54 |
35 |
| Butter |
1.65 |
42 |
Table 2.4 - Food products.
To add a new row, we can use the Add method passing in an array
of object values:
[C#]
sheet.Rows.Add("Tea", 1.32, 55 );
Similar to the add method is the insert method, which takes a
row position. If we consider the original data sheet above, the
following line of code will insert a new row at position 3:
[C#]
sheet.Rows.Insert(3, "Coffee", 2.50, 12 });
This will result in a data sheet holding rows for the products
Cheese, Milk, Bread, Coffee and Butter. The reason for this is
that we start row positions at position 0, and hence position 3
would be the 4th row.
[C#]
Console.WriteLine(sheet.Rows.Count); // Display number of rows
object[] values = sheet.Rows[1]; // Get values in the second row
The first line of the code example above illustrates how to
display the total number of rows in the sheet, using the rows
count property. The second line of the code example above uses
the rows indexer passing in a value of 1. This indexer returns
an object array holding all the values in the row. In this case,
array would contain the values Milk, 0.79 and 25. The reason that
we use an object array is that a row will in general consist of
numbers as well as non-numbers and so we use the generic object
type to wrap these (this is boxing in the .NET languages C# and
VB.NET).
There are three methods that allow rows to be deleted. These are
the Remove and Clear methods on the Rows collection, and the Clear
method on the DataSheet class. The Remove method on the rows collection
is overloaded twice:
[C#]
sheet.Rows.Remove(0); // Remove row 0
sheet.Rows.Remove(1, 2); // Starting at row 1 remove 2 rows
The first overload accepts a row position, and removes only that
row. The second overload accepts a row position and a row count,
and removes the specified number of rows (second argument) starting
at the specified row position (first argument).
There are two Clear methods relevant to rows within the Statsar
library. These are illustrated in the code snippet below:
[C#]
sheet.Rows.Clear(); // Clear all rows preserving column headers
sheet.Clear(); // Clears all rows and columns
The Rows Clear method will clear all rows from the data sheet,
but preserve the columns as empty columns with only headers and
with no values. The DataSheet Clear method will clear both all the
rows and all the columns. The difference between the methods is
that the Rows Clear method clears only the data, preserving the
schema, while the DataSheet Clear method clears both the data and
the sheet's schema.
2.4 Importing Arrays
The StatsCalculator class can create a data sheet from any type
of array. This includes arrays of integers, doubles or strings,
one-dimensional arrays, multidimensional arrays, jagged arrays
and arrays of reflected objects. This section explains what these
different types of arrays are, as well as illustrating how to
load them using the StatsCalculator class.
The simplest array is a one-dimensional array. The following code
snippet illustrates how to load different types of one-dimensional
array. The resulting data sheet will consist of a single column
named "Column1":
[C#]
// Load a one-dimensional integer array.
DataSheet sheet1 = calculator.Load(new int[]{ 3, 5, 7 });
// Load a one-dimensional double array.
DataSheet sheet2 = calculator.Load(new double[]{ 44.5, -23.3, 15.67 });
// Load a one-dimensional string array.
DataSheet sheet3 = calculator.Load(new string[]{ "A", "B", "C" });
The first line of code above loads a one-dimensional integer array,
the second line loads a one-dimensional array of real numbers
(doubles) and the last third line loads a one-dimensional array of
strings.
There are only two types of column within the Statsar library, the
DoubleColumn and the ObjectColumn. The StatsCalculator class will
create a DoubleColumn from the first and second lines above, i.e.
from the integer and double arrays, since these are both numeric
data types. The third line will create a data sheet with an
ObjectColumn, since the string array is non-numeric.
The Load method is also overloaded to accept multidimensional and
jagged arrays. It is worth spending a few moments explaining the
difference between these two types of array. The .NET framework (for
VB.NET and C# developers) supports true multidimensional arrays.
A jagged array is an array of arrays, meaning that we have a
one-dimensional array, where each element is itself an array.
The following code snippet shows how to declare and load a jagged
array:
[C#]
// Create a jagged array.
double[][] jaggedArray = {
new double[]{ 2.1, 6.4 }, new double[]{ 8.8, 1.9 },
new double[]{ 3.7, 5.2 } };
// Load the array.
DataSheet sheet = calculator.Load(jaggedArray);
In the above code, a jagged array of three elements (child arrays)
is declared, where each child array has 2 elements. This type of
array is supported by .NET languages (VB.NET and C#). The resulting
data sheet is illustrated below:
| Column1 |
Column2 |
| 2.1 |
6.4 |
| 8.8 |
1.9 |
| 3.7 |
5.2 |
Table 2.5 - An imported jagged array.
Each child array within the jagged array becomes a new row, and
the number of elements in each child array will become the number
of columns in the data sheet. As such, the resulting data sheet above
has two columns and three rows.
Multidimensional arrays are are supported by .NET languages (VB.NET and C#).
The following code snippet shows how to declare and import a multidimensional array:
[C#]
// Create a multidimensional array.
double[,] multiArray = {{ -1, 3.4 }, { 7.8, 1.2 }, { 2.6, 9.4 }};
// Load the array.
DataSheet sheet = calculator.Load(multiArray);
The resulting data sheet is illustrated below:
| Column1 |
Column2 |
| -1 |
3.4 |
| 7.8 |
1.2 |
| 2.6 |
9.4 |
Table 2.6 - An imported multidimensional array.
2.5 CSV and Tab Delimited Files
Once you have initiated a StatsCalculator instance, it is possible to
load files from disk by using the calculator Load method:
[C#]
StatsCalculator calculator = new StatsCalculator();
DataSheet sheet = calculator.Load(@"C:\Path\MyFile.csv");
This section explains the methods and properties that are used
to read and write CSV and tab delimited files using the StatsCalculator
and DataSheet classes. The calculator load method is overloaded
to accept an optional encoding type:
[C#]
StatsCalculator calculator = new StatsCalculator();
DataSheet unicodeSheet = calculator.Load(
@"C:\Path\MyFile.csv", Encoding.Unicode);
For .NET languages (C# and VB.NET), the different options for file
encoding are specified by using the Encoding class. Supported
encoding methods include Unicode, ASCII and UTF8. The code example
above illustrates how to load a file which has been saved in
Unicode format. If no encoding is specified, then the default file
encoding will be used to load the data sheet.
In order to write a data sheet to a file you can use the DataSheet.Write method.
By default, a CSV file with headers will be written,
suitable for reading by Microsoft Excel and all other CSV
compatible software.
[C#]
sheet.Write(@"C:\Path\MyFile.csv");
It is also possible to specify an encoding type when writing to a
file. For example, to export a data sheet in Unicode format, use
the following write method overload:
[C#]
sheet.Write(
@"C:\Path\MyFile.csv",
FileFormat.Text,
Encoding.Unicode);
This overload accepts three parameters: the filename, the file
format type and the encoding type. The FileFormat enumeration is
used to specify the file format. This can either be text, for CSV
or other delimited files, or else it can be table. The table format
is used to save a file using fixed column widths, which is the
default when writing a data sheet to screen, i.e. when using
Console.WriteLine to display the sheet. In the example above, we
use the text FileFormat (for CSV) and the Unicode encoding type.
The calculator Load method also has three overloads for reading
from a TextReader object or from a stream object. These are general
objects that may be used to read data from sources other than disk,
e.g. from memory streams or other custom stream locations. The
following code snippet illustrates how these methods are defined:
[C#]
public class StatsCalculator
{
// General calculator load methods:
public DataSheet Load(TextReader);
public DataSheet Load(Stream);
public DataSheet Load(Stream, Encoding);
// ...
}
Similarly the DataSheet.Write method is overloaded to also accept
a TextReader or a stream object. This allows you to write a sheet
to any location, e.g. to memory or other custom stream locations.
The following code snippet illustrates how these methods are
defined:
[C#]
public class DataSheet
{
// General DataSheet write methods:
public void Write(TextWriter writer);
public void Write(TextWriter writer, FileFormat);
public void Write(Stream stream);
public void Write(Stream, FileFormat, Encoding);
// ...
}
By default when reading or writing to a file, the first line of
the file is expected to contain column headers. It may be the case
that you wish to read or write a file without column headers. In
order to support this, the StatsConfiguration class can be used
to specify if headers are defined, as the following code
illustrates:
[C#]
// Load a CSV file that contains no column headers.
StatsConfiguration.FileHeaders = false;
DataSheet sheet = calculator.Load(@"C:\Path\MyFile.csv");
There are two other file properties on the StatsConfiguration
class which may be of interest. These are the FileNewLine and
FileSeperator properties. FileNewLine specifies the type of line
delimiter to use, where as FileSeperator specifies the type of
field separator to use. For example, the following code saves a
tab delimited file suitable for the UNIX file system:
[C#]
// Write a Unix-compatible tab delimited file.
StatsConfiguration.FileNewLine = "\n";
StatsConfiguration.FileSeperator = "\t";
sheet.Save("Export.dat");
2.6 Database Connectivity
The calculator Load method may be used to load data from multiple
sources, and returns a DataSheet instance that represents the loaded
data. As well as the load method, the calculator also provides a
DataSource property. Setting the DataSource property is equivalent
to the load method, when used in conjunction with the calculator
sheet property to return the sheet. The following code illustrates
the load method by loading an array into the calculator:
[C#]
// Load an array using the load method.
DataSheet sheet = calculator.Load(new int[]{ 44, -45, 63});
This is equivalent to using the DataSource and sheet properties
as illustrated below:
[C#]
// Load an array using the DataSource property.
calculator.DataSource = new int[]{ 44, -45, 63};
DataSheet sheet = calculator.Sheet;
The DataSource property is provided for convenience for users
familiar with ADO.NET data handling, since .NET classes implement
this property when binding data to visual controls.
To load data from a database, first load a DataTable using
standard ADO.NET classes such as the connection and command
objects. Once a DataTable has been loaded in code, it is possible
to bind the DataTable directly to the calculator to get back a
DataSheet instance. The following C# code illustrates how to bind to a
DataTable:
[C#]
// Get data table from database.
DataTable dataTable = GetDataTable();
// Set data table as calculator data source.
calculator.DataSource = dataTable;
DataSheet sheet = calculator.Sheet;
Console.WriteLine(sheet);
The example code below implements the GetDataTable() method used
by the above code snippet:
[C#]
public DataTable GetDataTable()
{
// Connect and initiate SELECT command.
string connectionString = // ...
SqlConnection connection = new SqlConnection(connectionString);
SqlCommand command = new SqlCommand(
"SELECT * FROM Product", connection);
// Fill data table.
SqlDataAdapter adapter = new SqlDataAdapter(command);
DataTable dataTable = new DataTable();
adapter.Fill(dataTable);
// Dispose managed resources.
adapter.Dispose();
command.Dispose();
connection.Dispose();
return dataTable;
}
The above method first declares a connection string, and then
loads DataTable from an SQL server database. You can use the
sample code above to load data from SQL server by specifying
a connection string and an appropriate SELECT command or stored
procedure. Code similar to the above can be used to return a
DataTable for other databases, e.g. MySQL.
When loading a DataTable by using the calculator load method,
the resulting data sheet will contain the same column names as per
the original database table. In addition, the resulting data sheet
may contain both DoubleColumn and ObjectColumn column types.
Numeric values will be converted to doubles (i.e. to real values)
where as non-numeric data types (dates, strings, etc.) will be
converted to objects.
NULL values are handled depending on the column type. For
DoubleColumn values, database NULLs are converted to NaN
(not a number). For ObjectColumn values, NULLs are converted to
null references for C#, and Nothing for Visual Basic.
After calculations have been performed, you can export a
DataSheet instance to a DataTable by using the ToDataTable() method:
[C#]
// Export a DataSheet to a DataTable.
DataTable dataTable = sheet.ToDataTable();
Once you have exported a DataTable you can then use standard
ADO.NET methods to save the resulting table to a database, e.g.
to SQL server
2.7 Reflection
The StatsCalculator class can also load an array of objects,
by using reflection. Consider a class called product with
properties name, size and cost. The following line of code loads
an array of these types directly into a data sheet:
[C#]
// Load an array of a custom data type. The calculator
// will use reflection to achieve this.
DataSheet sheet = calculator.Load(
new Product[]{
new Product("Milk", 52, 12.8),
new Product("Cheese", 58, 64.8),
new Product("Bread", 62, 66.9) });
The resulting data sheet is illustrated below. The column names
are determined by using reflection on the custom type's properties:
| Name |
Size |
Cost |
| Milk |
52 |
12.8 |
| Cheese |
58 |
64.8 |
| Bread |
62 |
66.9 |
Table 2.7 - Reflected data sheet.
Finally, it is also possible to bind the calculator to any object
that implements IEnumerable. This allows ArrayList or other list objects
to be loaded, as well as lists of custom collections by using reflection.