Tutorial 1: Performing the Basic Workflow in GUI Mode
To perform the steps of this tutorial, you will first need to execute HelixTree. This can be performed in the usual ways for your operating system such as double clicking the program’s icon or entering the program name on the command line of a console window. The end result is that HelixTree will be launched in GUI mode. Once HelixTree is executing, we can create a new project.
2.4.1 Create a New Project
Projects, in HelixTree, are the work area in which all data analysis takes place (See section 3.1.2 for a complete description of creating a Project.). They provide a number of benefits. First, since they are named, they allow you to keep separate data analysis activities distinct from one another. Second, projects can be closed and later reopened without losing any of your work. Finally, each project automatically maintains a detailed history of all of the operations performed within it. This is important if you are ever asked to identify the steps that led to a particular analysis result.
New projects are created via the New Project dialog box. This dialog can be launched by simply selecting the File->New Project menu item or, alternatively, by clicking on the Create A New Project button on the main window button bar.
|
The dialog box allows you to specify the directory location of your project. Click on the Browse button to open the Browse for Folder dialog box. For this tutorial, we will use the example subdirectory of the HelixTree installation directory, as our work area. Select the example subdirectory of the HelixTree installation directory from the Browse for Folder dialog box and click the OK button.
|
As you can see, the dialog requires that you provide a name for your project. We will call our project ’Discovery’. So, please enter that name, ’Discovery’, in the Project Name text field of the New Project dialog box.
When you have completed entering all of the necessary information, click on the Create Project button to close the dialog and create the project.
|
Notice that the client area of the HelixTree main window has changed to show the Project Viewer. The Project Viewer has three panes. For the moment, we will focus on the left most pane, the Project Navigator pane, which contains an icon for the project that you just created.
|
Since our new project is empty, there is not much that we can do that is very interesting. To change that situation, we will import some data.
2.4.2 Import A Data Set
Importing data into a HelixTree project is generally very easy to do. When using your own data sets, you have to be sure that the data types of your data set conform with what HelixTree allows. The details are described in Chapter 4, Importing Your Data Into HelixTree.
There are several different ways to import data depending on the format of the original data set. For this tutorial, we will use the Import Wizard which is launched by selecting the File->Import Data->Import Wizard menu item.
|
There are a number of buttons and fields displayed in the first panel of the Import Wizard. We are going to import a text file containing comma separated values (CSV) called GSIM.csv.
|
Click on the File button which causes a file chooser dialog to appear. Navigate to the example subdirectory of the HelixTree installation directory where you will find the file GSIM.csv. Click on the file name and then click on the Open button which will cause the file chooser dialog to close.
You are now back at the first panel of the Import Wizard and you see that both the Name and Format fields are filled in with appropriate information. Click on the Next button.
|
You will now see the second panel of the Import Wizard which allows you to choose a Row for Column Names. In this example, HelixTree has detected column names. If you did not want to use the detected row as column names, you may specify which row to use. For this tutorial, select No, use detected setting.
|
You will now see the third panel of the Import Wizard which allows you to choose a label column. A label column is a column of information that is intended to annotate or label each row of the data set. The column of data that you identify as a label column will not be used in any analysis. However, it will be used as a source of labels for various analysis views.
|
The GSIM.csv data set contains blood pressure information for 1000 individuals. Each individual is uniquely identified by a patient ID which is listed under the PATID field. We will choose this field as the label column by selecting the Select label column radio button, clicking on the PATID string and finally clicking on the Finish button.
At this point, the Import Wizard will be closed. However, two new windows will open: an Import Wizard message box and a Spreadsheet Viewer named Sheet 1.
|
The Import Wizard message box describes the imported data. For the GSIM.csv file, you should see the following message: "1000 rows imported 2 boolean, 0 integer, 4 real, 3 gene, 1 categorical columns successfully converted". Close the message box by clicking on the OK button.
2.4.3 View Imported Data in a Spreadsheet Viewer
The Spreadsheet Viewer window has its own menu and button bars and otherwise looks like a spreadsheet.
|
The left most column contains the row number. This is followed immediately by the column chosen as the label column, PATID. Note that both of these columns have, by chance, the same values and that neither of them will be involved in any analysis that we perform. By scrolling down the spreadsheet you will be able to verify that there are 1000 patients in this data set.
By resizing the Spreadsheet Viewer window you should be able to verify that there are 10 data columns. Most of the columns are intuitively labeled but a few need some explanation. For instance the first data column, BP_I, is blood pressure indicator. It is a boolean data type derived from the information in the second column, BP, blood pressure. For BP values greater than 105, the BP_I is set to 1 otherwise it is set to 0.
The seventh column is the body mass index (BMI). BMI is a measure of weight to height that correlates with the person’s percentage of body fat. It is often used as one among several factors that help predict a person’s proneness to disease. The formula for BMI is

where the person’s weight is in pounds and their height is in inches.
There are two columns that have age data. The column four, AGE?, is identical to the column three column, AGE, except that some of the age values have been replaced with ’?’. The ’?’ is used to indicate a missing data value. The AGE? column will be used in those examples where we want to demonstrate how HelixTree handles missing data. Columns eight through ten list the genotypes for three genetic markers for each of the 1000 patients. By “genotype” we mean which specific alleles appear at a marker location for a specific patient.
NOTE: A “genotype” does not specify “phase” information, by which we mean distinguishing which allele came from, for instance, the father, and which allele came from the mother. Thus, HelixTree always keeps the allele names in sorted order, and you will not see, for instance, occurrences of genotype "1_2" and genotype "2_1", but instead two occurrences of genotype "1_2".
2.4.4 Identify the Data Columns to be Used in Analysis
Our goal in this tutorial is to build a model that prioritizes the factors that determine blood pressure. We also might want to know which conditions are most likely to lead to high blood pressure. Once we have the model, we should be able to predict a person’s blood pressure given that we know their age, gender, BMI, etc. For purposes of analysis, we need to be able to enable, disable, and show dependency relationships between columns. All of this can be achieved by clicking on the column label and noting the resultant color coding of the data items.
Initially, all the columns of a newly imported data set are enabled so the data should be displayed as black text on a white background. Data columns whose display color is black are considered independent variables. By clicking on a column label one time, the display color of its data will change from black to magenta. This color change indicates that the column is now a dependent variable. By clicking the column a second time, the data’s display color will change from magenta to gray. The gray color identifies a disabled column. By clicking a column label a third time, the data’s display color will return to black identifying the column as an enabled, independent variable. Columns of data can also be sorted by clicking on the column number bar just above the column label. Each successive click will reverse the sort.
For our tutorial, BP_I and AGE? should be disabled (click twice). BP should be the dependent variable (click once). Leave all the other columns as enabled, independent variables.
|
2.4.5 Perform the Analysis
All analysis is performed in the context of the Spreadsheet Viewer. For the Discovery project, our first analysis effort will be very simple. From the Spreadsheet Viewer menu, select the Analysis->Interactive Tree Analysis menu item. A new window called Tree View 1 is now displayed.
|
In the client area of this window is a box which represents (and which we will call) the root node. The root node has several variables displayed in it:
- BP, blood pressure, the dependent variable of our data set.
- n, the number of elements of the data set that the node represents.
- μ, the mean blood pressure of the data set.
- s, the standard deviation.
- “N” uniquely identifies the root node. When we later generate other nodes, an “N” in this position followed by one or more digits will uniquely identify each of those other nodes. The digits will represent the branch position in the tree as viewed from left to right.
- The “I” in the lower right corner of the node will display a Node Annotator dialog if clicked on.
Click (right or left) anywhere in the root node to display a context menu. Select the Recursive Split menu item from the context menu. You should now see a complete tree composed of 15 nodes including the original root node. What happened? HelixTree used recursive partitioning to break the original 1000 data items into smaller sets. The data items in the smaller sets have been identified by HelixTree as having similar characteristics.
The root node identifies BP as being the dependent variable. HelixTree looks at all of the independent (and enabled) variables and decides which one of them is the best one to use to split the data set. Look at Tree View 1 and see that the children nodes are named Age. This means that HelixTree identified Age as the best of the independent variables to use first to split the original data set.
|
Further investigation of the two AGE nodes reveals some interesting facts. There are 482 patients older that 50 (see node N2). They have a mean blood pressure of 98.4. Those younger than 50 (see node N1) have a mean blood pressure of 92.9. The difference in blood pressure between the two groups draws our attention. But we have to ask, “Could the observed blood pressure difference be attributed to random factors in our population of 1000 patients?”
A p-value is a number whose purpose is to provide a measure of confidence in the information provided by the statistics. P-values range from zero to one with smaller values giving greater confidence and larger values giving less. P-values for the variable being split on (in this case AGE) are listed in the parent node (in this case, the root node, BP) along with the adjusted p-value (aP) and the Bonferroni adjusted p-value (bP).
Continuing down the tree, observe that for those under the age of 50, their gender is the next major determiner of blood pressure. Notice also that for the women in the group the next factor that divided them into low and high blood pressure categories was whether they smoked or not. However, for the men, BMI was a more critical factor.
|
Who has the highest blood pressure? Node N2122 has 41 members with an average blood pressure of 108. What characterizes these 41 individuals? They are men over 50 who do not smoke and have a BMI over 23!
2.4.6 Export Analysis Results for Publishing
The tree that we just built can be viewed as a model. We would like to know how good the model is at actually predicting blood pressure. Given that we know a person’s age, gender, BMI, etc., how accurately will the model predict their blood pressure? We can get a sense of this by selecting the File->View Predictions (In Sample) menu item on the Tree View 1 menu.
|
The Tree Predictions window that appears is a spreadsheet that contains five data columns. The columns of interest are the Actual, Predicted, and Residual columns all three of which deal with blood pressure, the dependent variable. Residual is the difference between the Actual and Predicted blood pressure. By clicking on the bar above the column name, the columns data can be sorted in either ascending or descending order. Sort the Residual column. You can see that the maximum residual is about +30 and the minimum residual is about -33.
|
For convenience, we are assessing model’s ability to predict with the very data that was used to build the model. Normally, we would test the model with a second data set that was not used to build the model.
Next, we want to publish our results. More accurately, HelixTree allows you to export your results in a form that lends itself to publishing. To get started, select the File->Save As menu item in the Tree Predictions window. The Export Data dialog that appears allows you to choose the name and format of the file that will contain your data.
|
Click the Browse button. In the Export Data dialog, navigate into the HelixTree examples directory and enter tutorial1_results.csv for the name of the file and select ASCII File - Delimited (*.txt *.csv) for the value of the Save as type field. Click the Save button. Back in the Export Data dialog review the values shown in the Name and Format fields. Click the Export button. Verify that the tutorial1_results.csv file exists in the HelixTree install directory. It can be viewed with a text editor, or a spreadsheet program such as Excel.
Close the Tree Predictions window by selecting the File->Close menu item on its menu. Close the Tree View 1 window in the same fashion. In the HelixTree main window, notice that the Project Navigator pane now contains a tree structure. Each node of the tree structure has an icon representing one of the objects that you created during this tutorial. By clicking on a node (except the project node and data set node), HelixTree will launch the appropriate viewer and display the object.
|
We are now at the end of this tutorial. Close the project by selecting the File->Close menu item. If prompted to save the project, click on the Yes button. Close HelixTree by selecting the File->Quit menu item.