‹‹ Back to SVS Home
Manipulating, Filtering and Preparing Data Using the Spreadsheet
6.2 Manipulating, Filtering and Preparing Data Using the Spreadsheet
The spreadsheet is used to set independent and dependent variables, suppress columns and rows, and activate or de-activate rows. We can also sort columns and, if we have a row identifier, we can see where each row’s data came from. For example, in Fig. 6.1 we can see each patient’s ID.
The spreadsheet is also the starting point for several modes of analysis. (See 6.5.)
6.2.1 Dependent or Independent Variable?
There are two types of variables; a dependent variable, such as BP (blood pressure) and independent variables such as Age (Fig. 6.1) . You, the user, pick what variables are dependent or independent. Recursive partitioning creates significant splits on the independent variables based on how they affect the dependent variable. Your first step to running a decision tree is to choose the dependent variable.
For example, you could use BP as an independent variable to predict age. Chapter 8, Prediction Recipes goes through an example of this.
6.2.2 Selecting a Dependent
|
Fig. 6.2 shows BP or blood pressure chosen as our dependent variable. To do so, we click on the button (BP) just under the numeric button for the column (2 in this case). Note the color changes to magenta in this system-specific appearance. The independent columns remain black. The inactive columns are greyed. You can toggle between these three states by clicking the lower column header (column name) buttons.
NOTE: If you clicked and the color doesn’t change, but the data moves around, it is because you clicked the column number button (the topmost button), which sorts the data, instead of the column name button, which changes the column state.
Multiple columns can be made dependent, independent or inactivated by toggling the first column to the desired state, then <Shift> Left Click on a distant column. All columns in between will be set to the state of the first column clicked.
6.2.3 Deactivating Unwanted Columns
|
In this data set, there are two columns representing blood pressure. BP is the actual blood pressure and BP_I is a binary representation of BP with blood pressures above 105 represented as a 1 and below as a 0. Since one of these column of data is not needed, clicking the BP_I button twice (until it is greyed out) suppresses that column.
6.2.4 Sorting Records
|
You can sort any column by clicking on the column header button. In the illustration we have sorted on the Age column. Look closely and you will see a down arrow in the button on the column number header row. The first time you do this the sort will be from low to high value.
If you click the column button a second time it sorts it from high to low value. Note the arrow on the right column points up and the highest value for the variable is at the top.
If you click the Unsort button on the top left, the sorting of the spreadsheet will revert to the original sort order of the dataset.
6.2.5 Plotting Individual Columns
You can plot a real or integer type column against the spreadsheet row labels (or row numbers if the spreadsheet does not contain row labels) by clicking on the column header button to access a drop down menu. From that menu select Plot This Column. A plot should open displaying the column data.
The drop down menu at the lower right corner allows you to select whether to view the actual data values of this column v or the negative logarithm base 10 of the values of this column -log 10v. When the plot is first created, this will default to the -log 10v if HelixTree determines that the column is probably a p-value output, and to the actual values otherwise.
NOTE: To have the data for more than one column displayable using the same plot window, use the spreadsheet menu Analysis->Plot Numeric Columns (6.5.5).
For a detailed description of the plot features, see Chapter 16, P-Value and Spreadsheet Plots.
6.2.6 Activating - Deactivating Row Data
|
Any row can be selected or de-selected (activated or deactivated) by clicking on the grey row selection button at the far left of the row. Since you cannot analyze a missing dependent value, any dependent variable row with missing values will automatically be deactivated. You can pick contiguous rows by first clicking at the first row and then <Shift> left clicking on that last row of the range you want to select-de select.
For example, if you wish to exclude all records of Age 55 or less, you can do this by first sorting the Age column in descending order. Then all the 55 or less values are in contiguous rows. You can then quickly deactivate the contiguous records (rows) with Age less than 55, thus excluding those values from further analysis.
6.2.7 Picking Random Record Sets
If you have many records, you might want to reduce the number of records to process for several reasons (for example, a smaller set of records runs faster that a large one). However, selecting or de-selecting individual records may not be practical and handpicking records might bias the results. Another reason you might want to create a data subset is to create test and validation data sets. This means you can do tree analysis on a random subset, and validate the model by running a tree on the holdout sample. Chapter 8, Prediction Recipes has more details of this procedure. Section 6.4.1.3 details the ways and means of selecting random subsets.