6.2 Manipulating, Filtering and Preparing Data Using the Spreadsheet
The spreadsheet is used to set independent and dependent variables, suppress columns and rows, and activate or de-activate rows. We can also sort columns and, if we have a row identifier, we can see where each row’s data came from. For example, in Fig. 6.1 we can see each compound name.
6.2.1 Dependent or Independent Variable?
The spreadsheet view of the data (Fig. 6.1)allows rapid virtual scrolling through the entire data set. If you right click on a row label, the single molecule viewer described in section Opening and Saving Files will open up on the given compound. Recursive partitioning creates significant splits on the independent columns based on how they affect the dependent columns. Your first step to running a decision tree is to choose the dependent variable.
6.2.2 Selecting a Dependent
|
Fig. 6.2 shows Potency chosen as our dependent variable. To do so, we click on the button (Potency) just under the numeric button for the column . Note the color changes to magenta in this system-specific appearance. The independent columns remain black. The inactive columns are greyed. You can toggle between these three states by clicking the lower column header (column name) buttons.
NOTE: If you clicked and the color doesn’t change, but the data moves around, it is because you clicked the column number button (the topmost button), which sorts the data, instead of the column name button, which changes the column state.
Multiple columns can be made dependent, independent or inactivated by toggling the first column to the desired state, then <Shift> Left Click on a distant column. All columns in between will be set to the state of the first column clicked.
6.2.3 Sorting Records
|
You can sort any column by clicking on the column header button. In the illustration we have sorted on the Y column. Look closely and you will see a down arrow in the button on the column number header row. The first time you do this the sort will be from low to high value.
If you click the column button a second time it sorts it from high to low value. Note the arrow on the right column points up and the highest value for the variable is at the top.
If you click the Unsort button on the top left, the sorting of the spreadsheet will revert to the original sort order of the dataset.
6.2.4 Deactivating Unwanted Columns
|
If we want to exclude PLLO:C(C)-C(CCCC) from further analysis, we can suppress that column by clicking on the PLLO:C(C)-C(CCCC) button twice (until it is greyed out).
6.2.5 Activating - Deactivating Row Data
|
Any row can be selected or de-selected (activated or deactivated) by clicking on the grey row selection button at the far left of the row. Since you cannot analyze a missing dependent value, any dependent variable row with missing values will automatically be deactivated. You can pick contiguous rows by first clicking at the first row and then <Shift> left clicking on that last row of the range you want to select-de select.
In Fig. 6.5’s spreadsheet, Potency in column 1 is the dependent variable, PLLO:C(CC)-C(CCCC) in column 4 is an inactive variable, and rows 16-18 are inactive. To view the compound, Right Click on the row label of the compound of interest. There is a tool for subset selection described in later sections of this chapter and in Chapter Interactive Tree Analysis Interactive Tree Analysis.
6.2.6 Picking Random Record Sets
If you have many records, you might want to reduce the number of records to process for several reasons (for example, a smaller set of records runs faster that a large one). However, selecting or de-selecting individual records may not be practical and handpicking records might bias the results. Another reason you might want to create a data subset is to create test and validation data sets. This means you can do tree analysis on a random subset, and validate the model by running a tree on the holdout sample. Chapter 8, Prediction Recipes has more details of this procedure. Section 6.3.2.1 details the ways and means of selecting random subsets.