There are times when we would like to predict an unknown response. For example, let’s assume we have 500 patients with
complete data. The data we collected contains BP (blood pressure), whether they smoked or not, age and BMI. This is our
training data. We have another 500 records, but this set has no BP data. Can we use the rules built on the first 500 records to
estimate the unknown BP in the second 500 records? This is a valuable feature of Optimus RP. Here is an overview of the
steps involved:
- Create and save a forest of random trees in the first, complete dataset.
- Apply the rules found in the random trees to the second dataset with the missing BP.
- Generate and view the average tree predictions spreadsheet.
In the step-by-step example below we modify an existing dataset to have unknown values using Microsoft’s
Excel.
| Figure 8.6: | Excel spreadsheet of the data set. |
|
8.2.1 Modifying a Copy of Data
First we need to create a set of data with missing information. In the folder Optimus RP\example there is a file called
CSIM.csv. Right click on that file and choose Copy from the pop-up menu.
Next, right click on the example folder on the left portion of Windows explorer and choose Paste from the pop-up menu.
This will make a new file called Copy of CSIM.csv . Double click that file to open in Excel.
Scroll down to Patient 501 (this will be row 502 in Excel). In column C (contains the BP value) overwrite the existing
value with 0 (zero). At this point your screen should look like the illustration above.
| Figure 8.7: | Modifying data in the spreadsheet view. |
|
We now need to copy that 0 value to the next 499 records. In the lower right of the cell there is a small square. If you click
and drag this down it will copy the 0 value the cells below. The illustration above shows the results of a short
drag.
| Figure 8.8: | Propagating the changes to other cells. |
|
As you continue to drag down the window will scroll. The further down you go the faster the scrolling. Eventually you
reach Patient 1000. You screen should look like the illustration above. When you release the mouse’s drag button it should fill
all rows between 501 and 1001 with zeros.
You need to save the change, so use the menus File-> Save. It will ask permission to save in the CSV format and will warn
of potential loss of functionally. This is OK so answer YES. Now close Excel. It will again ask about saving. Answer NO as
you have already saved changes.
8.2.2 Converting to a GHD File
We now need to convert that CSV file into a GHD file Optimus RP’s native format. Do this using the Import Wizard described
in the chapter on Importing Data 4.3.1.
8.2.3 Create a Subset With BP Readings
After importing the CSV file open the newly created spreadsheet and choose the menus Edit->Select subset. From this small
dialog select First N items: on the left and type 500 on the right. When the dialog looks like the below illustration click the OK
button.
| Figure 8.9: | Selecting the first 500 "N" items for a subset |
|
If you were to scroll down to Patient 500 you would see that the second half of the dataset is de-selected. They will not be
used in analysis at this point.
| Figure 8.10: | The selections beyond the first 500 are deselected. |
|
We need to set a dependent variable so, click on the BP column header once. We also need to eliminate the BP_I column
so click on the header twice. You screen should look something like the illustration below.
| Figure 8.11: | Setting the dependent variable by clicking on the column header. |
|
Next, select the menus Analysis->Interactive Tree Analysis to create a tree view.
8.2.4 Create Random Trees From Initial Dataset
From the tree view select the menus Tree->Random Tree Creation to get the above screen. Next hit Go and the
random trees will be built. This may take a minute or so depending on the speed of your computer. Click Close when finished.
The random trees are represented as a Multitree Model which will appear as a child of the tree node used to create the random
trees. The Multitree Model viewer is shown below.
| Figure 8.12: | The dialog window that appears when all the trees have been generated. |
|
8.2.5 Invert Data to Create Dataset With Missing BP
| Figure 8.13: | Inverting the data in the spreadsheet view. |
|
We now need to invert our data. Close the tree view and from the spreadsheet view select the menus Edit->Invert selection.
You are now working with the data that contains no BP.
8.2.6 Applying A Multitree Model To Make Predictions
From the spreadsheet with the missing BP values active, select the menus Analysis->Apply a Tree Model. This will open up
a Multitree selection window. This window contains a replica of the Project Navigator Window with the possible
tree models you can apply highlighted in white. Note that both a single tree and a Multitree are highlighted in
the example below. That is because you can make predictions using either a single tree or a set of random
trees.
| Figure 8.14: | Selecting the multitree model. |
|
For our example we have selected the Multitree model that we just created as shown below.
| Figure 8.15: | Choosing the multitree model that was just created. |
|
After selecting a tree model click the OK button to start the prediction process. This will create an Applied Tree Model
and place it as a child of the spreadsheet node with the missing BP values. Below is the resulting Applied Tree
Model.
| Figure 8.16: | The tree view showing the model applied to the data. |
|
From the Applied Tree Model select the menu Predictions->View Average Tree Predictions. This spreadsheet is
normally used in conjunction with a train and validation workflow, and so it shows the “actual” value of the BP, along with the
predicted value and the “residual”. Because we do not know what the actual value is, the Actual and Residual columns are
meaningless. However, the Predicted column shows the predicted value obtained from applying the model built on the first
500 observations to the final observations. You may then export these results out of the spreadsheet to other data formats if
desired.
At some future date, we anticipate not having to go through the cumbersome step of having to have both the training and
test data in the same spreadsheet. Note there is also another mechanism to do this that is available right now, namely
generating a C/C++ source code model described in section 7.6.4. This may be a desirable option for those who are familiar
with the C/C++ programming languages.
| Figure 8.17: | The spreadsheet view comparing the Actual to Predicted. |
|
The next illustration demonstrates how the relationships for applied models is displayed in the Project Navigator
Window. For convenience the third column of the Project Navigator Window for an Applied Tree Model will
tell you which tree or Multitree was used to create the Applied Tree Model. By clicking on any of the nodes
involved, including the Applied Tree Model, the spreadsheet or the tree/Multitree used for the prediction, will
highlight all three nodes. The model used for the prediction will be highlighted in yellow. The spreadsheet
that the model was applied to is highlighted in blue and the resulting Applied Tree model is highlighted in
green.
| Figure 8.18: | The Project Navigator with the colors showing the tree(s) used to create the applied model. |
|