‹‹ Back to SVS Home

8.2 Predicting An Unknown Response

8.2 Predicting An Unknown Response

There are times when we would like to predict an unknown response. For example, let’s assume we have 500 patients with complete data. The data we collected contains BP (blood pressure), whether they smoked or not, age and BMI. This is our training data. We have another 500 records, but this set has no BP data. Can we use the rules built on the first 500 records to estimate the unknown BP in the second 500 records? This is a valuable feature of Optimus RP. Here is an overview of the steps involved:

  1. Create and save a forest of random trees in the first, complete dataset.
  2. Apply the rules found in the random trees to the second dataset with the missing BP.
  3. Generate and view the average tree predictions spreadsheet.

In the step-by-step example below we modify an existing dataset to have unknown values using Microsoft’s Excel.


[Picture]
Figure 8.6: Excel spreadsheet of the data set.

8.2.1 Modifying a Copy of Data

First we need to create a set of data with missing information. In the folder Optimus RP\example there is a file called CSIM.csv. Right click on that file and choose Copy from the pop-up menu.

Next, right click on the example folder on the left portion of Windows explorer and choose Paste from the pop-up menu. This will make a new file called Copy of CSIM.csv . Double click that file to open in Excel.

Scroll down to Patient 501 (this will be row 502 in Excel). In column C (contains the BP value) overwrite the existing value with 0 (zero). At this point your screen should look like the illustration above.


[Picture]
Figure 8.7: Modifying data in the spreadsheet view.

We now need to copy that 0 value to the next 499 records. In the lower right of the cell there is a small square. If you click and drag this down it will copy the 0 value the cells below. The illustration above shows the results of a short drag.


[Picture]
Figure 8.8: Propagating the changes to other cells.

As you continue to drag down the window will scroll. The further down you go the faster the scrolling. Eventually you reach Patient 1000. You screen should look like the illustration above. When you release the mouse’s drag button it should fill all rows between 501 and 1001 with zeros.

You need to save the change, so use the menus File-> Save. It will ask permission to save in the CSV format and will warn of potential loss of functionally. This is OK so answer YES. Now close Excel. It will again ask about saving. Answer NO as you have already saved changes.

8.2.2 Converting to a GHD File

We now need to convert that CSV file into a GHD file Optimus RP’s native format. Do this using the Import Wizard described in the chapter on Importing Data 4.3.1.

8.2.3 Create a Subset With BP Readings

After importing the CSV file open the newly created spreadsheet and choose the menus Edit->Select subset. From this small dialog select First N items: on the left and type 500 on the right. When the dialog looks like the below illustration click the OK button.


[Picture]
Figure 8.9: Selecting the first 500 "N" items for a subset

If you were to scroll down to Patient 500 you would see that the second half of the dataset is de-selected. They will not be used in analysis at this point.


[Picture]
Figure 8.10: The selections beyond the first 500 are deselected.

We need to set a dependent variable so, click on the BP column header once. We also need to eliminate the BP_I column so click on the header twice. You screen should look something like the illustration below.


[Picture]
Figure 8.11: Setting the dependent variable by clicking on the column header.

Next, select the menus Analysis->Interactive Tree Analysis to create a tree view.

8.2.4 Create Random Trees From Initial Dataset

[Picture] From the tree view select the menus Tree->Random Tree Creation to get the above screen. Next hit Go and the random trees will be built. This may take a minute or so depending on the speed of your computer. Click Close when finished. The random trees are represented as a Multitree Model which will appear as a child of the tree node used to create the random trees. The Multitree Model viewer is shown below.


[Picture]
Figure 8.12: The dialog window that appears when all the trees have been generated.

8.2.5 Invert Data to Create Dataset With Missing BP


[Picture]
Figure 8.13: Inverting the data in the spreadsheet view.

We now need to invert our data. Close the tree view and from the spreadsheet view select the menus Edit->Invert selection. You are now working with the data that contains no BP.

8.2.6 Applying A Multitree Model To Make Predictions

From the spreadsheet with the missing BP values active, select the menus Analysis->Apply a Tree Model. This will open up a Multitree selection window. This window contains a replica of the Project Navigator Window with the possible tree models you can apply highlighted in white. Note that both a single tree and a Multitree are highlighted in the example below. That is because you can make predictions using either a single tree or a set of random trees.


[Picture]
Figure 8.14: Selecting the multitree model.

For our example we have selected the Multitree model that we just created as shown below.


[Picture]
Figure 8.15: Choosing the multitree model that was just created.

After selecting a tree model click the OK button to start the prediction process. This will create an Applied Tree Model and place it as a child of the spreadsheet node with the missing BP values. Below is the resulting Applied Tree Model.


[Picture]
Figure 8.16: The tree view showing the model applied to the data.

From the Applied Tree Model select the menu Predictions->View Average Tree Predictions. This spreadsheet is normally used in conjunction with a train and validation workflow, and so it shows the “actual” value of the BP, along with the predicted value and the “residual”. Because we do not know what the actual value is, the Actual and Residual columns are meaningless. However, the Predicted column shows the predicted value obtained from applying the model built on the first 500 observations to the final observations. You may then export these results out of the spreadsheet to other data formats if desired.

At some future date, we anticipate not having to go through the cumbersome step of having to have both the training and test data in the same spreadsheet. Note there is also another mechanism to do this that is available right now, namely generating a C/C++ source code model described in section 7.6.4. This may be a desirable option for those who are familiar with the C/C++ programming languages.


[Picture]
Figure 8.17: The spreadsheet view comparing the Actual to Predicted.

The next illustration demonstrates how the relationships for applied models is displayed in the Project Navigator Window. For convenience the third column of the Project Navigator Window for an Applied Tree Model will tell you which tree or Multitree was used to create the Applied Tree Model. By clicking on any of the nodes involved, including the Applied Tree Model, the spreadsheet or the tree/Multitree used for the prediction, will highlight all three nodes. The model used for the prediction will be highlighted in yellow. The spreadsheet that the model was applied to is highlighted in blue and the resulting Applied Tree model is highlighted in green.


[Picture]
Figure 8.18: The Project Navigator with the colors showing the tree(s) used to create the applied model.