Training and Validation Recipe

As we build models from our data, it is valuable to see how they hold up out of sample. That is, we want to build a model on a training portion of our data set, and validate the model on the holdout (test) portion of the data set. A simple procedure to do this is as follows:

  1. Select a training subset of the data at random using the spreadsheet subset selection procedure and go into interactive tree analysis.
  2. Compute a tree on random subset.
  3. Return to the spreadsheet and invert the selection to obtain the holdout (test) data set.
  4. From the spreadsheet Analysis->Apply a Tree Model menu, choose the model built in step 2.
  5. You can then view the average tree predictions, as well as the RMS error of the predictions that result from applying the tree built from the training set onto the holdout set.

We now present a more detailed example that uses a multiple tree model.

First import the GSIM.csv file from the examples directory.


[Picture]
Figure 8.1: The spreadsheet’s Edit menu with the "Select Row Subset" highlighted.

Use the spreadsheet menu shown above to select a random subset of the data. The dialog settings are shown below.


[Picture]
Figure 8.2: The dialog that allows you to choose the type and size of subset.

Next click on the BP column to make it dependent and disable the BP_I column by double clicking it. We are now going to create a prediction model based on the randomly selected data. From the spreadsheet menu shown below select Create a Multiple-Tree Model.


[Picture]
Figure 8.3: The Analysis menu for creating a multiple tree model.

This will add a new Multitree Model to the Project Navigator Window as a child of the spreadsheet. This model will later be used to make predictions to help us validate our model on the holdout data. Now return to the spreadsheet and from the Edit menu choose Invert selection. This will create a new spreadsheet and disable all rows used to create the prediction model and highlight all the rows not part of the prediction model. We will now use the Multitree Model created above and apply it to this new spreadsheet. First select Apply a Tree Model from the new spreadsheet’s Analysis menu. This will bring up a dialog that displays a copy of the Project Navigator Window with all potential tree models highlighted. Click on the Multitree model created above and click OK. The Multitree model will now be used to make predictions for the dependent BP column. As a result a new Applied Tree Model will be added to the Project Navigator Window as a child of the spreadsheet we are making predictions on. The Applied Tree Model is shown below.


[Picture]
Figure 8.4: The Applied Tree Model window.

Now we are ready to validate our model by comparing the predictions with the actual data. To do this select the menu Predictions->View Average Tree Predictions. This option will create a new spreadsheet showing the actual value of the BP column, the predicted value from applying the training set and the difference between the actual and prediction. The prediction spreadsheet is shown below.


[Picture]
Figure 8.5: The spreadsheet view of the actual and prediction comparison.