Stay Informed
Sign up to receive important news and announcements about Golden Helix Predictive Analytics.
8.1 Training and Validation Recipe
As we build models from our data, it is valuable to see how they hold up out of sample. That is, we want to build a model on a
training portion of our data set, and validate the model on the holdout (test) portion of the data set. A simple procedure to do
this is as follows:
- Select a training subset of the data at random using the spreadsheet subset selection procedure and go into
interactive tree analysis.
- Compute a tree on random subset.
- Return to the spreadsheet and invert the selection to obtain the holdout (test) data set.
- From the spreadsheet Analysis->Apply a Tree Model menu, choose the model built in step 2.
- You can then view the average tree predictions, as well as the RMS error of the predictions that result from
applying the tree built from the training set onto the holdout set.
We now present a more detailed example that uses a multiple tree model.
First import the CSIM.csv file from the examples directory.
| Figure 8.1: | The spreadsheet’s Edit menu with the "Select Row Subset" highlighted. |
|
Use the spreadsheet menu shown above to select a random subset of the data. The dialog settings are shown
below.
| Figure 8.2: | The dialog that allows you to choose the type and size of subset. |
|
Next click on the BP column to make it dependent and disable the BP_I column by double clicking it. We are now going
to create a prediction model based on the randomly selected data. From the spreadsheet menu shown below select Create a
Multiple-Tree Model.
| Figure 8.3: | The Analysis menu for creating a multiple tree model. |
|
This will add a new Multitree Model to the Project Navigator Window as a child of the spreadsheet. This model will later
be used to make predictions to help us validate our model on the holdout data. Now return to the spreadsheet and from the Edit
menu choose Invert selection. This will create a new spreadsheet and disable all rows used to create the prediction model and
highlight all the rows not part of the prediction model. We will now use the Multitree Model created above and apply it to this
new spreadsheet. First select Apply a Tree Model from the new spreadsheet’s Analysis menu. This will bring
up a dialog that displays a copy of the Project Navigator Window with all potential tree models highlighted.
Click on the Multitree model created above and click GO. The Multitree model will now be used to make
predictions for the dependent BP column. As a result a new Applied Tree Model will be added to the Project
Navigator Window as a child of the spreadsheet we are making predictions on. The Applied Tree Model is shown
below.
| Figure 8.4: | The Applied Tree Model window. |
|
Now we are ready to validate our model by comparing the predictions with the actual data. To do this select the menu
Predictions->View Average Tree Predictions. This option will create a new spreadsheet showing the actual value of the BP
column, the predicted value from applying the training set and the difference between the actual and prediction. The
prediction spreadsheet is shown below.
| Figure 8.5: | The spreadsheet view of the actual and prediction comparison. |
|