‹‹ Back to SVS Home
8.2 Getting the Best Prediction Performance
8.2 Getting the Best Prediction Performance
We have done extensive experimentation on how to obtain the best prediction performance from ChemTree across a diverse set of data. The following rules of thumb may be helpful.
Multiple trees give best prediction performance.
More trees is better than less. However, most of your gain in performance comes in the first 100-200 trees, and you would probably not want to use any more than 1000, mostly because it is a waste of CPU time for diminishing returns. The one place you would want to use a huge number of trees is if you desire accurate indicators of correlation/interaction effects (10.2 ). In this case, the more trees the better, especially if you are examining a lot of variables.
Over-building trees with a lax p-value threshold gives better prediction performance. Set the p-value threshold to 0.99 if
your data sets are not too large. These are not trees you would want to look at individually to understand structure/activity
relationships because they make many non-significant splits. However, when you average over multiple trees, we have
consistently seen significant prediction gains by over-building – this is especially true for small datasets of less than 1000
compounds. For very large datasets, you might go with a more conservative 0.05 p-value threshold so as not to overwhelm
your computer with vast trees.
When you set a lax p-value threshold, set maxsegments=3 so as not to end up with meaningless 10-way splits.
Optimal tree building parameters are displayed in the following tree options screen:
Surprisingly, adjusting the minimum elements per child seems to have little or no effect. You would think that ensuring that there was a sufficient quorum of observations at a node to make a prediction would result in better predictions. If anything, prediction performance with multiple trees seems to degrade if you increase the minimum elements per child.
For univariate analysis, the augmented atom descriptors work quite well over a broad class of screens – even when predicting melting point, which is a 3D packing problem. If you incorporate your own descriptors, you might try them both alone and in combination with the ChemTree descriptors with test and holdout samples and see if there is any performance difference.
For large datasets, it is important to reduce the number of descriptors generated in the data import process (4.3.2 ) by restricting that the descriptors appear in a modest number of compounds. We have used 100 or more as a threshold on a 30,000 compound data set successfully. The rarer descriptors tend to be used rarely, and while it may impact prediction performance to drop these descriptors, you will lose more in your prediction performance by not being able to compute a large number of multiple trees in a reasonable time.