10.2 Using More Than One Dependent Variable


[Picture]
Figure 10.1: This spreadsheet shows three columns set to be dependent variables.

From the spreadsheet, it is possible to select more than one dependent variable, or one dependent categorical variable containing more than two categories, and carry out multivariate tree analysis.

For continuous response, only binary partitioning is possible using 0/1 binary predictors. Continuous responses may not be mixed with binary or categorical responses.

However, for multivariate binary response, where all dependent variables are 0/1 binary, the predictors may be binary, continuous, categorical.

In addition, a categorical response containing more than two categories and/or multiple categorical responses may be used alone or with binary responses, and the predictors may also be binary, continuous, categorical. This is because categorical responses are first broken down into binary responses before analysis proceeds.

Multivariate analogs are present for histograms and manual split windows.


[Picture]
Figure 10.2: A tree view showing the multivariate splits information in the nodes.

An example multivariate tree is shown in Fig. 10.2 for the example multivariate data set, cancersubset.ghd. We have selected the results of 3 different cell line screens to analyze simultaneously. If you hover the mouse over one of the 3 glyphs, a tooltip will pop up displaying the response variable name. We see that in the root node all of the means are of a similar order of magnitude. The first split is highly significant, with the 151 compounds on the right more active against all 3 cell lines. Often it is more interesting to see a split where one or two variables are highly different, indicating some kind of preferential activity. This particular screen is not so rich in that respect – most compounds that kill one cancer cell type tend to kill all cell types across the board. However, there are a few exceptions if you look hard enough.

10.2.1 Continuous Multivariate Response

For continuous response, only binary partitioning is possible using 0/1 binary predictors. Continuous responses may not be mixed with binary or categorical responses.

Histograms and manual splits can also be created from trees based on this class of response.

The multivariate tree display shown above operates much the same as the univariate tree analysis, except that there is more than one dependent variable. For multivariate display, the means of each of the dependent variables are listed as u1, u2, and so forth. Similarly, the standard deviation, list standard error, and node mean squared error are listed separately for each dependent variable.

In addition to the means of the two variables, we use a glyph representation to display changes in response. In the glyph representation, the root node shows the means of the responses relative to one another. Subsequent nodes in the tree show the change in the response versus the root response, in terms of standard deviation units.

The dotted line is one standard deviation either side of zero change in response. In data with more pronounced effects, there may be 3 or 4 dotted lines for 3 to 4 standard deviations. For multivariate continuous response, only binary independent descriptors are implemented as predictors. In this case, the dependent variables are modeled as coming from a normal distribution, and the p-values are computed using a Hotelling T2 statistic, see section 17.2.2.

The node representation can be further customized from the options menu to present other statistics on the variables.

10.2.2 Binary and/or Categorical Multivariate Response

In multivariate binary and/or categorical response, all dependent variables actually used for analysis are binary, since categorical responses are broken down into binary responses for the sake of analysis.

A univariate categorical response with more than two categories is treated effectively as a multivariate (binary) response.

All predictor types are supported for multivariate binary and/or categorical response. In the plot above, where four binary responses are profiled, we see continuous, integer and binary splits.

10.2.3 Multivariate Multiple Tree Clustering

Multiple tree clustering is also possible in multiple dimensions, and is totally analogous to the unidimensional case. (See Section 9, Random Tree Generation. In the distance matrix plot, the user can choose to either view the compound distances versus one another in multivariate multiple tree space, or set one or both of the axes to be one of the multivariate responses.

10.2.4 File->Output C Code

The results variable is set up as an array to handle multivariate responses.

10.2.5 Multivariate Compound View

The compound view is changed for multivariate analysis to display the activities for each of the responses as seen below.


[Picture]
Figure 10.3: ChemTree has the capability of visualizing compounds for the multivariate selection.

10.2.6 Visualize Split Data->Multiple Tree Atom Highlighting

For Multiple Tree Atom Highlighting in multivariate mode, the default is to pop up the highlighting for the first dependent variable. However, in the edit menu, you can switch to showing weightings for any one of the other dependent variables.

10.2.7 Multivariate Cherry Picking (Included in Cherry Picking Module)

The multivariate cherry picking module is much the same as the univariate cherry picking module(7.7.4). However, in the multivariate version it is possible to specify a number of constraints on the ranges of the predictions of the dependent variables as displayed in Fig. 10.4.


[Picture]
Figure 10.4: This is the cherry picking view of the multivariate selection.

NOTE: It is often better to predict the activity of all compounds and rank order the predictions in a spreadsheet, choosing the top of the list, rather than counting on the predictions being accurate enough to use a hard threshold. However, if you are cherry picking hundreds of thousands or millions of compounds, a low-level filter may be appropriate. This functionality may also be useful when you are trying find compounds that have specificity against one target and have low activity on the other targets.