4.2 Mathematical Considerations

We saw in chapter 2, Welcome to Optimus RP, that we had to identify the dependent, independent and disabled variables of the data set that we wanted to analyze. The columns that we identify as independent variables can be any of the data types listed above. However, our choice for dependent variables is restricted to columns that contain either binary or integer/real data. Spreadsheet rows with missing values in the dependent column will be grayed out and inactivated.

4.2.1 P-Values and Dependent Variables

We also saw that as Optimus RP builds trees, it calculates p-values to assess the "goodness" of each partition. In order to be able to calculate p-values, Optimus RP makes assumptions about the distribution of the dependent variable. Binary dependent variables are assumed to have a binomial distribution and integer/real variables are assumed to have a normal distribution.

If the dependent variables violate this assumption, then it will be necessary to apply some mathematical transformation to map the dependent variable into the appropriate distribution. For instance, if a dependent real variable spans orders of magnitude, it could be brought closer to a normal distribution by applying a log transformation to it. This would need to be done before the data was imported.

4.2.2 Regression and Independent Variables

Optimus RP implements two types of regression: logistic and linear. The type of regression used depends on the data type of the independent variable used in the analysis. If the independent variable is binary then logistic regression is used. If the independent variable is integer/real then linear regression is used.

In the case of linear regression, it may be necessary to apply a monotonic transformation to the independent variable in order to achieve a better linear fit. The transformation will need to be done before the data is imported.

4.2.3 P-Value Algorithms

The algorithm that is used to calculate a p-value depends on the types of the independent and dependent variable used in the split. The following table summarizes the algorithms used.


Independent Variable Data Type Binary Dependent Variable Real/Integer Dependent Variable
Binary Chi Squared T - Test
Real/Integer Chi Squared with Segmenting Adjustment F - Test with Segmenting Adjustment
Categorical Chi Squared with Segmenting Adjustment F - Test with Segmenting Adjustment

The “Segmenting Adjustment” referred to for the real, integer, and categorical independent variable types is a multiplicity adjustment for finding the best way to segment the split. This refers to the same dependent and independent variables, but with the independent variable values divided into segments in different ways, contributing to a type of multiplicity. The segmenting adjustment algorithm depends on the type of split.

(See 17.)

4.2.4 Preparing Data Example

Here is a simple example of preparing data for import into Optimus RP. Suppose a file being imported has Boolean data saved in two complementary (mutually exclusive) columns. Let’s say that the columns are called YES and NO and either one or the other contains an ’X’ indicating the status of some condition.

To reformat the data you would need to create a third column of binary type perhaps named YES-NO. You would then place a 1 in the YES-NO column if the YES column contained an ’X’ and a 0 in the YES-NO column if the NO column contained an ’X’. Once the YES-NO column is properly filled in you might delete the YES and NO columns.

Of course, before you make any changes to a data file, you will want to make a back up copy. Also, if your data file is large, you will want to use some tool to automate the reformatting process. Depending on the type of data file, possible tools include, scripting languages, text editors, etc.

Here is an example using Microsoft Excel. Microsoft Excel works well for data sets composed of less than 65,536 rows and 256 columns. Once you import the data, create the new YES-NO column and fill its data cells with a formula that calculates a 1 or 0 value based on the presence of an ’X’ in the corresponding YES or NO columns. The ’IF’ formula that would do this is illustrated in Fig. 4.1.


[Picture]
Figure 4.1: A Microsoft Excel formula for creating a binary column.

A very common variation of this example employs two columns of data to identify gender. One column might be labeled MALE or M and the other FEMALE or F. The data cells would contain an ’X’ in either the MALE or FEMALE column. To convert this data into a single column of boolean type, we would use the same approach that we took with the YES or NO example above.