If you find any typos, errors, or places where the text may be improved, please let me know by adding an annotation using hypothes.is. To add an annotation, select some text and then click the on the pop-up menu. To see the annotations of others, click the in the upper right-hand corner of the page.

D Use of qualitative predictors

An important situation is how to deal with qualitative, and not quantitative, predictors when we fit a regression model. Qualitative predictors, also known as categorical variables or, in ’s terminology, factors, are very common, for example in social sciences. Dealing with them requires some care and proper understanding of how these variables are represented.

The simplest case is the situation with two levels. A binary variable \(C\) with two levels (for example, a and b) can be represented as \[ D=\left\{\begin{array}{ll} 1,&\text{if }C=b,\\ 0,&\text{if }C=a. \end{array}\right. \] \(D\) now is a dummy variable: it codifies with zeros and ones the two possible levels of the categorical variable. An example of \(C\) could be gender, which has levels male and female. The dummy variable associated is \(D=0\) if the gender is male and \(D=1\) if the gender is female.

The advantage of this dummification is its interpretability in regression models. Since level a corresponds to \(0\), it can be seen as the reference level to which level b is compared. This is the key point in dummification: set one level as the reference and codify the rest as departures from it with ones.

does the dummification automatically (translates a categorical variable \(C\) into its dummy version \(D\)) if it detects that a factor variable is present in the regression model.

Let’s see now the case with more than two levels, for example, a categorical variable \(C\) with levels a, b, and c. If we take a as the reference level, this variable can be represented by two dummy variables: \[ D_1=\left\{\begin{array}{ll}1,&\text{if }C=b,\\0,& \text{if }C\neq b\end{array}\right. \] and \[ D_2=\left\{\begin{array}{ll}1,&\text{if }C=c,\\0,& \text{if }C\neq c.\end{array}\right. \] Then \(C=a\) is represented by \(D_1=D_2=0\), \(C=b\) is represented by \(D_1=1,D_2=0\) and \(C=c\) is represented by \(D_1=0,D_2=1\).

In general, if we have a categorical variable with \(J\) levels, then the number of dummy variables required is \(J-1\). Again, does the dummification automatically for you if it detects that a factor variable is present in the regression model.

It may happen that one dummy variable, say \(D_1\) is not significant, while other dummy variables, say \(D_2\), are significant.

Do not codify a categorical variable as a discrete variable. This constitutes a major methodological fail that will flaw the subsequent statistical analysis.

For example if you have a categorical variable party with levels partyA, partyB, and partyC, do not encode it as a discrete variable taking the values 1, 2, and 3, respectively. If you do so:

  • You assume implicitly an order in the levels of party, since partyA is closer to partyB than to partyC.
  • You assume implicitly that partyC is three times larger than partyA.
  • The codification is completely arbitrary – why not considering 1, 1.5, and 1.75 instead of?

The right way of dealing with categorical variables in regression is to set the variable as a factor and let do internally the dummification.