C Use of qualitative predictors
An important situation is how to deal with qualitative, and not quantitative, predictors when we fit a regression model. Qualitative predictors, also known as categorical variables or, in ’s terminology, factors, are very common, for example in social sciences. Dealing with them requires some care and proper understanding of how these variables are represented.
The simplest case is the situation with two levels. A binary variable \(C\) with two levels (for example, a and b) can be represented as \[ D=\left\{\begin{array}{ll} 1,&\text{if }C=b,\\ 0,&\text{if }C=a. \end{array}\right. \] \(D\) now is a dummy variable: it codifies with zeros and ones the two possible levels of the categorical variable. An example of \(C\) could be gender, which has levels male and female. The dummy variable associated is \(D=0\) if the gender is male and \(D=1\) if the gender is female.
The advantage of this dummification is its interpretability in regression models. Since level a corresponds to \(0\), it can be seen as the reference level to which level b is compared. This is the key point in dummification: set one level as the reference and codify the rest as departures from it with ones.
does the dummification automatically (translates a categorical variable \(C\) into its dummy version \(D\)) if it detects that a factor variable is present in the regression model.
Let’s see now the case with more than two levels, for example, a categorical variable \(C\) with levels a, b, and c. If we take a as the reference level, this variable can be represented by two dummy variables: \[ D_1=\left\{\begin{array}{ll}1,&\text{if }C=b,\\0,& \text{if }C\neq b\end{array}\right. \] and \[ D_2=\left\{\begin{array}{ll}1,&\text{if }C=c,\\0,& \text{if }C\neq c.\end{array}\right. \] Then \(C=a\) is represented by \(D_1=D_2=0\), \(C=b\) is represented by \(D_1=1,D_2=0\) and \(C=c\) is represented by \(D_1=0,D_2=1\).
In general, if we have a categorical variable with \(J\) levels, then the number of dummy variables required is \(J-1\). Again, does the dummification automatically for you if it detects that a factor variable is present in the regression model.
It may happen that one dummy variable, say \(D_1\) is not significant, while other dummy variables, say \(D_2\), are significant.
Do not codify a categorical variable as a discrete variable. This constitutes a major methodological fail that will flaw the subsequent statistical analysis.
For example if you have a categorical variable party
with levels partyA
, partyB
, and partyC
, do not encode it as a discrete variable taking the values 1
, 2
, and 3
, respectively. If you do so:
-
You assume implicitly an order in the levels of
party
, sincepartyA
is closer topartyB
than topartyC
. -
You assume implicitly that
partyC
is three times larger thanpartyA
. -
The codification is completely arbitrary – why not considering
1
,1.5
, and1.75
instead of?
The right way of dealing with categorical variables in regression is to set the variable as a factor and let do internally the dummification.