As mentioned, the choice of variables is a crucial decision. The omission of relevant,
important variables can distort the findings. The inclusion of irrelevant variables on the other hand,
reduces model parsimony, and may also mask or replace the effects of more useful variables.
Once an initial set of variables has been listed, some of them may turn out to be less
important, not contribute significantly to the relationship.
For example, in the analysis of promotions, the sale of an item depends on the
discounted price of that item (discount price elasticity) as well as the discounted price of competing
items (discounted cross-price elasticity). It will turn out that some competitor items do compete
strongly, and their prices are included as variables in the final model, whereas the prices of other
items are not having a significant impact, and they are removed from the model.
There are a variety of approaches leading to the final selection of variables. Some of
these are discussed here.
The confirmatory approach is one where the analyst specifies the variables.
Typically, one would use a variety of other methods, arrive at some conclusions, and then confirm the
variables that you want to deploy in the regression.
Another method commonly found in statistical packages is called stepwise regression.
It follows this sequence of steps:
- At the start, dependent variable (y) is regressed with the most highly correlated
predictor variable (x1):
$$ y = b_0+b_1 x_1 $$
- Next, the predictor (x2) with highest partial correlation, is added to
the model:
$$ y = b_0+b_1 x_1+b_2 x_2 $$
- After each additional variable is added, the algorithm examines the partial F value for
the previous variable(s) (x1). If the variable no longer makes a significant contribution,
given the presence of the new variable, it is removed.
- Steps 2 and 3 are repeated with remaining independent variables, till all are examined,
and a “final” model emerges.
Sometimes the coefficients of one or two variables in the final model may not be
meaningful; the direction of the relationship may be nonsensical. Normally this occurs for variables
that have relatively low contribution to the relationship, and they can be removed.
Another approach referred to as the forward regression is similar to the stepwise
method. It adds independent variables progressively as long as they make contributions greater than
some threshold level. However, unlike stepwise estimation, once selected, variables are not deleted at
any subsequent stage.
Yet another approach, backward regression, is essentially forward regression in
reverse. It starts with all possible independent variables in the model, and sequentially deletes those
which make contributions below some threshold value.
These diverse model selection procedures, which are automated in statistical packages
like R, SAS and SPSS, are useful for variable screening.
In applications such as data mining, where the association between variables is often
unknown and needs to be explored, these techniques help unearth promising relationships.