Regression Analysis — Variable Selection Methods

As mentioned, the choice of variables is a crucial decision. The omission of relevant, important variables can distort the findings. The inclusion of irrelevant variables on the other hand, reduces model parsimony, and may also mask or replace the effects of more useful variables.

Once an initial set of variables has been listed, some of them may turn out to be less important, not contribute significantly to the relationship.

For example, in the analysis of promotions, the sale of an item depends on the discounted price of that item (discount price elasticity) as well as the discounted price of competing items (discounted cross-price elasticity). It will turn out that some competitor items do compete strongly, and their prices are included as variables in the final model, whereas the prices of other items are not having a significant impact, and they are removed from the model.

There are a variety of approaches leading to the final selection of variables. Some of these are discussed here.

The confirmatory approach is one where the analyst specifies the variables. Typically, one would use a variety of other methods, arrive at some conclusions, and then confirm the variables that you want to deploy in the regression.

Another method commonly found in statistical packages is called stepwise regression. It follows this sequence of steps:

  1. At the start, dependent variable (y) is regressed with the most highly correlated predictor variable (x1):
  2. $$ y = b_0+b_1 x_1 $$
  3. Next, the predictor (x2) with highest partial correlation, is added to the model:
  4. $$ y = b_0+b_1 x_1+b_2 x_2 $$
  5. After each additional variable is added, the algorithm examines the partial F value for the previous variable(s) (x1). If the variable no longer makes a significant contribution, given the presence of the new variable, it is removed.
  6. Steps 2 and 3 are repeated with remaining independent variables, till all are examined, and a “final” model emerges.

Sometimes the coefficients of one or two variables in the final model may not be meaningful; the direction of the relationship may be nonsensical. Normally this occurs for variables that have relatively low contribution to the relationship, and they can be removed.

Another approach referred to as the forward regression is similar to the stepwise method. It adds independent variables progressively as long as they make contributions greater than some threshold level. However, unlike stepwise estimation, once selected, variables are not deleted at any subsequent stage.

Yet another approach, backward regression, is essentially forward regression in reverse. It starts with all possible independent variables in the model, and sequentially deletes those which make contributions below some threshold value.

These diverse model selection procedures, which are automated in statistical packages like R, SAS and SPSS, are useful for variable screening.

In applications such as data mining, where the association between variables is often unknown and needs to be explored, these techniques help unearth promising relationships.


Previous     Next

Use the Search Bar to find content on MarketingMind.