# Hypothesis Testing

A hypothesis test examines two mutually exclusive claims about a parameter to determine which is best supported by the sample data. The parameter is usually the mean or proportion of some population variable of importance to the marketer.

The null hypothesis (H0) is the status quo or the default position that there is no relationship or no difference. The alternative or research hypothesis (HA) is the opposite of the null. It represents the relationship or difference.

The conclusion of the hypothesis test can be right or wrong. Erroneous conclusions are classified as Type I or Type II.

Type I error or false positive occurs when the null hypothesis is rejected, even though it is actually true. There is no difference between the groups, contrary to the conclusion that a significant difference exists.

Type II error or false negative occurs when the null hypothesis is accepted, though it is actually false. The conclusion that there is no difference is incorrect.

An oft quoted example is of the jury system where the defendant is “innocent until proven guilty” (H0 = “not guilty”, HA = “guilty”). The jury’s decision whether the defendant is not guilty (accept H0), or guilty (reject H0), may be either right or wrong. Convicting the guilty or acquitting the innocent are correct decisions. However, convicting an innocent person is a Type I error, while acquitting a guilty person is a Type II error.

Though one type of error may sometimes be worse than the other, neither is desirable. Researchers and analysts contain the error rates by collecting more data or greater evidence, and by establishing decision norms or standards.

A trade-off however is required because adjusting the norm to reduce type I error results in the increase in type II error, and vice versa. Expressed in terms of the probability of making an error, the standards are summarized in Exhibit 32.18:

 Truth

#### Exhibit 32.18 Probability of type I error (α), probability of type II error (β), and power (1- β), the probability of correctly rejection the null hypothesis.

α: Probability of making a Type I error, also referred to as the significance level, is usually set at 0.05 or 5%, i.e., type I error occurs 5% of the time.

β: Probability of making a Type II error.

1 – β: Called power, is the probability of correctly rejecting the null hypothesis.

Power is the probability of correctly rejecting the null hypothesis, i.e. correctly concluding there was a difference. This usually relates precisely to the objective of the study.

Power is dependent on three factors:

• Type I error (α) or significance level: Power decreases with as significance level is lowered. The norm for quantitative studies is α = 5%.
• Effect size (Δ): The magnitude of the “signal”, or the amount of difference between the parameters of interest. This is specified in terms of standard deviations, i.e., Δ=1 pertains to a difference of 1 standard deviation.
• Sample size: Power increases with sample size. While very small samples make statistical tests overly sensitive, very large samples make them insensitive. With excessively large samples, even very small effects can be statistically significant, which raises the issue of practical significance vs. statistical significance.

Power is usually set at 0.8 or 80%, which makes β (type II error) equal to 0.2.

Since both α and power (or β) are usually set according to norms, the size of a sample is essentially a function of the effect size, or the detectable difference. This is discussed further in the section Sample Size — Comparative Studies, in Chapter Sampling.

From the viewpoint of taking decisions, the distinction between statistical significance and practical or market significance must be clearly understood.

Take for example the results of a product test revealing, with statistical significance, that a new formulation is likely to increase a brand’s sales by a million dollars. These findings, however, may not have any practical or market significance if the increase in sales is too small to justify the costs of introducing the new variant.

In another example, pertaining to a retail bank, a number of initiatives targeting high-value customers, may have resulted in the reported increase in their customer satisfaction rating from 3.0 to 3.5, on a 5 point-scale. This large increase is of importance, suggesting that the initiatives have helped. But, if the p-value for the data is 0.1, in that case the result is not statistically significant at the usual level (α=0.05). There is a 10% chance that the difference is merely resulting from sampling error.

If the sample size is increased so that the results are statistically significant, that would increase the level of confidence that the difference is “real”, and would justify the continuation of the new initiatives.

Hypothesis tests are classified as one-tailed or two-tailed tests. The one-tailed test specifies the direction of the difference, i.e., the null hypothesis, H0, is expressed in terms of the equation parameter ≥ something, or parameter ≤ something.

For instance, in a before and after advertisement screening test, if the ad is expected to improve consumers’ disposition to try a new brand, then the hypothesis may be phrased as follows:

H0: null hypothesis: Dafter ≤ Dbefore

HA, research hypothesis: Dafter > Dbefore

Where D is the disposition to try the product, expressed as the proportion of respondents claiming they will purchase the brand.

If the direction of the difference is not known, a two-tailed test is applied. For instance, if for the same test, the marketer is interested in knowing whether there is a difference between men and women, in their disposition to buy the brand, the hypothesis becomes:

H0: null hypothesis: Dmale = Dfemale

HA, research hypothesis: Dmale ≠ Dfemale

The standard process for hypothesis testing comprises the following steps:

1. H0, HA: State the null and alternative hypothesis.
2. α: Set the level of significance, i.e. type I error. For most research studies this is set at 5%.
3. Test statistic: Compute the test statistic. Depending on the characteristics of the test this is either the z-score (standard score), the t-value, or the f-ratio.
4. p-value: Obtain the p-value by referencing test statistic in the relevant distribution table. The normal distribution is used for referencing the z-score, t distribution for the t-value and the f distribution for the f-ratio.
5. Test: Accept the research hypothesis HA (reject H0) if p-value < α.

Each of the test statistics is essentially a signal-to-noise ratio, where the signal is the relationship of interest (for instance, the difference in group means), and noise is a measure of variability of groups.

If a measurement scale outcome variable has little variability it will be easier to detect change than if it has a lot of variability (see Exhibit 32.17). So assessments or estimates of variability (i.e. standard deviation) are an important ingredient in sample size estimation.

A z-score (z) indicates how many standard deviations the sample mean is from the population mean.

$$z = \frac{\bar x-μ}{s/\sqrt n}$$

Where x̄ is the sample mean, μ is the population mean, and σ=s/√n is the sample standard deviation (refer CLT) and s is the standard deviation of the population.

Details of the t-test are provided in the section t-test, and the f-ratio is covered in the section ANOVA.

Note: The data analysis add-in in excel provides an easy to use facility to conduct hypothesis z, t and f tests. P-value calculators are also available online, for instance, at this Social Science Statistics web page.

For one-tailed, known mean and standard deviation tests the test statistic to use is the z-score.

Example: A slew of initiatives (increases in excise duty, constraints on pack sizes and restrictions on distribution channels) were introduced in an effort to cut down the consumption of cigarettes. The government’s target was to reduce consumption to less than 120 sticks per smoker, per month.

In a study conducted to assess the success of this initiative, the average consumption of 100 smokers was 28 sticks per week, or 112 sticks per month.

Based on a large scale consumption study conducted at an earlier date, the standard deviation of consumption of cigarettes was estimated as 40 sticks per month.

Based on this information, can we tell whether or not the government achieved its target?

H0: μ ≥120

HA: μ<120

α = 5%

$$z = \frac{\bar x-μ}{s/\sqrt n} = \frac{112-120}{40/10} = -2.0$$

p-value = 0.023 < α = 0.05

#### Exhibit 32.19 Probability that the sample average consumption of cigarettes is less than 120 is 0.023 or 2.3%.

The p-value (refer Exhibit 32.19) reveals that the probability of obtaining a z-score of −2.0 or lower is approximately 0.023. Based on this we can conclude that there is 97.7% (1 − 0.023) probability that the government’s initiatives succeeded in achieving the target of reducing consumption to less than 120 sticks per month.

In the context of the null hypothesis, there is a probability of 0.023 (2.3%) that the average is now smoking less than 120 sticks per month. As this is significant for the given level of 5%, the null hypothesis is rejected.

Note: A p-value from z-score calculator is provided on this web page.

Two-tailed, known mean and standard deviation tests also use the z-score statistic. (We can use the z-score and the normal distribution, provided the population’s standard deviation is known).

Example: The mean weight of fresh recruits into the army was 65.8 kg last year. For a sample of 200 recruits this year, the mean weight is 66.2 kg. Assuming the population standard deviation is 3.2 kg, at 0.05 significance level, can we conclude that the mean weight has changed since last year?

#### Exhibit 32.20 If the actual mean was 65.8 kg, there is a 7.7% probability that the sampled recruits would weigh ≥ 66.2 kg or ≤ 65.4 kg.

H0: μ=65.8

HA: μ≠65.8

α = 5%

$$z = \frac{\bar x-μ}{s/\sqrt n} = \frac{66.2-65.8}{3.2/14.14} = 1.77$$

p-value = 0.077 > α = 0.05

The p-value of 0.077, obtained from normal distribution (Exhibit 32.20) for z = 1.77, is not significant for the given level of 5%.

If the actual mean was 65.8 kg, there is a 7.7% probability that the sampled recruits would weigh ≥ 66.2 kg or ≤ 65.4 kg. Since this probability is higher than the significance level of 5%, the null hypothesis is not rejected. We cannot conclude with 95% certainty that the new recruits differ in weight for those recruited last year.

One-Tailed: unknown standard deviation: Since the population standard deviation is not known the test statistic is t-value. (Refer section t-test for details on the test and the t-value).

Example: For its house brand, a retailer procures packs of detergents from a manufacturer. According to specifications, these packs contain on average 15% of a surfactant. To verify that the proportions are correct, technicians at a contracted lab take samples from 80 packs and examine their ingredient composition. They find that for the sampled packs, the average surfactant composition is 14.7 gm per 100 gm of detergent, and the standard deviation is 1.2 gm. Based on these findings, can we infer that the packs contain less than the specified level of surfactant?

H0: μ ≥ 15 gm per 100 gm of detergent

HA: μ < 15 gm per 100 gm of detergent

α = 5%

$$t = \frac{\bar x-μ_0}{σ/\sqrt n} = \frac{14.7-15}{1.2/8.94} = -2.24$$

Degrees of freedom = 79.

p-value = 0.014.

The probability of obtaining a t-value of -2.24 or lower, when sampling a population is low (0.014 <α = 0.05). More specifically, if the surfactant concentration was actually 15%, the chance that our sample would average ≤14.7% is only 0.014 (1.4%). The null hypothesis is rejected, the data suggests that manufacturer is not meeting the specified concentration of 15%.

Note: A p-value from t-value calculator is provided on this web page.

Paired t-test: paired group test with unknown standard deviation. This is essentially the same as a single sample t-test. The paired values are reduced to a single series by computing the difference between the two sets.

Example: In advertising copy tests, purchase intent is gauged through pre-post exposure measurement of respondents’ disposition to buy or try a product. The metric, x, is usually the top-2 box rating (“very likely to buy”, “likely to buy”). The paired values (xbefore, xafter) are reduced to a single set of values (y) by computing the difference: y = xafter – xbefore. HA: μ > 0, i.e. exposure to the advert is expected to improve disposition to try the product.

Example: Sequential monadic tests are frequently used for product testing. The respondents try one product and rate it, move to another product and rate it, and then compare the two. A paired t-test may be used to determine whether an improved formulation is rated higher on an attribute. HA: μ > 0, i.e. new product expected to be rated higher.

The remaining steps are the same as that for a one-tailed, single sample t-test.

Comparison test for two groups of unknown standard deviation also requires use of the t-test, since the population’s standard deviation is not known.

Example: A study was conducted to examine the consumption of coffee by office workers. The statistics for the men and women sampled in this study are given below.

Men: Sample size nM=440, mean x̄M =46.5 cups per month, standard deviation sM=36.3.

Women: Sample size nW=360, mean x̄W=35.1 cups per month, standard deviation sW=20.6.

There is a difference of 11.4 cups per month in coffee consumption between men and women. Is this difference resulting from sampling error, or do men consume significantly more coffee than women?

H0: μM – μW < 0

HA: μM – μW > 0

α = 0.05

Standard deviation:

$$σ_{\bar x_M-\bar x_M} = \sqrt {\frac{s_M^2}{n_M} + \frac{s_W^2}{n_W}} = \sqrt {\frac{36.3×36.3}{440} + \frac{20.6×20.6}{360}} = 2.04$$ $$t=\frac{\bar x_M-\bar x_W}{σ_{\bar x_M - \bar x_W}} =\frac {46.5-35.1}{2.04}=5.58$$ $$degrees\;of\; freedom\; (df) = \frac {\left(\frac {s_M^2}{n_M} + \frac {s_W^2}{n_W}\right)^2} {\frac{(s_M^2/n_M)^2}{n_M-1}+\frac {(s_W^2/n_W)^2}{n_W-1}} = 716.8$$

P-value = 0.00001 obtained from t distribution.

The probability of obtaining a t-value of 5.58 or higher with 717 degrees of freedom, when sampling a population is very low (0.00001 << α = 0.05). More specifically, if women were consuming as much coffee as men, the chance that the sample differences would average 11.4 or more cups per month is only 0.00001. The null hypothesis is rejected. The data strongly suggests that women consumers consume less coffee than men.

Previous     Next

Note: To find content on MarketingMind type the acronym ‘MM’ followed by your query into the search bar. For example, if you enter ‘mm consumer analytics’ into Chrome’s search bar, relevant pages from MarketingMind will appear in Google’s result pages.