Knowledge Base
Methodology
How to Write a Strong Hypothesis | Guide & Examples

## How to Write a Strong Hypothesis | Guide & Examples

Published on 6 May 2022 by Shona McCombes.

A hypothesis is a statement that can be tested by scientific research. If you want to test a relationship between two or more variables, you need to write hypotheses before you start your experiment or data collection.

## Table of contents

What is a hypothesis, developing a hypothesis (with example), hypothesis examples, frequently asked questions about writing hypotheses.

A hypothesis states your predictions about what your research will find. It is a tentative answer to your research question that has not yet been tested. For some research projects, you might have to write several hypotheses that address different aspects of your research question.

A hypothesis is not just a guess – it should be based on existing theories and knowledge. It also has to be testable, which means you can support or refute it through scientific research methods (such as experiments, observations, and statistical analysis of data).

## Variables in hypotheses

Hypotheses propose a relationship between two or more variables . An independent variable is something the researcher changes or controls. A dependent variable is something the researcher observes and measures.

In this example, the independent variable is exposure to the sun – the assumed cause . The dependent variable is the level of happiness – the assumed effect .

## Prevent plagiarism, run a free check.

Step 1: ask a question.

Writing a hypothesis begins with a research question that you want to answer. The question should be focused, specific, and researchable within the constraints of your project.

## Step 2: Do some preliminary research

Your initial answer to the question should be based on what is already known about the topic. Look for theories and previous studies to help you form educated assumptions about what your research will find.

At this stage, you might construct a conceptual framework to identify which variables you will study and what you think the relationships are between them. Sometimes, you’ll have to operationalise more complex constructs.

## Step 3: Formulate your hypothesis

Now you should have some idea of what you expect to find. Write your initial answer to the question in a clear, concise sentence.

## Step 4: Refine your hypothesis

You need to make sure your hypothesis is specific and testable. There are various ways of phrasing a hypothesis, but all the terms you use should have clear definitions, and the hypothesis should contain:

- The relevant variables
- The specific group being studied
- The predicted outcome of the experiment or analysis

## Step 5: Phrase your hypothesis in three ways

To identify the variables, you can write a simple prediction in if … then form. The first part of the sentence states the independent variable and the second part states the dependent variable.

In academic research, hypotheses are more commonly phrased in terms of correlations or effects, where you directly state the predicted relationship between variables.

If you are comparing two groups, the hypothesis can state what difference you expect to find between them.

## Step 6. Write a null hypothesis

If your research involves statistical hypothesis testing , you will also have to write a null hypothesis. The null hypothesis is the default position that there is no association between the variables. The null hypothesis is written as H 0 , while the alternative hypothesis is H 1 or H a .

Hypothesis testing is a formal procedure for investigating our ideas about the world using statistics. It is used by scientists to test specific predictions, called hypotheses , by calculating how likely it is that a pattern or relationship between variables could have arisen by chance.

A hypothesis is not just a guess. It should be based on existing theories and knowledge. It also has to be testable, which means you can support or refute it through scientific research methods (such as experiments, observations, and statistical analysis of data).

A research hypothesis is your proposed answer to your research question. The research hypothesis usually includes an explanation (‘ x affects y because …’).

A statistical hypothesis, on the other hand, is a mathematical statement about a population parameter. Statistical hypotheses always come in pairs: the null and alternative hypotheses. In a well-designed study , the statistical hypotheses correspond logically to the research hypothesis.

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the ‘Cite this Scribbr article’ button to automatically add the citation to our free Reference Generator.

McCombes, S. (2022, May 06). How to Write a Strong Hypothesis | Guide & Examples. Scribbr. Retrieved 21 May 2024, from https://www.scribbr.co.uk/research-methods/hypothesis-writing/

## Is this article helpful?

Shona McCombes

Other students also liked, operationalisation | a guide with examples, pros & cons, what is a conceptual framework | tips & examples, a quick guide to experimental design | 5 steps & examples.

S.3 hypothesis testing.

In reviewing hypothesis tests, we start first with the general idea. Then, we keep returning to the basic procedures of hypothesis testing, each time adding a little more detail.

The general idea of hypothesis testing involves:

- Making an initial assumption.
- Collecting evidence (data).
- Based on the available evidence (data), deciding whether to reject or not reject the initial assumption.

Every hypothesis test — regardless of the population parameter involved — requires the above three steps.

## Example S.3.1

Is normal body temperature really 98.6 degrees f section .

Consider the population of many, many adults. A researcher hypothesized that the average adult body temperature is lower than the often-advertised 98.6 degrees F. That is, the researcher wants an answer to the question: "Is the average adult body temperature 98.6 degrees? Or is it lower?" To answer his research question, the researcher starts by assuming that the average adult body temperature was 98.6 degrees F.

Then, the researcher went out and tried to find evidence that refutes his initial assumption. In doing so, he selects a random sample of 130 adults. The average body temperature of the 130 sampled adults is 98.25 degrees.

Then, the researcher uses the data he collected to make a decision about his initial assumption. It is either likely or unlikely that the researcher would collect the evidence he did given his initial assumption that the average adult body temperature is 98.6 degrees:

- If it is likely , then the researcher does not reject his initial assumption that the average adult body temperature is 98.6 degrees. There is not enough evidence to do otherwise.
- either the researcher's initial assumption is correct and he experienced a very unusual event;
- or the researcher's initial assumption is incorrect.

In statistics, we generally don't make claims that require us to believe that a very unusual event happened. That is, in the practice of statistics, if the evidence (data) we collected is unlikely in light of the initial assumption, then we reject our initial assumption.

## Example S.3.2

Criminal trial analogy section .

One place where you can consistently see the general idea of hypothesis testing in action is in criminal trials held in the United States. Our criminal justice system assumes "the defendant is innocent until proven guilty." That is, our initial assumption is that the defendant is innocent.

In the practice of statistics, we make our initial assumption when we state our two competing hypotheses -- the null hypothesis ( H 0 ) and the alternative hypothesis ( H A ). Here, our hypotheses are:

- H 0 : Defendant is not guilty (innocent)
- H A : Defendant is guilty

In statistics, we always assume the null hypothesis is true . That is, the null hypothesis is always our initial assumption.

The prosecution team then collects evidence — such as finger prints, blood spots, hair samples, carpet fibers, shoe prints, ransom notes, and handwriting samples — with the hopes of finding "sufficient evidence" to make the assumption of innocence refutable.

In statistics, the data are the evidence.

The jury then makes a decision based on the available evidence:

- If the jury finds sufficient evidence — beyond a reasonable doubt — to make the assumption of innocence refutable, the jury rejects the null hypothesis and deems the defendant guilty. We behave as if the defendant is guilty.
- If there is insufficient evidence, then the jury does not reject the null hypothesis . We behave as if the defendant is innocent.

In statistics, we always make one of two decisions. We either "reject the null hypothesis" or we "fail to reject the null hypothesis."

## Errors in Hypothesis Testing Section

Did you notice the use of the phrase "behave as if" in the previous discussion? We "behave as if" the defendant is guilty; we do not "prove" that the defendant is guilty. And, we "behave as if" the defendant is innocent; we do not "prove" that the defendant is innocent.

This is a very important distinction! We make our decision based on evidence not on 100% guaranteed proof. Again:

- If we reject the null hypothesis, we do not prove that the alternative hypothesis is true.
- If we do not reject the null hypothesis, we do not prove that the null hypothesis is true.

We merely state that there is enough evidence to behave one way or the other. This is always true in statistics! Because of this, whatever the decision, there is always a chance that we made an error .

Let's review the two types of errors that can be made in criminal trials:

Table S.3.2 shows how this corresponds to the two types of errors in hypothesis testing.

Note that, in statistics, we call the two types of errors by two different names -- one is called a "Type I error," and the other is called a "Type II error." Here are the formal definitions of the two types of errors:

There is always a chance of making one of these errors. But, a good scientific study will minimize the chance of doing so!

## Making the Decision Section

Recall that it is either likely or unlikely that we would observe the evidence we did given our initial assumption. If it is likely , we do not reject the null hypothesis. If it is unlikely , then we reject the null hypothesis in favor of the alternative hypothesis. Effectively, then, making the decision reduces to determining "likely" or "unlikely."

In statistics, there are two ways to determine whether the evidence is likely or unlikely given the initial assumption:

- We could take the " critical value approach " (favored in many of the older textbooks).
- Or, we could take the " P -value approach " (what is used most often in research, journal articles, and statistical software).

In the next two sections, we review the procedures behind each of these two approaches. To make our review concrete, let's imagine that μ is the average grade point average of all American students who major in mathematics. We first review the critical value approach for conducting each of the following three hypothesis tests about the population mean $\mu$:

## In Practice

- We would want to conduct the first hypothesis test if we were interested in concluding that the average grade point average of the group is more than 3.
- We would want to conduct the second hypothesis test if we were interested in concluding that the average grade point average of the group is less than 3.
- And, we would want to conduct the third hypothesis test if we were only interested in concluding that the average grade point average of the group differs from 3 (without caring whether it is more or less than 3).

Upon completing the review of the critical value approach, we review the P -value approach for conducting each of the above three hypothesis tests about the population mean \(\mu\). The procedures that we review here for both approaches easily extend to hypothesis tests about any other population parameter.

4.4: Hypothesis Testing

Last updated
Save as PDF
Page ID 283

David Diez, Christopher Barr, & Mine Çetinkaya-Rundel
OpenIntro Statistics

Is the typical US runner getting faster or slower over time? We consider this question in the context of the Cherry Blossom Run, comparing runners in 2006 and 2012. Technological advances in shoes, training, and diet might suggest runners would be faster in 2012. An opposing viewpoint might say that with the average body mass index on the rise, people tend to run slower. In fact, all of these components might be influencing run time.

In addition to considering run times in this section, we consider a topic near and dear to most students: sleep. A recent study found that college students average about 7 hours of sleep per night.15 However, researchers at a rural college are interested in showing that their students sleep longer than seven hours on average. We investigate this topic in Section 4.3.4.

## Hypothesis Testing Framework

The average time for all runners who finished the Cherry Blossom Run in 2006 was 93.29 minutes (93 minutes and about 17 seconds). We want to determine if the run10Samp data set provides strong evidence that the participants in 2012 were faster or slower than those runners in 2006, versus the other possibility that there has been no change. 16 We simplify these three options into two competing hypotheses :

- H 0 : The average 10 mile run time was the same for 2006 and 2012.
- H A : The average 10 mile run time for 2012 was different than that of 2006.

We call H 0 the null hypothesis and H A the alternative hypothesis.

Null and alternative hypotheses

- The null hypothesis (H 0 ) often represents either a skeptical perspective or a claim to be tested.
- The alternative hypothesis (H A ) represents an alternative claim under consideration and is often represented by a range of possible parameter values.

16 While we could answer this question by examining the entire population data (run10), we only consider the sample data (run10Samp), which is more realistic since we rarely have access to population data.

The null hypothesis often represents a skeptical position or a perspective of no difference. The alternative hypothesis often represents a new perspective, such as the possibility that there has been a change.

Hypothesis testing framework

The skeptic will not reject the null hypothesis (H 0 ), unless the evidence in favor of the alternative hypothesis (H A ) is so strong that she rejects H 0 in favor of H A .

The hypothesis testing framework is a very general tool, and we often use it without a second thought. If a person makes a somewhat unbelievable claim, we are initially skeptical. However, if there is sufficient evidence that supports the claim, we set aside our skepticism and reject the null hypothesis in favor of the alternative. The hallmarks of hypothesis testing are also found in the US court system.

Exercise \(\PageIndex{1}\)

A US court considers two possible claims about a defendant: she is either innocent or guilty. If we set these claims up in a hypothesis framework, which would be the null hypothesis and which the alternative? 17

Jurors examine the evidence to see whether it convincingly shows a defendant is guilty. Even if the jurors leave unconvinced of guilt beyond a reasonable doubt, this does not mean they believe the defendant is innocent. This is also the case with hypothesis testing: even if we fail to reject the null hypothesis, we typically do not accept the null hypothesis as true. Failing to find strong evidence for the alternative hypothesis is not equivalent to accepting the null hypothesis.

In the example with the Cherry Blossom Run, the null hypothesis represents no difference in the average time from 2006 to 2012. The alternative hypothesis represents something new or more interesting: there was a difference, either an increase or a decrease. These hypotheses can be described in mathematical notation using \(\mu_{12}\) as the average run time for 2012:

- H 0 : \(\mu_{12} = 93.29\)
- H A : \(\mu_{12} \ne 93.29\)

where 93.29 minutes (93 minutes and about 17 seconds) is the average 10 mile time for all runners in the 2006 Cherry Blossom Run. Using this mathematical notation, the hypotheses can now be evaluated using statistical tools. We call 93.29 the null value since it represents the value of the parameter if the null hypothesis is true. We will use the run10Samp data set to evaluate the hypothesis test.

## Testing Hypotheses using Confidence Intervals

We can start the evaluation of the hypothesis setup by comparing 2006 and 2012 run times using a point estimate from the 2012 sample: \(\bar {x}_{12} = 95.61\) minutes. This estimate suggests the average time is actually longer than the 2006 time, 93.29 minutes. However, to evaluate whether this provides strong evidence that there has been a change, we must consider the uncertainty associated with \(\bar {x}_{12}\).

We learned in Section 4.1 that there is fluctuation from one sample to another, and it is very unlikely that the sample mean will be exactly equal to our parameter; we should not expect \(\bar {x}_{12}\) to exactly equal \(\mu_{12}\). Given that \(\bar {x}_{12} = 95.61\), it might still be possible that the population average in 2012 has remained unchanged from 2006. The difference between \(\bar {x}_{12}\) and 93.29 could be due to sampling variation, i.e. the variability associated with the point estimate when we take a random sample.

In Section 4.2, confidence intervals were introduced as a way to find a range of plausible values for the population mean. Based on run10Samp, a 95% confidence interval for the 2012 population mean, \(\mu_{12}\), was calculated as

\[(92.45, 98.77)\]

Because the 2006 mean, 93.29, falls in the range of plausible values, we cannot say the null hypothesis is implausible. That is, we failed to reject the null hypothesis, H 0 .

Double negatives can sometimes be used in statistics

In many statistical explanations, we use double negatives. For instance, we might say that the null hypothesis is not implausible or we failed to reject the null hypothesis. Double negatives are used to communicate that while we are not rejecting a position, we are also not saying it is correct.

Example \(\PageIndex{1}\)

Next consider whether there is strong evidence that the average age of runners has changed from 2006 to 2012 in the Cherry Blossom Run. In 2006, the average age was 36.13 years, and in the 2012 run10Samp data set, the average was 35.05 years with a standard deviation of 8.97 years for 100 runners.

First, set up the hypotheses:

- H 0 : The average age of runners has not changed from 2006 to 2012, \(\mu_{age} = 36.13.\)
- H A : The average age of runners has changed from 2006 to 2012, \(\mu _{age} 6 \ne 36.13.\)

We have previously veri ed conditions for this data set. The normal model may be applied to \(\bar {y}\) and the estimate of SE should be very accurate. Using the sample mean and standard error, we can construct a 95% con dence interval for \(\mu _{age}\) to determine if there is sufficient evidence to reject H 0 :

\[\bar{y} \pm 1.96 \times \dfrac {s}{\sqrt {100}} \rightarrow 35.05 \pm 1.96 \times 0.90 \rightarrow (33.29, 36.81)\]

This confidence interval contains the null value, 36.13. Because 36.13 is not implausible, we cannot reject the null hypothesis. We have not found strong evidence that the average age is different than 36.13 years.

Exercise \(\PageIndex{2}\)

Colleges frequently provide estimates of student expenses such as housing. A consultant hired by a community college claimed that the average student housing expense was $650 per month. What are the null and alternative hypotheses to test whether this claim is accurate? 18

H A : The average cost is different than $650 per month, \(\mu \ne\) $650.

18 Applying the normal model requires that certain conditions are met. Because the data are a simple random sample and the sample (presumably) represents no more than 10% of all students at the college, the observations are independent. The sample size is also sufficiently large (n = 75) and the data exhibit only moderate skew. Thus, the normal model may be applied to the sample mean.

Exercise \(\PageIndex{3}\)

The community college decides to collect data to evaluate the $650 per month claim. They take a random sample of 75 students at their school and obtain the data represented in Figure 4.11. Can we apply the normal model to the sample mean?

If the court makes a Type 1 Error, this means the defendant is innocent (H 0 true) but wrongly convicted. A Type 2 Error means the court failed to reject H 0 (i.e. failed to convict the person) when she was in fact guilty (H A true).

Example \(\PageIndex{2}\)

The sample mean for student housing is $611.63 and the sample standard deviation is $132.85. Construct a 95% confidence interval for the population mean and evaluate the hypotheses of Exercise 4.22.

The standard error associated with the mean may be estimated using the sample standard deviation divided by the square root of the sample size. Recall that n = 75 students were sampled.

\[ SE = \dfrac {s}{\sqrt {n}} = \dfrac {132.85}{\sqrt {75}} = 15.34\]

You showed in Exercise 4.23 that the normal model may be applied to the sample mean. This ensures a 95% confidence interval may be accurately constructed:

\[\bar {x} \pm z*SE \rightarrow 611.63 \pm 1.96 \times 15.34 \times (581.56, 641.70)\]

Because the null value $650 is not in the confidence interval, a true mean of $650 is implausible and we reject the null hypothesis. The data provide statistically significant evidence that the actual average housing expense is less than $650 per month.

## Decision Errors

Hypothesis tests are not flawless. Just think of the court system: innocent people are sometimes wrongly convicted and the guilty sometimes walk free. Similarly, we can make a wrong decision in statistical hypothesis tests. However, the difference is that we have the tools necessary to quantify how often we make such errors.

There are two competing hypotheses: the null and the alternative. In a hypothesis test, we make a statement about which one might be true, but we might choose incorrectly. There are four possible scenarios in a hypothesis test, which are summarized in Table 4.12.

A Type 1 Error is rejecting the null hypothesis when H0 is actually true. A Type 2 Error is failing to reject the null hypothesis when the alternative is actually true.

Exercise 4.25

In a US court, the defendant is either innocent (H 0 ) or guilty (H A ). What does a Type 1 Error represent in this context? What does a Type 2 Error represent? Table 4.12 may be useful.

To lower the Type 1 Error rate, we might raise our standard for conviction from "beyond a reasonable doubt" to "beyond a conceivable doubt" so fewer people would be wrongly convicted. However, this would also make it more difficult to convict the people who are actually guilty, so we would make more Type 2 Errors.

Exercise 4.26

How could we reduce the Type 1 Error rate in US courts? What influence would this have on the Type 2 Error rate?

To lower the Type 2 Error rate, we want to convict more guilty people. We could lower the standards for conviction from "beyond a reasonable doubt" to "beyond a little doubt". Lowering the bar for guilt will also result in more wrongful convictions, raising the Type 1 Error rate.

Exercise 4.27

How could we reduce the Type 2 Error rate in US courts? What influence would this have on the Type 1 Error rate?

A skeptic would have no reason to believe that sleep patterns at this school are different than the sleep patterns at another school.

Exercises 4.25-4.27 provide an important lesson:

If we reduce how often we make one type of error, we generally make more of the other type.

Hypothesis testing is built around rejecting or failing to reject the null hypothesis. That is, we do not reject H 0 unless we have strong evidence. But what precisely does strong evidence mean? As a general rule of thumb, for those cases where the null hypothesis is actually true, we do not want to incorrectly reject H 0 more than 5% of the time. This corresponds to a significance level of 0.05. We often write the significance level using \(\alpha\) (the Greek letter alpha): \(\alpha = 0.05.\) We discuss the appropriateness of different significance levels in Section 4.3.6.

If we use a 95% confidence interval to test a hypothesis where the null hypothesis is true, we will make an error whenever the point estimate is at least 1.96 standard errors away from the population parameter. This happens about 5% of the time (2.5% in each tail). Similarly, using a 99% con dence interval to evaluate a hypothesis is equivalent to a significance level of \(\alpha = 0.01\).

A confidence interval is, in one sense, simplistic in the world of hypothesis tests. Consider the following two scenarios:

- The null value (the parameter value under the null hypothesis) is in the 95% confidence interval but just barely, so we would not reject H 0 . However, we might like to somehow say, quantitatively, that it was a close decision.
- The null value is very far outside of the interval, so we reject H 0 . However, we want to communicate that, not only did we reject the null hypothesis, but it wasn't even close. Such a case is depicted in Figure 4.13.

In Section 4.3.4, we introduce a tool called the p-value that will be helpful in these cases. The p-value method also extends to hypothesis tests where con dence intervals cannot be easily constructed or applied.

## Formal Testing using p-Values

The p-value is a way of quantifying the strength of the evidence against the null hypothesis and in favor of the alternative. Formally the p-value is a conditional probability.

definition: p-value

The p-value is the probability of observing data at least as favorable to the alternative hypothesis as our current data set, if the null hypothesis is true. We typically use a summary statistic of the data, in this chapter the sample mean, to help compute the p-value and evaluate the hypotheses.

A poll by the National Sleep Foundation found that college students average about 7 hours of sleep per night. Researchers at a rural school are interested in showing that students at their school sleep longer than seven hours on average, and they would like to demonstrate this using a sample of students. What would be an appropriate skeptical position for this research?

This is entirely based on the interests of the researchers. Had they been only interested in the opposite case - showing that their students were actually averaging fewer than seven hours of sleep but not interested in showing more than 7 hours - then our setup would have set the alternative as \(\mu < 7\).

We can set up the null hypothesis for this test as a skeptical perspective: the students at this school average 7 hours of sleep per night. The alternative hypothesis takes a new form reflecting the interests of the research: the students average more than 7 hours of sleep. We can write these hypotheses as

- H 0 : \(\mu\) = 7.
- H A : \(\mu\) > 7.

Using \(\mu\) > 7 as the alternative is an example of a one-sided hypothesis test. In this investigation, there is no apparent interest in learning whether the mean is less than 7 hours. (The standard error can be estimated from the sample standard deviation and the sample size: \(SE_{\bar {x}} = \dfrac {s_x}{\sqrt {n}} = \dfrac {1.75}{\sqrt {110}} = 0.17\)). Earlier we encountered a two-sided hypothesis where we looked for any clear difference, greater than or less than the null value.

Always use a two-sided test unless it was made clear prior to data collection that the test should be one-sided. Switching a two-sided test to a one-sided test after observing the data is dangerous because it can inflate the Type 1 Error rate.

TIP: One-sided and two-sided tests

If the researchers are only interested in showing an increase or a decrease, but not both, use a one-sided test. If the researchers would be interested in any difference from the null value - an increase or decrease - then the test should be two-sided.

TIP: Always write the null hypothesis as an equality

We will find it most useful if we always list the null hypothesis as an equality (e.g. \(\mu\) = 7) while the alternative always uses an inequality (e.g. \(\mu \ne 7, \mu > 7, or \mu < 7)\).

The researchers at the rural school conducted a simple random sample of n = 110 students on campus. They found that these students averaged 7.42 hours of sleep and the standard deviation of the amount of sleep for the students was 1.75 hours. A histogram of the sample is shown in Figure 4.14.

Before we can use a normal model for the sample mean or compute the standard error of the sample mean, we must verify conditions. (1) Because this is a simple random sample from less than 10% of the student body, the observations are independent. (2) The sample size in the sleep study is sufficiently large since it is greater than 30. (3) The data show moderate skew in Figure 4.14 and the presence of a couple of outliers. This skew and the outliers (which are not too extreme) are acceptable for a sample size of n = 110. With these conditions veri ed, the normal model can be safely applied to \(\bar {x}\) and the estimated standard error will be very accurate.

What is the standard deviation associated with \(\bar {x}\)? That is, estimate the standard error of \(\bar {x}\). 25

The hypothesis test will be evaluated using a significance level of \(\alpha = 0.05\). We want to consider the data under the scenario that the null hypothesis is true. In this case, the sample mean is from a distribution that is nearly normal and has mean 7 and standard deviation of about 0.17. Such a distribution is shown in Figure 4.15.

The shaded tail in Figure 4.15 represents the chance of observing such a large mean, conditional on the null hypothesis being true. That is, the shaded tail represents the p-value. We shade all means larger than our sample mean, \(\bar {x} = 7.42\), because they are more favorable to the alternative hypothesis than the observed mean.

We compute the p-value by finding the tail area of this normal distribution, which we learned to do in Section 3.1. First compute the Z score of the sample mean, \(\bar {x} = 7.42\):

\[Z = \dfrac {\bar {x} - \text {null value}}{SE_{\bar {x}}} = \dfrac {7.42 - 7}{0.17} = 2.47\]

Using the normal probability table, the lower unshaded area is found to be 0.993. Thus the shaded area is 1 - 0.993 = 0.007. If the null hypothesis is true, the probability of observing such a large sample mean for a sample of 110 students is only 0.007. That is, if the null hypothesis is true, we would not often see such a large mean.

We evaluate the hypotheses by comparing the p-value to the significance level. Because the p-value is less than the significance level \((p-value = 0.007 < 0.05 = \alpha)\), we reject the null hypothesis. What we observed is so unusual with respect to the null hypothesis that it casts serious doubt on H 0 and provides strong evidence favoring H A .

p-value as a tool in hypothesis testing

The p-value quantifies how strongly the data favor H A over H 0 . A small p-value (usually < 0.05) corresponds to sufficient evidence to reject H 0 in favor of H A .

TIP: It is useful to First draw a picture to find the p-value

It is useful to draw a picture of the distribution of \(\bar {x}\) as though H 0 was true (i.e. \(\mu\) equals the null value), and shade the region (or regions) of sample means that are at least as favorable to the alternative hypothesis. These shaded regions represent the p-value.

The ideas below review the process of evaluating hypothesis tests with p-values:

- The null hypothesis represents a skeptic's position or a position of no difference. We reject this position only if the evidence strongly favors H A .
- A small p-value means that if the null hypothesis is true, there is a low probability of seeing a point estimate at least as extreme as the one we saw. We interpret this as strong evidence in favor of the alternative.
- We reject the null hypothesis if the p-value is smaller than the significance level, \(\alpha\), which is usually 0.05. Otherwise, we fail to reject H 0 .
- We should always state the conclusion of the hypothesis test in plain language so non-statisticians can also understand the results.

The p-value is constructed in such a way that we can directly compare it to the significance level ( \(\alpha\)) to determine whether or not to reject H 0 . This method ensures that the Type 1 Error rate does not exceed the significance level standard.

If the null hypothesis is true, how often should the p-value be less than 0.05?

About 5% of the time. If the null hypothesis is true, then the data only has a 5% chance of being in the 5% of data most favorable to H A .

Exercise 4.31

Suppose we had used a significance level of 0.01 in the sleep study. Would the evidence have been strong enough to reject the null hypothesis? (The p-value was 0.007.) What if the significance level was \(\alpha = 0.001\)? 27

27 We reject the null hypothesis whenever p-value < \(\alpha\). Thus, we would still reject the null hypothesis if \(\alpha = 0.01\) but not if the significance level had been \(\alpha = 0.001\).

Exercise 4.32

Ebay might be interested in showing that buyers on its site tend to pay less than they would for the corresponding new item on Amazon. We'll research this topic for one particular product: a video game called Mario Kart for the Nintendo Wii. During early October 2009, Amazon sold this game for $46.99. Set up an appropriate (one-sided!) hypothesis test to check the claim that Ebay buyers pay less during auctions at this same time. 28

28 The skeptic would say the average is the same on Ebay, and we are interested in showing the average price is lower.

Exercise 4.33

During early October, 2009, 52 Ebay auctions were recorded for Mario Kart.29 The total prices for the auctions are presented using a histogram in Figure 4.17, and we may like to apply the normal model to the sample mean. Check the three conditions required for applying the normal model: (1) independence, (2) at least 30 observations, and (3) the data are not strongly skewed. 30

30 (1) The independence condition is unclear. We will make the assumption that the observations are independent, which we should report with any nal results. (2) The sample size is sufficiently large: \(n = 52 \ge 30\). (3) The data distribution is not strongly skewed; it is approximately symmetric.

H 0 : The average auction price on Ebay is equal to (or more than) the price on Amazon. We write only the equality in the statistical notation: \(\mu_{ebay} = 46.99\).

H A : The average price on Ebay is less than the price on Amazon, \(\mu _{ebay} < 46.99\).

29 These data were collected by OpenIntro staff.

Example 4.34

The average sale price of the 52 Ebay auctions for Wii Mario Kart was $44.17 with a standard deviation of $4.15. Does this provide sufficient evidence to reject the null hypothesis in Exercise 4.32? Use a significance level of \(\alpha = 0.01\).

The hypotheses were set up and the conditions were checked in Exercises 4.32 and 4.33. The next step is to find the standard error of the sample mean and produce a sketch to help find the p-value.

Because the alternative hypothesis says we are looking for a smaller mean, we shade the lower tail. We find this shaded area by using the Z score and normal probability table: \(Z = \dfrac {44.17 \times 46.99}{0.5755} = -4.90\), which has area less than 0.0002. The area is so small we cannot really see it on the picture. This lower tail area corresponds to the p-value.

Because the p-value is so small - specifically, smaller than = 0.01 - this provides sufficiently strong evidence to reject the null hypothesis in favor of the alternative. The data provide statistically signi cant evidence that the average price on Ebay is lower than Amazon's asking price.

## Two-sided hypothesis testing with p-values

We now consider how to compute a p-value for a two-sided test. In one-sided tests, we shade the single tail in the direction of the alternative hypothesis. For example, when the alternative had the form \(\mu\) > 7, then the p-value was represented by the upper tail (Figure 4.16). When the alternative was \(\mu\) < 46.99, the p-value was the lower tail (Exercise 4.32). In a two-sided test, we shade two tails since evidence in either direction is favorable to H A .

Exercise 4.35 Earlier we talked about a research group investigating whether the students at their school slept longer than 7 hours each night. Let's consider a second group of researchers who want to evaluate whether the students at their college differ from the norm of 7 hours. Write the null and alternative hypotheses for this investigation. 31

Example 4.36 The second college randomly samples 72 students and nds a mean of \(\bar {x} = 6.83\) hours and a standard deviation of s = 1.8 hours. Does this provide strong evidence against H 0 in Exercise 4.35? Use a significance level of \(\alpha = 0.05\).

First, we must verify assumptions. (1) A simple random sample of less than 10% of the student body means the observations are independent. (2) The sample size is 72, which is greater than 30. (3) Based on the earlier distribution and what we already know about college student sleep habits, the distribution is probably not strongly skewed.

Next we can compute the standard error \((SE_{\bar {x}} = \dfrac {s}{\sqrt {n}} = 0.21)\) of the estimate and create a picture to represent the p-value, shown in Figure 4.18. Both tails are shaded.

31 Because the researchers are interested in any difference, they should use a two-sided setup: H 0 : \(\mu\) = 7, H A : \(\mu \ne 7.\)

An estimate of 7.17 or more provides at least as strong of evidence against the null hypothesis and in favor of the alternative as the observed estimate, \(\bar {x} = 6.83\).

We can calculate the tail areas by rst nding the lower tail corresponding to \(\bar {x}\):

\[Z = \dfrac {6.83 - 7.00}{0.21} = -0.81 \xrightarrow {table} \text {left tail} = 0.2090\]

Because the normal model is symmetric, the right tail will have the same area as the left tail. The p-value is found as the sum of the two shaded tails:

\[ \text {p-value} = \text {left tail} + \text {right tail} = 2 \times \text {(left tail)} = 0.4180\]

This p-value is relatively large (larger than \(\mu\)= 0.05), so we should not reject H 0 . That is, if H 0 is true, it would not be very unusual to see a sample mean this far from 7 hours simply due to sampling variation. Thus, we do not have sufficient evidence to conclude that the mean is different than 7 hours.

Example 4.37 It is never okay to change two-sided tests to one-sided tests after observing the data. In this example we explore the consequences of ignoring this advice. Using \(\alpha = 0.05\), we show that freely switching from two-sided tests to onesided tests will cause us to make twice as many Type 1 Errors as intended.

Suppose the sample mean was larger than the null value, \(\mu_0\) (e.g. \(\mu_0\) would represent 7 if H 0 : \(\mu\) = 7). Then if we can ip to a one-sided test, we would use H A : \(\mu > \mu_0\). Now if we obtain any observation with a Z score greater than 1.65, we would reject H 0 . If the null hypothesis is true, we incorrectly reject the null hypothesis about 5% of the time when the sample mean is above the null value, as shown in Figure 4.19.

Suppose the sample mean was smaller than the null value. Then if we change to a one-sided test, we would use H A : \(\mu < \mu_0\). If \(\bar {x}\) had a Z score smaller than -1.65, we would reject H 0 . If the null hypothesis is true, then we would observe such a case about 5% of the time.

By examining these two scenarios, we can determine that we will make a Type 1 Error 5% + 5% = 10% of the time if we are allowed to swap to the "best" one-sided test for the data. This is twice the error rate we prescribed with our significance level: \(\alpha = 0.05\) (!).

Caution: One-sided hypotheses are allowed only before seeing data

After observing data, it is tempting to turn a two-sided test into a one-sided test. Avoid this temptation. Hypotheses must be set up before observing the data. If they are not, the test must be two-sided.

## Choosing a Significance Level

Choosing a significance level for a test is important in many contexts, and the traditional level is 0.05. However, it is often helpful to adjust the significance level based on the application. We may select a level that is smaller or larger than 0.05 depending on the consequences of any conclusions reached from the test.

- If making a Type 1 Error is dangerous or especially costly, we should choose a small significance level (e.g. 0.01). Under this scenario we want to be very cautious about rejecting the null hypothesis, so we demand very strong evidence favoring H A before we would reject H 0 .
- If a Type 2 Error is relatively more dangerous or much more costly than a Type 1 Error, then we should choose a higher significance level (e.g. 0.10). Here we want to be cautious about failing to reject H 0 when the null is actually false. We will discuss this particular case in greater detail in Section 4.6.

Significance levels should reflect consequences of errors

The significance level selected for a test should reflect the consequences associated with Type 1 and Type 2 Errors.

Example 4.38

A car manufacturer is considering a higher quality but more expensive supplier for window parts in its vehicles. They sample a number of parts from their current supplier and also parts from the new supplier. They decide that if the high quality parts will last more than 12% longer, it makes nancial sense to switch to this more expensive supplier. Is there good reason to modify the significance level in such a hypothesis test?

The null hypothesis is that the more expensive parts last no more than 12% longer while the alternative is that they do last more than 12% longer. This decision is just one of the many regular factors that have a marginal impact on the car and company. A significancelevel of 0.05 seems reasonable since neither a Type 1 or Type 2 error should be dangerous or (relatively) much more expensive.

Example 4.39

The same car manufacturer is considering a slightly more expensive supplier for parts related to safety, not windows. If the durability of these safety components is shown to be better than the current supplier, they will switch manufacturers. Is there good reason to modify the significance level in such an evaluation?

The null hypothesis would be that the suppliers' parts are equally reliable. Because safety is involved, the car company should be eager to switch to the slightly more expensive manufacturer (reject H 0 ) even if the evidence of increased safety is only moderately strong. A slightly larger significance level, such as \(\mu = 0.10\), might be appropriate.

Exercise 4.40

A part inside of a machine is very expensive to replace. However, the machine usually functions properly even if this part is broken, so the part is replaced only if we are extremely certain it is broken based on a series of measurements. Identify appropriate hypotheses for this test (in plain language) and suggest an appropriate significance level. 32

8.1: The Elements of Hypothesis Testing

Last updated
Save as PDF
Page ID 130263

## Learning Objectives

- To understand the logical framework of tests of hypotheses.
- To learn basic terminology connected with hypothesis testing.
- To learn fundamental facts about hypothesis testing.

## Types of Hypotheses

A hypothesis about the value of a population parameter is an assertion about its value. As in the introductory example we will be concerned with testing the truth of two competing hypotheses, only one of which can be true.

## Definition: null hypothesis and alternative hypothesis

- The null hypothesis , denoted \(H_0\), is the statement about the population parameter that is assumed to be true unless there is convincing evidence to the contrary.
- The alternative hypothesis , denoted \(H_a\), is a statement about the population parameter that is contradictory to the null hypothesis, and is accepted as true only if there is convincing evidence in favor of it.

## Definition: statistical procedure

Hypothesis testing is a statistical procedure in which a choice is made between a null hypothesis and an alternative hypothesis based on information in a sample.

The end result of a hypotheses testing procedure is a choice of one of the following two possible conclusions:

- Reject \(H_0\) (and therefore accept \(H_a\)), or
- Fail to reject \(H_0\) (and therefore fail to accept \(H_a\)).

The null hypothesis typically represents the status quo, or what has historically been true. In the example of the respirators, we would believe the claim of the manufacturer unless there is reason not to do so, so the null hypotheses is \(H_0:\mu =75\). The alternative hypothesis in the example is the contradictory statement \(H_a:\mu <75\). The null hypothesis will always be an assertion containing an equals sign, but depending on the situation the alternative hypothesis can have any one of three forms: with the symbol \(<\), as in the example just discussed, with the symbol \(>\), or with the symbol \(\neq\). The following two examples illustrate the latter two cases.

## Example \(\PageIndex{1}\)

A publisher of college textbooks claims that the average price of all hardbound college textbooks is \(\$127.50\). A student group believes that the actual mean is higher and wishes to test their belief. State the relevant null and alternative hypotheses.

The default option is to accept the publisher’s claim unless there is compelling evidence to the contrary. Thus the null hypothesis is \(H_0:\mu =127.50\). Since the student group thinks that the average textbook price is greater than the publisher’s figure, the alternative hypothesis in this situation is \(H_a:\mu >127.50\).

## Example \(\PageIndex{2}\)

The recipe for a bakery item is designed to result in a product that contains \(8\) grams of fat per serving. The quality control department samples the product periodically to insure that the production process is working as designed. State the relevant null and alternative hypotheses.

The default option is to assume that the product contains the amount of fat it was formulated to contain unless there is compelling evidence to the contrary. Thus the null hypothesis is \(H_0:\mu =8.0\). Since to contain either more fat than desired or to contain less fat than desired are both an indication of a faulty production process, the alternative hypothesis in this situation is that the mean is different from \(8.0\), so \(H_a:\mu \neq 8.0\).

In Example \(\PageIndex{1}\), the textbook example, it might seem more natural that the publisher’s claim be that the average price is at most \(\$127.50\), not exactly \(\$127.50\). If the claim were made this way, then the null hypothesis would be \(H_0:\mu \leq 127.50\), and the value \(\$127.50\) given in the example would be the one that is least favorable to the publisher’s claim, the null hypothesis. It is always true that if the null hypothesis is retained for its least favorable value, then it is retained for every other value.

Thus in order to make the null and alternative hypotheses easy for the student to distinguish, in every example and problem in this text we will always present one of the two competing claims about the value of a parameter with an equality. The claim expressed with an equality is the null hypothesis. This is the same as always stating the null hypothesis in the least favorable light. So in the introductory example about the respirators, we stated the manufacturer’s claim as “the average is \(75\) minutes” instead of the perhaps more natural “the average is at least \(75\) minutes,” essentially reducing the presentation of the null hypothesis to its worst case.

The first step in hypothesis testing is to identify the null and alternative hypotheses.

## The Logic of Hypothesis Testing

Although we will study hypothesis testing in situations other than for a single population mean (for example, for a population proportion instead of a mean or in comparing the means of two different populations), in this section the discussion will always be given in terms of a single population mean \(\mu\).

The null hypothesis always has the form \(H_0:\mu =\mu _0\) for a specific number \(\mu _0\) (in the respirator example \(\mu _0=75\), in the textbook example \(\mu _0=127.50\), and in the baked goods example \(\mu _0=8.0\)). Since the null hypothesis is accepted unless there is strong evidence to the contrary, the test procedure is based on the initial assumption that \(H_0\) is true. This point is so important that we will repeat it in a display:

The test procedure is based on the initial assumption that \(H_0\) is true.

The criterion for judging between \(H_0\) and \(H_a\) based on the sample data is: if the value of \(\overline{X}\) would be highly unlikely to occur if \(H_0\) were true, but favors the truth of \(H_a\), then we reject \(H_0\) in favor of \(H_a\). Otherwise we do not reject \(H_0\).

Supposing for now that \(\overline{X}\) follows a normal distribution, when the null hypothesis is true the density function for the sample mean \(\overline{X}\) must be as in Figure \(\PageIndex{1}\): a bell curve centered at \(\mu _0\). Thus if \(H_0\) is true then \(\overline{X}\) is likely to take a value near \(\mu _0\) and is unlikely to take values far away. Our decision procedure therefore reduces simply to:

- if \(H_a\) has the form \(H_a:\mu <\mu _0\) then reject \(H_0\) if \(\bar{x}\) is far to the left of \(\mu _0\);
- if \(H_a\) has the form \(H_a:\mu >\mu _0\) then reject \(H_0\) if \(\bar{x}\) is far to the right of \(\mu _0\);
- if \(H_a\) has the form \(H_a:\mu \neq \mu _0\) then reject \(H_0\) if \(\bar{x}\) is far away from \(\mu _0\) in either direction.

Think of the respirator example, for which the null hypothesis is \(H_0:\mu =75\), the claim that the average time air is delivered for all respirators is \(75\) minutes. If the sample mean is \(75\) or greater then we certainly would not reject \(H_0\) (since there is no issue with an emergency respirator delivering air even longer than claimed).

If the sample mean is slightly less than \(75\) then we would logically attribute the difference to sampling error and also not reject \(H_0\) either.

Values of the sample mean that are smaller and smaller are less and less likely to come from a population for which the population mean is \(75\). Thus if the sample mean is far less than \(75\), say around \(60\) minutes or less, then we would certainly reject \(H_0\), because we know that it is highly unlikely that the average of a sample would be so low if the population mean were \(75\). This is the rare event criterion for rejection: what we actually observed \((\overline{X}<60)\) would be so rare an event if \(\mu =75\) were true that we regard it as much more likely that the alternative hypothesis \(\mu <75\) holds.

In summary, to decide between \(H_0\) and \(H_a\) in this example we would select a “rejection region” of values sufficiently far to the left of \(75\), based on the rare event criterion, and reject \(H_0\) if the sample mean \(\overline{X}\) lies in the rejection region, but not reject \(H_0\) if it does not.

## The Rejection Region

Each different form of the alternative hypothesis Ha has its own kind of rejection region:

- if (as in the respirator example) \(H_a\) has the form \(H_a:\mu <\mu _0\), we reject \(H_0\) if \(\bar{x}\) is far to the left of \(\mu _0\), that is, to the left of some number \(C\), so the rejection region has the form of an interval \((-\infty ,C]\);
- if (as in the textbook example) \(H_a\) has the form \(H_a:\mu >\mu _0\), we reject \(H_0\) if \(\bar{x}\) is far to the right of \(\mu _0\), that is, to the right of some number \(C\), so the rejection region has the form of an interval \([C,\infty )\);
- if (as in the baked good example) \(H_a\) has the form \(H_a:\mu \neq \mu _0\), we reject \(H_0\) if \(\bar{x}\) is far away from \(\mu _0\) in either direction, that is, either to the left of some number \(C\) or to the right of some other number \(C′\), so the rejection region has the form of the union of two intervals \((-\infty ,C]\cup [C',\infty )\).

The key issue in our line of reasoning is the question of how to determine the number \(C\) or numbers \(C\) and \(C′\), called the critical value or critical values of the statistic, that determine the rejection region.

## Definition: critical values

The critical value or critical values of a test of hypotheses are the number or numbers that determine the rejection region.

Suppose the rejection region is a single interval, so we need to select a single number \(C\). Here is the procedure for doing so. We select a small probability, denoted \(\alpha\), say \(1\%\), which we take as our definition of “rare event:” an event is “rare” if its probability of occurrence is less than \(\alpha\). (In all the examples and problems in this text the value of \(\alpha\) will be given already.) The probability that \(\overline{X}\) takes a value in an interval is the area under its density curve and above that interval, so as shown in Figure \(\PageIndex{2}\) (drawn under the assumption that \(H_0\) is true, so that the curve centers at \(\mu _0\)) the critical value \(C\) is the value of \(\overline{X}\) that cuts off a tail area \(\alpha\) in the probability density curve of \(\overline{X}\). When the rejection region is in two pieces, that is, composed of two intervals, the total area above both of them must be \(\alpha\), so the area above each one is \(\alpha /2\), as also shown in Figure \(\PageIndex{2}\).

The number \(\alpha\) is the total area of a tail or a pair of tails.

## Example \(\PageIndex{3}\)

In the context of Example \(\PageIndex{2}\), suppose that it is known that the population is normally distributed with standard deviation \(\alpha =0.15\) gram, and suppose that the test of hypotheses \(H_0:\mu =8.0\) versus \(H_a:\mu \neq 8.0\) will be performed with a sample of size \(5\). Construct the rejection region for the test for the choice \(\alpha =0.10\). Explain the decision procedure and interpret it.

If \(H_0\) is true then the sample mean \(\overline{X}\) is normally distributed with mean and standard deviation

\[\begin{align} \mu _{\overline{X}} &=\mu \nonumber \\[5pt] &=8.0 \nonumber \end{align} \nonumber \]

\[\begin{align} \sigma _{\overline{X}}&=\dfrac{\sigma}{\sqrt{n}} \nonumber \\[5pt] &= \dfrac{0.15}{\sqrt{5}} \nonumber\\[5pt] &=0.067 \nonumber \end{align} \nonumber \]

Since \(H_a\) contains the \(\neq\) symbol the rejection region will be in two pieces, each one corresponding to a tail of area \(\alpha /2=0.10/2=0.05\). From Figure 7.1.6, \(z_{0.05}=1.645\), so \(C\) and \(C′\) are \(1.645\) standard deviations of \(\overline{X}\) to the right and left of its mean \(8.0\):

\[C=8.0-(1.645)(0.067) = 7.89 \; \; \text{and}\; \; C'=8.0 + (1.645)(0.067) = 8.11 \nonumber \]

The result is shown in Figure \(\PageIndex{3}\). α = 0.1

The decision procedure is: take a sample of size \(5\) and compute the sample mean \(\bar{x}\). If \(\bar{x}\) is either \(7.89\) grams or less or \(8.11\) grams or more then reject the hypothesis that the average amount of fat in all servings of the product is \(8.0\) grams in favor of the alternative that it is different from \(8.0\) grams. Otherwise do not reject the hypothesis that the average amount is \(8.0\) grams.

The reasoning is that if the true average amount of fat per serving were \(8.0\) grams then there would be less than a \(10\%\) chance that a sample of size \(5\) would produce a mean of either \(7.89\) grams or less or \(8.11\) grams or more. Hence if that happened it would be more likely that the value \(8.0\) is incorrect (always assuming that the population standard deviation is \(0.15\) gram).

Because the rejection regions are computed based on areas in tails of distributions, as shown in Figure \(\PageIndex{2}\), hypothesis tests are classified according to the form of the alternative hypothesis in the following way.

## Definitions: Test classifications

- If \(H_a\) has the form \(\mu \neq \mu _0\) the test is called a two-tailed test .
- If \(H_a\) has the form \(\mu < \mu _0\) the test is called a left-tailed test .
- If \(H_a\) has the form \(\mu > \mu _0\)the test is called a right-tailed test .

Each of the last two forms is also called a one-tailed test .

## Two Types of Errors

The format of the testing procedure in general terms is to take a sample and use the information it contains to come to a decision about the two hypotheses. As stated before our decision will always be either

- reject the null hypothesis \(H_0\) in favor of the alternative \(H_a\) presented, or
- do not reject the null hypothesis \(H_0\) in favor of the alternative \(H_0\) presented.

There are four possible outcomes of hypothesis testing procedure, as shown in the following table:

As the table shows, there are two ways to be right and two ways to be wrong. Typically to reject \(H_0\) when it is actually true is a more serious error than to fail to reject it when it is false, so the former error is labeled “ Type I ” and the latter error “ Type II ”.

## Definition: Type I and Type II errors

In a test of hypotheses:

- A Type I error is the decision to reject \(H_0\) when it is in fact true.
- A Type II error is the decision not to reject \(H_0\) when it is in fact not true.

Unless we perform a census we do not have certain knowledge, so we do not know whether our decision matches the true state of nature or if we have made an error. We reject \(H_0\) if what we observe would be a “rare” event if \(H_0\) were true. But rare events are not impossible: they occur with probability \(\alpha\). Thus when \(H_0\) is true, a rare event will be observed in the proportion \(\alpha\) of repeated similar tests, and \(H_0\) will be erroneously rejected in those tests. Thus \(\alpha\) is the probability that in following the testing procedure to decide between \(H_0\) and \(H_a\) we will make a Type I error.

## Definition: level of significance

The number \(\alpha\) that is used to determine the rejection region is called the level of significance of the test. It is the probability that the test procedure will result in a Type I error .

The probability of making a Type II error is too complicated to discuss in a beginning text, so we will say no more about it than this: for a fixed sample size, choosing \(alpha\) smaller in order to reduce the chance of making a Type I error has the effect of increasing the chance of making a Type II error . The only way to simultaneously reduce the chances of making either kind of error is to increase the sample size.

## Standardizing the Test Statistic

Hypotheses testing will be considered in a number of contexts, and great unification as well as simplification results when the relevant sample statistic is standardized by subtracting its mean from it and then dividing by its standard deviation. The resulting statistic is called a standardized test statistic . In every situation treated in this and the following two chapters the standardized test statistic will have either the standard normal distribution or Student’s \(t\)-distribution.

## Definition: hypothesis test

A standardized test statistic for a hypothesis test is the statistic that is formed by subtracting from the statistic of interest its mean and dividing by its standard deviation.

For example, reviewing Example \(\PageIndex{3}\), if instead of working with the sample mean \(\overline{X}\) we instead work with the test statistic

\[\frac{\overline{X}-8.0}{0.067} \nonumber \]

then the distribution involved is standard normal and the critical values are just \(\pm z_{0.05}\). The extra work that was done to find that \(C=7.89\) and \(C′=8.11\) is eliminated. In every hypothesis test in this book the standardized test statistic will be governed by either the standard normal distribution or Student’s \(t\)-distribution. Information about rejection regions is summarized in the following tables:

Every instance of hypothesis testing discussed in this and the following two chapters will have a rejection region like one of the six forms tabulated in the tables above.

No matter what the context a test of hypotheses can always be performed by applying the following systematic procedure, which will be illustrated in the examples in the succeeding sections.

## Systematic Hypothesis Testing Procedure: Critical Value Approach

- Identify the null and alternative hypotheses.
- Identify the relevant test statistic and its distribution.
- Compute from the data the value of the test statistic.
- Construct the rejection region.
- Compare the value computed in Step 3 to the rejection region constructed in Step 4 and make a decision. Formulate the decision in the context of the problem, if applicable.

The procedure that we have outlined in this section is called the “Critical Value Approach” to hypothesis testing to distinguish it from an alternative but equivalent approach that will be introduced at the end of Section 8.3.

## Key Takeaway

- A test of hypotheses is a statistical process for deciding between two competing assertions about a population parameter.
- The testing procedure is formalized in a five-step procedure.

Hypothesis tests

Formal hypothesis testing is perhaps the most prominent and widely-employed form of statistical analysis. It is sometimes seen as the most rigorous and definitive part of a statistical analysis, but it is also the source of many statistical controversies. The currently-prevalent approach to hypothesis testing dates to developments that took place between 1925 and 1940, especially the work of Ronald Fisher , Jerzy Neyman , and Egon Pearson .

In recent years, many prominent statisticians have argued that less emphasis should be placed on the formal hypothesis testing approaches developed in the early twentieth century, with a correspondingly greater emphasis on other forms of uncertainty analysis. Our goal here is to give an overview of some of the well-established and widely-used approaches for hypothesis testing. We will also provide some perspectives on how these tools can be effectively used, and discuss their limitations. We will also discuss some new approaches to hypothesis testing that may eventually come to be as prominent as these classical approaches.

A falsifiable hypothesis is a statement, or hypothesis, that can be contradicted with evidence. In empirical (data-driven) research, this evidence will always be obtained through the data. In statistical hypothesis testing, the hypothesis that we formally test is called the null hypothesis . The alternative hypothesis is a second hypothesis that is our proposed explanation for what happens if the null hypothesis is wrong.

## Test statistics #

The key element of a statistical hypothesis test is the test statistic , which (like any statistic) is a function of the data. A test statistic takes our entire dataset, and reduces it to one number. This one number ideally should contain all the information in the data that is relevant for assessing the two hypotheses of interest, and exclude any aspects of the data that are irrelevant for assessing the two hypotheses. The test statistic measures evidence against the null hypothesis. Most test statistics are constructed so that a value of zero represents the lowest possible level of evidence against the null hypothesis. Test statistic values that deviate from zero represent greater levels of evidence against the null hypothesis. The larger the magnitude of the test statistic, the stronger the evidence against the null hypothesis.

A major theme of statistical research is to devise effective ways to construct test statistics. Many useful ways to do this have been devised, and there is no single approach that is always the best. In this introductory course, we will focus on tests that starting with an estimate of a quantity that is relevant for assessing the hypotheses, then proceed by standardizing this estimate by dividing it by its standard error. This approach is sometimes referred to as “Wald testing”, after Abraham Wald .

## Testing the equality of two proportions #

As a basic example, let’s consider risk perception related to COVID-19. As you will see below, hypothesis testing can appear at first to be a fairly elaborate exercise. Using this example, we describe each aspect of this exercise in detail below.

## The data and research question #

The data shown below are simulated but are designed to reflect actual surveys conducted in the United States in March of 2020. Partipants were asked whether they perceive that they have a substantial risk of dying if they are infected with the novel coronavirus. The number of people stating each response, stratified on age, are shown below (only two age groups are shown):

Each subject’s response is binary – they either perceive themselves to be high risk, or not to be at high risk. When working with this type of data, we are usually interested in the proportion of people who provide each response within each stratum (age group). These are conditional proportions, conditioning on the age group. The numerical values of the conditional proportions are given below:

There are four conditional proportions in the table above – the proportion of younger people who perceive themselves to be at higher risk, 0.110=25/(25+202); the proportion of younger people who do not perceive themselves to be at high risk, 0.890=202/(25+202); the proportion of older people who perceive themselves to be at high risk 0.195=30/(30+124); and the proportion of older people who do not perceive themselves to be at high risk, 0.805=124/(30+124).

The trend in the data is that younger people perceive themselves to be at lower risk of dying than older people, by a difference of 0.195-0.110=0.085 (in terms of proportions). But is this trend only present in this sample, or is it generalizable to a broader population (say the entire US population)? That is the goal of conducting a statistical hypothesis test in this setting.

## The population structure #

Corresponding to our data above is the unobserved population structure, which we can denote as follows

The symbols \(p\) and \(q\) in the table above are population parameters . These are quantitites that we do not know, and wish to assess using the data. In this case, our null hypothesis can be expressed as the statement \(p = q\) . We can estimate \(p\) using the sample proportion \(\hat{p} = 0.110\) , and similarly estimate \(q\) using \(\hat{q} = 0.195\) . However these estimates do not immediately provide us with a way of expressing the evidence relating to the hypothesis that \(p=q\) . This is provided by the test statistic.

## A test statistic #

As noted above, a test statistic is a reduction of the data to one number that captures all of the relevant information for assessing the hypotheses. A natural first choice for a test statistic here would be the difference in sample proportions between the two age groups, which is 0.195 - 0.110 = 0.085. There is a difference of 0.085 between the perceived risks of death in the younger and older age groups.

The difference in rates (0.085) does not on its own make a good test statistic, although it is a good start toward obtaining one. The reason for this is that the evidence underlying this difference in rates depends also on the absolute rates (0.110 and 0.195), and on the sample sizes (227 and 154). If we only know that the difference in rates is 0.085, this is not sufficient to evaluate the hypothesis in a statistical manner. A given difference in rates is much stronger evidence if it is obtained from a larger sample. If we have a difference of 0.085 with a very large sample, say one million people, then we should be almost certain that the true rates differ (i.e. the data are highly incompatiable with the hypothesis that \(p=q\) ). If we have the same difference in rates of 0.085, but with a small sample, say 50 people per age group, then there would be almost no evidence for a true difference in the rates (i.e. the data are compatiable with the hypothesis \(p=q\) ).

To address this issue, we need to consider the uncertainty in the estimated rate difference, which is 0.085. Recall that the estimated rate difference is obtained from the sample and therefore is almost certain to deviate somewhat from the true rate difference in the population (which is unknown). Recall from our study of standard errors that the standard error for an estimated proportion is \(\sqrt{p(1-p)/n}\) , where \(p\) is the outcome probability (here the outcome is that a person perceives a high risk of dying), and \(n\) is the sample size.

In the present analysis, we are comparing two proportions, so we have two standard errors. The estimated standard error for the younger people is \(\sqrt{0.11\cdot 0.89/227} \approx 0.021\) . The estimated standard error for the older people is \(\sqrt{0.195\cdot 0.805/154} \approx 0.032\) . Note that both standard errors are estimated, rather than exact, because we are plugging in estimates of the rates (0.11 and 0.195). Also note that the standard error for the rate among older people is greater than that for younger people. This is because the sample size for older people is smaller, and also because the estimated rate for older people is closer to 1/2.

In our previous discussion of standard errors, we saw how standard errors for independent quantities \(A\) and \(B\) can be used to obtain the standard error for the difference \(A-B\) . Applying that result here, we see that the standard error for the estimated difference in rates 0.195-0.11=0.085 is \(\sqrt{0.021^2 + 0.032^2} \approx 0.038\) .

The final step in constructing our test statistic is to construct a Z-score from the estimated difference in rates. As with all Z-scores, we proceed by taking the estimated difference in rates, and then divide it by its standard error. Thus, we get a test statistic value of \(0.085 / 0.038 \approx 2.24\) .

A test statistic value of 2.24 is not very close to zero, so there is some evidence against the null hypothesis. But the strength of this evidence remains unclear. Thus, we must consider how to calibrate this evidence in a way that makes it more interpretable.

## Calibrating the evidence in the test statistic #

By the central limit theorem (CLT), a Z-score approximately follows a normal distribution. When the null hypothesis holds, the Z-score approximately follows the standard normal distribution (recall that a standard normal distribution is a normal distribution with expected value equal to 0 and variance equal to 1). If the null hypothesis does not hold, then the test statistic continues to approximately follow a normal distribution, but it is not the standard normal distribution.

A test statistic of zero represents the least possible evidence against the null hypothesis. Here, we will obtain a test statistic of zero when the two proportions being compared are identical, i.e. exactly the same proportions of younger and older people perceive a substantial risk of dying from a disease. Even if the test statistic is exactly zero, this does not guarantee that the null hypothesis is true. However it is the least amount of evidence that the data can present against the null hypothesis.

In a hypothesis testing setting using normally-distrbuted Z-scores, as is the case here (due to the CLT), the standard normal distribution is the reference distribution for our test statistic. If the Z-score falls in the center of the reference distribution, there is no evidence against the null hypothesis. If the Z-score falls into either tail of the reference distribution, then there is evidence against the null distribution, and the further into the tails of the reference distribution the Z-score falls, the greater the evidence.

The most conventional way to quantify the evidence in our test statistic is through a probability called the p-value . The p-value has a somewhat complex definition that many people find difficult to grasp. It is the probability of observing as much or more evidence against the null hypothesis as we actually observe, calculated when the null hypothesis is assumed to be true. We will discuss some ways to think about this more intuitively below.

For our purposes, “evidence against the null hypothesis” is reflected in how far into the tails of the reference distribution the Z-score (test statistic) falls. We observed a test statistic of 2.24 in our COVID risk perception analysis. Recall that due to the “empirical rule”, 95% of the time, a draw from a standard normal distribution falls between -2 and 2. Thus, the p-value must be less than 0.05, since 2.24 falls outside this interval. The p-value can be calculated using a computer, in this case it happens to be approximately 0.025.

As stated above, the p-value tells us how likely it would be for us to obtain as much evidence against the the null hypothesis as we observed in our actual data analysis, if we were certain that the null hypothesis were true. When the null hypothesis holds, any evidence against the null hypothesis is spurious. Thus, we will want to see stronger evidence against the null from our actual analysis than we would see if we know that the null hypothesis were true. A smaller p-value therefore reflects more evidence against the null hypothesis than a larger p-value.

By convention, p-values of 0.05 or smaller are considered to represent sufficiently strong evidence against the null hypothesis to make a finding “statistically significant”. This threshold of 0.05 was chosen arbitrarily 100 years ago, and there is no objective reason for it. In recent years, people have argued that either a lesser or a greater p-value threshold should be used. But largely due to convention, the practice of deeming p-values smaller than 0.05 to be statistically significant continues.

## Summary of this example #

Here is a restatement of the above discussion, using slightly different language. In our analysis of COVID risk perceptions, we found a difference in proportions of 0.085 between younger and older subjects, with younger people perceiving a lower risk of dying. This is a difference based on the sample of data that we observed, but what we really want to know is whether there is a difference in COVID risk perception in the population (say, all US adults).

Suppose that in fact there is no difference in risk perception between younger and older people. For instance, suppose that in the population, 15% of people believe that they have a substantial risk of dying should they become infected with the novel coronavirus, regardless of their age. Even though the rates are equal in this imaginary population (both being 15%), the rates in our sample would typically not be equal. Around 3% of the time (0.024=2.4% to be exact), if the rates are actually equal in the population, we would see a test statistic that is 2.4 or larger. Since 3% represents a fairly rare event, we can conclude that our observed data are not compatible with the null hypothesis. We can also say that there is statistically significant evidence against the null hypothesis, and that we have “rejected” the null hypothesis at the 3% level.

In this data analysis, as in any data analysis, we cannot confirm definitively that the alternative hypothesis is true. But based on our data and the analysis performed above, we can claim that there is substantial evidence against the null hypothesis, using standard criteria for what is considered to be “substantial evidence”.

## Comparison of means #

A very common setting where hypothesis testing is used arises when we wish to compare the means of a quantitative measurement obtained for two populations. Imagine, for example, that we have two ways of manufacturing a battery, and we wish to assess which approach yields batteries that are longer-lasting in actual use. To do this, suppose we obtain data that tells us the number of charge cycles that were completed in 200 batteries of type A, and in 300 batteries of type B. For the test developed below to be meaningful, the data must be independent and identically distributed samples.

The raw data for this study consists of 500 numbers, but it turns out that the most relevant information from the data is contained in the sample means and sample standard deviations computed within each battery type. Note that this is a huge reduction in complexity, since we started with 500 measurements and are able to summarize this down to just four numbers.

Suppose the summary statistics are as follows, where \(\bar{x}\) , \(\hat{\sigma}_x\) , and \(n\) denote the sample mean, sample standard deviation, and sample size, respectively.

The simplest measure comparing the two manufacturing approaches is the difference 420 - 403 = 17. That is, batteries of type A tend to have 17 more charge cycles compared to batteries of type B. This difference is present in our sample, but is it also true that the entire population of type A batteries has more charge cycles than the entire population of type B batteries? That is the goal of conducting a hypothesis test.

The next step in the present analysis is to divide the mean difference, which is 17, by its standard error. As we have seen, the standard error of the mean, or SEM, is \(\sigma/n\) , where \(\sigma\) is the standard deviation and \(n\) is the sample size. Since \(\sigma\) is almost never known, we plug in its estimate \(\hat{\sigma}\) . For the type A batteries, the estimated SEM is thus \(70/\sqrt{200} \approx 4.95\) , and for the type B batteries the estimated SEM is \(90/\sqrt{300} \approx 5.2\) .

Since we are comparing two estimated means that are obtained from independent samples, we can pool the standard deviations to obtain an overall standard deviation of \(\sqrt{4.95^2 + 5.2^2} \approx 7.18\) . We can now obtain our test statistic \(17/7.18 \approx 2.37\) .

The test statistic can be calibrated against a standard normal reference distribution. The probability of observing a standard normal value that is greater in magnitude than 2.37 is 0.018 (this can be obtained from a computer). This is the p-value, and since it is smaller than the conventional threshold of 0.05, we can claim that there is a statistically significant difference between the average number of charge cycles for the two types of batteries, with the A batteries having more charge cycles on average.

The analysis illustrated here is called a two independent samples Z-test , or just a two sample Z-test . It may be the most commonly employed of all statistical tests. It is also common to see the very similar two sample t-test , which is different only in that it uses the Student t distribution rather than the normal (Gaussian) distribution to calculate the p-values. In fact, there are quite a few minor variations on this testing framework, including “one sided” and “two sided” tests, and tests based on different ways of pooling the variance. Due to the CLT, if the sample size is modestly large (which is the case here), the results of all of these tests will be almost identical. For simplicity, we only cover the Z-test in this course.

## Assessment of a correlation #

The tests for comparing proportions and means presented above are quite similar in many ways. To provide one more example of a hypothesis test that is somewhat different, we consider a test for a correlation coefficient.

Recall that the sample correlation coefficient \(\hat{r}\) is used to assess the relationship, or association, between two quantities X and Y that are measured on the same units. For example, we may ask whether two biomarkers, serum creatinine and D-dimer, are correlated with each other. These biomarkers are both commonly used in medical settings and are obtained using blood tests. D-dimer is used to assess whether a person has blood clots, and serum creatinine is used to measure kidney performance.

Suppose we are interested in whether there is a correlation in the population between D-dimer and serum creatinine. The population correlation coefficient between these two quantitites can be denoted \(r\) . Our null hypothesis is \(r=0\) . Suppose that we observe a sample correlation coefficient of \(\hat{r}=0.15\) , using an independent and identically distributed sample of pairs \((x, y)\) , where \(x\) is a D-dimer measurement and \(y\) is a serum creatinine measurement. Are these data consistent with the null hypothesis?

As above, we proceed by constructing a test statistic by taking the estimated statistic and dividing it by its standard error. The approximate standard error for \(\hat{r}\) is \(1/\sqrt{n}\) , where \(n\) is the sample size. The test statistic is therefore \(\sqrt{n}\cdot \hat{r} \approx 1.48\) .

We now calibrate this test statistic by comparing it to a standard normal reference distribution. Recall from the empirical rule that 5% of the time, a standard normal value falls outside the interval (-2, 2). Therefore, if the test statistic is smaller than 2 in magnitude, as is the case here, its p-value is greater than 0.05. Thus, in this case we know that the p-value will exceed 0.05 without calculating it, and therefore there is no basis for claiming that D-dimer and serum creatinine levels are correlated in this population.

## Sampling properties of p-values #

A p-value is the most common way of calibrating evidence. Smaller p-values indicate stronger evidence against a null hypothesis. By convention, if the p-value is smaller than some threshold, usually 0.05, we reject the null hypothesis and declare a finding to be “statistically significant”. How can we understand more deeply what this means? One major concern should be obtaining a small p-value when the null hypothesis is true. If the null hypothesis is true, then it is incorrect to reject it. If we reject the null hypothesis, we are making a false claim. This can never be prevented with complete certainty, but we would like to have a very clear understanding of how likely it is to reject the null hypothesis when the null hypothesis is in fact true.

P-values have a special property that when the null distribution is true, the probability of observing a p-value smaller than 0.05 is 0.05 (5%). In fact, the probability of observing a p-value smaller than \(t\) is equal to \(t\) , for any threshold \(t\) . For example, the probability of observing a p-value smaller than 0.1, when the null hypothesis is true, is 10%.

This fact gives a more concrete understanding of how strong the evidence is for a particular p-value. If we always reject the null hypothesis when the p-value is 0.1 or smaller, then over the long run we will reject the null hypothesis 10% of the time when the null hypothesis is true. If we always reject the null hypothesis when the p-value is 0.05 or smaller, then over the long run we will reject the null hypothesis 5% of the time when the null hypothesis is true.

The approach to hypothesis testing discussed above largely follows the framework developed by RA Fisher around 1925. Note that although we mentioned the alternative hypothesis above, we never actually used it. A more elaborate approach to hypothesis testing was developed somewhat later by Egon Pearson and Jerzy Neyman. The “Neyman-Pearson” approach to hypothesis testing is even more formal than Fisher’s approach, and is most suited to highly planned research efforts in which the study is carefully designed, then executed. While ideally all research projects should be carried out this way, in reality we often conduct research using data that are already available, rather than using data that are specifically collected to address the research question.

Neyman-Pearson hypothesis testing involves specifying an alternative hypothesis that we anticipate encountering. Usually this alternative hypothesis represents a realistic guess about what we might find once the data are collected. In each of the three examples above, imagine that the data are not yet collected, and we are asked to specify an alternative hypothesis. We may arrive at the following:

In comparing risk perceptions for COVID, we may anticipate that older people will perceive a 30% risk of dying, and younger people will anticipate a 5% risk of dying.

In comparing the number of charge cycles for two types of batteries, we may anticipate that batter type A will have on average 500 charge cycles, and battery type B will have on average 400 charge cycles.

In assessing the correlation between D-dimer and serum creatinine levels, we may anticipate a correlation of 0.3.

Note that none of the numbers stated here are data-driven – they are specified before any data are collected, so they do not match the results from the data, which were collected only later. These alternative hypotheses are all essentially speculations, based perhaps on related data or theoretical considerations.

There are several benefits of specifying an explicit alternative hypothesis, as done here, even though it is not strictly necessary and can be avoided entirely by adopting Fisher’s approach to hypothesis testing. One benefit of specifying an alternative hypothesis is that we can use it to assess the power of our planned study, which can in turn inform the design of the study, in particular the sample size. The power is the probability of rejecting the null hypothesis when the alternative hypothesis is true. That is, it is the probability of discovering something real. The power should be contrasted with the level of a hypothesis test, which is the probability of rejecting the null hypothesis when the null hypothesis is true. That is, the level is the probability of “discovering” something that is not real.

To calculate the power, recall that for many of the test statistics that we are considering here, the test statistic has the form \(\hat{\theta}/{\rm SE}(\hat{\theta})\) , where \(\hat{\theta}\) is an estimate. For example, \(\hat{\theta}\) ) may be the correlation coefficient between D-dimer and serum creatinine levels. As stated above, the power is the probability of rejecting the null hypothesis when the alternative hypothesis is true. Suppose we decide to reject the null hypothesis when the test statistic is greater than 2, which is approximately equivalent to rejecting the null hypothesis when the p-value is less than 0.05. The following calculation tells us how to obtain the power in this setting:

Under the alternative hypothesis, \(\sqrt{n}(\hat{r} - r)\) approximately follows a standard normal distribution. Therefore, if \(r\) and \(n\) are given, we can easily use the computer to obtain the probability of observing a value greater than \(2 - \sqrt{n}r\) . This gives us the power of the test. For example, if we anticipate \(r=0.3\) and plan to collect data for \(n=100\) observations, the power is 0.84. This is generally considered to be good power – if the true value of \(r\) is in fact 0.3, we would reject the null hypothesis 84% of the time.

A study usually has poor power because it has too small of a sample size. Poorly powered studies can be very misleading, but since large sample sizes are expensive to collect, a lot of research is conducted using sample sizes that yield moderate or even low power. If a study has low power, it is unlikely to reject the null hypothesis even when the alternative hypothesis is true, but it remains possible to reject the null hypothesis when the null hypothesis is true (usually this probability is 5%). Therefore the most likely outcome of a poorly powered study may be an incorrectly rejected null hypothesis.

Publications
- Account settings

Advanced Search
Journal List
v.19(7); 2019 Jul

Hypothesis tests

Associated data.

- • Hypothesis tests are used to assess whether a difference between two samples represents a real difference between the populations from which the samples were taken.
- • A null hypothesis of ‘no difference’ is taken as a starting point, and we calculate the probability that both sets of data came from the same population. This probability is expressed as a p -value.
- • When the null hypothesis is false, p- values tend to be small. When the null hypothesis is true, any p- value is equally likely.

## Learning objectives

By reading this article, you should be able to:

- • Explain why hypothesis testing is used.
- • Use a table to determine which hypothesis test should be used for a particular situation.
- • Interpret a p- value.

A hypothesis test is a procedure used in statistics to assess whether a particular viewpoint is likely to be true. They follow a strict protocol, and they generate a ‘ p- value’, on the basis of which a decision is made about the truth of the hypothesis under investigation. All of the routine statistical ‘tests’ used in research— t- tests, χ 2 tests, Mann–Whitney tests, etc.—are all hypothesis tests, and in spite of their differences they are all used in essentially the same way. But why do we use them at all?

Comparing the heights of two individuals is easy: we can measure their height in a standardised way and compare them. When we want to compare the heights of two small well-defined groups (for example two groups of children), we need to use a summary statistic that we can calculate for each group. Such summaries (means, medians, etc.) form the basis of descriptive statistics, and are well described elsewhere. 1 However, a problem arises when we try to compare very large groups or populations: it may be impractical or even impossible to take a measurement from everyone in the population, and by the time you do so, the population itself will have changed. A similar problem arises when we try to describe the effects of drugs—for example by how much on average does a particular vasopressor increase MAP?

To solve this problem, we use random samples to estimate values for populations. By convention, the values we calculate from samples are referred to as statistics and denoted by Latin letters ( x ¯ for sample mean; SD for sample standard deviation) while the unknown population values are called parameters , and denoted by Greek letters (μ for population mean, σ for population standard deviation).

Inferential statistics describes the methods we use to estimate population parameters from random samples; how we can quantify the level of inaccuracy in a sample statistic; and how we can go on to use these estimates to compare populations.

## Sampling error

There are many reasons why a sample may give an inaccurate picture of the population it represents: it may be biased, it may not be big enough, and it may not be truly random. However, even if we have been careful to avoid these pitfalls, there is an inherent difference between the sample and the population at large. To illustrate this, let us imagine that the actual average height of males in London is 174 cm. If I were to sample 100 male Londoners and take a mean of their heights, I would be very unlikely to get exactly 174 cm. Furthermore, if somebody else were to perform the same exercise, it would be unlikely that they would get the same answer as I did. The sample mean is different each time it is taken, and the way it differs from the actual mean of the population is described by the standard error of the mean (standard error, or SEM ). The standard error is larger if there is a lot of variation in the population, and becomes smaller as the sample size increases. It is calculated thus:

where SD is the sample standard deviation, and n is the sample size.

As errors are normally distributed, we can use this to estimate a 95% confidence interval on our sample mean as follows:

We can interpret this as meaning ‘We are 95% confident that the actual mean is within this range.’

Some confusion arises at this point between the SD and the standard error. The SD is a measure of variation in the sample. The range x ¯ ± ( 1.96 × SD ) will normally contain 95% of all your data. It can be used to illustrate the spread of the data and shows what values are likely. In contrast, standard error tells you about the precision of the mean and is used to calculate confidence intervals.

One straightforward way to compare two samples is to use confidence intervals. If we calculate the mean height of two groups and find that the 95% confidence intervals do not overlap, this can be taken as evidence of a difference between the two means. This method of statistical inference is reasonably intuitive and can be used in many situations. 2 Many journals, however, prefer to report inferential statistics using p -values.

## Inference testing using a null hypothesis

In 1925, the British statistician R.A. Fisher described a technique for comparing groups using a null hypothesis , a method which has dominated statistical comparison ever since. The technique itself is rather straightforward, but often gets lost in the mechanics of how it is done. To illustrate, imagine we want to compare the HR of two different groups of people. We take a random sample from each group, which we call our data. Then:

- (i) Assume that both samples came from the same group. This is our ‘null hypothesis’.
- (ii) Calculate the probability that an experiment would give us these data, assuming that the null hypothesis is true. We express this probability as a p- value, a number between 0 and 1, where 0 is ‘impossible’ and 1 is ‘certain’.
- (iii) If the probability of the data is low, we reject the null hypothesis and conclude that there must be a difference between the two groups.

Formally, we can define a p- value as ‘the probability of finding the observed result or a more extreme result, if the null hypothesis were true.’ Standard practice is to set a cut-off at p <0.05 (this cut-off is termed the alpha value). If the null hypothesis were true, a result such as this would only occur 5% of the time or less; this in turn would indicate that the null hypothesis itself is unlikely. Fisher described the process as follows: ‘Set a low standard of significance at the 5 per cent point, and ignore entirely all results which fail to reach this level. A scientific fact should be regarded as experimentally established only if a properly designed experiment rarely fails to give this level of significance.’ 3 This probably remains the most succinct description of the procedure.

A question which often arises at this point is ‘Why do we use a null hypothesis?’ The simple answer is that it is easy: we can readily describe what we would expect of our data under a null hypothesis, we know how data would behave, and we can readily work out the probability of getting the result that we did. It therefore makes a very simple starting point for our probability assessment. All probabilities require a set of starting conditions, in much the same way that measuring the distance to London needs a starting point. The null hypothesis can be thought of as an easy place to put the start of your ruler.

If a null hypothesis is rejected, an alternate hypothesis must be adopted in its place. The null and alternate hypotheses must be mutually exclusive, but must also between them describe all situations. If a null hypothesis is ‘no difference exists’ then the alternate should be simply ‘a difference exists’.

## Hypothesis testing in practice

The components of a hypothesis test can be readily described using the acronym GOST: identify the Groups you wish to compare; define the Outcome to be measured; collect and Summarise the data; then evaluate the likelihood of the null hypothesis, using a Test statistic .

When considering groups, think first about how many. Is there just one group being compared against an audit standard, or are you comparing one group with another? Some studies may wish to compare more than two groups. Another situation may involve a single group measured at different points in time, for example before or after a particular treatment. In this situation each participant is compared with themselves, and this is often referred to as a ‘paired’ or a ‘repeated measures’ design. It is possible to combine these types of groups—for example a researcher may measure arterial BP on a number of different occasions in five different groups of patients. Such studies can be difficult, both to analyse and interpret.

In other studies we may want to see how a continuous variable (such as age or height) affects the outcomes. These techniques involve regression analysis, and are beyond the scope of this article.

The outcome measures are the data being collected. This may be a continuous measure, such as temperature or BMI, or it may be a categorical measure, such as ASA status or surgical specialty. Often, inexperienced researchers will strive to collect lots of outcome measures in an attempt to find something that differs between the groups of interest; if this is done, a ‘primary outcome measure’ should be identified before the research begins. In addition, the results of any hypothesis tests will need to be corrected for multiple measures.

The summary and the test statistic will be defined by the type of data that have been collected. The test statistic is calculated then transformed into a p- value using tables or software. It is worth looking at two common tests in a little more detail: the χ 2 test, and the t -test.

## Categorical data: the χ 2 test

The χ 2 test of independence is a test for comparing categorical outcomes in two or more groups. For example, a number of trials have compared surgical site infections in patients who have been given different concentrations of oxygen perioperatively. In the PROXI trial, 4 685 patients received oxygen 80%, and 701 patients received oxygen 30%. In the 80% group there were 131 infections, while in the 30% group there were 141 infections. In this study, the groups were oxygen 80% and oxygen 30%, and the outcome measure was the presence of a surgical site infection.

The summary is a table ( Table 1 ), and the hypothesis test compares this table (the ‘observed’ table) with the table that would be expected if the proportion of infections in each group was the same (the ‘expected’ table). The test statistic is χ 2 , from which a p- value is calculated. In this instance the p -value is 0.64, which means that results like this would occur 64% of the time if the null hypothesis were true. We thus have no evidence to reject the null hypothesis; the observed difference probably results from sampling variation rather than from an inherent difference between the two groups.

## Table 1

Summary of the results of the PROXI trial. Figures are numbers of patients.

## Continuous data: the t- test

The t- test is a statistical method for comparing means, and is one of the most widely used hypothesis tests. Imagine a study where we try to see if there is a difference in the onset time of a new neuromuscular blocking agent compared with suxamethonium. We could enlist 100 volunteers, give them a general anaesthetic, and randomise 50 of them to receive the new drug and 50 of them to receive suxamethonium. We then time how long it takes (in seconds) to have ideal intubation conditions, as measured by a quantitative nerve stimulator. Our data are therefore a list of times. In this case, the groups are ‘new drug’ and suxamethonium, and the outcome is time, measured in seconds. This can be summarised by using means; the hypothesis test will compare the means of the two groups, using a p- value calculated from a ‘ t statistic’. Hopefully it is becoming obvious at this point that the test statistic is usually identified by a letter, and this letter is often cited in the name of the test.

The t -test comes in a number of guises, depending on the comparison being made. A single sample can be compared with a standard (Is the BMI of school leavers in this town different from the national average?); two samples can be compared with each other, as in the example above; or the same study subjects can be measured at two different times. The latter case is referred to as a paired t- test, because each participant provides a pair of measurements—such as in a pre- or postintervention study.

A large number of methods for testing hypotheses exist; the commonest ones and their uses are described in Table 2 . In each case, the test can be described by detailing the groups being compared ( Table 2 , columns) the outcome measures (rows), the summary, and the test statistic. The decision to use a particular test or method should be made during the planning stages of a trial or experiment. At this stage, an estimate needs to be made of how many test subjects will be needed. Such calculations are described in detail elsewhere. 5

## Table 2

The principle types of hypothesis test. Tests comparing more than two samples can indicate that one group differs from the others, but will not identify which. Subsequent ‘post hoc’ testing is required if a difference is found.

## Controversies surrounding hypothesis testing

Although hypothesis tests have been the basis of modern science since the middle of the 20th century, they have been plagued by misconceptions from the outset; this has led to what has been described as a crisis in science in the last few years: some journals have gone so far as to ban p -value s outright. 6 This is not because of any flaw in the concept of a p -value, but because of a lack of understanding of what they mean.

Possibly the most pervasive misunderstanding is the belief that the p- value is the chance that the null hypothesis is true, or that the p- value represents the frequency with which you will be wrong if you reject the null hypothesis (i.e. claim to have found a difference). This interpretation has frequently made it into the literature, and is a very easy trap to fall into when discussing hypothesis tests. To avoid this, it is important to remember that the p- value is telling us something about our sample , not about the null hypothesis. Put in simple terms, we would like to know the probability that the null hypothesis is true, given our data. The p- value tells us the probability of getting these data if the null hypothesis were true, which is not the same thing. This fallacy is referred to as ‘flipping the conditional’; the probability of an outcome under certain conditions is not the same as the probability of those conditions given that the outcome has happened.

A useful example is to imagine a magic trick in which you select a card from a normal deck of 52 cards, and the performer reveals your chosen card in a surprising manner. If the performer were relying purely on chance, this would only happen on average once in every 52 attempts. On the basis of this, we conclude that it is unlikely that the magician is simply relying on chance. Although simple, we have just performed an entire hypothesis test. We have declared a null hypothesis (the performer was relying on chance); we have even calculated a p -value (1 in 52, ≈0.02); and on the basis of this low p- value we have rejected our null hypothesis. We would, however, be wrong to suggest that there is a probability of 0.02 that the performer is relying on chance—that is not what our figure of 0.02 is telling us.

To explore this further we can create two populations, and watch what happens when we use simulation to take repeated samples to compare these populations. Computers allow us to do this repeatedly, and to see what p- value s are generated (see Supplementary online material). 7 Fig 1 illustrates the results of 100,000 simulated t -tests, generated in two set of circumstances. In Fig 1 a , we have a situation in which there is a difference between the two populations. The p- value s cluster below the 0.05 cut-off, although there is a small proportion with p >0.05. Interestingly, the proportion of comparisons where p <0.05 is 0.8 or 80%, which is the power of the study (the sample size was specifically calculated to give a power of 80%).

The p- value s generated when 100,000 t -tests are used to compare two samples taken from defined populations. ( a ) The populations have a difference and the p- value s are mostly significant. ( b ) The samples were taken from the same population (i.e. the null hypothesis is true) and the p- value s are distributed uniformly.

Figure 1 b depicts the situation where repeated samples are taken from the same parent population (i.e. the null hypothesis is true). Somewhat surprisingly, all p- value s occur with equal frequency, with p <0.05 occurring exactly 5% of the time. Thus, when the null hypothesis is true, a type I error will occur with a frequency equal to the alpha significance cut-off.

Figure 1 highlights the underlying problem: when presented with a p -value <0.05, is it possible with no further information, to determine whether you are looking at something from Fig 1 a or Fig 1 b ?

Finally, it cannot be stressed enough that although hypothesis testing identifies whether or not a difference is likely, it is up to us as clinicians to decide whether or not a statistically significant difference is also significant clinically.

## Hypothesis testing: what next?

As mentioned above, some have suggested moving away from p -values, but it is not entirely clear what we should use instead. Some sources have advocated focussing more on effect size; however, without a measure of significance we have merely returned to our original problem: how do we know that our difference is not just a result of sampling variation?

One solution is to use Bayesian statistics. Up until very recently, these techniques have been considered both too difficult and not sufficiently rigorous. However, recent advances in computing have led to the development of Bayesian equivalents of a number of standard hypothesis tests. 8 These generate a ‘Bayes Factor’ (BF), which tells us how more (or less) likely the alternative hypothesis is after our experiment. A BF of 1.0 indicates that the likelihood of the alternate hypothesis has not changed. A BF of 10 indicates that the alternate hypothesis is 10 times more likely than we originally thought. A number of classifications for BF exist; greater than 10 can be considered ‘strong evidence’, while BF greater than 100 can be classed as ‘decisive’.

Figures such as the BF can be quoted in conjunction with the traditional p- value, but it remains to be seen whether they will become mainstream.

## Declaration of interest

The author declares that they have no conflict of interest.

## Supplementary material

The following is the Supplementary data to this article:

Are you looking to take your research project to the next level? Have you heard of the power of a hypothesis but need to figure out how to formulate one that will unlock potential discoveries? We can help!

So get ready; it's time to dive into unlocking the power of research! This blog post will explore what makes a well-crafted and powerful hypothesis - from identifying a research question to developing supporting evidence.

By learning how to craft a compelling hypothesis, you'll have more tremendous success in every step of your research project.

## What are hypotheses, and why are they important?

A hypothesis is an educated guess or a proposition based on limited evidence as a starting point for further investigation. It provides a framework for research and allows researchers to refine their ideas, collect data, and draw conclusions. Hypotheses are essential to the process because they will enable us to organize our thoughts and test theories properly.

Hypotheses are used in many fields , from medicine to psychology to economics. In each area, developing hypotheses based on observations enable researchers to make predictions about their data and guide them toward finding meaningful results.

For example, in medicine, hypotheses can be used to predict which treatments will be most effective for particular conditions or which drugs may have adverse effects when taken together. This allows doctors to make better decisions when caring for patients.

In psychology, hypotheses are often used in experiments to determine whether certain variables influence behavior or mental processes. By testing different combinations of variables, psychologists can identify patterns and understand why people behave the way they do.

In economics, hypotheses provide economists with a framework for analyzing the relationship between economic variables such as wages and consumer spending habits. By understanding these relationships, economists can better understand how economic forces affect the economy.

Overall, hypotheses play an essential role in helping scientists develop new ideas and draw meaningful conclusions from the collected data. Without taking the step to create hypotheses, it would be difficult for researchers to make sense of the vast amounts of information available today and use it effectively in their investigations.

## How to determine an effective research question to form your hypothesis

When conducting research, having a compelling research question is critical . Properly formulating this question will allow the researcher to develop their hypothesis. A research question provides a clear and focused goal for your research study and also gives direction on how to get there. A compelling research question should be specific, answerable in the context of your field of study, significant, novel (not already answered by previous studies), and timely – that is, relevant to current events or trends.

Before determining the best research question, you must first understand your topic. Think about the area of knowledge that interests you most and narrow it down to a single theme or concept within this topic. Focus on what interests you most within this theme, and make sure there is room for further exploration and analysis. Once you have chosen a specific topic and narrowed down your focus, you can begin formulating questions related to your project.

To ensure relevance and impact to your field of study, choose questions that address essential issues in the literature or suggest solutions to existing problems. Avoid overly broad topics with unclear objectives; instead, opt for focused questions to enable targeted data collection and analysis with concrete results.

Additionally, consider time frames when formulating questions. If the issue has been discussed extensively in the past but has not been revisited recently, then it's likely not worthy of a new investigation.

Once you have developed some potential questions related to your topic, review them carefully and decide which question best captures the essence of what you want to learn through researching this topic.

Ask yourself:

- Is this question answerable?
- Does it fit within my field of study?
- Is it significant enough?
- Would its findings be novel?

If so, then congratulations! You have identified a compelling research question.

## Tips for crafting a well-crafted hypothesis

Once you have formulated the official research question, you may develop the formal hypothesis. When composing a hypothesis, it's essential to think carefully about the question you are trying to answer.

A solid hypothesis should be testable, meaning that it can be verified or disproved through research. It should also be specific and focused on one issue at a time. Here are some tips for crafting a well-crafted hypothesis:

- Consider the goal of your research: Think about what it is that you want to learn or determine from your experiment and make sure that your hypothesis reflects this goal.
- Create an educated guess as to why something is happening: Your hypothesis should explain why something is occurring based on what evidence you already have and direct further investigation into the matter. For example, if you hypothesize that increased carbon dioxide levels in the atmosphere will lead to global warming, your research should focus on examining this relationship further.
- Define any variables or parameters involved in the experiment: This includes things like temperature or chemical composition that could potentially affect the outcome of any experiments done in pursuit of testing your hypothesis.
- Use clear and precise language: Make sure your hypothesis is written with clear and precise language so that anyone reading it can understand exactly what you are attempting to investigate or explain. Avoid complex words and keep sentences short whenever possible.

Following these simple tips will help ensure that your hypothesis is well-crafted and ready for testing!

## Examples of evidence that can support your hypothesis

When it comes to developing a hypothesis, supporting evidence is essential for making sure it holds up. This evidence helps strengthen the argument that is being driven by providing facts and logical reasoning that support the hypothesis.

Examples of evidence that can be used to back up a hypothesis include using data from experiments, case studies, and other research projects. Data from experiments can provide insight into how certain variables interact to form a particular outcome.

Case studies may offer greater depth in understanding a specific phenomenon's cause and effect; research projects may yield results that confirm or refute existing theories on a subject.

In addition to these traditional forms of evidence, personal experiences or observations can also help to support a hypothesis. For example, if someone's daily commute has been consistently faster since they changed routes, they could use their personal experience to argue that making this change resulted in shorter commutes.

Similarly, suppose someone has witnessed how two variables consistently coincide (i.e., when one goes up, another goes down). In that case, this could be used to support the notion that there is some correlation between these two aspects.

Overall, evidence to support your hypothesis is crucial for ensuring its validity and credibility. While conducting experiments or researching may seem like time-consuming processes, having solid supporting evidence will make it much easier to defend your ideas convincingly when challenged.

Therefore, it is crucial to take the time necessary to gather credible sources of information to provide the most substantial possible backing for your hypotheses.

## Understanding the potential of hypotheses and how they can help your research project progress

The power of research lies in the ability to develop and test hypotheses. A hypothesis is a statement or an idea that can be tested to determine its validity.

Essentially, it is a form of educated guesswork that helps researchers form conclusions about their data. By developing a hypothesis for a research project, you are effectively setting up the framework for further exploration.

When developing a hypothesis, you must consider both the expected outcomes and possible alternative explanations. This will help you focus on testing the possible results without getting sidetracked by irrelevant information. Once you have established a concrete hypothesis, it can then be used as a basis for further research and experimentation.

The process of testing hypotheses is an integral part of the scientific method and can help researchers build confidence in their findings and conclusions. Through careful observation and experimentation, researchers can compare their results against what they initially hypothesized, allowing them to draw more accurate conclusions about their data. As such, hypotheses play an essential role in helping researchers connect the dots between different pieces of evidence and form meaningful conclusions.

Overall, understanding how hypotheses can be used in research projects can be immensely beneficial in helping progress towards reaching meaningful insights from their data. By setting up expectations ahead of time and then testing them against real-world conditions, researchers can gain valuable insights that could potentially change the way we understand our world – now that's something worth exploring!

## Final thoughts

A hypothesis is a proposed explanation for an observable phenomenon. It's important to note that hypotheses are not the same thing as theories–a theory is a much broader and well-established frame of reference that explains multiple phenomena.

Generally, scientists form a research question and then narrow it down to a testable hypothesis. After making observations and conducting experiments to gather data, researchers can use evidence to support or reject the hypothesis.

By following these steps to formulate a solid hypothesis, you will be on your way to developing a successful research project. Happy researching!

Header image by Bnenin.

- Hypothesis Testing - Writing, Examples and Steps

An empirical study begins with writing a hypothesis. If there is no hypothesis, we will not be able to test any cause and effect relationship. Therefore, it’s important to write a hypothesis that can be tested and can offer some great insights into a situation.

We’ve been using the word “hypothesis” quite frequently in previous econometrics articles . In fact, we have represented a hypothesis statistically, developed econometrics models and calculated the extent to which an independent variable affects a dependent variable. However, we haven’t formally defined it. So, here we go:

In the simplest words, a hypothesis:

- is an assumption or a supposition,
- which can be tested and analyzed
- on the basis of one’s observation
- of a set of random variables

And a statistical hypothesis is an assumption about a situation or a population that can be represented and tested via any or a combination of statistical methods.

Therefore, the main elements of a hypothesis include:

- a guess, a theory or a statement
- observation (study of something)
- population (data to be gathered by observing population)

However, a meticulously thought and refined hypothesis is not a guess.

## How to Write a Hypothesis ?

You know what a hypothesis is; what purpose it serves; how it is to be tested. The entire study or experience revolves around a hypothesis. So, a slight mistake in writing a hypothesis could result in wastage of time, money and effort.

While testing a hypothesis is a complex procedure, writing a hypothesis is the trickiest part. Needless to say, you need to be extremely careful when writing a hypothesis that you’re going to test. It is thinking about the right question – a question that can be tested and results obtained from it can enhance your understanding or meet your objectives.

Remember that there is no single tried and tested method of writing a hypothesis. You can see a generic relationship between two variables and then can refine it. Here is an example:

“ Males and females differently handle employee issues ”.

In this statement, we wrote a generic hypothesis. It is not measurable.

“ Females handle employee issues better than males ”.

The second statement provides a direction, as in who does better. When you compare two things, it means a situation is measurable.

“ If females are assigned the task of handling employee issues, then they will do a better job than males because females have higher emotional quotient ”.

The third statement, as you can see, offers specific details. The difference in the level of emotional quotient of males and females sets the scene. It is measurable and quantifiable.

Therefore, a well written hypothesis should be:

- Quantifiable

## Hypothesis Testing

Hypothesis testing refers to a formal process of investigating a supposition or statement to accept or reject it. The econometricians examine a random sample from the population. If it is consistent with the hypothesis, it is accepted. Otherwise it is rejected.

## Types of Hypothesis

There are two types of hypothesis – Null and Alternative.

A hypothesis test concludes whether to reject the null hypothesis and accept the alternative hypothesis or to fail to reject the null hypothesis. The decision is based on the value of X and R.

Points to be noted:

## Decision Errors in Hypothesis Testing

Before we jump onto the process of hypothesis testing, let’s learn about the errors that can result from it. The errors are divided into two categories:

- The probability of committing Type I error is called alpha and is denoted by α.
- This probability is called the significance level.
- Also known as the error of first kind.
- It’s a focus of skepticism
- Type II error is when a researcher fails to detect an effect that is present.
- The probability of committing Type II error is called Beta and is denoted by β.
- And the probability of not committing a Type II error is called Power Test (1 – β).
- Also known as the error of second kind.

When Null Hypothesis is:

When Alternative Hypothesis is:

## Steps in Hypothesis Testing

Econometricians follow a formal process to test a hypothesis and determine whether it is to be rejected. The steps include:

The first step involves positioning the null and alternative hypotheses. Remember, that these are mutually exclusive. If one hypothesis states a fact, the other must reject it.

Consider statistical assumptions – such as independence of observations from each other, normality of observations, random errors and probability distribution of random errors, randomization during sampling, etc.

This includes deciding the test which is to be carried out to test the hypothesis. At the same time, we need to decide how sample data will be used to test the null hypothesis.

At this stage, sample data is examined. It’s when we find scores – mean values, normal distribution, t distribution, z score, etc.

This stage involves making decision to either reject the null hypothesis in favor of alternative hypothesis or not to reject the null hypothesis.

## Accepting or Rejecting Null Hypothesis

This is an extension of the last step - interpreting results in the process of hypothesis testing. A null hypothesis is accepted or rejected basis P value and the region of acceptance.

P value – it is a function of the observed sample results. A threshold value is chosen before the test is conducted and is called the significance level, which is represented as α. If the calculated value of P ≤ α, it suggests the inconsistency between the observed data and the assumption that the null hypothesis is true . This suggests that the null hypothesis must be rejected. However, this doesn’t mean that alternative hypothesis can be accepted as true. This is when Type I error occurs.

Example: You roll a pair of dice once and assume that these are fair and hence the result shown by rolling the dice would be fair.

The null hypothesis is – the dice are fair. You’ve assumed a significance level (α) of 0.04.

Now you roll the dice and observe that both show 6. The p value will be 1/36 or 1/ (6*6) assuming that the test static is uniformly distributed. The p value comes out to be 0.028 which is less than the assumed value of α. On this basis the null hypothesis is rejected. It suggests that the assumption suggesting that dice are fair is not correct.

Region of Acceptance – It is the range of values that leads you to accept the null hypothesis. When you collect and observe sample data, you compute a test static. If its value falls within the specific range, the null hypothesis is accepted.

Example: You might hypothesize that the average weight of the students in a school is 30 kgs. To test this hypothesis, you collect a random sample and compute the mean score. If the sample mean falls close to the hypothesized mean, say between 29 and 31, you accept the null hypothesis. The region of acceptance, therefore, is 29 and 31. The values falling outside this region will fall in the region of rejection.

## Hypothesis Tests – One-Tailed and Two-Tailed

The region of acceptance or rejection can be directional or non-directional. Basis this, we decide whether to perform one-tailed or two-tailed test to accept or reject the hypothesis.

## One-Tailed Test

When the region of acceptance falls entirely on one side of the tail of distribution, one-tailed test is conducted. This means in a test of a statistical hypothesis when values fall outside the specific region only on one side of the sampling distribution, it is one-tailed test.

Example: A null hypothesis says that the marriageable age of a person is greater than or equal to 24. Then, the alternative hypothesis would be that the marriageable age is less than 24. The region of rejection, in this case, would be on the left hand side of the sampling distribution, which is the set of numbers less than 24.

## Two-Tailed Test

When the region of rejection falls on the both sides of sampling distribution, it’s a two-tailed test.

Example: The null hypothesis says that the marriageable age of a person is equal to 24. Then, the alternative hypothesis would be that the marriageable age is less than or greater than 24. The region of rejection, in this case, would be on both sides of the sampling distribution, which are two sets of numbers – one greater than 24 and the other less than 24.

Sample Problem

Election commission supposes that at least 80% of the 1,000,000 voters will turn up to vote in upcoming elections. A survey of 100 randomly sampled voters finds that only 71 percent will turn up. How to find the region of acceptance, assuming a significance level of 0.05 or 5%?

Formulate hypotheses

Null Hypothesis: At least 80% of the voters will turn up to vote.

H 0 suggest that P ≥ 80

Alternative Hypothesis: Less than 80% will turn up to vote.

H a suggests P ≤ 80

Data Sampling

The sample of the population is taken randomly.

Formulating an Analysis Plan

The proportion of sample voters who say that they will turn up to vote is 71% or .71. This is also the test of statistic.

Investigating the Data

Let’s assume that the mean of sample data is .80, which is hypothesized proportion of sample which will turn to vote.

Standard deviation (σ) = √ [ {P*(1-P)/n} * {(N-n)/(N-1)} ]

P = test value specified in null hypothesis

n = sample size

N = population size

σ = √ [ {(0.80 * 0.20)/100} * {(1,000,000 – 100)/(1,000,000 – 1)} ]

σ = √ [0.0016 * 0.9999] = √ 0.0015998 = √0.0016 = 0.04

Finding the lower and upper limits of region of acceptance

The upper limit will be equal to 100% or 1 since this is the highest proportion of the population.

The lower limit (LL) = P(X’ ≤LL) = α = 0.05

If we put the values in a statistical normal distribution calculator, LL comes out to be 0.734.

This means that the region of acceptance lies between 0.734 and 1.

Accepting or Rejecting the Hypothesis

The survey on sample proportion suggested that 71% voters will turn up to vote. But the region of acceptance is between 0.734 and 1. It means that .71 falls out of the region of acceptance and falls in the region of rejection on the left hand side. Therefore, we reject the null hypothesis that 80% of the voters will turn out to vote in upcoming elections.

How to Write a Great Hypothesis

Hypothesis Definition, Format, Examples, and Tips

Kendra Cherry, MS, is a psychosocial rehabilitation specialist, psychology educator, and author of the "Everything Psychology Book."

Amy Morin, LCSW, is a psychotherapist and international bestselling author. Her books, including "13 Things Mentally Strong People Don't Do," have been translated into more than 40 languages. Her TEDx talk, "The Secret of Becoming Mentally Strong," is one of the most viewed talks of all time.

- The Scientific Method

## Hypothesis Format

Falsifiability of a hypothesis.

- Operationalization

## Hypothesis Types

Hypotheses examples.

- Collecting Data

A hypothesis is a tentative statement about the relationship between two or more variables. It is a specific, testable prediction about what you expect to happen in a study. It is a preliminary answer to your question that helps guide the research process.

Consider a study designed to examine the relationship between sleep deprivation and test performance. The hypothesis might be: "This study is designed to assess the hypothesis that sleep-deprived people will perform worse on a test than individuals who are not sleep-deprived."

## At a Glance

A hypothesis is crucial to scientific research because it offers a clear direction for what the researchers are looking to find. This allows them to design experiments to test their predictions and add to our scientific knowledge about the world. This article explores how a hypothesis is used in psychology research, how to write a good hypothesis, and the different types of hypotheses you might use.

## The Hypothesis in the Scientific Method

In the scientific method , whether it involves research in psychology, biology, or some other area, a hypothesis represents what the researchers think will happen in an experiment. The scientific method involves the following steps:

- Forming a question
- Performing background research
- Creating a hypothesis
- Designing an experiment
- Collecting data
- Analyzing the results
- Drawing conclusions
- Communicating the results

The hypothesis is a prediction, but it involves more than a guess. Most of the time, the hypothesis begins with a question which is then explored through background research. At this point, researchers then begin to develop a testable hypothesis.

Unless you are creating an exploratory study, your hypothesis should always explain what you expect to happen.

In a study exploring the effects of a particular drug, the hypothesis might be that researchers expect the drug to have some type of effect on the symptoms of a specific illness. In psychology, the hypothesis might focus on how a certain aspect of the environment might influence a particular behavior.

Remember, a hypothesis does not have to be correct. While the hypothesis predicts what the researchers expect to see, the goal of the research is to determine whether this guess is right or wrong. When conducting an experiment, researchers might explore numerous factors to determine which ones might contribute to the ultimate outcome.

In many cases, researchers may find that the results of an experiment do not support the original hypothesis. When writing up these results, the researchers might suggest other options that should be explored in future studies.

In many cases, researchers might draw a hypothesis from a specific theory or build on previous research. For example, prior research has shown that stress can impact the immune system. So a researcher might hypothesize: "People with high-stress levels will be more likely to contract a common cold after being exposed to the virus than people who have low-stress levels."

In other instances, researchers might look at commonly held beliefs or folk wisdom. "Birds of a feather flock together" is one example of folk adage that a psychologist might try to investigate. The researcher might pose a specific hypothesis that "People tend to select romantic partners who are similar to them in interests and educational level."

## Elements of a Good Hypothesis

So how do you write a good hypothesis? When trying to come up with a hypothesis for your research or experiments, ask yourself the following questions:

- Is your hypothesis based on your research on a topic?
- Can your hypothesis be tested?
- Does your hypothesis include independent and dependent variables?

Before you come up with a specific hypothesis, spend some time doing background research. Once you have completed a literature review, start thinking about potential questions you still have. Pay attention to the discussion section in the journal articles you read . Many authors will suggest questions that still need to be explored.

## How to Formulate a Good Hypothesis

To form a hypothesis, you should take these steps:

- Collect as many observations about a topic or problem as you can.
- Evaluate these observations and look for possible causes of the problem.
- Create a list of possible explanations that you might want to explore.
- After you have developed some possible hypotheses, think of ways that you could confirm or disprove each hypothesis through experimentation. This is known as falsifiability.

In the scientific method , falsifiability is an important part of any valid hypothesis. In order to test a claim scientifically, it must be possible that the claim could be proven false.

Students sometimes confuse the idea of falsifiability with the idea that it means that something is false, which is not the case. What falsifiability means is that if something was false, then it is possible to demonstrate that it is false.

One of the hallmarks of pseudoscience is that it makes claims that cannot be refuted or proven false.

## The Importance of Operational Definitions

A variable is a factor or element that can be changed and manipulated in ways that are observable and measurable. However, the researcher must also define how the variable will be manipulated and measured in the study.

Operational definitions are specific definitions for all relevant factors in a study. This process helps make vague or ambiguous concepts detailed and measurable.

For example, a researcher might operationally define the variable " test anxiety " as the results of a self-report measure of anxiety experienced during an exam. A "study habits" variable might be defined by the amount of studying that actually occurs as measured by time.

These precise descriptions are important because many things can be measured in various ways. Clearly defining these variables and how they are measured helps ensure that other researchers can replicate your results.

## Replicability

One of the basic principles of any type of scientific research is that the results must be replicable.

Replication means repeating an experiment in the same way to produce the same results. By clearly detailing the specifics of how the variables were measured and manipulated, other researchers can better understand the results and repeat the study if needed.

Some variables are more difficult than others to define. For example, how would you operationally define a variable such as aggression ? For obvious ethical reasons, researchers cannot create a situation in which a person behaves aggressively toward others.

To measure this variable, the researcher must devise a measurement that assesses aggressive behavior without harming others. The researcher might utilize a simulated task to measure aggressiveness in this situation.

## Hypothesis Checklist

- Does your hypothesis focus on something that you can actually test?
- Does your hypothesis include both an independent and dependent variable?
- Can you manipulate the variables?
- Can your hypothesis be tested without violating ethical standards?

The hypothesis you use will depend on what you are investigating and hoping to find. Some of the main types of hypotheses that you might use include:

- Simple hypothesis : This type of hypothesis suggests there is a relationship between one independent variable and one dependent variable.
- Complex hypothesis : This type suggests a relationship between three or more variables, such as two independent and dependent variables.
- Null hypothesis : This hypothesis suggests no relationship exists between two or more variables.
- Alternative hypothesis : This hypothesis states the opposite of the null hypothesis.
- Statistical hypothesis : This hypothesis uses statistical analysis to evaluate a representative population sample and then generalizes the findings to the larger group.
- Logical hypothesis : This hypothesis assumes a relationship between variables without collecting data or evidence.

A hypothesis often follows a basic format of "If {this happens} then {this will happen}." One way to structure your hypothesis is to describe what will happen to the dependent variable if you change the independent variable .

The basic format might be: "If {these changes are made to a certain independent variable}, then we will observe {a change in a specific dependent variable}."

## A few examples of simple hypotheses:

- "Students who eat breakfast will perform better on a math exam than students who do not eat breakfast."
- "Students who experience test anxiety before an English exam will get lower scores than students who do not experience test anxiety."
- "Motorists who talk on the phone while driving will be more likely to make errors on a driving course than those who do not talk on the phone."
- "Children who receive a new reading intervention will have higher reading scores than students who do not receive the intervention."

## Examples of a complex hypothesis include:

- "People with high-sugar diets and sedentary activity levels are more likely to develop depression."
- "Younger people who are regularly exposed to green, outdoor areas have better subjective well-being than older adults who have limited exposure to green spaces."

## Examples of a null hypothesis include:

- "There is no difference in anxiety levels between people who take St. John's wort supplements and those who do not."
- "There is no difference in scores on a memory recall task between children and adults."
- "There is no difference in aggression levels between children who play first-person shooter games and those who do not."

## Examples of an alternative hypothesis:

- "People who take St. John's wort supplements will have less anxiety than those who do not."
- "Adults will perform better on a memory task than children."
- "Children who play first-person shooter games will show higher levels of aggression than children who do not."

## Collecting Data on Your Hypothesis

Once a researcher has formed a testable hypothesis, the next step is to select a research design and start collecting data. The research method depends largely on exactly what they are studying. There are two basic types of research methods: descriptive research and experimental research.

## Descriptive Research Methods

Descriptive research such as case studies , naturalistic observations , and surveys are often used when conducting an experiment is difficult or impossible. These methods are best used to describe different aspects of a behavior or psychological phenomenon.

Once a researcher has collected data using descriptive methods, a correlational study can examine how the variables are related. This research method might be used to investigate a hypothesis that is difficult to test experimentally.

## Experimental Research Methods

Experimental methods are used to demonstrate causal relationships between variables. In an experiment, the researcher systematically manipulates a variable of interest (known as the independent variable) and measures the effect on another variable (known as the dependent variable).

Unlike correlational studies, which can only be used to determine if there is a relationship between two variables, experimental methods can be used to determine the actual nature of the relationship—whether changes in one variable actually cause another to change.

The hypothesis is a critical part of any scientific exploration. It represents what researchers expect to find in a study or experiment. In situations where the hypothesis is unsupported by the research, the research still has value. Such research helps us better understand how different aspects of the natural world relate to one another. It also helps us develop new hypotheses that can then be tested in the future.

Thompson WH, Skau S. On the scope of scientific hypotheses . R Soc Open Sci . 2023;10(8):230607. doi:10.1098/rsos.230607

Taran S, Adhikari NKJ, Fan E. Falsifiability in medicine: what clinicians can learn from Karl Popper [published correction appears in Intensive Care Med. 2021 Jun 17;:]. Intensive Care Med . 2021;47(9):1054-1056. doi:10.1007/s00134-021-06432-z

Eyler AA. Research Methods for Public Health . 1st ed. Springer Publishing Company; 2020. doi:10.1891/9780826182067.0004

Nosek BA, Errington TM. What is replication ? PLoS Biol . 2020;18(3):e3000691. doi:10.1371/journal.pbio.3000691

Aggarwal R, Ranganathan P. Study designs: Part 2 - Descriptive studies . Perspect Clin Res . 2019;10(1):34-36. doi:10.4103/picr.PICR_154_18

Nevid J. Psychology: Concepts and Applications. Wadworth, 2013.

By Kendra Cherry, MSEd
Kendra Cherry, MS, is a psychosocial rehabilitation specialist, psychology educator, and author of the "Everything Psychology Book."

Table of contents. Step 1: State your null and alternate hypothesis. Step 2: Collect data. Step 3: Perform a statistical test. Step 4: Decide whether to reject or fail to reject your null hypothesis. Step 5: Present your findings. Other interesting articles. Frequently asked questions about hypothesis testing.

A hypothesis test consists of five steps: 1. State the hypotheses. State the null and alternative hypotheses. These two hypotheses need to be mutually exclusive, so if one is true then the other must be false. 2. Determine a significance level to use for the hypothesis. Decide on a significance level.

Step 5: Phrase your hypothesis in three ways. To identify the variables, you can write a simple prediction in if … then form. The first part of the sentence states the independent variable and the second part states the dependent variable. If a first-year student starts attending more lectures, then their exam scores will improve.

Null and Alternative Hypotheses. The actual test begins by considering two hypotheses.They are called the null hypothesis and the alternative hypothesis.These hypotheses contain opposing viewpoints. \(H_0\): The null hypothesis: It is a statement of no difference between the variables—they are not related. This can often be considered the status quo and as a result if you cannot accept the ...

If the engineer used the P -value approach to conduct his hypothesis test, he would determine the area under a tn - 1 = t24 curve and to the right of the test statistic t * = 1.22: In the output above, Minitab reports that the P -value is 0.117. Since the P -value, 0.117, is greater than \ (\alpha\) = 0.05, the engineer fails to reject the null ...

In hypothesis testing, the goal is to see if there is sufficient statistical evidence to reject a presumed null hypothesis in favor of a conjectured alternative hypothesis.The null hypothesis is usually denoted \(H_0\) while the alternative hypothesis is usually denoted \(H_1\). An hypothesis test is a statistical decision; the conclusion will either be to reject the null hypothesis in favor ...

Below these are summarized into six such steps to conducting a test of a hypothesis. Set up the hypotheses and check conditions: Each hypothesis test includes two hypotheses about the population. One is the null hypothesis, notated as H 0, which is a statement of a particular parameter value. This hypothesis is assumed to be true until there is ...

S.3 Hypothesis Testing. In reviewing hypothesis tests, we start first with the general idea. Then, we keep returning to the basic procedures of hypothesis testing, each time adding a little more detail. The general idea of hypothesis testing involves: Making an initial assumption. Collecting evidence (data).

Testing Hypotheses using Confidence Intervals. We can start the evaluation of the hypothesis setup by comparing 2006 and 2012 run times using a point estimate from the 2012 sample: ˉx12 = 95.61 minutes. This estimate suggests the average time is actually longer than the 2006 time, 93.29 minutes.

A standardized test statistic for a hypothesis test is the statistic that is formed by subtracting from the statistic of interest its mean and dividing by its standard deviation. For example, reviewing Example 8.1.3 8.1. 3, if instead of working with the sample mean X¯¯¯¯ X ¯ we instead work with the test statistic.

Hypothesis tests # Formal hypothesis testing is perhaps the most prominent and widely-employed form of statistical analysis. It is sometimes seen as the most rigorous and definitive part of a statistical analysis, but it is also the source of many statistical controversies. The currently-prevalent approach to hypothesis testing dates to developments that took place between 1925 and 1940 ...

A hypothesis test is a procedure used in statistics to assess whether a particular viewpoint is likely to be true. They follow a strict protocol, and they generate a 'p-value', on the basis of which a decision is made about the truth of the hypothesis under investigation.All of the routine statistical 'tests' used in research—t-tests, χ 2 tests, Mann-Whitney tests, etc.—are all ...

The logic of hypothesis testing, as compared to jury trials page 3 This simple layout shows an excellent correspondence between hypothesis testing and jury decision-making. t test examples page 4 Here are some examples of the very widely used t test. The t test through Minitab page 8

Guide to Experimental Design | Overview, 5 steps & Examples. Published on December 3, 2019 by Rebecca Bevans.Revised on June 21, 2023. Experiments are used to study causal relationships.You manipulate one or more independent variables and measure their effect on one or more dependent variables.. Experimental design create a set of procedures to systematically test a hypothesis.

The power of research lies in the ability to develop and test hypotheses. A hypothesis is a statement or an idea that can be tested to determine its validity. Essentially, it is a form of educated guesswork that helps researchers form conclusions about their data. By developing a hypothesis for a research project, you are effectively setting up ...

A hypothesis test is carried out at the 5% level of significance to test if a normal coin is fair or not. (i) Describe what the population parameter could be for the hypothesis test. (ii) State whether the hypothesis test should be a one-tailed test or a two-tailed test, give a reason for your answer. (iii)

A hypothesis test is used to test whether or not some hypothesis about a population parameter is true.. To perform a hypothesis test in the real world, researchers obtain a random sample from the population and perform a hypothesis test on the sample data, using a null and alternative hypothesis:. Null Hypothesis (H 0): The sample data occurs purely from chance.

There are 5 main hypothesis testing steps, which will be outlined in this section. The steps are: Determine the null hypothesis: In this step, the statistician should identify the idea that is ...

Hypothesis testing is a statistical process to determine the likelihood that a given or null hypothesis is true. It goes through a number of steps to find out what may lead to rejection of the hypothesis when its true and acceptance when its not true. This article discusses the steps which a given hypothesis goes through, including the decisional errors that could happen in a statistical process.

A hypothesis is a tentative, testable answer to a scientific question. Once a scientist has a scientific question she is interested in, the scientist reads up to find out what is already known on the topic. Then she uses that information to form a tentative answer to her scientific question. Sometimes people refer to the tentative answer as "an ...

A hypothesis is a tentative statement about the relationship between two or more variables. It is a specific, testable prediction about what you expect to happen in a study. It is a preliminary answer to your question that helps guide the research process. Consider a study designed to examine the relationship between sleep deprivation and test ...