So far we have seen how to build a linear regression model using the whole dataset. If we build it that way, there is no way to tell how the model will perform with new data. So the preferred practice is to split your dataset into a 80:20 sample (training:test), then, build the model on the 80% sample and then use the model thus built to predict the dependent variable on test data.
Doing it this way, we will have the model predicted values for the 20% data (test) as well as the actuals (from the original dataset). By calculating accuracy measures (like min_max accuracy) and error rates (MAPE or MSE), we can find out the prediction accuracy of the model. Now, lets see how to actually do this..
Step 2: develop the model on the training data and use it to predict the distance on test data, step 3: review diagnostic measures..
From the model summary, the model p value and predictor’s p value are less than the significance level, so we know we have a statistically significant model. Also, the R-Sq and Adj R-Sq are comparative to the original model built on full data.
A simple correlation between the actuals and predicted values can be used as a form of accuracy measure. A higher correlation accuracy implies that the actuals and predicted values have similar directional movement, i.e. when the actuals values increase the predicteds also increase and vice-versa.
Now lets calculate the Min Max accuracy and MAPE: $$MinMaxAccuracy = mean \left( \frac{min\left(actuals, predicteds\right)}{max\left(actuals, predicteds \right)} \right)$$
$$MeanAbsolutePercentageError \ (MAPE) = mean\left( \frac{abs\left(predicteds−actuals\right)}{actuals}\right)$$
Suppose, the model predicts satisfactorily on the 20% split (test data), is that enough to believe that your model will perform equally well all the time? It is important to rigorously test the model’s performance as much as possible. One way is to ensure that the model equation you have will perform well, when it is ‘built’ on a different subset of training data and predicted on the remaining data.
How to do this is? Split your data into ‘k’ mutually exclusive random sample portions. Keeping each portion as test data, we build the model on the remaining (k-1 portion) data and calculate the mean squared error of the predictions. This is done for each of the ‘k’ random sample portions. Then finally, the average of these mean squared errors (for ‘k’ portions) is computed. We can use this metric to compare different linear models.
By doing this, we need to check two things:
In other words, they should be parallel and as close to each other as possible. You can find a more detailed explanation for interpreting the cross validation charts when you learn about advanced linear model building.
In the below plot, Are the dashed lines parallel? Are the small and big symbols are not over dispersed for one particular color?
We have covered the basic concepts about linear regression. Besides these, you need to understand that linear regression is based on certain underlying assumptions that must be taken care especially when working with multiple X s . Once you are familiar with that, the advanced regression models will show you around the various special cases where a different form of regression would be more suitable.
© 2016-17 Selva Prabhakaran. Powered by jekyll , knitr , and pandoc . This work is licensed under the Creative Commons License.
The " general linear F-test " involves three basic steps, namely:
As you can see by the wording of the third step, the null hypothesis always pertains to the reduced model, while the alternative hypothesis always pertains to the full model.
The easiest way to learn about the general linear F-test is to first go back to what we know, namely the simple linear regression model. Once we understand the general linear F-test for the simple case, we then see that it can be easily extended to the multiple case. We take that approach here.
The " full model ", which is also sometimes referred to as the " unrestricted model ," is the model thought to be most appropriate for the data. For simple linear regression, the full model is:
\[y_i=(\beta_0+\beta_1x_{i1})+\epsilon_i\]
Here's a plot of a hypothesized full model for a set of data that we worked with previously in this course (student heights and grade point averages):
And, here's another plot of a hypothesized full model that we previously encountered (state latitudes and skin cancer mortalities):
In each plot, the solid line represents what the hypothesized population regression line might look like for the full model. The question we have to answer in each case is "does the full model describe the data well?" Here, we might think that the full model does well in summarizing the trend in the second plot but not the first.
The reduced model
The " reduced model ," which is sometimes also referred to as the " restricted model ," is the model described by the null hypothesis H 0 . For simple linear regression, a common null hypothesis is H 0 : β 1 = 0. In this case, the reduced model is obtained by "zeroing-out" the slope β 1 that appears in the full model. That is, the reduced model is:
\[y_i=\beta_0+\epsilon_i\]
This reduced model suggests that each response y i is a function only of some overall mean, β 0 , and some error ε i .
Let's take another look at the plot of student grade point average against height, but this time with a line representing what the hypothesized population regression line might look like for the reduced model:
Not bad — there (fortunately?!) doesn't appear to be a relationship between height and grade point average. And, it appears as if the reduced model might be appropriate in describing the lack of a relationship between heights and grade point averages. How does the reduced model do for the skin cancer mortality example?
It doesn't appear as if the reduced model would do a very good job of summarizing the trend in the population.
How do we decide if the reduced model or the full model does a better job of describing the trend in the data when it can't be determined by simply looking at a plot? What we need to do is to quantify how much error remains after fitting each of the two models to our data. That is, we take the general linear F-test approach:
Recall that, in general, the error sum of squares is obtained by summing the squared distances between the observed and fitted (estimated) responses:
\[\sum(\text{observed } - \text{ fitted})^2\]
Therefore, since \(y_i\) is the observed response and \(\hat{y}_i\) is the fitted response for the full model :
\[SSE(F)=\sum(y_i-\hat{y}_i)^2\]
And, since \(y_i\) is the observed response and \(\bar{y}\) is the fitted response for the reduced model :
\[SSE(R)=\sum(y_i-\bar{y})^2\]
Let's get a better feel for the general linear F-test approach by applying it to two different two datasets. First, let's look at the heightgpa data . The following plot of grade point averages against heights contains two estimated regression lines — the solid line is the estimated line for the full model, and the dashed line is the estimated line for the reduced model:
As you can see, the estimated lines are almost identical. Calculating the error sum of squares for each model, we obtain:
\[SSE(F)=\sum(y_i-\hat{y}_i)^2=9.7055\]
\[SSE(R)=\sum(y_i-\bar{y})^2=9.7331\]
The two quantities are almost identical. Adding height to the reduced model to obtain the full model reduces the amount of error by only 0.0276 (from 9.7331 to 9.7055). That is, adding height to the model does very little in reducing the variability in grade point averages. In this case, there appears to be no advantage in using the larger full model over the simpler reduced model.
Look what happens when we fit the full and reduced models to the skin cancer mortality and latitude dataset :
Here, there is quite a big difference in the estimated equation for the reduced model (solid line) and the estimated equation for the full model (dashed line). The error sums of squares quantify the substantial difference in the two estimated equations:
\[SSE(F)=\sum(y_i-\hat{y}_i)^2=17173\]
\[SSE(R)=\sum(y_i-\bar{y})^2=53637\]
Adding latitude to the reduced model to obtain the full model reduces the amount of error by 36464 (from 53637 to 17173). That is, adding latitude to the model substantially reduces the variability in skin cancer mortality. In this case, there appears to be a big advantage in using the larger full model over the simpler reduced model.
Where are we going with this general linear F-test approach? In short:
How different does SSE ( R ) have to be from SSE ( F ) in order to justify using the larger full model? The general linear F -statistic:
\[F^*=\left( \frac{SSE(R)-SSE(F)}{df_R-df_F}\right)\div\left( \frac{SSE(F)}{df_F}\right)\]
helps answer this question. The F -statistic intuitively makes sense — it is a function of SSE ( R )- SSE ( F ), the difference in the error between the two models. The degrees of freedom — denoted df R and df F — are those associated with the reduced and full model error sum of squares, respectively.
We use the general linear F -statistic to decide whether or not:
In general, we reject H 0 if F * is large — or equivalently if its associated P -value is small.
For simple linear regression, it turns out that the general linear F -test is just the same ANOVA F -test that we learned before. As noted earlier for the simple linear regression case, the full model is:
and the reduced model is:
Therefore, the appropriate null and alternative hypotheses are specified either as:
The degrees of freedom associated with the error sum of squares for the reduced model is n -1, and:
\[SSE(R)=\sum(y_i-\bar{y})^2=SSTO\]
The degrees of freedom associated with the error sum of squares for the full model is n -2, and:
\[SSE(F)=\sum(y_i-\hat{y}_i)^2=SSE\]
Now, we can see how the general linear F -statistic just reduces algebraically to the ANOVA F -test that we know:
\(F^*=\left( \frac{SSE(R)-SSE(F)}{df_R-df_F}\right)\div\left( \frac{SSE(F)}{df_F}\right)\) | |
- 1 = - 2 | ( )
|
\(F^*=\left( \frac{SSTO-SSE}{(n-1)-(n-2)}\right)\div\left( \frac{SSE}{(n-2)}\right)=\frac{MSR}{MSE}\) |
That is, the general linear F -statistic reduces to the ANOVA F -statistic:
\[F^*=\frac{MSR}{MSE}\]
For the student height and grade point average example:
\[F^*=\frac{MSR}{MSE}=\frac{0.0276/1}{9.7055/33}=\frac{0.0276}{0.2941}=0.094\]
For the skin cancer mortality example:
\[F^*=\frac{MSR}{MSE}=\frac{36464/1}{17173/47}=\frac{36464}{365.4}=99.8\]
The P -value is calculated as usual. The P -value answers the question: "what is the probability that we’d get an F* statistic as large as we did, if the null hypothesis were true?" The P -value is determined by comparing F * to an F distribution with 1 numerator degree of freedom and n -2 denominator degrees of freedom. For the student height and grade point average example, the P -value is 0.761 (so we fail to reject H 0 and we favor the reduced model), while for the skin cancer mortality example, the P -value is 0.000 (so we reject H 0 and we favor the full model).
Does alcoholism have an effect on muscle strength? Some researchers (Urbano-Marquez, et al , 1989) who were interested in answering this question collected the following data ( alcoholarm.txt ) on a sample of 50 alcoholic men:
The full model is the model that would summarize a linear relationship between alcohol consumption and arm strength. The reduced model, on the other hand, is the model that claims there is no relationship between alcohol consumption and arm strength.
Upon fitting the reduced model to the data, we obtain:
\[SSE(R)=\sum(y_i-\bar{y})^2=1224.32\]
Note that the reduced model does not appear to summarize the trend in the data very well.
Upon fitting the full model to the data, we obtain:
\[SSE(F)=\sum(y_i-\hat{y}_i)^2=720.27\]
The full model appears to decribe the trend in the data better than the reduced model.
The good news is that in the simple linear regression case, we don't have to bother with calculating the general linear F -statistic. Statistical software does it for us in the ANOVA table:
As you can see, the output reports both SSE ( F ) — the amount of error associated with the full model — and SSE ( R ) — the amount of error associated with the reduced model. The F -statistic is:
\[F^*=\frac{MSR}{MSE}=\frac{504.04/1}{720.27/48}=\frac{504.04}{15.006}=33.59\]
and its associated P -value is < 0.001 (so we reject H 0 and we favor the full model). We can conclude that there is a statistically significant linear association between lifetime alcohol consumption and arm strength.
Copyright © 2018 The Pennsylvania State University Privacy and Legal Statements Contact the Department of Statistics Online Programs
Linear regression is a powerful statistical tool used to model the relationship between a dependent variable and one or more independent variables. However, the validity and reliability of linear regression analysis hinge on several key assumptions. If these assumptions are violated, the results of the analysis can be misleading or even invalid. In this comprehensive guide, we will delve into the essential assumptions of linear regression, explore how to check them, and provide practical solutions for addressing potential violations.
Linear regression is a cornerstone of statistical modeling, widely employed in various fields, from economics and finance to social sciences and engineering. Its simplicity and interpretability make it a popular choice for understanding the relationships between variables. However, like any statistical method, linear regression relies on a set of assumptions to ensure the accuracy and meaningfulness of its results.
When these assumptions are met, linear regression provides unbiased and efficient estimates of the model parameters. However, when these assumptions are violated, the results can be biased, inefficient, or even completely invalid. Therefore, it’s crucial to understand these assumptions, assess whether they hold in your data, and take appropriate corrective measures if they don’t.
Before we dive into the specifics of checking and addressing assumption violations, let’s first outline the key assumptions underlying linear regression:
Now that we understand the key assumptions, let’s explore how to check whether they hold in your data.
The linearity assumption states that the relationship between the independent variables and the dependent variable is linear. This means that a straight line can adequately represent the relationship.
How to check:
<response-element_nghost-ng-c2478939204=”” ng-version=”0.0.0-PLACEHOLDER”>
<divrole=”presentation” data-mprt=”5″ style=”position: absolute; overflow: hidden; left: 0px; width: 5px; height: 5px;”><divstyle=”position: absolute; overflow: hidden; width: 1e+06px; height: 1e+06px; transform: translate3d(0px, 0px, 0px); contain: strict; top: 0px; left: 0px;”><divrole=”presentation” aria-hidden=”true” style=”position: absolute; font-family: "Google Sans Mono", Consolas, "Courier New", monospace; font-weight: normal; font-size: 14px; font-feature-settings: "liga" 0, "calt" 0; font-variation-settings: normal; line-height: 18px; letter-spacing: 0px; height: 0px; width: 45px;”>
<divrole=”presentation” aria-hidden=”true” style=”position: absolute;”>
<divrole=”presentation” aria-hidden=”true” data-mprt=”7″ style=”position: absolute; font-family: "Google Sans Mono", Consolas, "Courier New", monospace; font-weight: normal; font-size: 14px; font-feature-settings: "liga" 0, "calt" 0; font-variation-settings: normal; line-height: 18px; letter-spacing: 0px; width: 45px; height: 18px;”>
<canvasaria-hidden=”true” width=”14″ height=”5″ style=”position: absolute; transform: translate3d(0px, 0px, 0px); contain: strict; top: 0px; right: 0px; width: 14px; height: 5px; display: block;”>
What to do if the linearity assumption fails:
The independence assumption states that the errors (residuals) are independent of each other. This means that the error in one observation should not be related to the error in another observation.
What to do if the independence assumption fails:
The homoscedasticity assumption states that the variance of the errors is constant across all levels of the independent variables. This means that the spread of the residuals should be roughly the same across the range of fitted values.
What to do if the homoscedasticity assumption fails:
The normality assumption states that the errors are normally distributed. This means that if you were to plot a histogram of the residuals, it should roughly resemble a bell-shaped curve.
What to do if the normality assumption fails:
The no multicollinearity assumption states that the independent variables are not highly correlated with each other. Multicollinearity can make it difficult to interpret the individual effects of the independent variables and can lead to unstable estimates.
Teach yourself statistics
This lesson describes how to conduct a hypothesis test to determine whether there is a significant linear relationship between an independent variable X and a dependent variable Y .
The test focuses on the slope of the regression line
Y = Β 0 + Β 1 X
where Β 0 is a constant, Β 1 is the slope (also called the regression coefficient), X is the value of the independent variable, and Y is the value of the dependent variable.
If we find that the slope of the regression line is significantly different from zero, we will conclude that there is a significant relationship between the independent and dependent variables.
The approach described in this lesson is valid whenever the standard requirements for simple linear regression are met.
The test procedure consists of four steps: (1) state the hypotheses, (2) formulate an analysis plan, (3) analyze sample data, and (4) interpret results.
If there is a significant linear relationship between the independent variable X and the dependent variable Y , the slope will not equal zero.
H o : Β 1 = 0
H a : Β 1 ≠ 0
The null hypothesis states that the slope is equal to zero, and the alternative hypothesis states that the slope is not equal to zero.
The analysis plan describes how to use sample data to accept or reject the null hypothesis. The plan should specify the following elements.
Using sample data, find the standard error of the slope, the slope of the regression line, the degrees of freedom, the test statistic, and the P-value associated with the test statistic. The approach described in this section is illustrated in the sample problem at the end of this lesson.
Predictor | Coef | SE Coef | T | P |
Constant | 76 | 30 | 2.53 | 0.01 |
X | 35 | 20 | 1.75 | 0.04 |
SE = s b 1 = sqrt [ Σ(y i - ŷ i ) 2 / (n - 2) ] / sqrt [ Σ(x i - x ) 2 ]
t = b 1 / SE
If the sample findings are unlikely, given the null hypothesis, the researcher rejects the null hypothesis. Typically, this involves comparing the P-value to the significance level , and rejecting the null hypothesis when the P-value is less than the significance level.
The local utility company surveys 101 randomly selected customers. For each survey participant, the company collects the following: annual electric bill (in dollars) and home size (in square feet). Output from a regression analysis appears below.
Annual bill = 0.55 * Home size + 15 | ||||
Predictor | Coef | SE Coef | T | P |
Constant | 15 | 3 | 5.0 | 0.00 |
Home size | 0.55 | 0.24 | 2.29 | 0.01 |
Is there a significant linear relationship between annual bill and home size? Use a 0.05 level of significance.
The solution to this problem takes four steps: (1) state the hypotheses, (2) formulate an analysis plan, (3) analyze sample data, and (4) interpret results. We work through those steps below:
H o : The slope of the regression line is equal to zero.
H a : The slope of the regression line is not equal to zero.
We get the slope (b 1 ) and the standard error (SE) from the regression output.
b 1 = 0.55 SE = 0.24
We compute the degrees of freedom and the t statistic, using the following equations.
DF = n - 2 = 101 - 2 = 99
t = b 1 /SE = 0.55/0.24 = 2.29
where DF is the degrees of freedom, n is the number of observations in the sample, b 1 is the slope of the regression line, and SE is the standard error of the slope.
An official website of the United States government
The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.
The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.
Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .
Paravee maneejuk.
Center of Excellence in Econometrics, Faculty of Economics, Chiang Mai University, Chiang Mai, Thailand
The discussion on the use and misuse of p -values in 2016 by the American Statistician Association was a timely assertion that statistical concept should be properly used in science. Some researchers, especially the economists, who adopt significance testing and p -values to report their results, may felt confused by the statement, leading to misinterpretations of the statement. In this study, we aim to re-examine the accuracy of the p -value and introduce an alternative way for testing the hypothesis. We conduct a simulation study to investigate the reliability of the p -value. Apart from investigating the performance of p -value, we also introduce some existing approaches, Minimum Bayes Factors and Belief functions, for replacing p -value. Results from the simulation study confirm unreliable p -value in some cases and that our proposed approaches seem to be useful as the substituted tool in the statistical inference. Moreover, our results show that the plausibility approach is more accurate for making decisions about the null hypothesis than the traditionally used p -values when the null hypothesis is true. However, the MBFs of Edwards et al. [ Bayesian statistical inference for psychological research . Psychol. Rev. 70(3) (1963), pp. 193–242]; Vovk [ A logic of probability, with application to the foundations of statistics . J. Royal Statistical Soc. Series B (Methodological) 55 (1993), pp. 317–351] and Sellke et al. [ Calibration of p values for testing precise null hypotheses . Am. Stat. 55(1) (2001), pp. 62–71] provide more reliable results compared to all other methods when the null hypothesis is false.
Our work has been inspired by a statement on statistical significance and p -values of the American Statistical Association (ASA) in 2016 [ 23 ]. They stated that p -value does not provide a good measure of evidence regarding a model or hypothesis whose validity or significance should not be based only on whether its p -value passes a specific threshold, for example, 0.10, 0.05, or 0.01. This statement indicates that in many scientific disciplines, the use of p -value to make the decision on tests of hypotheses may have led to a large number of wrong discoveries. Some researchers, especially the economists, who adopt significance testing and p -values to report their results, may felt confused by the statement, leading to misinterpretations of the statement. The econometric models and statistical tests have been intensively used by economists for interpreting causal effects, model selection, and forecasting model. The question is how to test and make an inference without p -values.
The discussion of this issue is not quite new. The critiques of this issue were started in Berkson [ 2 ], Rozeboom [ 16 ], Cohen [ 3 ], and the review of the studies attempting in banning p -values refers to Kline [ 13 ]. The motivation for banning p -values is a concern with the logic that underlies significance testing and p -value. One of the most prominent problems of the p -value is that many researchers misunderstand that p -value is the probability of the null hypothesis. Indeed, a p -value does not have any meaning in this regard [ 14 ] and [ 7 ]. The misconceptions about the interpretation of p -value are explained in the work of Assaf and Tsionas [ 1 ]. They provided a simple explanation of the problem in making an inference from p -value, for example, if the p -value is less than 0.05, we have enough evidence to reject the null hypothesis and accept the claim. By this conviction in the regression framework, we must reject the null hypothesis ( H 0 : β = 0 ) . While this is fine, the interpretation can be misleading as the p -value is only the probability of the observed results, regardless of the value of the β . Intuitively, we make the same interpretation of β = 0 under the range of p -values less than 0.05. This indicates that the p -value provides indirect evidence about the null hypothesis, as the parameters are assumed to be fixed for all p -values less than 0.05. In addition, it is known that the p -value might overstate the evidence against the null [ 7 , 12 , 21 , 1 ]. Another problem of using p -value is that there exhibits a high dependence of the p -value on the sample size. It was evident that a smaller sample size can yield higher p -value and vice versa, see Rouder et al . [ 17 ]. Thus, if we do not have large enough sample size, the interpretation might be wrong as it is difficult to obtain the accurate testing result, especially in the case that the null hypothesis must be accepted (the null hypothesis is true). Also, it is too dangerous to have only the binary decisions (e.g. whether to reject or accept the null hypothesis). It is the extreme binary view that, in our opinion, has caused various problems to the decision making. Stern [ 21 ] mentioned that non-significant p -value indicates that the data could easily be observed under the null hypothesis. However, the data could also be observed under a range of alternative hypotheses. Thus, it is overconfident to make the decision based on this binary approach and it may contribute to a misunderstanding of the test results.
Obviously, the p -value is currently misinterpreted, overtrusted, and misused in the research reports. Hence, this puts us in a difficult situation of testing against a null hypothesis. However, our discussion should not be taken as a statement for researchers and practitioners to completely avoid p -values. Rather, we should investigate some misconceptions about the p -values and find alternative methods that have better statistical interpretations and properties. Fortunately, there is an alternative approach to p -value. We refer to such an approach as the Bayesian counterpart, the Bayes factor method and the plausibility method. Our suggested methods are similar to some of the suggestions in the American Statistician in 2019 [ 24 ]. They have further discussed the problems of p -values and suggested the new guidelines for supplementing or replacing p -values. In this article, they suggested second-generation p -values, confidence intervals, false-positive risk and Bayes factor methods for replacing the conventional p -values.
We can start discussing the Bayes factor, which has been widely accepted as a valuable alternative to the p -value approach in these recent days (see [ 21 , 15 , 10 , 1 ]). Page and Satake [ 15 ] revealed that there are two main differences between p -value and Bayes factors. First, the calculation of p -value involves both observed data and ‘more extreme’ hypothetical outcomes, whereas the calculation of the Bayes factor can be obtained from observed data alone. Note that, in a Bayesian approach, the information from the observed data is normally combined with the priors for the parameter of interest. This is the point that sparks much of the debate regarding the Bayes methods, because the selection of prior may have much impact on the posterior distribution and conclusions. However, the calculation of the Bayes factor can be obtained from observed data alone by assuming the uniform prior on the parameter of interest. Second, a p -value is computed in relation to only the null hypothesis, whereas the Bayes factor considers both the null and alternative hypotheses. Many researchers confirmed that the Bayes factor is more suitable to address the problem of comparing hypotheses as it provides a natural framework for integrating a variety of sources of information about a quantity of interest. In other words, the statistical test based on this method relies on the combination of the information from the observed data and the prior information. Generally speaking, the prior information from the researcher is combined with the likelihood of the data to obtain the posterior distribution for constructing the Bayes factor. This posterior distribution explicitly addresses the information about the values of parameters, which are most plausible. Bayes factor becomes a measure of evidence regarding the two conflicting hypotheses and therefore, it can investigate whether the parameter of interest is equal to a specific value or not, say ( H 0 : β = b ) against ( H 1 : β ≠ b ) , in the regression context. Thus, in practice, the Bayes factor can be computed by the ratio of the posterior distribution of the null hypothesis and alternative hypothesis. Held and Ott [ 9 ] confirmed that Bayes factor can be considered as an alternative to the p -value for testing hypotheses and for quantifying the degree to which observed data support or conflict with a hypothesis (this approach is discussed further below). The additional information can be obtained from Stern [ 21 ]. However, in this study, we focus on the evidence against a simple null hypothesis provided by the Minimum Bayes factor (MBF) approach. In other words, this approach transforms p -value to a Bayes factor for making a new interpretation from the observed data (see [ 9 , 10 ]). In this context, MBF is usually oriented as p- values in that the smaller values provide stronger evidence against the null hypothesis [ 8 ].
Another approach considered in this study is the plausibility-based belief function (plausibility method) as proposed by Kanjanatarakul, Denoeux, Sriboonchitta [ 11 ]. This method is the extension of the MBF concept. While MBF focuses on transforming p -value or ( t -statistic) to obtain MBF, plausibility-based belief function considers transforming β to the plausibility (similar to p -value). The discussion of this method as the alternative to p -value for testing the hypotheses is quite limited. Thus, we attempt to fill the gap of the literature and suggest using this method as another alternative method for testing hypotheses. The method allows us to obtain the plausibility of each parameter in the range that we are considering. For example, if we want to know whether β = b , b = [ − 3 , 3 ] , we can find the plausibilities P l ( β = − 3 ) , … , P l ( β = 3 ) . Thus, if we want to test whether ( H 0 : β = b ) , we measure the plausibility P l ( β = b ) , and if the P l ( β = b ) is less than 0.05, we accept this null hypothesis. We can see that the p -value and the plausibility seem to provide similar information. However, Kanjanatarakul, Denoeux, and Sriboonchitta [ 11 ] mentioned that these two measures are completely different in interpretations. The p -value is the probability, under the hypothesis ( H 0 : β = b ) which is based on the assumption of repeated sampling, and it takes into account, in the computation of the probability, values of t -statistics larger than specific criteria values corresponding to p -values of 0.10, 0.05, and 0.01. In contrast, the assertion P l ( β = b ) = α indicates that there is a parameter β with β = b , whose likelihood conditional on the data is α times the maximum likelihood. The closer the value of α to zero, the higher the probability that β = b . This is to say we can obtain the P l ( β = b ) for all possible values of b . This means that the value of plausibility is directly dependent on the value of β . For more explanation of the Bayesian approach and the belief functions, refer to Shafer [ 20 ].
In this paper, we further explore these two methods as the alternative to the p -values by proving that under the same hypothesis, these two methods provide the direct probability statements about the parameter of interest and provide a more accurate and reliable result for inferential decision making. We conduct several experiments and illustrate the practical application of the methods using a simple regression model which is widely employed in many research works. The issues in the paper are further elaborated in another section. We will provide the background of the frequentistic p -value, the Bayes factor and plausibility methods (likelihood-based belief functions) in Section 2 , followed by the experiment and real application studies in Sections 3 and 4 , respectively. Finally, the conclusion is provided in Section 5 .
2.1. the p - value.
Recall that the p -value is the probability of obtaining a test statistic equal to or more than the observed results under the assumption that the null hypothesis is true [ 21 ]. More precisely, it is a quantitative measure of discrepancy between the data and the point null hypothesis [ 7 ]. The basic rule for decision-making is that if the p -value of the sample data is less than the specific threshold or significant level at 0.01, 0.05, and 0.10, the result is said to be statistically significant and the null hypothesis is rejected. In this study, the simple linear regression is considered, and it can be written as follows
where y i is the dependent variable and x i is the independent variable. ε i is the error term which is assumed to be independent and identically distributed ( iid ) normal distribution. Thus, the statistical test in this study is based on this normality assumption. To investigate the impact of the variable x i on y i , one needs to examine whether β equals to zero. Hence the hypothesis in this problem can be set H 0 : β = 0 against the alternative hypothesis H a : β ≠ 0 . Let t be t- statistic criteria, then under the traditional p -value method, it is calculated as
where t ⌢ is the observed t- statistic for testing H 0 : β = 0 , computed by t ⌢ = β ⌢ / var ( β ⌢ ) and Φ ( ⋅ ) denotes the cumulative standard normal distribution function. Then inference regarding H 0 is based on the p -value. For a better understanding of the misuse and misinterpretation of the p -value, let us provide a simulation example in this regard.
As an example, we simulate a situation where ( H 0 : β 1 = 0 ) is true and ( H 0 : β 2 = 0 ) is false. Thus, we consider a linear regression model with two independent variables. The parameter β 0 is omitted as we are only considering the effect of the independent variables on the dependent variable, thus the model becomes
where β 1 = 0 and β 2 = 1 . ε i is assumed to have a normal distribution with mean zero and variance σ 2 , ε i ∼ i i d N ( 0 , σ 2 ) . We simulate 1000 data sets with sample size N = 20, 100, and 500 using the specification of Eq. 4 and estimate all these data sets to obtain the p -values of β 1 and β 2 . This means that the parameter β 1 is assumed to have an insignificant effect on y i while there is statistical significance β 2 . The results are illustrated in Figure 1 .
Performance of p -values when H 0 is true (upper panel) and H 0 is false (lower panel).
Let us consider the lower panel of Figure 1 ; it is obvious that, when the null hypothesis ( H 0 : β 2 = 0 ) is false or must be rejected, the result shows that p -values are all less than significant level at 0.01, except in the sample size N = 20 . As we can observe that there is a probability that the null hypothesis is rejected and the results are fine as the bulk of the p -values are well less than 0.01. This enables us to have an accurate inference that we cannot reject the null hypothesis. There are however still chances that p -values fall over that 0.01 criterion. This indicates that the researchers and practitioners still have a chance to have a wrong interpretation (type II error), especially when the small sample size is used in the study.
Then we turn our attention to the parameter β 1 which we know that ( H 0 : β 1 = 0 ) must be accepted. For all our 1,000 data sets significance tests, the spread of p -values is all over the place as a uniform distribution. This indicates that there is a high chance that the null hypothesis is rejected. Therefore, in this example, we can conclude that the decision-making based on p -values will be, more or less, arbitrary and the conclusion is made imprecise.
Furthermore, we also observe that the p -values in this simulation study do exhibit no dependence on the sample size when ( H 0 : β 2 = 0 ) is true or must be accepted (upper panel) but high dependence on sample size when ( H 0 : β 2 = 0 ) is false or must be rejected (lower panel). This illustrates that the probability of rejecting ( H 0 : β 2 = 0 ) , when ( H 1 : β 2 ≠ 0 ) is true depends on N, whereas the probability of rejecting the null hypothesis, when ( H 0 : β 2 = 0 ) is true (type I error), does not depend on N .
As it is one of the powerful tools for making a statistical test, the Bayes factor is widely accepted as a valuable alternative to the p -value approach. Stern [ 21 ] mentioned that the ‘Bayes factor has a significant advantage over the p -value as it can address the likelihood of the observed data for each hypothesis and thereby treating the two symmetrically’. This is to say it is more realistic as it provides inferences and compares hypotheses for the given data we analyze. In this study, we focus on the Bayes factor and consider a point null hypothesis H 0 : β 0 = 0 and additional priors Pr ( H 0 ) = π and Pr ( H 1 ) = 1 − Pr ( H 0 ) . In the Bayesian point of view, it is possible to compute the probability of a hypothesis conditionally on observed data in terms of the posterior; an appropriate statistic for comparing hypotheses is the posterior odds:
in which the ratio Pr ( y , x | H 0 ) / Pr ( y , x | H 1 ) = B F 01 is called the Bayes factor of H 0 relative to H 1 . This Bayes factor can be computed formally as
where Pr ( H 0 | y , x ) = ∫ H 0 f ( y , x | β ) π ( β ) d β / f ( y , x ) . f ( y , x | β ) is the density function (likelihood function) and π ( β ) is prior density of β . Pr ( H 0 | y , x ) and Pr ( H 1 | y , x ) are the posterior distribution under the null and the alternative hypothesis, respectively. If the prior probability Pr ( H 0 ) = Pr ( H 1 ) , Bayes factor becomes the likelihood ratio of H 0 relative to H 1 . This ratio is termed the Bayes factor [ 12 ]. Thus, the Bayes factor provides a direct quantitative measure of whether the data have increased or decreased the odds of H 0 . As we know, the Bayesian approach consists of the likelihood of the data and the prior distribution of the parameter. The problem is what the appropriate prior is. In Eq. 6, the prior probabilities Pr ( H 0 ) and Pr ( H 1 ) are included, but when no information about prior distributions are available, the approach of Minimum Bayes Factor (MBF) can be used. Since min B F 01 ≤ 1 the Bayes factors lie in the same range as p -values. Another way to compute this MBF is introduced by Edwards, Lindman, and Savage [ 5 ]. They mentioned that the Bayes factor (Eq. 6) is based on the observed data, and they also suggested that this MBF can be computed easily by treating p -value ( p ) as that observed data, thus
The other option for obtaining the MBF is to back-transform p to the underlying test statistic t , which was used to calculate p -value (see, Eq. 3). Therefore, MBF conditional on t -statistic can be computed by
under the assumption that t = t ( p ) is one-to-one transformation. If this transformation does not hold, the p-based Bayes factor (Eq. 7) is preferred, since it is directly computed by the p -value. In this study, we focus on the evidence against a simple null hypothesis provided by MBF, as it is easy to compare with a p -value. To compute the MBF, we have to minimize the test-based Bayes factor based on Eq. 7 or Eq. 8. Thus, the MBF can be computed by
where f ( ⋅ | β ⌢ , H 1 ) is the maximum density function or likelihood of the optimal β ⌢ . The minimum Bayes factor is the smallest possible Bayes factor that can be obtained for a p -value in a certain class of distributions considered under the alternative [ 9 ].
Several methods for computing the MBF have been proposed since the pioneering work of Edwards et al . [ 5 ]. In this study, we also mention four methods as the tools for computing MBF with an emphasis on two-sided p -based and test-based Bayes factor. The formulas of these methods are provided in Table 1 .
Formula | |
---|---|
Edwards . [ ] | |
Goodman [ ] | |
Vovk [ ] and Sellke . [ ] | |
Sellke . [ ] |
However, the interpretation of MBF is still different from the p -value approach. The transformation of p -value to MBF is called calibration (but it is not just a change of scale, like converting from Fahrenheit degrees to Celsius degrees). By considering MBF, we are in a different conceptual framework. The categorization of the Bayes factor is provided in Table 2 . [ 9 ].
Minimum | Interpretation |
---|---|
1–1/3 | Weak evidence for |
1/3–1/10 | Moderate evidence for |
1/10–1/30 | Substantial evidence for |
1/30–1/100 | Strong evidence for |
1/100–1/300 | Very strong evidence for |
< 1/300 | Decisive evidence for |
Now, let us consider H 0 : β = 0 in this method. As we mentioned before, we can measure the plausibility β = 0 using the belief function. Denoeux [ 4 ] justified the belief function B e l y Θ on Θ is built from the likelihood function. Thus, we can use the normal likelihood to quantify the plausibility of H 0 : β = 0 providing its value between zero and one as the same range as that of the p -value. The plausibility H 0 is given by the contour function
for any hypothesis H ⊆ Θ , where p l y ( β ) is the relative likelihood L ( β ) / L ( β ⌢ ) . β ⌢ is the parameter value that maximizes the likelihood function. Clearly, the plausibility H 0 is rescaled to the interval [0, 1]. Thus under the normality assumption, β ⌢ is the value that maximizes the likelihood function.
When we take the derivative of the log likelihood with respect to β , we obtain
Consequently, we can estimate the P l y Θ ( H 0 : β = 0 ) as
This method can be viewed as the extension of the minimum Bayes factor approach as the plausibility is computed as the ratio of the relative likelihood. However, it transforms the value of β instead of p -value and t -statistic. Thus, it can be said that the plausibility is directly computed from any β . Thus, it can be viewed as an alternative method to the p -value.
Let us consider the same Example 1. We illustrate the calculation of P l y Θ ( H 0 : β = 0 ) . In this example, we simulate one data set with a sample size N = 50 . To generate this data, we set the seed of R software‘s random number generator as 1, set.seed(1) and the estimated results are provided in Table 3 and Figure 2 .
Marginal contour functions for the parameters β 1 and β 2 (based on one simulated dataset, N = 50). The vertical red line is the P l y Θ ( H 0 : β j = 0 ) . j = 1 , 2 .
Parameter | True value | estimate | | -statistic| | -value | ||
---|---|---|---|---|---|---|
0 | −0.0203 | 0.1238 | 0.1640 | 0.8700 | 0.9860 | |
1 | 1.0045 | 0.1324 | 7.5860 | 0.0000 | 0.0000 |
Table 3 provides the estimated parameters from Maximum likelihood estimator (MLE), together with the standard error of the parameters (Column 3), absolute t -statistics (Column 4), p -values (Column 5), and plausibilities P l y Θ ( H 0 : β j = 0 ) . j = 1 , 2 . We can observe that the p -values and the plausibilities provide similar results as they report that both the p -value and the plausibility of H 0 : β 2 = 0 are zero, indicating that β 2 is significantly different from zero. Likewise, both methods also give the same interpretation that the parameter β 1 is insignificant as both the p -value and the plausibility are higher than 0.01, 0.05 and 0.10. However, it is interesting to have a different degree between a p -value of H 0 : β 1 = 0 and plausibility of P l y Θ ( H 0 : β 1 = 0 ) . We find that P l y Θ ( H 0 : β 1 = 0 ) is 0.9860 while the p -value is 0.8700. It can be said that p -value states the amount of evidence for accepting H 1 as 0.8700 / 0.9860 = 0 .8823 times as less as the plausibility-based belief function does. If the P l y Θ ( H 0 : β 1 = 0 ) is true, we can say that the p -value underestimates the true probability. The comparison of the plausibility-based belief function and the p -value is further discussed in Section 3 .
Several alternative methods for making an inference are introduced in this study. If we need a statistical test, which one is the most preferable? and how do we compare the different approaches? To answer these questions, in this section, the experimental study is conducted using the simulated data. For comparison, we consider the cases where, after the tests, we can find out the truth. So, we simulate the data which we have already known the correct answer to the statistical test.
To further illustrate, we consider an experiment to make comparisons directly among p -value, Bayes factor and plausibility approaches under the linear regression context. We start with the following data generating process,
where β 1 = 3 and β 2 = 0 so that there is only a significant effect of x 1 i on y i . ε i , x 1 i , and x 2 i are generated from a normal distribution with mean zero and variance one. Six different sample sizes are simulated consisting of N = 10, N = 20, N = 50, N = 100, N = 200, and N = 500. 1,000 data sets are simulated for each sample size. Simulations were generated using random seeds to simplify replication. To compare the performance of each method, this study proposed the use of the percentage of incorrect inferences as the measure. For p -value, we use the conventional statistical inference, in which the p -value is equal to or lower than thresholds namely 0.10, 0.05, and 0.01, to make a decision about the null hypothesis. Likewise, the plausibility-based belief function is interpreted in the same way as the p -value. On the other hand, in the case of the minimum Bayes factor approach, the interpretation is different from the first two methods as we make the decision upon the MBF following the Held and Ott [ 9 ] labeled intervals as presented in Table 2 . Our interest is to see whether these methods will reveal any non-significant outcome when the null is false and reveal the significant outcome when the null is true. The results of the method comparison are provided in the following Figures and Tables.
The p -value and the plausibility are reported in Tables Tables4 4 and and5, 5 , respectively. These two approaches provide a similar interpretation where their values less than 0.10, 0.05, and 0.01 are said to be significant. While, the MBF results, reported in Tables 6–9 , provide another interpretation perspective.
-value approach | ||||||
---|---|---|---|---|---|---|
N = 10 | N = 20 | N = 50 | N = 100 | N = 200 | N = 500 | |
-value <0.01 | 99.7 | 100 | 100 | 100 | 100 | 100 |
-value <0.05 | 99.9 | 100 | 100 | 100 | 100 | 100 |
-value <0.10 | 100 | 100 | 100 | 100 | 100 | 100 |
-value < 0.01 | 2.1 | 1.9 | 2.2 | 1.1 | 1.3 | 1.2 |
-value < 0.05 | 8.7 | 7.7 | 5.6 | 5.2 | 4.5 | 5.7 |
-value < 0.10 | 13.5 | 13 | 10.2 | 9.5 | 10.3 | 10.3 |
-value < 0.10) | . |
Plausibility approach | ||||||
---|---|---|---|---|---|---|
N = 10 | N = 20 | N = 50 | N = 100 | N = 200 | N = 500 | |
plausibility | 100 | 100 | 100 | 100 | 100 | 100 |
plausibility | 100 | 100 | 100 | 100 | 100 | 100 |
plausibility | 100 | 100 | 100 | 100 | 100 | 100 |
plausibility | ||||||
plausibility | 5.1 | 1.7 | 1.1 | 0.5 | 0.4 | 0.3 |
plausibility | 11.5 | 5.3 | 3.4 | 1.9 | 1.8 | 2.3 |
plausibility | 16.2 | 8.9 | 5.1 | 3.4 | 3 | 3.8 |
plausibility |
N = 10 | N = 20 | N = 50 | N = 100 | N = 200 | N = 500 | ||
---|---|---|---|---|---|---|---|
1–1/3 | Weak evidence for | 0 | 0 | 0 | 0 | 0 | 0 |
1/3–1/10 | Moderate evidence for | 0.1 | 0 | 0 | 0 | 0 | 0 |
1/10–1/30 | Substantial evidence for | 0.2 | 0 | 0 | 0 | 0 | 0 |
1/30–1/100 | Strong evidence for | 0.5 | 0 | 0 | 0 | 0 | 0 |
1/100–1/300 | Very strong evidence for | 0.9 | 0 | 0 | 0 | 0 | 0 |
< 1/300 | Decisive evidence for | 98.3 | 100 | 100 | 100 | 100 | 100 |
. | |||||||
1–1/3 | Weak evidence for | 81.7 | 82.7 | 85.8 | 86.7 | 85.8 | 84.8 |
1/3–1/10 | Moderate evidence for | 13.3 | 12 | 11.3 | 10.4 | 11.3 | 11.4 |
1/10–1/30 | Substantial evidence for | 2.9 | 3.5 | 1.6 | 1.9 | 1.6 | 2.9 |
1/30–1/100 | Strong evidence for | 1 | 1.4 | 1 | 0.6 | 1 | 0.6 |
1/100–1/300 | Very strong evidence for | 0.3 | 0.2 | 0.2 | 0.3 | 0.2 | 0.1 |
< 1/300 | Decisive evidence for | 0.8 | 0.2 | 0.1 | 0.1 | 0.1 | 0.2 |
. |
N = 10 | N = 20 | N = 50 | N = 100 | N = 200 | N = 500 | ||
---|---|---|---|---|---|---|---|
1–1/3 | Weak evidence for | 0 | 0 | 0 | 0 | 0 | 0 |
1/3–1/10 | Moderate evidence for | 0.6 | 0 | 0 | 0 | 0 | 0 |
1/10–1/30 | Substantial evidence for | 0.5 | 0 | 0 | 0 | 0 | 0 |
1/30–1/100 | Strong evidence for | 0.8 | 0 | 0 | 0 | 0 | 0 |
1/100–1/300 | Very strong evidence for | 1.0 | 0 | 0 | 0 | 0 | 0 |
< 1/300 | Decisive evidence for | 97.1 | 100 | 100 | 100 | 100 | 100 |
. | |||||||
1–1/3 | Weak evidence for | 95.2 | 95 | 95.6 | 97.1 | 97.2 | 96.7 |
1/3–1/10 | Moderate evidence for | 2.9 | 4.0 | 3 | 2.1 | 1.7 | 2.7 |
1/10–1/30 | Substantial evidence for | 1 | 0.7 | 1.2 | 0.4 | 0.8 | 0.4 |
1/30–1/100 | Strong evidence for | 0.4 | 0.1 | 0.2 | 0.4 | 0.3 | 0.1 |
1/100–1/300 | Very strong evidence for | 0.1 | 0.1 | 0 | 0 | 0 | 0.1 |
< 1/300 | Decisive evidence for | 0.4 | 0.1 | 0 | 0 | 0 | 0 |
. |
N = 10 | N = 20 | N = 50 | N = 100 | N = 200 | N = 500 | ||
---|---|---|---|---|---|---|---|
1–1/3 | Weak evidence for | 0.1 | 0 | 0 | 0 | 0 | 0 |
1/3–1/10 | Moderate evidence for | 0.3 | 0 | 0 | 0 | 0 | 0 |
1/10–1/30 | Substantial evidence for | 0.5 | 0 | 0 | 0 | 0 | 0 |
1/30–1/100 | Strong evidence for | 0.9 | 0 | 0 | 0 | 0 | 0 |
1/100–1/300 | Very strong evidence for | 1.0 | 0 | 0 | 0 | 0 | 0 |
< 1/300 | Decisive evidence for | 97.2 | 100 | 100 | 100 | 100 | 100 |
. | |||||||
1–1/3 | Weak evidence for | 94.5 | 94 | 95.1 | 96.6 | 96.7 | 95.4 |
1/3–1/10 | Moderate evidence for | 3.5 | 4.4 | 3.1 | 2.6 | 2.0 | 3.8 |
1/10–1/30 | Substantial evidence for | 1.1 | 1.3 | 1.6 | 0.4 | 1.0 | 0.6 |
1/30–1/100 | Strong evidence for | 0.3 | 0.1 | 0.2 | 0.4 | 0.3 | 0.1 |
1/100–1/300 | Very strong evidence for | 0.1 | 0.1 | 0 | 0 | 0 | 0.1 |
< 1/300 | Decisive evidence for | 0.5 | 0.1 | 0 | 0 | 0 | 0 |
. |
N = 10 | N = 20 | N = 50 | N = 100 | N = 200 | N = 500 | ||
---|---|---|---|---|---|---|---|
1–1/3 | Weak evidence for | 0 | 0 | 0 | 0 | 0 | 0 |
1/3–1/10 | Moderate evidence for | 0.1 | 0 | 0 | 0 | 0 | 0 |
1/10–1/30 | Substantial evidence for | 0 | 0 | 0 | 0 | 0 | 0 |
1/30–1/100 | Strong evidence for | 0.7 | 0 | 0 | 0 | 0 | 0 |
1/100–1/300 | Very strong evidence for | 0.5 | 0 | 0 | 0 | 0 | 0 |
< 1/300 | Decisive evidence for | 98.7 | 100 | 100 | 100 | 100 | 100 |
. | |||||||
1–1/3 | Weak evidence for | 82.6 | 83.4 | 86.5 | 87.4 | 86.4 | 85.9 |
1/3–1/10 | Moderate evidence for | 11.6 | 10.5 | 8.6 | 9.2 | 10.2 | 9.5 |
1/10–1/30 | Substantial evidence for | 3.4 | 4.1 | 2.4 | 1.9 | 2.0 | 2.9 |
1/30–1/100 | Strong evidence for | 0.8 | 1.4 | 1.6 | 1.0 | 0.6 | 1.3 |
1/100–1/300 | Very strong evidence for | 0.8 | 0.3 | 0.7 | 0.1 | 0.6 | 0.2 |
< 1/300 | Decisive evidence for | 0.8 | 0.3 | 0.2 | 0.4 | 0.2 | 0.2 |
. |
We provide the percentage of incorrect inferences to make the comparison of these three approaches. However, it is quite difficult to compare these approaches as the interpretation of significant results are different. Hence, in this experiment, decisive evidence for H 1 : β ≠ 0 is considered as an acceptable decision favoring the alternative hypothesis while weak, moderate, substantial, strong and very strong evidence are considered as an acceptable decision favoring the null hypothesis H 0 : β = 0 . Likewise, p -value and plausibility are also used cut-offs, 0.1, .05, and 0.01, to make decisions about the null hypothesis. Therefore, if p -value and plausibility are less than 0.1, it is considered to reject the null hypothesis, otherwise, accept.
We begin our experiment with the case H 1 : β 1 ≠ 0 must be accepted as we set β 1 = 3 in Eq. (16). As we can observe in Tables Tables4 4 – 9 , the plausibility calibration produces the lowest incorrect inferences compared to the p -value and the four MBFs, when N = 10. the percentage of incorrect inferences of both methods is 0%. However, if we consider the more restrictive decision point for both methods, say 0.05 and 0.01 criteria, we can observe the incorrect inferences of plausibility method are 0% thus favoring the H 1 : β 1 ≠ 0 , whereas the p -values at 0.05 and 0.01 criteria accept this alternative hypothesis 0.3% and 0.1%, respectively. This indicates that there is little chance that the p -value is misleading. Furthermore, using the 0.10 criterion, it is evident that the percentage of incorrect inferences of plausibility and p -value are 0%, indicating the high reliability of plausibility test when H 0 : β 1 = 0 must be rejected. In the cases of MBF, among the four different methods, we find that MBF of Sellke et al . [ 19 ] performs well in this experiment as the percentage of the incorrect inferences is only 1.3%. Yet, this rate is still higher than those of p -value and plausibility methods. However, the results from these three approaches in the samples of N > 10 provide the same interpretation as the percentage of incorrect inferences are 0%.
Then, suppose now we consider the case that the null hypothesis H 0 : β 2 = 0 must be rejected. As we can see in Tables Tables4 4 – 9 , the heterogeneous results are provided. To make a clear picture, we summarize the percentage of incorrect inference results in Figure 3 . Different lines indicate different methods. The results show that there is variability in the evidence of testing. Firstly, it can be seen in the right panel of Figure 3 where the percentage of incorrect inferences of all methods seems to decrease when the sample size increases to N = 100, and the constant trend of incorrect inferences maintains after N = 100. Secondly, the Minimum Bayes Factor of Edwards et al . [ 5 ] produces the lowest rate of incorrect inferences.
Summary percentage of incorrect inference results.
For a closer look at the behavior of Minimum Bayes Factors of Edwards et al . [ 5 ] in Table 7 , the percentage that this method finds moderate, substantial, strong, very strong, and decisive is always less than or equal to 5%. Meanwhile, from the p -value approach ( Table 4 ), the percentage of incorrect inferences is ranged between 9.5–13.5%. This indicates that p -value states the amount of evidence against H 0 : β 2 = 0 as approximately 2.3–3.7 times (computed by the percentage of incorrect inferences of p -value divided by that of MBF) as much as the MBF of Edwards et al . [ 5 ] does. In other words, the p -value exaggerates statistical significance around 2–4 times as much as the MBF. Therefore, we can confidently argue that the conclusion derived from p -values is less accurate as a measure of the strength of evidence against H 0 : β 2 = 0 .
Although the plausibility approach is not performing well up to producing the lowest incorrect inferences in this case, its rate of incorrectness is still lower than the p -value approach for all sample sizes, except for N = 10. Table 4 reveals that the percentage of incorrect inferences from the p -value favoring H 0 : β 2 = 0 decreases with the growth of the sample size. This indicates that plausibility approach is more accurate for making decisions about the null hypothesis than the traditionally used p -value thresholds. Therefore, from an empirical or applied point of view, we could consider this alternative as a useful tool for researchers to avoid false discovery claims.
In addition, we also plot the boxplot for displaying the full range of variation of p -values, plausibility and MBFs, obtained from the same simulation results from Tables Tables4 4 – 9 , in Figures 4–6 , respectively. In all panels, the y-axis plots the probability values obtained from different methods and different sample sizes.
The full range of variation (from min to max) of p -values.
The full range of variation (from min to max) of plausibility (PL).
The full range of variation (from min to max) of MBF
Considering the case that the null hypothesis H 0 : β 1 = 0 must be rejected (the true β 1 is 3). In other words, there is strong evidence favoring the alternative hypothesis. As shown in the left panel of Figures Figures4 4 – 6 , there is a small variation in the probability values for all methods. When the sample size is greater than 10, all methods show the evidence supporting the alternative hypothesis. However, in the case of small sample size, say N = 10, there is a number of times that our testing methods lead to misinterpretation. Among 1,000 simulated datasets, we can see that p -value favors H 1 : β 1 ≠ 0 one time when using the 0.05 criterion while the plausibility gives no evidence of supporting the alternative hypothesis. For the four MBFs, similar results are shown. The variation of MBFs is also similar to those of the p -value and plausibility approaches, except for N = 10. This indicates that the power of any test depends on the sample size. If the sample size is large enough, the test will be more reliable, especially when the null hypothesis H 0 : β 1 = 0 must be rejected. However, there is no evidence supporting the reliability of the test for the case of the null hypothesis that must be accepted.
By using the decisive evidence for H 1 : β 1 ≠ 0 ( M B F 01 < 1 / 300 ) as the criterion, the number of times that MBFs produce the value greater than this criterion is relatively high compared to the p -value and plausibility approaches ( Figure 6 ). These results indicate that there could be a misinterpretation of the hypothesis test when the number of observations is low. However, if we use the weak evidence for H 1 : β 1 ≠ 0 , ( 1 / 3 < M B F 01 ≤ 1 ) , there is no evidence that MBF methods, except MBFs of Edwards et al . [ 5 ]; and Vovk [ 22 ] and Sellke et al . [ 19 ], fall in the range of this criterion. This indicates that MBFs of Edwards et al . [ 5 ]; and Vovk [ 22 ] and Sellke et al . [ 19 ] have a small chance of providing the wrong interpretation. This result corresponds to the results provided in Tables Tables7 7 and and8 8 .
Furthermore, the variation of p -values, plausibility, and MBFs in the right panel of Figures Figures4 4 – 6 also provide another test of ( β 2 ) . Remind that the null hypothesis must be accepted or the null hypothesis is correct, β 2 = 0 . We can see that there exhibits a relatively high variation compared to the case of the null hypothesis is incorrect (as reflected by the higher heights of the boxes). Under this test, we notice that MBFs of Edwards et al . [ 5 ]; and Vovk [ 22 ] and Sellke et al . [ 19 ], as illustrated in panels (b) and (c) in Figure 6 , show the small variability as the height of the box plots is short. We also observe that the median of these two MBFs exceeds 0.9 indicating that there is weak evidence for H 1 . However, we observe that there are some outliers located below ( M B F 01 < 1 / 300 ) , indicating that there is a small chance that the MBF favors the decisive evidence for H 1 . Then, let us consider the MBFs of Goodman [ 6 ] and Sellke et al . [ 19 ]. In all sample sizes, the median MBFs are around 0.8 and 0.9, respectively. This indicates that these two methods seem to favor the weak evidence of an alternative hypothesis as well. Also, the outliers of these two MBFs are not present, meaning the high chance of favoring the decisive evidence for H 1 . Therefore, we can conclude that MBFs of Edwards et al . [ 5 ]; and Vovk [ 22 ] and Sellke et al . [ 19 ] perform well and provide more accurate testing for the case the null hypothesis must be accepted. This result corresponds to the results provided in Tables Tables7 7 and and8 8 .
In a nutshell, these simulation results provide evidence of the high performance of plausibility approach when the null hypothesis is correct and must be accepted. Meanwhile, the MBFs of Edwards et al . [ 5 ]; and Vovk [ 22 ] and Sellke et al . [ 19 ] provide more reliable results compared to all other methods when the null hypothesis is incorrect. Yet. there is no evidence of 100% correct inferences under this case. This indicates that decision-making based on these approaches will be, more or less, arbitrary.
Finally, we make a comparison among MBFs, plausibility and p -value using a real application on the impact of economic variables on energy price in Spain. We use a dataset from the R package ‘MSwM’ [ 18 ] covering Price of the energy in Spain ( P t ) and other economic variables namely Oil price ( O t ) , Gas price ( G t ) , Coal price ( C t ) , Exchange rate between Dollar-Euro ( E x t ) , Ibex 35 index divided by one thousand ( I b e x t ) and Daily demand of energy ( D t ) . The data were collected from the Spanish Market Operator of Energy (OMEL), the Bank of Spain and the U.S. Energy Information Administration, covering, January 1, 2002, to October 31, 2008. We consider the following linear regression model
The application results of the three statistical tests are provided in Table 10 , where we list each covariate corresponding coefficient, the p -value, plausibility and four types of the Minimum Bayes factor.
Coefficient | -value | Plausibility | Goodman | Edwards . | Vovk and Sellke . | Sellke . | |
---|---|---|---|---|---|---|---|
−9.1253 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | |
0.0284 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | |
0.0430 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | |
−0.0021 | 0.2800 | 0.0000 | 0.5575 | 0.9936 | 0.9686 | 0.6423 | |
6.0403 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | |
−0.1590 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | |
0.0089 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 |
Considering the test of β 3 in Table 10 , we find strong evidence that six out of seven coefficients are favoring the alternative hypothesis. All three approaches provide the same interpretation of these coefficients. The p -value is less than 0.01, corresponding to the decisive evidence for the alternative hypothesis of four MBF methods. However, there exists a contradictory result in the interpretation of coefficient β 3 as different interpretation is made from each method. The p -value confirms the insignificant effect of coal price on the energy price of Spain, but the plausibility gives a significant result. Then from the MBF results, we can see all four MBF methods are categorized as weak evidence for the alternative hypothesis. These results seem to correspond to our simulation experiment in Section 3 which shows that the rate of incorrect inferences when the null hypothesis is incorrect is relatively high compared to the case when the null is correct. This application’s results suggest that the researchers need to be careful in interpreting a statistical result and various approaches should be used to crosscheck the result of one another.
In this paper, we highlight some of the misconceptions about the p -value and illustrate its performance using some simulated experiments and introduce two alternative approaches for the p -value namely the plausibility and the Minimum Bayes factor (MBF) as well as find the evidence against a simple null hypothesis under the linear regression context. MBF is an alternative to the p -value offered by the Bayesian approach, which relies solely on the observed samples to provide direct probability statements about the parameters of interest. In the case of plausibility approach, it can be viewed as the extension of the Minimum Bayes factor approach as the plausibility is computed as the ratio of the relative likelihood. However, it transforms the value of the parameter instead of p -value and t -statistic. Thus, it can be said that the plausibility is directly computed from any parameters.
The values of MBF and plausibility lie in the same range as p -value and this fact facilitates us to make the comparison. While the plausibility provides similar interpretation as p -value where 0.1, 0.05, 0.01 are given as the cut-offs or decision criteria, the MBF is interpreted following the Goodman [ 6 ] labeled intervals. As a result, a MBF between 1–1/3 is considered weak evidence for H 1 ; while 1/3–1/10 corresponds to moderate evidence, 1/10–1/30 substantial evidence, 1/30–1/100 strong evidence, 1/100–1/300 very strong evidence, and < 1/300 decisive evidence. To make a comparison of these three approaches, we conduct a simulation study to compare the incorrect inferences of each approach.
Our results show that the plausibility approach is more accurate for making decisions about the null hypothesis than the traditionally used p -value when the null hypothesis is true and must be accepted. However, the MBFs of Edwards et al . [ 5 ]; and Vovk [ 22 ] and Sellke et al . [ 19 ] provide more reliable results compared to all other methods when the null hypothesis is false or must be rejected. Based on our results, there is no evidence of 100% correct inferences under this case. This indicates that decision-making based on these approaches will be, more or less, arbitrary, when the null hypothesis is incorrect. As we mention in the introduction, it is too dangerous to have only binary decisions. Hence, the decision making in favoring each hypothesis needs to consider the whole categorization of MBF in order to avoid this strong inference. In addition, we could consider these alternatives as useful tools for researchers to avoid false discovery claims based on the p -value.
Nevertheless, our discussion should not be taken as a statement for researchers and practitioners to completely avoid p -value. Rather, we should investigate some misconceptions about the p -value and find alternative methods that have better statistical interpretations and properties. Finally, we note that the research studies involve much more than the statistical interpretation stage. The researcher in this area should carefully make the interpretation of the results. Instead of banning or rejecting p -value all at once, we suggest considering all these statistical tests for achieving the reliable results of your work. Furthermore, we should consider non-statistical evidence, such as theory and real evidence, in supporting decision making. This will help us gain more reliable results.
The authors would like to thank the four anonymous reviewers, the editor, and Prof. Hung T. Nguyen for his helpful comments and suggestions. The financial support of this work is provided by Center of Excellence in Econometrics, Chiang Mai University.
No potential conflict of interest was reported by the author(s).
As you were browsing something about your browser made us think you were a bot. There are a few reasons this might happen:
To regain access, please make sure that cookies and JavaScript are enabled before reloading the page.
Stack Exchange network consists of 183 Q&A communities including Stack Overflow , the largest, most trusted online community for developers to learn, share their knowledge, and build their careers.
Q&A for work
Connect and share knowledge within a single location that is structured and easy to search.
I have some data that is highly correlated. If I run a linear regression I get a regression line with a slope close to one (= 0.93). What I'd like to do is test if this slope is significantly different from 1.0. My expectation is that it is not. In other words, I'd like to change the null hypothesis of the linear regression from a slope of zero to a slope of one. Is this a sensible approach? I'd also really appreciate it you could include some R code in your answer so I could implement this method (or a better one you suggest!). Thanks.
Your hypothesis can be expressed as $R\beta=r$ where $\beta$ is your regression coefficients and $R$ is restriction matrix with $r$ the restrictions. If our model is
$$y=\beta_0+\beta_1x+u$$
then for hypothesis $\beta_1=0$, $R=[0,1]$ and $r=1$.
For these type of hypotheses you can use linearHypothesis function from package car :
It seems you're still trying to reject a null hypothesis. There are loads of problems with that, not the least of which is that it's possible that you don't have enough power to see that you're different from 1. It sounds like you don't care that the slope is 0.07 different from 1. But what if you can't really tell? What if you're actually estimating a slope that varies wildly and may actually be quite far from 1 with something like a confidence interval of ±0.4. Your best tactic here is not changing the null hypothesis but actually speaking reasonably about an interval estimate. If you apply the command confint() to your model you can get a 95% confidence interval around your slope. Then you can use this to discuss the slope you did get. If 1 is within the confidence interval you can state that it is within the range of values you believe likely to contain the true value. But more importantly you can also state what that range of values is.
The point of testing is that you want to reject your null hypothesis, not confirm it. The fact that there is no significant difference, is in no way a proof of the absence of a significant difference. For that, you'll have to define what effect size you deem reasonable to reject the null.
Testing whether your slope is significantly different from 1 is not that difficult, you just test whether the difference $slope - 1$ differs significantly from zero. By hand this would be something like :
Now you should be aware of the fact that the effect size for which a difference becomes significant, is
provided that we have a decent estimator of the standard error on the slope. Hence, if you decide that a significant difference should only be detected from 0.1, you can calculate the necessary DF as follows :
Mind you, this is pretty dependent on the estimate of the seslope. To get a better estimate on seslope, you could do a resampling of your data. A naive way would be :
putting seslope2 in the optimization function, returns :
All this will tell you that your dataset will return a significant result faster than you deem necessary, and that you only need 7 degrees of freedom (in this case 9 observations) if you want to be sure that non-significant means what you want it means.
You can simply not make probability or likelihood statements about the parameter using a confidence interval, this is a Bayesian paradigm.
What John is saying is confusing because it there is an equivalence between CIs and Pvalues, so at a 5%, saying that your CI includes 1 is equivalent to saying that Pval>0.05.
linearHypothesis allows you to test restrictions different from the standard beta=0
Sign up or log in, post as a guest.
Required, but never shown
By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy .
David mckenzie.
One of the shortest posts I wrote for the blog was on a joint test of orthogonality when testing for balance between treatment and control groups. Given a set of k covariates X1, X2, X3, …., Xk, this involves running the regression:
Treatment = a + b1X1+b2X2+b3X3+…+bkXk + u
And then testing the joint hypothesis b1=b2=b3=…=bk=0. This could be done by running the equation as a linear regression and using an F-test, or running it as a probit and using a chi-squared test. If the experiment is stratified, you might want to do this conditioning on randomization strata , especially if the probability of assignment to treatment varies across strata , and if the experiment is clustered, then the standard errors should be clustered. There are questions about whether it is desirable at all to do such tests when you know for sure the experiment was correctly randomized, but let’s assume you want to do such a test, perhaps to show the sample is still balanced after attrition, or that a randomization done in the field was done correctly.
One of the folk wisdoms is that researchers sometimes are surprised to find this test rejecting the null hypothesis of joint orthogonality, especially when they have a lot of variables in their balance table, or when they have multiple treatments and estimate a multinomial logit. A new paper by Jason Kerwin, Nada Rostom and Olivier Sterck shows this via simulations, and offers a solution.
Joint orthogonality tests based on standard robust standard errors over-reject the null, especially when k is large relative to n
Kerwin et al. look at both joint orthogonality tests, as well as the practice of doing pairwise t-tests (or group F-tests with multiple treatments) and doing some sort of “vote counting” where e.g. researchers look to see whether more than 10 percent of the tests reject the null at the 10% level. They run simulations for two data generating processes they specify (one using individual level randomization, and one clustered), and with data from two published experiments (one with k=33 and n=698 and individual level randomization, and one with k=10 and clustered randomization with 1016 units in 148 clusters).
They find that standard joint orthogonality tests with “robust” standard errors (HC1, HC2, or HC3) over-reject the null in their simulations:
· When n=500 and k=50, in one data generating process the test rejects the null at the 10% level approximately 50% of the time! That is, in half the cases researchers would conclude that a truly randomized experiment resulted in imbalance between treatment and control.
· Things look a lot better if n is large relative to k. With n=5000, size is around the correct 10% even for k=50 or 60; when k=10, size looks pretty good for n=500 or more.
· The issue is not surprisingly worse in clustered experiments, where the effective degrees of freedom are lower.
What is the problem?
The problem is that standard Eicker-White robust standard error asymptotics do not hold when the number of covariates are large relative to the sample size . Cattaneo et al. (2018) provide discussion and proofs, and suggest that the HC3 estimator can be conservative and used for inference – although Kerwin et al. still find overrejection using HC3 in their simulations. In addition to the number of covariates, leverage matters a lot – and having a lot of covariates and small sample can increase leverage.
So what are the solutions?
The solution Kerwin et al. propose is to use omnibus tests with randomization inference instead of regression standard errors. They show this gives the correct size in their simulations, works with clustering, and also works with multiple treatments. They show this makes a difference in practice to the published papers they relook at: in one, the F-test p-value from HC1 clustered standard errors is p=0.088, whereas it would be 0.278 using RI standard errors; and similarly a regression clustered standard error p-value of 0.068 becomes 0.186 using RI standard errors – so using randomization inference makes the published papers claim of balanced randomization more credible (for once a methods paper that strengthens existing results!).
My other suggestion is for researchers to also think carefully about how many variables they are putting in their balance tables in the first place. We are most concerned about imbalances in variables that will be highly correlated with outcomes of interest – but also often like to use this balance table/Table 1 to provide some summary statistics that help provide context and details of the sample. The latter is a reason for more controls, but keeping to 10-20 controls rather than 30-50 seems plenty to me in most cases – and also will help with journals having restrictions on how many rows your tables can have. Pre-registering which variables will go into this test then helps guard against selective reporting. There are also some parallels to the use of methods such as pdslasso to choose controls – I have a new working paper coming out soon on using this method with field experiments, and one of the lessons there is putting in too many variables can result in a higher chance of not selecting the ones that matter.
Another practical note
Another practical note with these tests is that it can be common to have a few missing values for some baseline covariates – e.g. age might be missing for 3 cases, gender for one, education for a few others, etc. This does not present such a problem for pairwise t-tests (where you are then testing treatment and control are balanced for the subsample that have data on a particular variable). But for a joint orthogonality F-test, the regression would only then be estimated for the subsample with no missing data, which could be a lot lower than n. Researchers then need to think about dummying out the missing values before running this test – but then this can result in a whole lot more (often highly correlated) covariates in the form of dummy variables for these missing values. Another reason to be judicious on which variables go into the omnibus test and focusing on a subset of variables without many missing values.
Thank you for choosing to be part of the Development Impact community!
Your subscription is now active. The latest blog posts and blog-related announcements will be delivered directly to your email inbox. You may unsubscribe at any time.
Lead Economist, Development Research Group, World Bank
7 Altmetric
The selection of auxiliary variables is an important first step in appropriately implementing missing data methods such as full information maximum likelihood (FIML) estimation or multiple imputation. However, practical guidelines and statistical tests for selecting useful auxiliary variables are somewhat lacking, leading to potentially biased estimates. We propose the use of random forest analysis and lasso regression as alternative methods to select auxiliary variables, particularly in situations in which the missing data pattern is nonlinear or otherwise complex (i.e., interactive relationships between variables and missingness). Monte Carlo simulations demonstrate the effectiveness of random forest analysis and lasso regression compared to traditional methods ( t -tests, Little’s MCAR test, logistic regressions), in terms of both selecting auxiliary variables and the performance of said auxiliary variables when incorporated in an analysis with missing data. Both techniques outperformed traditional methods, providing a promising direction for improvement of practical methods for handling missing data in statistical analyses.
This is a preview of subscription content, log in via an institution to check access.
Subscribe and save.
Price includes VAT (Russian Federation)
Instant access to the full article PDF.
Rent this article via DeepDyve
Institutional subscriptions
Data availability.
Reiterating the open practices statement above, all simulation files and worked example code are available on an Open Science Framework Repository at: https://osf.io/q84ts/ .
We note that although extensions of both FIML and multiple imputation have been developed to handle MNAR missing data, we refer throughout the paper to the more widely known and used MAR-based versions of these methods—e.g., invoking FIML estimation under missing data by setting arguments missing = “FIML” and fixed.x = “TRUE” in the lavaan package in R, as in the simulation reported later in the paper.
Note that the goal of satisfying the MAR assumption is aspirational but unverifiable in practice: in real datasets, researchers can never be certain that (a) they have identified true causes, as opposed to correlates, of missing data; (b) they have identified all such causes of missingness and all are measured and available in the dataset; and (c) missing values are not additionally caused by participants’ unseen scores on the variables in question, resulting in an analysis satisfying the MNAR mechanism. In other words, researchers can never be certain that the MAR assumption is (fully) met; rather, researchers can only render MAR more plausible by searching for and including useful auxiliary variables in analysis. In practice, researchers can never distinguish between MAR and MNAR mechanisms, as doing so would require access to participants’ unseen (missing) scores on all variables with missing data.
Our collective experience collaborating with and providing statistical consultation for numerous substantive and applied researchers has led us to the firm conviction that successful convergence of complex multiple imputation models is by no means a foregone conclusion, especially when models incorporate complexities such as those listed above. The definition of “successful convergence” for multiple imputation is crucial to this conclusion. While on the user end one may achieve successful results with no warning message in most software packages, investigation of recommended imputation diagnostics might demonstrate untrustworthy performance (see, e.g., Enders, 2022 ; Hayes & Enders, 2023 ).
Unless the researcher has decisive reasons to believe that the data are MCAR, such as when missing data are caused by a lab computer periodically crashing in a haphazard manner unrelated to participants’ characteristics or when the researcher has used a planned missing data design to purposefully inject MCAR missing data.
Alternatively, the researcher might include all substantive model variables as well, e.g.,
which would allow the researcher to assess whether candidate auxiliary variables \({a}_{1}\) , \({a}_{2}\) , and \({a}_{3}\) predicted missing data above and beyond the variable(s) in the substantive model (i.e., x , smoking attitudes, in the hypothetical example).
Admittedly, this poses no shortcoming when assessing the types of inherently parabolic convex missing mechanisms under specific consideration in the present study, but may hinder generalizations to other, thornier, less orthodox functional forms of the relationship between auxiliary variables and missing data indicators.
Note that this implies that the permutation importance test was conducted using marginal rather than partial variable importance, as described by Strobl et al. (2020). Based on pilot simulations, this procedure performed substantially better than partial variable importance measures. Because our goal here was not a detailed comparison of these options, however, we do not discuss partial importance measures further.
Note that we also ran a set of analyses that included no auxiliary variables and that estimated the model using listwise deletion rather than FIML, using argument missing = “listwise” in lavaan. Because the results of these listwise analyses were identical to those of the “no auxiliary variable” FIML analyses, we opted to conserve space by omitting them from our presentation here.
This can be said of the interactive mechanism here because it was designed to mimic the effects of a convex functional form, despite missing data rates depending on the values of two, rather than just one, auxiliary variables.
Arbuckle, J. N. (1996). Full information estimation in the presence of incomplete data. In Advanced structural equation modeling. (pp. 243–277). Lawrence Erlbaum Associates. Inc.
Berk, R. A. (2009). Statistical learning from a regression perspective . Springer.
Google Scholar
Breiman, L. (1996). Bagging predictors. Machine Learning, 24 (2), 123–140. https://doi.org/10.1007/BF00058655
Article Google Scholar
Breiman, L. (2001). Random Forests. Machine Learning, 45 (1), 5–32. https://doi.org/10.1023/A:1010933404324
Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classification and Regression Trees . Wadsworth.
Cohen, J., Cohen, P., Aiken, L. S., & West, S. G. (2003). Applied multiple regression/correlation analysis for the behavioral sciences (3rd Ed.). Lawrence Erlbaum Associates, Inc.
Collins, L. M., Schafer, J. L., & Kam, C. M. (2001). A comparison of inclusive and restrictive strategies in modern missing data procedures. Psychological Methods, 6 (4), 330–351. https://doi.org/10.1037/1082-989X.6.4.330
Article PubMed Google Scholar
Debeer, D., Hothorn, T., & Strobl, C. (2021). permimp: Conditional Permutation Importance (R package version 1.0–2). https://CRAN.R-project.org/package=permimp
Debeer, D., & Strobl, C. (2020). Conditional permutation importance revisited. BMC Bioinformatics, 21 (1), 307. https://doi.org/10.1186/s12859-020-03622-2
Article PubMed PubMed Central Google Scholar
Dixon, W. J. (1988). BMDP statistical software . University of California Press.
Enders, C. K. (2021). Applied missing data analysis (2nd ed.). Manuscript in press at Guilford Press.
Enders, C. K. (2022). Applied missing data analysis (2nd Ed.). The Guilford Press.
Enders, C. K. (2023). Fitting structural equation models with missing data. In Handbook of structural equation modeling (2nd Ed., pp. 223–240). The Guilford Press.
Enders, C. K., Du, H., & Keller, B. T. (2020). A model-based imputation procedure for multilevel regression models with random coefficients, interaction effects, and nonlinear terms. Psychological Methods, 25 (1), 88–112. https://doi.org/10.1037/met0000228
Graham, J. W. (2003). Adding missing-data-relevant variables to FIML-based structural equation models. Structural Equation Modeling: A Multidisciplinary Journal, 10 (1), 80–100. https://doi.org/10.1207/S15328007SEM1001_4
Grund, S., Lüdtke, O., & Robitzsch, A. (2021). Multiple imputation of missing data in multilevel models with the R package mdmb: A flexible sequential modeling approach. Behavior Research Methods . https://doi.org/10.3758/s13428-020-01530-0
Hapfelmeier, A., Hothorn, T., Ulm, K., & Strobl, C. (2014). A new variable importance measure for random forests with missing data. Statistics and Computing, 24 (1), 21–34. https://doi.org/10.1007/s11222-012-9349-1
Hapfelmeier, A., & Ulm, K. (2013). A new variable selection approach using Random Forests. Computational Statistics & Data Analysis, 60 , 50–69. https://doi.org/10.1016/J.CSDA.2012.09.020
Hastie, T., Tibshirani, R., & Friedman, J. H. (2009). The elements of statistical learning . Springer-Verlag.
Book Google Scholar
Hayes, T., & Enders, C. K. (2023). Maximum Likelihood and Multiple Imputation Missing Data Handling: How They Work, and How to Make Them Work in Practice. In H. Cooper, A. Panter, D. Rindskopf, K. , Sher, M. Coutanche, & L. McMullen (Eds.), APA Handbook of Research Methods in Psychology (2nd Ed.). American Psychological Association.
Hoerl, A. E., & Kennard, R. W. (1970). Ridge Regression: Biased Estimation for Nonorthogonal Problems. Technometrics, 12 (1), 55. https://doi.org/10.2307/1267351
Hothorn, T., Hornik, K., & Zeileis, A. (2006). Unbiased Recursive Partitioning: A Conditional Inference Framework. Journal of Computational and Graphical Statistics, 15 (3), 651–674. https://doi.org/10.1198/106186006X133933
IBM Corp. (2022). IBM SPSS Statistics for Macintosh, Version 29.0 . IBM Corp.
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An introduction to statistical learning with applications in R (2nd Ed.). Springer.
Jamshidian, M., & Jalal, S. (2010). Tests of Homoscedasticity, Normality, and Missing Completely at Random for Incomplete Multivariate Data. Psychometrika, 75 (4), 649–674. https://doi.org/10.1007/s11336-010-9175-3
Jamshidian, M., Jalal, S., & Jansen, C. (2014). MissMech : An R Package for Testing Homoscedasticity, Multivariate Normality, and Missing Completely at Random (MCAR). Journal of Statistical Software , 56 (6), 1–31. https://doi.org/10.18637/jss.v056.i06
Jeliĉić, H., Phelps, E., & Lerner, R. M. (2009). Use of missing data methods in longitudinal studies: The persistence of bad practices in developmental psychology. Developmental Psychology, 45 (4), 1195–1199. https://doi.org/10.1037/a0015665
Kim, K. H., & Bentler, P. M. (2002). Tests of homogeneity of means and covariance matrices for multivariate incomplete data. Psychometrika, 67 (4), 609–624. https://doi.org/10.1007/BF02295134
Kursa, M. B., & Rudnicki, W. R. (2010). Feature Selection with the Boruta Package. Journal of Statistical Software , 36 (11), 1–13. https://doi.org/10.18637/jss.v036.i11
Liaw, A., & Wiener, M. (2002). Classification and regression by randomForest. R News, 2 (3), 12–22.
Little, R. J. A. (1988). A test of missing completely at random for multivariate data with missing values. Journal of the American Statistical Association, 83 (404), 1198–1202. https://doi.org/10.1080/01621459.1988.10478722
Little, R. J. A., & Rubin, D. B. (2002). Statistical analysis with missing data (2nd ed.). Wiley. https://doi.org/10.1002/9781119013563
Muthén, B., Kaplan, D., & Hollis, M. (1987). On structural equation modeling with data that are not missing completely at random. Psychometrika, 52 (3), 431–462. https://doi.org/10.1007/BF02294365
Nicholson, J. S., Deboeck, P. R., & Howard, W. (2017). Attrition in developmental psychology: A review of modern missing data reporting and practices. International Journal of Behavioral Development, 41 (1), 143–153. https://doi.org/10.1177/0165025415618275
Park, T., & Lee, S.-Y. (1997). A test of missing completely at random for longitudinal data with missing observations. Statistics in Medicine, 16 (16), 1859–1871. https://doi.org/10.1002/(SICI)1097-0258(19970830)16:16%3c1859::AID-SIM593%3e3.0.CO;2-3
R Core Team. (2022). R: A language and environment for statistical computing . R Foundation for Statistical Computing. http://r-project.org/
Raghunathan, T. E. (2004). What Do We Do with Missing Data? Some Options for Analysis of Incomplete Data. Annual Review of Public Health, 25 (1), 99–117. https://doi.org/10.1146/annurev.publhealth.25.102802.124410
Raykov, T., & Marcoulides, G. A. (2014). Identifying Useful Auxiliary Variables for Incomplete Data Analyses. Educational and Psychological Measurement, 74 (3), 537–550. https://doi.org/10.1177/0013164413511326
Raykov, T., & West, B. T. (2016). On enhancing plausibility of the missing at random assumption in incomplete data analyses via evaluation of response-auxiliary variable correlations. Structural Equation Modeling, 23 (1), 45–53. https://doi.org/10.1080/10705511.2014.937848
Rosseel, Y. (2012). lavaan : An R package for structural equation modeling. Journal of Statistical Software , 48 (2), 1–36. https://doi.org/10.18637/jss.v048.i02
Rothacher, Y., & Strobl, C. (2023a). Identifying Informative Predictor Variables with Random Forests. Journal of Educational and Behavioral Statistics, Advance Online Publication. https://doi.org/10.3102/10769986231193327
Rothacher, Y., & Strobl, C. (2023b). Identifying Informative Predictor Variables With Random Forests. Journal of Educational and Behavioral Statistics . https://doi.org/10.3102/10769986231193327
Rubin, D. B. (1976). Inference and missing data. Biometrika, 63 (3), 581–592. https://doi.org/10.2307/2335739
Rubin, D. B. (1987). Multiple imputation for nonresponse in surveys . Wiley.
Savalei, V., & Bentler, P. M. (2009). A two-stage approach to missing data: Theory and application to auxiliary variables. Structural Equation Modeling, 16 (3), 477–497. https://doi.org/10.1080/10705510903008238
Schafer, J. L. (1997). Analysis of incomplete multivariate data . Chapman & Hall.
Strobl, C., Boulesteix, A.-L., Kneib, T., Augustin, T., & Zeileis, A. (2008). Conditional variable importance for random forests. BMC Bioinformatics, 9 (1), 307. https://doi.org/10.1186/1471-2105-9-307
Strobl, C., Boulesteix, A.-L., Zeileis, A., & Hothorn, T. (2007). Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC Bioinformatics, 8 (1), 25. https://doi.org/10.1186/1471-2105-8-25
Strobl, C., Malley, J., & Tutz, G. (2009). An introduction to recursive partitioning: Rationale, application, and characteristics of classification and regression trees, bagging, and random forests. Psychological Methods, 14 (4), 323–348. https://doi.org/10.1037/a0016973
Tay, J. K., Narasimhan, B., & Hastie, T. (2023). Elastic Net Regularization Paths for All Generalized Linear Models. Journal of Statistical Software , 106 (1), 1–31. https://doi.org/10.18637/jss.v106.i01
Thoemmes, F., & Rose, N. (2014). A Cautious Note on Auxiliary Variables That Can Increase Bias in Missing Data Problems. Multivariate Behavioral Research, 49 (5), 443–459. https://doi.org/10.1080/00273171.2014.931799
Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso. Source: Journal of the Royal Statistical Society. Series B (Methodological) , 58 (1), 267–288. https://www.jstor.org/stable/pdf/2346178.pdf?refreqid=fastly-default%3Ab4a52e9774c54c338a7426faa2779e6e&ab_segments=&origin=&initiator=&acceptTC=1
van Ginkel, J. R., Linting, M., Rippe, R. C. A., & van der Voort, A. (2020). Rebutting existing misconceptions about multiple imputation as a method for handling missing data. Journal of Personality Assessment, 102 (3), 297–308. https://doi.org/10.1080/00223891.2018.1530680
Woods, A. D., Gerasimova, D., Dusen, B. Van, Nissen, J., Bainter, S., Uzdavines, A., Davis-Kean, P., Halvorson, M. A., King, K., Logan, J., Xu, M., Vasilev, M. R., Clay, J. M., Moreau, D., Joyal-Desmarais, K., Cruz, R. A., Brown, D., Schmidt, K., & Elsherif, M. (2023). Best Practices for Addressing Missing Data through Multiple Imputation . PsyArXiv. https://doi.org/10.31234/OSF.IO/UAEZH
Yuan, K.-H., Jamshidian, M., & Kano, Y. (2018). Missing Data Mechanisms and Homogeneity of Means and Variances-Covariances. Psychometrika, 83 (2), 425–442. https://doi.org/10.1007/s11336-018-9609-x
Zhang, Q., & Wang, L. (2017). Moderation analysis with missing data in the predictors. Psychological Methods, 22 (4), 649–666. https://doi.org/10.1037/met0000104
Download references
No funding was used to support this research.
Authors and affiliations.
Department of Psychology, Florida International University, 11200 SW 8 Street, Miami, FL, DM 381B, USA
Timothy Hayes
Department of Psychology, Oklahoma State University, Stillwater, OK, USA
Amanda N. Baraldi
Department of Computational Biomedicine, Cedars-Sinai Medical Center, Los Angeles, CA, USA
Stefany Coxe
You can also search for this author in PubMed Google Scholar
Correspondence to Timothy Hayes .
Conflicts of interest.
The authors have no conflicts of interest to disclose.
Not applicable for the simulated data used in the paper (no human subjects participated in this theoretical, simulation research).
All simulation files and worked example code are available on an Open Science Framework Repository at: https://osf.io/q84ts/ .
Publisher's note.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Below is the link to the electronic supplementary material.
Supplementary file2 (pptx 136 kb), rights and permissions.
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
Reprints and permissions
Hayes, T., Baraldi, A.N. & Coxe, S. Random forest analysis and lasso regression outperform traditional methods in identifying missing data auxiliary variables when the MAR mechanism is nonlinear (p.s. Stop using Little’s MCAR test). Behav Res (2024). https://doi.org/10.3758/s13428-024-02494-1
Download citation
Accepted : 19 June 2024
Published : 09 September 2024
DOI : https://doi.org/10.3758/s13428-024-02494-1
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative
IMAGES
VIDEO
COMMENTS
x: The value of the predictor variable. Simple linear regression uses the following null and alternative hypotheses: H0: β1 = 0. HA: β1 ≠ 0. The null hypothesis states that the coefficient β1 is equal to zero. In other words, there is no statistically significant relationship between the predictor variable, x, and the response variable, y.
The null hypothesis of a two-tailed test states that there is not a linear relationship between \(x\) and \(y\). The alternative hypothesis of a two-tailed test states that there is a significant linear relationship between \(x\) and \(y\). Either a t-test or an F-test may be used to see if the slope is significantly different from zero.
c plot.9.2 Statistical hypothesesFor simple linear regression, the chief null hypothesis is H0 : β1 = 0, and the corresponding alter. ative hypothesis is H1 : β1 6= 0. If this null hypothesis is true, then, from E(Y ) = β0 + β1x we can see that the population mean of Y is β0 for every x value, which t.
The following examples show how to decide to reject or fail to reject the null hypothesis in both simple linear regression and multiple linear regression models. Example 1: Simple Linear Regression. Suppose a professor would like to use the number of hours studied to predict the exam score that students will receive in his class. He collects ...
For the multiple linear regression model, there are three different hypothesis tests for slopes that one could conduct. They are: Hypothesis test for testing that all of the slope parameters are 0. Hypothesis test for testing that a subset — more than one, but not all — of the slope parameters are 0.
Formally, our "null model" corresponds to the fairly trivial "regression" model in which we include 0 predictors, and only include the intercept term b 0. H 0:Y i =b 0 +ϵ i. If our regression model has K predictors, the "alternative model" is described using the usual formula for a multiple regression model: H1: Yi = (∑K k=1bkXik ...
In regression, as described partially in the other two answers, the null model is the null hypothesis that all the regression parameters are 0. So you can interpret this as saying that under the null hypothesis, there is no trend and the best estimate/predictor of a new observation is the mean, which is 0 in the case of no intercept.
The null hypothesis is rejected if falls outside the acceptance region. How the acceptance region is determined depends not only on the desired size of the test, but also on whether the test is: two-tailed (could be smaller or larger than ; we do not exclude either of the two possibilities) . one-tailed (only one of the two things, i.e., either smaller or larger, is possible).
Simple Linear Regression ANOVA Hypothesis Test Example: Rainfall and sales of sunglasses We will now describe a hypothesis test to determine if the regression model is meaningful; in other words, does the value of \(X\) in any way help predict the expected value of \(Y\)?
Interpreting the hypothesis test# If we reject the null hypothesis, can we assume there is an exact linear relationship? No. A quadratic relationship may be a better fit, for example. This test assumes the simple linear regression model is correct which precludes a quadratic relationship. If we don't reject the null hypothesis, ...
The reduced model is the model that the null hypothesis describes. Because the null hypothesis sets each of the slope parameters in the full model equal to 0, the reduced model is: ... For the multiple linear regression model, there are three different hypothesis tests for slopes that one could conduct. They are: Hypothesis test for testing ...
Here are key steps of doing hypothesis tests with linear regression models: Formulate null and alternate hypotheses: The first step of hypothesis testing is to formulate the null and alternate hypotheses. The null hypothesis (H0) is a statement that represents the state of the real world where the truth about something needs to be justified.
Whenever we perform linear regression, we want to know if there is a statistically significant relationship between the predictor variable and the response variable. We test for significance by performing a t-test for the regression slope. We use the following null and alternative hypothesis for this t-test: H 0: β 1 = 0 (the slope is equal to ...
When your sample contains sufficient evidence, you can reject the null and conclude that the effect is statistically significant. Statisticians often denote the null hypothesis as H 0 or H A.. Null Hypothesis H 0: No effect exists in the population.; Alternative Hypothesis H A: The effect exists in the population.; In every study or experiment, researchers assess an effect or relationship.
The null hypothesis (H0) answers "No, there's no effect in the population.". The alternative hypothesis (Ha) answers "Yes, there is an effect in the population.". The null and alternative are always claims about the population. That's because the goal of hypothesis testing is to make inferences about a population based on a sample.
As in simple linear regression, under the null hypothesis t 0 = βˆ j seˆ(βˆ j) ∼ t n−p−1. We reject H 0 if |t 0| > t n−p−1,1−α/2. This is a partial test because βˆ j depends on all of the other predictors x i, i 6= j that are in the model. Thus, this is a test of the contribution of x j given the other predictors in the model.
6. I am confused about the null hypothesis for linear regression. If a variable in a linear model has p <0.05 p <0.05 (when R prints out stars), I would say the variable is a statistically significant part of the model.
Null and alternate hypothesis. When there is a p-value, there is a hull and alternative hypothesis associated with it. In Linear Regression, the Null Hypothesis is that the coefficients associated with the variables is equal to zero.
The "reduced model," which is sometimes also referred to as the "restricted model," is the model described by the null hypothesis H 0. For simple linear regression, a common null hypothesis is H 0: β 1 = 0. In this case, the reduced model is obtained by "zeroing-out" the slope β 1 that appears in the full model. That is, the reduced model is:
Linear regression is a cornerstone of statistical modeling, widely employed in various fields, from economics and finance to social sciences and engineering. ... The null hypothesis is homoscedasticity. If the p-value is significant (typically less than 0.05), it suggests heteroscedasticity. Example:
Hypothesis Test for Regression Slope. This lesson describes how to conduct a hypothesis test to determine whether there is a significant linear relationship between an independent variable X and a dependent variable Y.. The test focuses on the slope of the regression line Y = Β 0 + Β 1 X. where Β 0 is a constant, Β 1 is the slope (also called the regression coefficient), X is the value of ...
For simple linear regression, the null hypothesis for the ANOVA is that the regression model (fit line) is identical to a simpler model (horizontal line). In other words, the null hypothesis is that the slope is actually zero. Also note that the term "linearity" is not really defined in the question or the answers, and can be misleading.
They provided a simple explanation of the problem in making an inference from p -value, for example, if the p -value is less than 0.05, we have enough evidence to reject the null hypothesis and accept the claim. By this conviction in the regression framework, we must reject the null hypothesis ( H 0: β = 0).
A regression analysis between sales (y in $1000) and advertising (x in dollars) resulted in the following equation: y-hat = 30,000 + 4x\\The above equation implies that an increase of 1 in advertising is associated with increase of 4000 in sales A standard normal distribution is one that has zero mean and variance = 1 Regression analysis is a statistical procedure for developing a mathematical ...
Your hypothesis can be expressed as Rβ = r R β = r where β β is your regression coefficients and R R is restriction matrix with r r the restrictions. If our model is. y = β +β x + u y = β + β x + u. then for hypothesis β1 = 0 β 1 = 0, R = [0, 1] R = [0, 1] and r = 1 r = 1.
A standard way for testing for balance between treatment and control groups is to regress a treatment indicator on a set of covariates, and then use an F-test to test the null hypothesis of joint orthogonality. However, a new paper shows that this test can over-reject the null substantially when sample sizes are small or the number of covariates large. Randomization inference approaches can be ...
Under a linear missing data mechanism (second column in Table 8) known in the literature to affect variable means (and mean differences), Little's test performed as expected, correctly rejecting the null hypothesis of MCAR missing data with greater statistical power as both the missing data rate and sample size increased.