Dummy Variables

  • First Online: 04 July 2019

Cite this chapter

dummy variable in research paper

  • Miguel Ángel Canela 4 ,
  • Inés Alegre 4 &
  • Alberto Ibarra 5  

3982 Accesses

5 Citations

In this chapter, we explain how to introduce categorical variables in a regression analysis, coding the categories with dummy variables. This is needed in most of the applications of regression analysis, since the samples on which we collect our data are typically partitioned into groups. In the example, we use a dummy variable to code gender, which allows us to include the comparison between genders in the analysis in an easy way.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save.

  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Available as EPUB and PDF
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
  • Durable hardcover edition

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

dummy variable in research paper

Multivariate Regression: Additional Topics

dummy variable in research paper

Multiple Regression Analysis from Data Science Perspective

Author information, authors and affiliations.

IESE Business School, Barcelona, Spain

Miguel Ángel Canela & Inés Alegre

IPADE Business School, Mexico City, Mexico

Alberto Ibarra

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Miguel Ángel Canela .

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this chapter

Canela, M.Á., Alegre, I., Ibarra, A. (2019). Dummy Variables. In: Quantitative Methods for Management. Springer, Cham. https://doi.org/10.1007/978-3-030-17554-2_6

Download citation

DOI : https://doi.org/10.1007/978-3-030-17554-2_6

Published : 04 July 2019

Publisher Name : Springer, Cham

Print ISBN : 978-3-030-17553-5

Online ISBN : 978-3-030-17554-2

eBook Packages : Business and Management Business and Management (R0)

Share this chapter

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

arXiv's Accessibility Forum starts next month!

Help | Advanced Search

Statistics > Applications

Title: dummy variables and their interactions in regression analysis: examples from research on body mass index.

Abstract: This paper is especially written for students and demonstrates the correct use of nominal and ordinal scaled variables in regression analysis by means of so-called dummy variables. We start out with examples of body mass index (BMI) differences between males and females, and between low, middle, and high educated people. We extend our examples with several explanatory (dummy) variables and the interactions between dummy variables. Readers learn how to use dummy variables and their interactions and how to interpret the statistical results. We included data, SPSS syntax, and additional information on a website ( this http URL ) that goes with this text. No mathematical knowledge is required.
Comments: 7448 words, 3 figures, 7 tables
Subjects: Applications (stat.AP)
Cite as: [stat.AP]
  (or [stat.AP] for this version)
  Focus to learn more arXiv-issued DOI via DataCite

Submission history

Access paper:.

  • Other Formats

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

MGHIHP HE-802, Spring 2021

Chapter 17 dummy variables and interactions in regression analysis.

This chapter is not part of the course HE802 in spring 2021.

Over the last few weeks, we used simple and then multiple regression analysis to analyze the linear relationships between a continuous numeric dependent variable and one or more independent variables. This week we will continue to build on our knowledge of regression analysis by adding two key capabilities to our toolbox: dummy variables to look at differences between groups and interaction terms to see if the relationship between a dependent and independent variable is different for different groups.

Additionally, we have oral exam #2 coming up very soon. Oral exam #2 will focus on the materials from Chapters 6–8. In the exam, you will be asked to run an OLS linear regression model, interpret its results, and run selected diagnostic tests to determine if we can trust the results of the model you made.

Everybody should please schedule a time with me (by email) when you would like to take your Oral Exam #2. It should be sometime in the March 16–27 range.

This week, our goals are to…

Add and interpret dummy variables and interaction terms to our OLS linear regressions.

Visualize regression results, including visualizations with dummy and interaction variables.

17.1 Tips, Tricks, and Answers From Last Week

As always, reading this section is optional, but it is based on questions I received from members of the class over the last week and it may contain some useful information.

17.1.1 Extenuating Circumstances

Some of you have asked if you can modify the speed and schedule with which you complete the course, especially given recent extenuating circumstances caused by the spread of coronavirus.

The short answer is yes, in most cases. The Institute has mechanisms in place to accommodate your schedule and other obligations in situations like these. However, this needs to be discussed individually on a case-by-case basis. Please do not hesitate to e-mail or call me to discuss your specific situation. I will do everything I can to accommodate your request. Thank you.

17.1.2 Tables in RMarkdown

Separate from the tables that you make using the table() command from within code chunks in RMarkdown, you can also make tables (containing text and/or numbers) in RMarkdown to help organize your writing.

If you would like to learn about how to add tables to your own RMarkdown file, please copy and adapt the code from the following resource:

Note that the most basic way to make a table is to simply type it out like this:

You can copy and paste the table above into your own RMarkdown file and then modify it to your needs.

When you “knit” your RMarkdown file into a PDF, HTML, or Word document using the Knit menu, this table that we inputted as plain text with hyphens and vertical bars will be converted into a nice-looking table. Here is what it will look like:

Name Age Occupation
Gedke 23 Chemist
Yada 54 Slug Tamer
Bif 34 Puppeteer

I also find it handy to use an online tool to help me make tables in RMarkdown, so that I don’t have to manually type out all of the hyphens and vertical bars you saw above. Here is the tool I usually use:

  • Markdown Tables Generator

17.2 Final Project Details and Requirements

17.2.1 description.

With just over a month remaining in our time together in this course, I would like to share specific expectations and requirements for the final project in this class. The final project is not meant to be even close to a full quantitative research study. Instead, you can think of the final project as a take-home final exam. Another way to think about the project is that you will be writing an extended methods section and a condensed results section of an empirical research article.

The final project is due on April 25, 2020. 194

17.2.2 Project Goals

The goals of this final project are to…

Present and interpret the results of one quantitatve test or model (such as ANOVA, t-test, linear or logistic regression) that answers a clear and specific research question.

Run, interpret, and appropriately respond to all required diagnostic tests for the quantitative test or model and present the results of all tests.

17.2.3 Project Requirements

Here are the items you must present and tasks you must complete:

Write a clear research question (RQ) that can be solved using regression analysis techniques. This research question should be a single sentence with a question mark at the end. 195

To answer your RQ, various concepts will have to be first measured, recorded in a dataset as variables, and then related to each other quantitatively. Identify a dataset that you will use to answer your research question. 196 Clearly describe the dataset, including: a) population from which the data sample was drawn, b) unit of observation, c) all variables that you will use in your analysis and the unit of measurement of each variable, d) background information about the data.

Given the structure of the data and the RQ of interest, explain which type of quantitative test is most appropriate to answer your RQ and why. Also identify at least one other type of quantitative test that could also be used and explain why you instead chose the test that you did.

Present basic descriptive statistics that are relevant to your RQ. You should include at least one table and at least one figure/chart.

Show the code and results of one quantitative test or model that answers your RQ. 197

Run and present the results of all diagnostic tests that pertain to the type of test or model you ran. Ideally, your model will pass all of the tests. If your diagnostics show that your model specification violates any of the assumptions of your chosen test, you might be able to fix the problem and run the test again. Please describe all efforts to fix such problems. If you are not able to solve all such problems, it is okay. The key is that you explain what you find and how you went about your methods.

Interpret the results of your test/model that are relevant to your RQ.

Briefly explain any limitations in your analysis.

Include all R code and results in your final submission.

Present all writing in well-written English.

Present everything in an aesthetically pleasing manner. It is recommended that you use an RMarkdown document, but this is not required.

17.2.4 Grading Rubric

The final project will be graded according to the rubric below. Each criterion is worth a maximum score of two points unless otherwise noted.

Criterion Score = 0 Score = 1 Score = 2
Clear RQ Unclear, more than one sentence, not a question. Confusingly presented but understandable. Clear, simply written, single sentence ending with a question mark.
Population and sample Relationship between sample and population is unclear, details about population is omitted. Minor omissions, but overall description of the population is understandable. It is very clear what the population is and how many observations from this population were sampled and then included in the dataset used in the project.
Unit of observation The meaning of each row in the data is not understandable from what is written. Reader can figure out based on context, but a clear explanation is missing. It is very clear what each row of the data means/represents. This is explicitly stated with no ambiguity or confusion.
Variables used The variables used in the analysis are not addressed. Some variables are mentioned but not all. How each variable is measured is not clear. Dependent variable and all independent variables are described in one sentence each. Unit of measure (and any relevant explanation of how a variable is coded in the data) is given for each variable.
Background on data It is not understandable where the data came from and from what context. Few details are given about the data. Clear explanation of where the data came from, when it was collected, who collected it, etc.
Choose test/model No explanation of why the presented quantitative test/model was chosen. No comparison to another test/model. Incorrect selection of model type. An explanation may be there but it might be incorrect, or a comparison to another test/model is missing. Logical explanation of the way the data is structured and how the selected test/model is best suited to that data structure. Clear explanation of why at least one alternative test/model was not used.
Descriptive statistics No or very few statistics presented. Statistics for irrelevant variables or information are presented. Descriptive statistics do not cover all variables and observations relevant to the RQ. Only one of two required charts is included. Descriptive statistics are presented for all variables relevant to the RQ and used in the selected test/model. One well-made figure is presented. One well-made table is presented.
Test/model result Code and/or summary is not shown for test/model. Code does not accomplish the type of test/model that was supposed to be used. Only partial work or result is shown. Type of test/model is unclear. Correct test/model result is shown along with appropriate R code to execute it.
Test/model assumption 1 Assumption not considered. Assumption is mentioned but incorrectly interpreted. Assumption is tested correctly and interpreted correctly.
Test/model assumption 2 Assumption not considered. Assumption is mentioned but incorrectly interpreted. Assumption is tested correctly and interpreted correctly.
Test/model assumption 3 Assumption not considered. Assumption is mentioned but incorrectly interpreted. Assumption is tested correctly and interpreted correctly.
Interpret results Many irrelevant details are given. Research question is not clearly answered. Research question is answered but interpretation of results is not exactly correct. Succinct interpretation of the portion of the test/model output that pertains to the RQ.
Limitations Limitations are not addressed or are completely incorrect given the test/model model used. Limitations are partially addressed. Multiple plausible limitations to the analysis and the conclusions we can draw from it are addressed.
R code included No R code is included Only partial R code for the results presented is given. R code is included (displayed in final document) for all results that were generated using R.
Writing quality (+) Sentences and paragraphs are not formatted according to convention. Full sentences are not used much or well. Minor grammar and/or spelling errors occur throughout, but the main points are understandable. Writing is clear and succinct. It is easy to read quickly and understand the analysis and the results. No grammar or spelling errors.
Aesthetics Project is presented in a confusing manner. Order and flow of requested items is not logical. Unnecessary fonts, symbols, and formatting layout appears. Minor blemishes and errors are visible in the submitted project. Order of all content is clear and logical. Sections and sub-sections are logically and clearly marked. The write-up is easy to read.

Items marked with a (+) in the table above will carry more weight than just two points. All other items have a maximum score of two points.

Your grade on the project will be the number of points achieved divided by the total number of points possible.

If you are not satisfied with your grade on this project, you do have the option of taking an INCOMPLETE grade for the course. Then, you will improve and re-submit your project in the weeks that follow the end of the course. I will re-grade the project and then put your improved final grade for the course into the grading system.

17.3 Dummy Variables

17.3.1 definition.

“Dummy” variables are variables that just have two values. Here are some examples:

Variable Levels
gender female, male
experimental group treatment, control
completed training did complete training, did not complete training
citizenship native, foreign
test result pass, fail

How does this look within a dataset? Maybe you would have a variable in your dataset called gender and it would be coded as 1 for females and 0 for males.

A dummy variable can also be called a binary variable, dichotomous variable, yes/no variable, two-level categorical variable, etc. All of these terms mean the same thing.

But what about when you have three or more qualitative (non-numeric) categories that you want to include in your regression as a single variable?

Here are some examples:

Variable Levels
Race black, white, other
Favorite ice cream flavor chocolate, vanilla, strawberry, other
Type of car owned Gas, electric, hybrid, do not have a car

Please read the following resources related to this situation:

  • Dummy Coding: The how and why

17.3.2 Example

I’m going to run through a fake example now, to illustrate how dummy variables are used in regression. First I’m adding a race variable to the GSSvocab data that you have seen before:

Distribution of the new race variable:

Here’s an OLS linear regression:

There are three dummy variables in this regression model: gendermale , raceblack , and raceother .

Let’s interpret these one at a time.

gendermale is a dummy variable that the computer created for us based on the factor (categorical) variable gender , which just has two levels (male and female). It added the word “male” to the variable name to tell us that it coded male as 1 and female as 0.

The coefficient of gendermale is interpreted like this: Males are predicted to score 0.1465 lower on the vocabulary test than females, holding all other independent variables constant. We compare the level of the variable that corresponds to 1 (males) with the level that corresponds to 0 (females).

raceblack and raceother are two dummy variables that were generated for us by the computer from the categorical factor variable race . Let’s walk through exactly how this happened. race has three levels: black, white, and other. We want to see if race is associated with our dependent variable ( vocab ), but race is not a number that we can just add into the regression as a numeric variable like we can for age or years of education. All we can do is predict the differences in the dependent variable for each of these three groups.

But the computer only understands numbers. So what do we do? Well, we convert the categorical variable race into numeric dummy variables (which only take on the values/levels of 1 and 0). We can make three dummy variables:

  • raceblack – Coded as 1 for anyone in the survey data who is black and 0 for anyone who is not black (meaning they are white or other).
  • raceother – Coded as 1 for anyone in the survey data who is non-white and non-black and 0 for anyone who is white or black.
  • racewhite – Coded as 1 for anyone in the survey data who is white and 0 for anyone who is not white (meaning they are either black or other).

Now we have three separate numeric variables that, acting together, replace our need for the categorical race variable that we started with. Note that the computer does this automatically for us when we run the regression, because it knows 198 everything I just told you.

Then we include two of these three variables in the regression as numeric variables. That’s what the computer did for us automatically. It left one out. Why did it leave one out? Well, for the gender variable, which is also a categorical factor variable, we could make one dummy variable called gendermale 199 and another called genderfemale , 200 if we really wanted to. That would be perfectly legitimate.

Consider what would happen if we included both gendermale and genderfemale as independent variables in the regression. It wouldn’t be useful for us to include both of these because they contain the same information. Both the gendermale and the genderfemale variables tell us whether a person in the survey is male or female, even though they are coded differently. There is no added benefit to having both of these variables. We just need one of them to understand the predicted difference in the dependent variable between males and females.

Going back to the three-category race variable, but keeping the above explanation about the two-category gender variable in mind, let’s think about why we only need to include two of the three race dummy variables in our regression model. Remember that we now have three dummy variables that we have added to our data, and we know the values of these three variables for every single person in the dataset.

Below are three people in our dataset. We’ll pretend their names are Ophelia who is other, Winnie who is white, and Belinda who is black. You can see their race as it is coded in the race variable, and then you can see how each of these people is coded for each of the three new dummy variables.

name race raceblack raceother racewhite
Ophelia other 0 1 0
Winnie white 0 0 1
Belinda black 1 0 0

With these three dummy variables, we no longer need the original race variable. Let’s look at the table again, but only with the new dummy variables:

name raceblack raceother racewhite
Ophelia 0 1 0
Winnie 0 0 1
Belinda 1 0 0

Using only these dummy variables, we can see very easily that Ophelia is other, because she is is coded as 1 for raceother , 0 for raceblack , and 0 for racewhite . We don’t need the original race variable anymore to identify each person’s race. And these dummy variables are purely numeric, so the computer can now incorporate everyone’s race into the regression. We solved that problem.

But now have a look at the version of the table below, in which I have eliminated the racewhite variable:

name raceblack raceother
Ophelia 0 1
Winnie 0 0
Belinda 1 0

Let’s go through our three surveyed people based on this new table with just two variables:

  • Ophelia: Coded as 1 for raceother and 0 for raceblack , so we know she is other.
  • Winnie: Coded as 0 for raceother and 0 for raceblack , so we know that she is white, because she’s coded as 0 for the other two dummy variables.
  • Belinda: Coded as 0 for raceother and 1 for raceblack , so we know she is black.

Even though we took out the racewhite variable, we still were able to figure out each person’s race, even Winnie’s! Including all three dummy variables for a three-category categorical variable is too much information !!! The computer doesn’t need it. The variable that we leave out is called the reference category or reference level . If our categorical variable has 3 categories, we need 2 dummy variables. If our categorical variable has 6 categories, we need 5 dummy variables. The rule is:

\[number\; of\; dummy\; variables = number\; of\; categories - 1\]

That was a pretty lengthy explanation. Hopefully this makes sense. Ask questions if not!

Let’s return finally to the regression we ran. The coefficient for raceblack is 0.0149 and the coefficient for raceother is -0.0196. Here’s how we interpret these results:

  • Black race people are predicted to have a vocabulary score that is 0.0149 higher than white race people, controlling for all of the other independent variables.
  • Other race people are predicted to have a vocabulary score that is 0.0196 lower than white race people, controlling for all of the other independent variables.

racewhite was left out as the reference category. The computer only understands numbers. The coefficients for raceother compares all people for whom raceother = 1 (like Ophelia) to those for whom raceother = 0 and raceblack = 0 (like Winnie). The coefficient for raceblack compares all people for whom raceblack = 1 (like Belinda) to all people for whom raceother = 0 and raceblack = 0 (like Winnie once again).

The regression only allows us to compare each category with the reference category. All non-white people can only be compared with those who are white. We could have left out a different variable as the reference category and included racewhite . In that case, all of our results would be compared to the selected reference category. But the overall results would be the same. Look:

I changed the reference category to raceblack and now you’ll see that it is left out of the regression, and racewhite is included. The coefficient for racewhite is now -0.0149 , exactly the same magnitude as the previous raceblack coefficient but with the opposite sign! Despite changing the reference category, we got the exact same result, that white people are predicted to have a lower vocabulary score that is 0.0149 than that of black people (holding constant all of the other independent variables). So it ultimately didn’t matter too much which reference category we chose. The rest of the model is the same. All the other coefficients are the same. The Multiple R-squared is exactly the same.

Hopefully all of this made sense. I couldn’t find an explanation of more-than-two category categorical variables that I liked, so I wrote all of this. Please ask for clarifications as you see fit!

Keep in mind that I generated the above race data randomly just to illustrate how to use dummy variables in regressions, so this is obviously not a true finding. Another reminder of this is that the p-values for the race dummy variables in the result were not statistically significant. So we anyway don’t have any evidence that the fake race assignments are associated with vocabulary score in the population at large.

17.3.3 Optional Resources

The following resources might help reinforce your understanding of dummy variables. It is not required for you to read/consume these:

  • Working With Dummy Variables
  • Section “14.1 Dummy Variables” of Quantitative Research Methods for Political Science…
  • Dummy Variables in Regression
  • 2:38 and after in “Eviews 7: How to interpret dummy variables and the dummy variable trap explained part 1” – Note that in the regression output shown here, C means “Constant,” which is the same thing as an intercept. The video should start at 2:38 automatically if you open it through this link.

Some of the resources above explain how to interpret a regression coefficient for a dummy variable when there are only two categories (such as male and female) that you need to capture in your regression.

17.4 Interactions in Regression

17.4.1 definition and overview.

Previously, we learned about the concept of an interaction when doing ANOVA tests. Now, we will look at how to incorporate the very same concept into a linear regression model.

Please have a look at the following resources. If you find the concept of an interaction to be intuitive already, you can quickly skim through these resources rather than reading word-for-word.

  • Exploring interactions with continuous predictors in regression models
  • Section “14.2 Interaction Effects” of Quantitative Research Methods for Political Science…
  • Interpreting Interactions in Regression

17.4.2 Example

Now we will run through a very short example, which you can easily run in R on your own computer!

Please copy the code below into R and run it. 201 You will need to modify this code as part of your assignment this week.

If you want to see the same thing but with the individual data points displayed as well (which isn’t always a good idea), run this:

interact_plot(fitiris, pred = Petal.Width, modx = Species, plot.points = TRUE)

Basically, when we interact a continuous variable with a categorical variable, we are asking the regression model to tell us if there is a different slope for the relationship between that continuous variable and the dependent variable for the different levels of the categorical variable. This is extremely useful.

Here’s how this corresponds to our borrowed example above:

dependent variable continuous independent variable categorical independent variable
petal length petal width species

In the regression, we interacted petal width and species. We got the following regression equation:

\[\begin{eqnarray} Petal.Length_{predicted} &=& 0.55Petal.Width + 0.45Speciesversicolor \\ && + \text{ } 2.91Speciesvirginica + 1.32Petal.Width*Speciesversicolor \\ && + \text{ } 0.10Petal.Width*Speciesvirginica + 1.33 \end{eqnarray}\]

The categorical variable Species has three levels:

setosa is the reference category. You’ll notice that it’s missing from the regression output, and that’s why. The computer created dummy variables for the other two species: Speciesversicolor and Speciesvirginica .

Let’s look now at the predicted relationship between the independent variable Petal.Width and the dependent variable Petal.Length . To do this, we’ll take out any terms 202 that include Petal.Width on the right side of the equation:

\[0.55Petal.Width + 1.32Petal.Width*Speciesversicolor + 0.10Petal.Width*Speciesvirginica\]

Using this, we can figure out the predicted relationship between Petal.Length and Petal.Width for plants that fall into each of the three levels of Species :

setosa – For all setosa plants, Speciesversicolor is coded as 0 and Speciesvirginica is also coded as 0. We plug in 0 for each of these in the expression above: \(0.55Petal.Width + 1.32Petal.Width*0 + 0.10Petal.Width*0\) which is equal to \(0.55Petal.Width\) . 0.55 is the final coefficient. For setosa plants, A one-unit increase in Petal.Width is associated with a 0.55 unit increase in Petal.Length .

versicolor – For all versicolor plants, Speciesversicolor is coded as 1 and Speciesvirginica is coded as 0. We plug these into the expression above: \(0.55Petal.Width + 1.32Petal.Width*1 + 0.10Petal.Width*0\) which is equal to \((0.55+1.32)Petal.Width = 1.87Petal.Width\) . 1.87 is the final coefficient. For versicolor plants, a one-unit increase in Petal.Width is associated with a 1.87 unit increase in Petal.Length .

virginica – For all virginica plants, Speciesversicolor is coded as 0 and Speciesvirginica is coded as 1. We plug these into the expression above: \(0.55Petal.Width + 1.32Petal.Width*0 + 0.10Petal.Width*1\) which is equal to \((0.55+0.10)Petal.Width = 0.65Petal.Width\) . 0.65 is the final coefficient. For virginica plants, a one-unit increase in Petal.Width is associated with a 0.65 unit increase in Petal.Length .

There are more examples in the resources linked above and you’ll also be practicing this in this week’s assignment.

17.5 Assignment

In this week’s assignment, you will revisit some of your work from last week as we add dummy variables and interaction terms into our linear regression models.

Like last week, load the GSSvocab dataset from the car package. Once again, run exact same regression you ran last week, which used the variables age , gender , educ , and vocab . 203

17.5.1 Dummy Variables, Part 1

Right now in the dataset, gender is coded as a factor variable:

factor is what R calls a categorical variable.

And how many levels does this categorical variable have?

It has 2 levels, and those levels are female and male . So this particular categorical variable is also a dummy variable.

Task 1 : Recode the gender variable. Make a new variable called female for which females are coded as 1 and males as 0 .

Task 2 : Create a two-way table to show that your recode was successful.

Task 3 : Use the class() command (demonstrated above) to figure out what type of variable your new female variable is. It should be numeric.

Task 4 : Re-run the same linear regression (with age , gender , educ , and vocab ), but replace gender with the new female variable that you just made. Is the regression result the same as the one you got last week? It should be the same.

What just happened? Last week, R converted the gender variable to a dummy variable for you automatically. So you don’t actually need to do this recoding process every time you use a dummy variable. But it’s important for you to know that the computer is treating females as 1 and males as 0 nevertheless (or sometimes vice versa, but it’s always using 0’s and 1’s).

Task 5 : Now that you know more this week than last week about dummy variables, interpret the coefficient for the female variable in your regression output.

17.5.2 Dummy Variables, Part 2

Now consider this new research question, still using the GSSvocab dataset: Do native-born people have different vocabulary abilities than non-native-born people, controlling for age, gender, and education?

Task 6 : What is the null hypothesis for this research question?

Task 7 : What is the alternate hypothesis for this research question?

Task 8 : Run a new regression to answer this new research question. Show the results of this regression.

Task 9 : Write out the full regression equation based on this output.

Task 10 : What is the predicted vocabulary score for someone with the following characteristics? Please show the entire calculation.

  • gender = male
  • education = 8
  • nativeBorn = yes

Task 11 : What is the predicted vocabulary score for someone with the same characteristics above, except that they are female? The difference should be equal to the coefficient of the dummy variable for gender! That’s the whole point! Please show the entire calculation.

Task 12 : What is the answer to the research question? Make sure your answer includes an interpretation of the coefficient for the nativeBorn variable, as well as that coefficient’s standard error, t-value, and p-value.

17.5.3 Interactions

Now we’ll turn to another research question: Is the relationship between education and vocabulary different for native-born and non-native-born people, when controlling for age and gender? In other words, is there an interaction between educ and nativBorn , when controlling for age and gender ?

This page is likely to help you complete the next few tasks. And you should also refer to the code with the iris data that is earier in this chapter.

Task 13 : Modify your previous code and run a new regression that includes the interaction in this new research question. Show your regression table in your submission.

Task 14 : Write out the full regression equation based on this output.

Task 15 : Use the interact_plot() function to visualize the results.

Task 16 : What is the answer to the new research question about the interaction? Make sure you look to see which coefficients are statistically significant and then interpret the results accordingly.

17.5.4 Logistical Tasks

Task 17 : Please submit any feedback or questions you have as part of your assignment.

Task 18 : Please e-mail me to schedule a time when you would like to take your Oral Exam #2. It should be sometime in the March 16–27 2020 range .

Task 19 : Please submit your assignment to the D2L dropbox as always.

This due date was added on April 1, 2020. ↩︎

There are no exceptions to this requirement. ↩︎

As stated before, those of you who do not have data of your own that you would like to analyze can have a discussion with me and I can provide you with a research question and a dataset in which to study it. ↩︎

In reality, you will likely run many tests/models on your own to arrive at the one that fits your RQ and data the best. But you do not need to show all of this work in your final submission. If you do wish to show all of this additional work, you can include it in an appendix to your assignment, but this is not required. ↩︎

Meaning that it is programmed to behave as if it knows. ↩︎

In this variable gendermale , all males in the data would be coded as 1 and all females will be coded as 0. ↩︎

In this variable genderfemale , all females in the data would be coded as 1 and all males would be coded as 0, which is the exact opposite of how we would code the gendermale variable. ↩︎

Source: Exploring interactions with continuous predictors in regression models ↩︎

A term is anything in between the plus signs. In the equation \(a = 2b + rudolph + 43\) , \(2b\) , \(rudolph\) , and \(43\) are all terms on the right side of the equation. ↩︎

Just copy and paste your code from last week. Don’t type it again! ↩︎

Popular searches

  • How to Get Participants For Your Study
  • How to Do Segmentation?
  • Conjoint Preference Share Simulator
  • MaxDiff Analysis
  • Likert Scales
  • Reliability & Validity

Request consultation

Do you need support in running a pricing or product study? We can help you with agile consumer research and conjoint analysis.

Looking for an online survey platform?

Conjointly offers a great survey tool with multiple question types, randomisation blocks, and multilingual support. The Basic tier is always free.

Research Methods Knowledge Base

  • Navigating the Knowledge Base
  • Foundations
  • Measurement
  • Research Design
  • Conclusion Validity
  • Data Preparation
  • Descriptive Statistics

Dummy Variables

  • General Linear Model
  • Posttest-Only Analysis
  • Factorial Design Analysis
  • Randomized Block Analysis
  • Analysis of Covariance
  • Nonequivalent Groups Analysis
  • Regression-Discontinuity Analysis
  • Regression Point Displacement
  • Table of Contents

Fully-functional online survey tool with various question types, logic, randomisation, and reporting for unlimited number of surveys.

Completely free for academics and students .

A dummy variable is a numerical variable used in regression analysis to represent subgroups of the sample in your study. In research design, a dummy variable is often used to distinguish different treatment groups. In the simplest case, we would use a 0,1 dummy variable where a person is given a value of 0 if they are in the control group or a 1 if they are in the treated group. Dummy variables are useful because they enable us to use a single regression equation to represent multiple groups. This means that we don’t need to write out separate equation models for each subgroup. The dummy variables act like ‘switches’ that turn various parameters on and off in an equation. Another advantage of a 0,1 dummy-coded variable is that even though it is a nominal-level variable you can treat it statistically like an interval-level variable (if this made no sense to you, you probably should refresh your memory on levels of measurement ). For instance, if you take an average of a 0,1 variable, the result is the proportion of 1 s in the distribution.

y i is outcome score of i th unit,

β 0 is coefficient for the intercept ,

β 1 is coefficient for the slope ,

  • 1 if the i th unit is in the treatment group;
  • 0 if the i th unit is in the control group;

e i is residual for the i th unit.

To illustrate dummy variables, consider the simple regression model for a posttest-only two-group randomized experiment. This model is essentially the same as conducting a t-test on the posttest means for two groups or conducting a one-way Analysis of Variance (ANOVA) . The key term in the model is β 1 , the estimate of the difference between the groups. To see how dummy variables work, we’ll use this simple model to show you how to use them to pull out the separate sub-equations for each subgroup. Then we’ll show how you estimate the difference between the subgroups by subtracting their respective equations. You’ll see that we can pack an enormous amount of information into a single equation using dummy variables. All I want to show you here is that β 1 is the difference between the treatment and control groups.

To see this, the first step is to compute what the equation would be for each of our two groups separately. For the control group, Z = 0 . When we substitute that into the equation, and recognize that by assumption the error term averages to 0 , we find that the predicted value for the control group is β 0 , the intercept. Now, to figure out the treatment group line, we substitute the value of 1 for Z , again recognizing that by assumption the error term averages to 0 . The equation for the treatment group indicates that the treatment group value is the sum of the two beta values.

Now, we’re ready to move on to the second step – computing the difference between the groups. How do we determine that? Well, the difference must be the difference between the equations for the two groups that we worked out above. In other word, to find the difference between the groups we just find the difference between the equations for the two groups! It should be obvious from the figure that the difference is β 1 . Think about what this means. The difference between the groups is β 1 . OK, one more time just for the sheer heck of it. The difference between the groups in this model is β 1 !

Whenever you have a regression model with dummy variables, you can always see how the variables are being used to represent multiple subgroup equations by following the two steps described above:

  • create separate equations for each subgroup by substituting the dummy values
  • find the difference between groups by finding the difference between their equations

Cookie Consent

Conjointly uses essential cookies to make our site work. We also use additional cookies in order to understand the usage of the site, gather audience analytics, and for remarketing purposes.

For more information on Conjointly's use of cookies, please read our Cookie Policy .

Which one are you?

I am new to conjointly, i am already using conjointly.

How robust is linear regression with dummy variables ?

12 Pages Posted: 26 Aug 2022

Eric Blankmeyer

Texas State University

Date Written: July 15, 2022

Researchers in the social sciences make extensive use of linear regression models in which the dependent variable is continuous-valued while the explanatory variables are a combination of continuous-valued regressors and dummy variables. The dummies partition the sample into groups, some of which may contain only a few observations. Such groups may easily include enough outliers to break down the parameter estimates. This paper discusses the problem at an intuitive level and cites sources for the key theorems establishing bounds on the breakdown point in models with dummy variables.

Keywords: linear regression, breakdown point, dummy variables, fixed effects

JEL Classification: C01

Suggested Citation: Suggested Citation

Eric Blankmeyer (Contact Author)

Texas state university ( email ).

San Marcos, TX 78666 United States 512-245-3253 (Phone)

Do you have a job opening that you would like to promote on SSRN?

Paper statistics, related ejournals, econometrics: econometric & statistical methods - general ejournal.

Subscribe to this fee journal for more curated articles on this topic

dummy variable in research paper

Academia.edu no longer supports Internet Explorer.

To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to  upgrade your browser .

  •  We're Hiring!
  •  Help Center

Dummy Variables

  • Most Cited Papers
  • Most Downloaded Papers
  • Newest Papers
  • Last »
  • Wilkins Case Follow Following
  • Seasonal Variations Follow Following
  • SERVO MOTOR Follow Following
  • Regression Follow Following
  • ESTRUCTURA DEL LENGUAJE DE PROGRAMACION C++ Follow Following
  • Enginering Follow Following
  • Automobile Engineering Follow Following
  • Managerial Economics Follow Following
  • Structural Decomposition Analysis Follow Following
  • Economic Impact Analysis Follow Following

Enter the email address you signed up with and we'll email you a reset link.

  • Academia.edu Journals
  •   We're Hiring!
  •   Help Center
  • Find new research papers in:
  • Health Sciences
  • Earth Sciences
  • Cognitive Science
  • Mathematics
  • Computer Science
  • Academia ©2024

COMMENTS

  1. PDF A Smart Guide to Dummy Variables: Four Applications and a Macro

    creation of dummy variables and improve productivity. 1. Introduction to Dummy Variables Dummy variables are independent variables which take the value of either 0 or 1. Just as a "dummy" is a stand-in for a real person, in quantitative analysis, a dummy variable is a numeric stand-in for a qualitative fact or a logical proposition.

  2. Dummy variables and their interactions in regression analysis: examples

    Dummy variables and their interactions in regression analysis: examples from research on body mass index Manfred Te Grotenhuis Paula Thijs The authors are affiliated to Radboud University, the Netherlands. Further information can be found on the website that goes with this paper [total word count 7452] Abstract

  3. Regression Using Dummy Variables

    Dummy Variables (also often called binary variables or dichotomous variables) are variables that have only two possible values. For example, yes/no responses to a question, whether a car has a tow hitch or not, or whether a person is unemployed or not. We usually assign the value 1 or 0 to these dummy variables.

  4. (PDF) Dummy variables and their interactions in regression analysis

    This paper which is especially written for students, demonstrates the correct use of nominal and ordinal scaled variables in regression analysis by means of so-called dummy variables.

  5. What Are Dummy Variables and How to Use Them in a Regression Model

    A 7-variable subset of the Automobiles data set. (Source: UC Irvine) The above 7-variables version can be downloaded from here.. In the above data set, the aspiration variable is of type Standard or Turbo. Our regression goal is to estimate the effect of aspiration on vehicle price. To that end, we will introduce a dummy variable to encode aspiration as follows:

  6. Interpreting dummy variables and their interaction effects in strategy

    Abstract. Dummy variables have been employed frequently in strategy research to capture the influence of categorical variables.However,misinterpretation of results may arise,especially when inter-action effects between dummy variables and other explanatory variables are involved in a regression.We discuss two approaches of entering dummy ...

  7. On Dummy Variable Regression Analysis:

    Efforts are also made (1) to give illustrations and examples of problems to which this type of multiple-regression analysis might be applied productively; (2) to show how "dummy variable" regression analysis is both similar to and different from other multivariate techniques in terms of the analytical procedures and the kinds of interpretations ...

  8. PDF Regression Analysis with Dummy Variables

    There are many variables in social science research, such as gender, ethnicity, and marital status, that are inherently categorical. It turns out that categorical variables can be used as independent variables in ... In this case, the dummy variable for gender explains 60,0 percent of the variance in income. Created Date: 7/10/2007 4:30:11 PM ...

  9. PDF Dummy Variables 6

    To include the groups in the regression equation, we use dummy variables. A dummy variable or, more briefly, a dummy, is a variable taking values 0 and 1. In regression analysis, dummies are used to code groups. In this chapter, we explain how to code groups with dummies, and how to interpret the coefficients of those dummies in a regression ...

  10. [1511.05728] Dummy variables and their interactions in regression

    This paper is especially written for students and demonstrates the correct use of nominal and ordinal scaled variables in regression analysis by means of so-called dummy variables. We start out with examples of body mass index (BMI) differences between males and females, and between low, middle, and high educated people. We extend our examples with several explanatory (dummy) variables and the ...

  11. PDF Dummy-Variable Regression

    Likewise, we cannot calculate unique least-squares estimates for the model because the set of three dummy variables is perfectly collinear; for example, as is apparent from the table in Equation 7.5, D3 = 1 − D1 − D2. In general, then, for a polytomous factor with. m categories, we need to code. m − 1 dummy regressors.

  12. Interpreting dummy variables and their interaction effects in strategy

    3 Vermeulen and Barkema's (2001) study is one of the few examples that use the partition approach. Under the column 'Survival Analysis 2' in their Table 2, the multiplicative terms between Z it (i.e. number of preceding greenfields or number of preceding acquisitions) and two dummy variables, namely greenfield and acquisition, together partition the effect of Z it for greenfield ...

  13. (PDF) Interpreting Dummy Variables and Their Interaction Effects in

    Interpreting dummy v ariables and their. interaction effects in strategy r esearch. Paul S. L.Yip Nanyang Technological University, Singapore. Eric W. K.Tsang Wayne State University,USA. Abstract ...

  14. PDF Use of Dummy Variables in Regression Analysis

    1. The number of dummy variables necessary to represent a single attribute variable is equal to the number of levels (categories) in that variable minus one. 2. For a given attribute variable, none of the dummy variables constructed can be redundant. That is, one dummy variable can not be a constant multiple or a simple linear relation of another.

  15. How to Use Dummy Variables in Regression Analysis

    To use gender as a predictor variable in a regression model, we must convert it into a dummy variable. Since it is currently a categorical variable that can take on two different values ("Male" or "Female"), we only need to create k-1 = 2-1 = 1 dummy variable. To create this dummy variable, we can choose one of the values ("Male" or ...

  16. Chapter 17 Dummy Variables and Interactions in Regression Analysis

    The following resources might help reinforce your understanding of dummy variables. It is not required for you to read/consume these: Working With Dummy Variables; Section "14.1 Dummy Variables" of Quantitative Research Methods for Political Science… Dummy Variables in Regression

  17. PDF Lecture 13 Use and Interpretation of Dummy Variables

    Use and Interpretation of Dummy Variables Dummy variables - where the variable takes only one of two values - are useful tools in econometrics, since often interested in variables that are qualitative rather than quantitative In practice this means interested in variables that split the sample into two distinct groups in the following way

  18. Dummy Variables

    Dummy Variables. A dummy variable is a numerical variable used in regression analysis to represent subgroups of the sample in your study. In research design, a dummy variable is often used to distinguish different treatment groups. In the simplest case, we would use a 0,1 dummy variable where a person is given a value of 0 if they are in the ...

  19. Dummy variables vs. category-wise models

    Empirical research frequently involves regression analysis with binary categorical variables, which are traditionally handled through dummy explanatory variables. This paper argues that separate category-wise models may provide a more logical and comprehensive tool for analysing data with binary categories. Exploring different aspects of both ...

  20. How robust is linear regression with dummy variables

    Abstract. Researchers in the social sciences make extensive use of linear regression models in which the dependent variable is continuous-valued while the explanatory variables are a combination of continuous-valued regressors and dummy variables. The dummies partition the sample into groups, some of which may contain only a few observations.

  21. (PDF) Interpreting Dummy Variables in Semi-Logarithmic Regression

    Care must be taken when interpreting the coefficients of dummy variables in semi-logarithmic regression models. Existing results in the literature provide the best unbiased estimator of the ...

  22. Application of Dummy Variables in Multiple Regression Analysis

    The method o f least square is typically used to estimate the. regression coefficients in a multiple linear model. The method. of least square chooses the β 's in the equation (1) so that the ...

  23. Dummy Variables Research Papers

    Impact of the Political Environment on the Stock Market: An Analysis using Dummy Variables and the GARCH Models. It is believed that the political environment of a country affects its stock market. With the help of the dummy variables and GARCH models, it is concluded that political environment does affect the stock market in India.

  24. Examining co-offending and re-offending across crime categories using

    Research on co-offending has become increasingly popular across the last two decades of criminological research. In this paper, we focus on three key variables and their relationship with co-offend... Skip to main content. ... The dummy variable "group.crime" is a magnitude of one if the current crime event is a group or co-offending event ...