# INTRODUCTION TO REGRESSION IN R AND IN RBRUL # Daniel Ezra Johnson # SSS 5 # 30 July 2014 # THIS FILE CAN BE DOWNLOADED FROM: # http://www.danielezrajohnson.com/sss5_workshop.R # INTRODUCTION # You should save this file on your computer. # Then make sure you re-open it in R. # This file is an example of a (heavily annotated) R script. # Instead of manually typing in R commands, you can also highlight lines in a script, # and press Command-Enter (Mac) or Ctrl-R (Windows) to execute those commands. # Many people arrange the desktop so that the script window is on one side of the # screen, and the R console on the other. # This makes it easy to work in both windows and go back and forth. # All the lines beginning with # are comments. # This means that they have no effect when a section of text is run in R. # In fact, R will ignore whatever follows a # even within a single line. # I will use such in-line comments for less important notes. # Try typing "2 + 2" into the R Console, followed by Return/Enter. # Now highlight the "2 + 2" below, in the script, and press Command-Enter/Ctrl-R. # The same thing should happen in both cases. 2 + 2 # Did you get this output? # [1] 4 # the [1] can be ignored here # What about the following: 2 + 2 * 3 # Did you get 4 * 3 = 12? # No - the result is 8. # As you can see from this result, R follows the standard "order of operations". # Unless you put parentheses, multiplication will be carried out before addition. (2 + 2) * 3 # Now you should get the answer 12. # We often have to use parentheses in R. # The R GUI uses colored parenthese to help you not lose track of them. # WHAT IS R AND WHAT IS RBRUL? # R is a programming language or "environment" for statistics and data analysis. # With R, we can perform practically any mathematical or statistical calculation. # We can use hundreds of built-in functions and also define our own functions. # We can obtain statistics, including regression coefficients and p-values, easily. # In this workshop, we will explore regression using R proper, and also with Rbrul. # Rbrul is a front-end interface to some of the regression functions in R. # Rbrul may be easier and more intuitive - but R is much more powerful and flexible. # INSTALLING PACKAGES AND RBRUL # Execute the following commands to install necessary packages. # If you already did this, you don't have to do it again (but it doesn't hurt!). # Be careful with capitalization. R is very sensitive to capitalization. install.packages("ggplot2") install.packages("lme4") install.packages("lmerTest") # Once the packages are installed, you must load them with the library() command. # Unlike install.packages(), you don't need quotation marks for library(). library(ggplot2) library(lme4) library(lmerTest) # We install Rbrul with the command source(), which loads an external program. # Assuming you have an Internet connection, run the following command: source("http://www.danielezrajohnson.com/Rbrul.R") # If the source() command returns an error message, try it again. # If you don't have access to the Internet right now, it won't work. # However, if you downloaded the file Rbrul.R previously, you can substitute # the local path to the saved file, for example: source("~/Desktop/Rbrul.R"). # In general, you should source Rbrul from the Web, to get the latest update. # A window wil open saying "This window has opened to help Google Analytics # track the spread of Rbrul around the globe. You may close it at any time." # You may close it at any time. # THREE GOALS OF REGRESSION # Traditionally, there are three reasons why we build regression models: # One is to quantitatively ESTIMATE the effects of predictors on a response. # Predictor means independent variable; response means dependent variable. # A second reason is to TEST HYPOTHESES (for example, significance testing). # A third reason is to use observed data to PREDICT new or unseen data. # This workshop will focus on estimation and simple hypothesis testing. # We will ask how big an effect an predictor has on a response (estimation). # And we'll ask if the effect could be the result of chance (hypothesis testing). # WHAT ARE MIXED MODELS? # Mixed models or mixed-effects models are a fairly new type of regression model. # They're useful if data has a grouped structure (e.g. tokens grouped by speaker). # I believe that many sociolinguistic situations require mixed-effects modeling. # This makes it hard to properly illustrate "ordinary" regression with real data! # ORDINARY (FIXED-EFFECTS) REGRESSION IN R # We can create a pretend data set with a single command: toy <- data.frame( x = c(0, 1, 2, 3, 4, 5, 6, 7, 8, 9), y = c(1, 3, 2, 5, 4, 7, 4, 6, 5, 7)) # Note that we could have also expressed x as "0:9" instead of "c(0, 1, ..., 9)". # We can plot these points with qplot(), a function in the ggplot2 package. qplot(x = x, y = y, data = toy) # When you open a graphing window, it may need to be resized to display correctly. # Here, our response (y) is numeric, and we will perform a linear regression. # Linear regression finds the straight line that is the best fit for these points. # Linear regression minimizes the "sum of squared errors". # It minimizes the squared vertical distances from the points to the line. # We can add this best-fit line to the plot using the geom_line argument. qplot(x = x, y = y, data = toy) + geom_line(stat = "smooth", method = "lm") # Any straight line can be described by two numbers: its slope and its intercept. # The slope of the line is the amount it goes up as you move one unit to the right. # (If the line goes down as you move to the right, the slope will be negative.) # The intercept is the value of y at x = 0 (where the line "intercepts" the y-axis). # Just by eyeballing the line on the plot, we would say that its intercept is 2. # And we would eyeball that its slope is a little bit more than 0.5. # Together, these two parameters make up a simple rough model of the data. # R fits linear regression models - finds the best line - with the lm() command. # We feed a formula of the form "y ~ x" to lm(), and tell it where the data is. lm(y ~ x, data = toy) # The relevant part of this output is the "Coefficients", which are as follows: # (Intercept) x # 2.0000 0.5333 # We see that we were correct in our eyeballing. The intercept is exactly 2. # The slope is 0.5333, meaning that y goes up 0.5333 for every increase of 1 in x. # This is estimation of the parameters of a model obtained by linear regression. # A hypothesis we might want to test is the question of statistical significance. # We might want to know if the slope, 0.5333, is significantly different than zero. # More precisely, we want to know the proportion of the time that a slope that large # would be seen, on average, if we just randomly threw 10 points against a wall. # Above, we just printed the lm() model directly to the screen. # We can also store the model and give it a name. m1 <- lm(y ~ x, toy) # Then we can print it out, or work with it as needed. Test the following commands: m1 # prints the parameters of the model coef(m1) # prints just the coefficients of the model coef(m1)[1] # prints the intercept (the first coefficient) coef(m1)[2] # prints the slope (the second coefficient) # To test the null hypothesis that the true slope is zero, we can fit a null model. # This is a model with an intercept (the 1), but no slope. The slope is fixed at 0. m0 <- lm(y ~ 1, toy) m0 # _________________________________________________________________________________ # EXERCISE 1 # In the model m0, what is the value of the intercept? # _________________________________________________________________________________ # Note that m1, with an intercept and a slope for x, could also have been fit # with the formula y ~ 1 + x. The formula y ~ x, is short for y ~ 1 + x. You could # also tell the program to omit the intercept, forcing the line to pass through # the origin (0, 0). You would do this with the formula y ~ 0 + x. # To test the significance of the slope in m1, we need to compare two models: # m1 and m0. m1 has an intercept and a slope, while m0 only has an intercept. # As it is a more complex model, m1 will ALWAYS fit the data better than m0. # However, it is a more complex model. In general, we prefer simpler models. # We do a test to decide if the better fit is worth the added complexity. This # can be called a "significance test" - testing the "statistical significance" # of the slope. It is also called a "hypothesis test" - testing the hypothesis # that the underlying slope is zero. If we reject this, then we have an effect! # In the case of ordinary linear regression, this is done with an F-test. # More generally, two models are compared with a likelihood-ratio chi-squared test. # Note: this is also how GoldVarb compares models in its stepwise routines. # If you haven't heard of GoldVarb, consider yourself lucky. # We use anova() to compare the models, perform an F-test and obtain a p-value. # Despite the name, this command does not necessarily carry out "an ANOVA". # ANOVA is specifically for comparing discrete groups, which we're not doing here. anova(m0, m1) # note that the order of terms does not matter # In the anova(m0, m1) output, focus on where it says "Pr(>F)": # Pr(>F) # 0.005163 # This is our p-value. It is well below the usual .05 significance threshold. # Therefore, we tend to believe that the observed trend is real and not just noise. # This toy example had a single predictor, x. # Naturally, regression becomes more useful when there is more than one predictor. # Then, we estimate several "slopes" simultaneously (this is harder to visualize). # We might have a formula y ~ x1 + x2, or y ~ x1 * x2 (indicating an interaction). # Note: this is called "multiple regression", not "multivariate regression". # Multivariate regression is when there is more than one dependent variable. # _________________________________________________________________________________ # EXERCISE 2 # If the y values were chosen at random, any trend would have to be due to chance. # Repeat the following commands by highlighting and using Command-Enter/Ctrl-R. # Keep an eye on the p-values. x <- 1:10 # x = 1, 2, ..., 10 y <- sample(1:10) # y = 1, 2, ..., 10 in random order m1 <- lm(y ~ x) m0 <- lm(y ~ 1) anova(m0, m1) # How often is the p-value below .05? # With a true slope of zero, the p-value should fall below .05 .... 5% of the time. # Try it 100 times! How many times did you get a "false positive" (Type I error)? # _________________________________________________________________________________ # ORDINARY (FIXED-EFFECTS) REGRESSION IN RBRUL # R works by entering commands (or sets of commands) at the > prompt. # Rbrul is based on text menus and it interacts with the user using questions. # You can cut and paste multiple lines into Rbrul at once, but it's not a great idea. # You also can't go backwards if you make a mistake. # However, you can quit Rbrul (either from a menu or with the Stop button) and # go back in without starting over. Or you can start over with the "reset" option. # We will attempt to carry out the same regression that we did above, in Rbrul. # First, note that Rbrul always uses the object named "datafile" to hold its data. # We can load any file with Rbrul, but since we already have our "toy" data frame, # we'll use the assignment operator (a left-pointing arrow) to create the "datafile". datafile <- toy # Note that you can use = instead of <-, but in R circles, this is a grave faux pas. # We start Rbrul with the R command rbrul(). rbrul() # After a moment and some messages, you should see the data structure and main menu. # If you don't, it usually means that required packages have not loaded. # On office computers, you may need to run R "as administrator" to install packages. ## Current data structure: ## x (numeric with 10 values): 0 1 2 3 4 ... ## y (numeric with 7 values): 1 3 2 5 4 ... ## Total tokens: 10 ## MAIN MENU ## 1-load/save data 2-adjust data ## 4-crosstabs 5-modeling 6-plotting ## 8-restore data 9-reset 0-exit # To fit a regression model, we type "5" for "modeling", and press Return/Enter. 5 # We then see the modeling menu: ## MODELING MENU ## 1-choose variables 2-one-level (recommended) ## 3-step-up 4-step-down 5-step-up/step-down ## 6-plotting 8-settings 9-main menu 0-exit ## 10-chi-square test # The first thing we have to do is choose our variables. # For example, we have to tell Rbrul which variable is the response. # We enter "1" to choose variables. 1 ## Choose response (dependent variable) by number (1-x 2-y). # To choose y as the response variable, we type "2" followed by Return/Enter. 2 ## Type of response? (1-continuous Enter-binary) # Here we type "1" plus Return/Enter, because y is a numeric/continuous variable. # When the response is a real number, it is called linear regression. # (If the response was binary, we would be doing logistic regression. See below.) 1 ## Choose predictors (independent variables) by number (1-x). # Here we only have one choice: we type "1" (plus Return/Enter) to select x. 1 ## Are any predictors continuous? (1-x Enter-none) # Indeed, our predictor x is a numeric/continuous predictor, so we enter "1". 1 # Since there are no other predictors, Rbrul returns us to the modeling menu, # after displaying the "Current variables": a continuous response y, and a fixed, # continuous predictor x. Here, "fixed" means a fixed effect, an ordinary predictor. # The alternative to fixed effects is random effects, which we will use later on. # There is only one predictor being considered for our model: x. # Therefore, it doesn't matter if we choose "step-up", "step-down", or "one-level". # Stepwise modeling can be dangerous, so we will get in the "one-level" habit, # entering "2" for a one-level analysis. 2 ## ONE-LEVEL ANALYSIS WITH x (0.00516) ## ## $x ## continuous coef ## +1 0.533 ## ## $misc ## deviance AIC df intercept grand mean R2.fixed R2.random R2.total ## 12.933 36.951 2 2 4.4 coming soon! # The above output is Rbrul's summary of the model with response y and predictor x. # We see the p-value for x at the top of the summary: 0.00516. # It is the same as what we obtained before, just rounded to 3 significant digits. # We see the slope, 0.533, under the heading "coef" in the "x" section. # We see the intercept, 2, under the heading "intercept" in the "misc" section. # We also see other information, like the "grand" (overall) mean. # We also should see the R-squared value, a measure of the fit of the model. # Right now, there is a problem with the calculations; it says "coming soon!" # MIXED-EFFECTS REGRESSION IN RBRUL # We will now run a regression on a real linguistic data set. # The response variable is post-vocalic or coda /r/; there are 1597 tokens of it. # 40 speakers in Gretna, Scotland, read a word list containing 40 or so coda /r/s. # The variable was coded narrowly, but we focus on zero vs. any other realization. # This is a binary dependent variable, so we will be running a logistic regression. # Predictors are modeled as affecting the LOG-ODDS of the response in a linear way. # Log-odds are a way of taking the 0-to-1 probability scale, and stretching it out, # so that it covers all numbers from negative infinity to positive infinity. # Log-odds equals the natural logarithm of the odds, or log(p / (1 - p)). # A log-odds of 0 corresponds to 0.5 probability. # A log-odds of +1 corresponds to a probability of 0.73; -1 corresponds to 0.27. # A log-odds of +2 corresponds to a probability of 0.88; -2 corresponds to 0.12. # With coda /r/, a log-odds of 0 corresponds to 50% /r/ pronunciation. # But regression mainly deals with effects, that is, changes in log-odds. # On the log-odds scale, the difference between 50% and 70% /r/ is 0.85 log-odds. # A similar change of 0.85 would get us from 70% to 85% /r/, from 85% to 93% /r/, # or from 93% to 97%. # We might say that a fixed change in log-odds "counts for less" the closer we are # to 100% (or to 0%). Alternatively, we could say that a change from 5% to 10% is # simply larger than a change from 40% to 50%. Both interpretations can make sense. # To load the Gretna datafile, first enter "9" to go to the "Main Menu". # If you have already saved the Gretna file on your computer: 1 # load/save data [Enter] # do not save current data c # commas separate the columns [choose file] # navigate to the file and choose it with the wizard # If you have not saved the Gretna file on your computer, and have Internet access: 0 # to exit Rbrul datafile <- read.csv("http://www.danielezrajohnson.com/gretna.csv") rbrul() # to re-enter Rbrul # We should now be at the main menu in Rbrul. # The "Current data structure" shows: ## word (factor with 58 values): air barred bars beard car ... ## set (factor with 12 values): SQUARE START NEAR NORTH FORCE ... ## position (factor with 2 values): coda-final coda-internal ## r4 (factor with 4 values): approximant zero tap trill ## r2 (integer with 2 values): 1 0 ## speaker (factor with 40 values): G15F G16F G17F G17F2 G18F ... ## age (integer with 24 values): 15 16 17 18 19 ... ## gender (factor with 2 values): female male ## agegroup (factor with 2 values): young old ## class (factor with 2 values): working middle ## Total tokens: 1597 # Our response variable will be r2, which is coded 0 for zero /r/ and 1 otherwise. # We will do a regression to estimate the size of three "fixed effects": # set - how does the preceding vowel affect coda /r/? # age - how does speaker age affect coda /r/? # gender - how does speaker gender affect coda /r/? # We may not be specifically interested in individual words and speakers, today, # but we must take them into account when the data is grouped by word and speaker. # That is, the data has multiple observations of each word and of each speaker. # Word is nested within set (each word is always in the same lexical set). # Speaker is nested within gender and age (the data was collected all at one time). # This matters because if there is a lot of by-speaker or by-word variation, # what look like significant predictor effects can plausibly arise just by chance. # We control for this by using two "random effects" for speaker and word. # Random effects usually have many levels, and nest within predictors of interest. # From the Rbrul main menu, enter the following commands (manually): 5 # modeling menu 1 # choose variables 5 # choose r2 as the response [Enter] # r2 is a binary variable 2 # we will model what favors coda /r/, not what disfavors it 7 # choose predictor age 8 # choose predictor gender 2 # choose predictor set 1 # choose predictor word 6 # choose predictor speaker [Enter] # finish choosing predictors # Rbrul should now be asking "Are any predictors continuous?" Age is continuous. # As a rule, it's better to leave naturally-numeric predictors like age as # continuous, rather than (for example) making several discrete age groups. 7 # mark age as continuous [Enter] # finish (the other predictors are categorical) # Rbrul now asks if we want to consider "a pairwise interaction of fixed effects". # There is sometimes confusion about what an interaction means, in statistics. # Suppose our older speakers are mostly male and our younger speakers mostly female. # This is not an interaction! It should be called an association or a lack of # independence. The term "non-orthogonality" has also been used in sociolinguistics. # An interaction is when two or more predictors affect the response in a way # that is not independent. For example, if the effect of gender on /r/ was greater # for older speakers, that would be an interaction between gender and age. We will # not be examining possible interactions right now; feel free to experiment later! [Enter] # finish (don't consider any interactions) # Rbrul now asks if there are any random intercepts. We select speaker and word. # This model will allow individual speakers and words to vary in their overall # rates of /r/ realization, which is sort of like giving each word and speaker # its own intercept. A random intercept is one type of random effect. A more # advanced type of random effect, which we will skip over today, but is often # necessary to consider, is called a random slope. An example would be if some # words showed more /r/ with males and other words showed more /r/ with females. # Such things have been known to happen, but we will not consider this here. 1 # mark word as random intercept 6 # mark speaker as random intercept [Enter] # finish (the other predictors are fixed effects) # Rbrul now asks about random slopes corresponding to each random intercept. [Enter] # no by-word random slopes [Enter] # no by-speaker random slopes # Rbrul now takes us back to the modeling menu. # We will choose "2" for a one-level model, considering all predictors. # A one-level model is like a step-down model except it only does the first step. 2 # choose one-level model [Enter] # After a minute... a long output results, including estimates for word and speaker. # Note: mixed-effects models can take a long time to fit - sometimes very long! # They can also have estimation problems that do not affect fixed-effects models. # Currently, Rbrul suppressed some of the warnings that the actual model-fitting # functions can produce. Therefore, you may want to try fitting the same models # in R (see the next section) to be sure that you are not being led astray. # (In the "Settings" menu, you can switch off the long random-effect display.) # Scroll back to the top, where we see coefficients for age, gender, and set. ## ONE-LEVEL ANALYSIS WITH word [random, not tested] and speaker [random, not ## tested] and set (7.1e-05) + age (0.000392) + gender (0.734) ## ## $gender ## factor logodds tokens 1/1+0 centered factor weight ## male 0.131 795 0.630 0.533 ## female -0.131 802 0.587 0.467 ## ## $set ## factor logodds tokens 1/1+0 centered factor weight ## FIR 1.109 119 0.748 0.752 ## OUR 0.859 7 0.857 0.702 ## NURSE 0.597 225 0.689 0.645 ## SQUARE 0.413 73 0.671 0.602 ## NORTH 0.238 125 0.672 0.559 ## HEARD 0.000 152 0.625 0.5 ## FORCE -0.235 180 0.617 0.441 ## NEAR -0.245 86 0.616 0.439 ## START -0.439 155 0.606 0.392 ## CURE -0.482 112 0.562 0.382 ## FIRE -0.598 12 0.667 0.355 ## LETTER -1.218 351 0.470 0.228 ## ## $age ## continuous logodds ## +1 0.059 # First look at the top line. It says we have word and speaker as random effects. # These random effect terms are not tested for significance. # Then it gives (lexical) set, with a p-value of 7.1e-05, which means 0.000071. # Lexical set, then, is a highly significant predictor of coda /r/. # Age is given a p-value of 0.000392, which is also highly significant. # Gender is not significant: p = 0.734. The observed difference could be chance. # Turning to the estimates, we see that the first column of numbers is in log-odds. # Log-odds is the standard unit for reporting the results of logistic regressions. # We just remember that a given change in log-odds has its biggest effect near 50%, # and that 0% and 100% are infinitely far away in log-odds terms. # The predictor age is given a coefficient of 0.059. # This means that for every year older a speaker is, the model says their log-odds # of producing coda /r/ increases by 0.059 - so the odds of /r/ go up by 6.1%. exp(0.059) # 1.061 # The predictor "set" has 12 coefficients, one for each lexical set. # However, it is important to understand that these coefficients add up to zero. # Each one represents the difference from the mean for that lexical set. # We see that the FIR set most favors coda /r/, with a coefficient of 1.109. # We see that the LETTER set least favors coda /r/, with a coefficient of -1.218. # For factors, Rbrul also gives the # of tokens and raw proportion, for each level. # Rbrul also gives the number of tokens, the raw proportion of /r/ ("1/1+0"), # and the VARBRUL (GoldVarb) factor weight for each category. # Note how positive log-odds correspond to factor weights over 0.5, # negative log-odds correspond to factor weights under 0.5, # and zero log-odds (for the HEARD lexical set) corresponds to 0.5 exactly. # At the bottom of the output, we see the results for the random effects. # First we will discuss word. Here are the top five and bottom five words. ## $word (random) ## random intercept tokens 1/1+0 centered factor weight ## for 0.381 39 0.769 0.594 ## Kerr 0.375 39 0.744 0.593 ## fur 0.373 39 0.795 0.592 ## moored 0.324 32 0.656 0.581 ## car 0.322 40 0.675 0.58 ## ... ... ... ... ... ## Kirsty -0.346 38 0.658 0.414 ## daughter -0.384 33 0.333 0.405 ## turtle -0.472 40 0.575 0.384 ## Kirriemuir -0.484 40 0.425 0.381 ## person -0.64 40 0.450 0.345 # Remember that we already have a predictor for lexical set. After taking set # into account, the word effects show which individual words favored or disfavored # coda /r/. They show the deviations of each word from the prediction for their set. # The word that behaved the most differently from its set's average was "person". # Under set, we see that overall, the HEARD lexical set contained 62.5% coda /r/. # But under word, we note that "person" contained /r/ only 45% of the time. # So "person" receives a negative random effect coefficient: -0.64. # In the other direction, "for" had 77% /r/ while the NORTH set as a whole had 67%. # So "for" receives a positive random effect coefficient: 0.381. # Immediately above the individual words, there is a kind of summary row. ## $word (random) ## intercept tokens 1/1+0 centered factor weight ## std dev 0.392 1597 0.609 ... # Based on this sample data, the population of all words is estimated to vary # with a standard deviation of 0.392. (A normal distribution is assumed.) # Under subject, we see a much higher number in the "std dev" position: 2.336. ## $speaker (random) ## intercept tokens 1/1+0 centered factor weight ## std dev 2.336 1597 0.609 ... # This tells us that subjects are much more variable, or idiosyncratic, than words. # In fact, this a quite a large subject standard deviation. # This suggests that we may need to find other between-subject predictors, # besides age and gender, that help condition the use of coda /r/. # In fact, we have left social class out of the above model. Including it would # likely reduce the "left-over" subject standard deviation. # _________________________________________________________________________________ # EXERCISE 3 # Using the same Gretna data set, fit a model with: # r2 as the binary response (1 as the application value) # age, set, and position as fixed effects # no fixed-effect interactions # speaker and word as random intercepts # no random slopes # Does the "position" variable (e.g. "car" vs. "card") come out as significant? # If so, which favors coda /r/ more, words like "car" or words like "card"? # Which speaker favors coda /r/ the most, taking their age into account? # Who favors it least for their age? # _________________________________________________________________________________ # MIXED-EFFECTS REGRESSION IN R # Let's enter "0" to exit Rbrul and return to R. # We'll then perform the same regressions outside Rbrul that we did inside Rbrul. # First, let's make a copy of Rbrul's datafile, and call it "gretna". gretna <- datafile # If we never used Rbrul, we might load the gretna data with a command like this: gretna <- read.csv("http://www.danielezrajohnson.com/gretna.csv") # As above, we want to perform a regression with r2 as the response. # We want to estimate the fixed effects of age, gender, and lexical set. # We want to include random intercepts for speaker and word. # We'll use the glmer() function, from the lme4 package, to do this regression. # Observe how we build the following regression formula: # r2 ~ age + gender + set + (1|speaker) + (1|word) # The response goes at the left, followed by "~". # The predictors come after "~". We list the variable names, connected by "+". # A random intercept for z is specified by "(1|z)". # Rbrul, by default, uses sum contrasts, meaning that factors are expressed as # deviations from the mean. # Now we'll actually fit the model as "m2". After the formula, we could enter # "data = gretna", but just putting the object name "gretna" is sufficient. # The argument "family = binomial" sets glmer() up for a logistic, not linear, # regression. (For linear regression, we can put "family = gaussian" or omit it.) # The "control" argument tells the glmer() function to ignore all the warnings # it might be producing, as well as to perform up to a million optimization steps. # If you left off the "control = ... " you would have to troubleshoot things. m2 <- glmer(r2 ~ age + gender + set + (1|speaker) + (1|word), gretna, family = binomial, control = glmerControl(optimizer = "bobyqa", check.conv.grad = "ignore", check.conv.singular = "ignore", check.conv.hess = "ignore", check.scaleX = "ignore", optCtrl = list(maxfun = 1000000))) # This fits the model and stores it in an object called m2. To summarize m2, # with p-values provided by lmerTest (somewhat controversially), we use summary(). summary(m2) # Notice that R does not automatically print the details of the random effects # (words and speakers). You can access these with a function called ranef(). # But we'll focus just on the fixed effects section of the output, at the top. # Fixed effects: # Estimate Std. Error z value Pr(>|z|) # (Intercept) -2.42513 0.92666 -2.617 0.008869 ** # age 0.05851 0.01545 3.787 0.000152 *** # gendermale 0.26210 0.76801 0.341 0.732899 # setFIR 1.59092 0.52211 3.047 0.002311 ** # setFIRE -0.11665 0.83314 -0.140 0.888650 # setFORCE 0.24630 0.43302 0.569 0.569488 # setHEARD 0.48198 0.45939 1.049 0.294094 # setLETTER -0.73635 0.39725 -1.854 0.063798 . # setNEAR 0.23679 0.51614 0.459 0.646397 # setNORTH 0.71978 0.48012 1.499 0.133825 # setNURSE 1.07863 0.43774 2.464 0.013737 * # setOUR 1.34082 1.26878 1.057 0.290615 # setSQUARE 0.89413 0.57327 1.560 0.118828 # setSTART 0.04209 0.44513 0.095 0.924661 # --- # Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 # You should see some similarities between this and the Rbrul output. # The estimate for age is the same, and has the same interpretation. # (Of course, this output includes way too many decimal places.) # The estmimate for gender is twice as large as it was in the Rbrul model. # This is because R uses "treatment contrasts" (by default). # Rbrul uses "sum contrasts" (by default). # So while Rbrul tells us males' and females' difference from the mean, # R tells us males' difference from females. The female group, set to 0, is absent. # ("female" is the default baseline group because it comes first alphabetically.) # We see the same thing at work when we look at the lexical sets. # FIR still has the highest coefficient (1.59) and LETTER the lowest (-0.74). # However, in the R model, these coefficients all represent the difference from # the (absent, baseline) set, CURE, which comes first in alphabetical order. # In situations like this, where there is no logical baseline level, we may prefer # sum contrasts. To switch, try entering the following and fitting the model again: options(contrasts = c("contr.sum", "contr.poly")) m3 <- glmer(r2 ~ age + gender + set + (1|speaker) + (1|word), gretna, family = binomial, control = glmerControl(optimizer = "bobyqa", check.conv.grad = "ignore", check.conv.singular = "ignore", check.conv.hess = "ignore", check.scaleX = "ignore", optCtrl = list(maxfun = 1000000))) summary(m3) # Unfortunately, in this case you have to figure out which numbered level is which. # And you have to fill in the coefficient for the missing level of each factor, # using the "zero-sum principle". In this sense, Rbrul's output is more friendly. # To return to treatment contrasts, we would type: options(contrasts = c("contr.treatment", "contr.poly")) # In the glmer() outputs, we see some p-values. These are called Wald tests. # They are useful because there is one for each parameter (level of a factor), # but they are less reliable than the term-wise p-values we get from anova(). # Also, hypothesis testing with anova() is not affected by the choice of contrasts. m.full <- glmer(r2 ~ age + gender + set + (1|speaker) + (1|word), gretna, family = binomial, control = glmerControl(optimizer = "bobyqa", check.conv.grad = "ignore", check.conv.singular = "ignore", check.conv.hess = "ignore", check.scaleX = "ignore", optCtrl = list(maxfun = 1000000))) m.noage <- glmer(r2 ~ gender + set + (1|speaker) + (1|word), gretna, family = binomial, control = glmerControl(optimizer = "bobyqa", check.conv.grad = "ignore", check.conv.singular = "ignore", check.conv.hess = "ignore", check.scaleX = "ignore", optCtrl = list(maxfun = 1000000))) m.nogender <- glmer(r2 ~ age + set + (1|speaker) + (1|word), gretna, family = binomial, control = glmerControl(optimizer = "bobyqa", check.conv.grad = "ignore", check.conv.singular = "ignore", check.conv.hess = "ignore", check.scaleX = "ignore", optCtrl = list(maxfun = 1000000))) m.noset <- glmer(r2 ~ age + gender + (1|speaker) + (1|word), gretna, family = binomial, control = glmerControl(optimizer = "bobyqa", check.conv.grad = "ignore", check.conv.singular = "ignore", check.conv.hess = "ignore", check.scaleX = "ignore", optCtrl = list(maxfun = 1000000))) # Just now, we fit four models. One has all three predictors (the full model). # Each of the other three is lacking one of the terms that we wish to test. anova(m.full, m.noage) # this tests the significance of age: p = 0.0003922 anova(m.full, m.nogender) # this tests the significance of gender: p = 0.7342 anova(m.full, m.noset) # this tests the significance of set: p = 7.099e-05 # A useful alternative to doing these tests separately involves the drop1() function. drop1(m.full, test = "Chisq") # Df AIC LRT Pr(Chi) # 1353.6 # age 1 1364.2 12.569 0.0003922 *** # gender 1 1351.7 0.115 0.7341985 # set 11 1369.9 38.250 7.099e-05 *** # --- # Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 # Fortunately, these p-values are the same as the ones we obtained in Rbrul. # This is no coincidence! Behind the scenes, Rbrul runs commands like these ones. # We saw that summary() applied to glmer() model does not give the details of the # random effects. It gives the overall standard deviations for speakers and words: summary(m.full) # ... # Random effects: # Groups Name Variance Std.Dev. # word (Intercept) 0.1538 0.3922 # speaker (Intercept) 5.4582 2.3363 # ... # Rounded, these same standard deviations formed the rightmost column of the # Rbrul output. If we want to see the individual random effects, we use ranef(). ranef(m.full) # Now we see a log-odds pseudo-coefficient for each individual word and speaker. # (The true random effect parameters ARE the std. dev.'s for word and speaker.) # $word # (Intercept) # air 0.153680473 # barred -0.345179443 # bars 0.038372690 # ... # $speaker # (Intercept) # G15F -1.39409020 # G16F 1.98312077 # G17F -3.30176041 # ... # You will notice that the speakers and words are in alphabetical order here, # while Rbrul places them in order of their intercept. Depending on the purpose, # the Rbrul interface may be easier to work with. Still, it is good to know how to # accomplish the same tasks in R proper. # _________________________________________________________________________________ # EXERCISE 4 # Load the Orkney data file (http://www.danielezrajohnson.com/orkney.csv). # If you have a local copy and there is no Internet, load the local copy. # Call this object "orkney". Fill in the blank to complete the read.csv() command. orkney <- read.csv(_____________) # We want to see if the GOOSE vowels in Orkney are fronting over time. # That is, we want to see if there is a positive slope of F2.norm ~ year of birth. # Make a subset of the data as follows. goose <- c("boot", "food", "fuel", "mule", "mute", "pool", "refute", "stew", "stewed", "sure", "through", "too", "tune") orkney.goose <- subset(orkney, word %in% goose) # Do a mixed-effects regression in R, using orkney.goose as the data frame. # The response is called F2.norm. This is a continous, NORMALIZED variable. # The predictor is called year. This is also a continuous variable. # We want to include random intercepts for speaker and word. # This is a linear regression, not logistic, so use lmer(), not glmer(). # You can omit the argument "family = binomial", or use "family = gaussian". # You shouldn't need the "control" arguments this time. m.orkney <- lmer(_______________) summary(m.orkney) # Is the age (year of birth) effect statistically significant? # To answer this, fit another model without the fixed effect of "year". # Then compare the two models using anova(). # What is the size of the age (year of birth) effect? # To answer this, inspect the "year" coefficient in the model with "year". # Remember that the units will be Z-scores per year. # _________________________________________________________________________________ # This is the end of the regression workshop. Thank you for participating! # If you have any questions, feel free to email me at danielezrajohnson@gmail.com. # I do make it a priority to respond to (and eventually solve!) all Rbrul problems. # SOLUTIONS # _________________________________________________________________________________ # SOLUTION 1 # In the model m0, what is the value of the intercept? coef(m0)[1] # the intercept is 4.4 # _________________________________________________________________________________ # _________________________________________________________________________________ # SOLUTION 2 # This solution contains some more advanced R code, but it's not necessary. # You could also do the "100-times test" manually, repeating Command-Enter/Ctrl-R. p <- vector() # set up a vector to hold the p-values for (i in 1:1000) # set up to loop 1000 times { x <- 1:10 # x = 1, 2, ..., 10 y <- sample(1:10) # y = 1, 2, ..., 10 in random order m1 <- lm(y ~ x) m0 <- lm(y ~ 1) p[i] <- anova(m0, m1)[6][2,1] # captures the p-value from the anova() object } # How often is the p-value below .05? # It should be below .05, on average, 5% of the time. mean(p < .05) # 0.049 # _________________________________________________________________________________ # _________________________________________________________________________________ # SOLUTION 3 # Using the same gretna data set, fit a model with: # r2 as the binary response (1 as the application value) # age, set, and position as fixed effects # speaker and word as random effects 1 # choose variables [Enter] # keep r2 as the response 7 # choose age as a predictor 2 # choose set as a predictor 3 # choose position as a predictor 1 # choose word as a predictor 6 # choose speaker as a predictor [Enter] # finish choosing predictors 7 # mark age as continuous [Enter] # finish [Enter] # no fixed-effects interactions 1 # mark word as random intercept 6 # mark speaker as random intercept [Enter] # finish [Enter] # no by-word random slopes [Enter] # no by-speaker random slopes 2 # one-level model # Does the position variable (e.g. "car" vs. "card") come out as significant? # The predictor for position in the coda gets a p-value of 0.734. # Not significant! # If so, which favors coda /r/ more, words like "car" or words like "card"? # Since the predictor is not significant, we shouldn't really answer this! # Which speaker favors coda /r/ the most for their age? # Speaker G19F2. She produces 90% coda /r/. # Her age is 19, so her expected log-odds of /r/ is -1.812 + 19 * 0.059 = -0.691. # Using the logistic formula, this is exp(-0.691) / (1 + exp(-0.691)) = 33.4% /r/. # Who favors it least for their age? # Speaker G69F. She produces 0% coda /r/. # Her age is 69, so her expected log-odds of /r/ is -1.812 + 69 * 0.059 = 2.259 # This corresponds to exp(2.259) / (1 + exp(2.259)) = 90.5% /r/. # _________________________________________________________________________________ # _________________________________________________________________________________ # SOLUTION 4 # Load the Orkney data file (http://www.danielezrajohnson.com/orkney.csv). # If you have a local copy and there is no Internet, load the local copy. # Call this object orkney. orkney <- read.csv("http://www.danielezrajohnson.com/orkney.csv") # We want to see if the GOOSE vowels in Orkney are fronting over time. # That is, we want to see if there is a positive slope of F2.norm ~ year of birth. # Make a subset of the data as follows. goose <- c("boot", "food", "fuel", "mule", "mute", "pool", "refute", "stew", "stewed", "sure", "through", "too", "tune") orkney.goose <- subset(orkney, word %in% goose) # Do a mixed-effects regression in R, using orkney.goose as the data frame. # The response is called F2.norm. This is a continous, NORMALIZED variable. # The predictor is called year. This is also a continuous variable. # We want to include random intercepts for speaker and word. # This is a linear regression (not logistic); use "family = gaussian" or omit. model.full <- lmer(F2.norm ~ year + (1|speaker) + (1|word), orkney.goose) # Is the age (year of birth) effect statistically significant? # To answer this, we fit another model without the fixed effect of "year". model.noyear <- lmer(F2.norm ~ (1|speaker) + (1|word), orkney.goose) # Then compare the two models using anova(). anova(model.full, model.noyear) # p = 0.06801 # What is the size of the age (year of birth) effect? # To answer this, inspect the "year" coefficient in model.full. summary(model.full) # returns the entire model summary # We see a coefficient of 0.0064 for "year" in the "Fixed effects". # Remember that the units will be Z-scores (normalized units) per year. fixef(model.full) # returns the fixed effect coefficients from the model # We have an increase of 0.0064 Z-scores for each year increase in date of birth. # _________________________________________________________________________________