# INTRODUCTION TO REGRESSION IN R AND IN RBRUL
# Daniel Ezra Johnson
# SSS 5
# 30 July 2014

# THIS FILE CAN BE DOWNLOADED FROM:
# http://www.danielezrajohnson.com/sss5_workshop.R

# INTRODUCTION

# You should save this file on your computer.
# Then make sure you re-open it in R.

# This file is an example of a (heavily annotated) R script.
# Instead of manually typing in R commands, you can also highlight lines in a script,
# and press Command-Enter (Mac) or Ctrl-R (Windows) to execute those commands.

# Many people arrange the desktop so that the script window is on one side of the
# screen, and the R console on the other.
# This makes it easy to work in both windows and go back and forth.

# All the lines beginning with # are comments.
# This means that they have no effect when a section of text is run in R.
# In fact, R will ignore whatever follows a # even within a single line.
# I will use such in-line comments for less important notes.

# Try typing "2 + 2" into the R Console, followed by Return/Enter.
# Now highlight the "2 + 2" below, in the script, and press Command-Enter/Ctrl-R.
# The same thing should happen in both cases.

2 + 2

# Did you get this output?
# [1] 4							# the [1] can be ignored here
 	
# What about the following:

2 + 2 * 3

# Did you get 4 * 3 = 12?
# No - the result is 8.
# As you can see from this result, R follows the standard "order of operations".
# Unless you put parentheses, multiplication will be carried out before addition.

(2 + 2) * 3

# Now you should get the answer 12.
# We often have to use parentheses in R.
# The R GUI uses colored parenthese to help you not lose track of them.


# WHAT IS R AND WHAT IS RBRUL?

# R is a programming language or "environment" for statistics and data analysis.
# With R, we can perform practically any mathematical or statistical calculation.
# We can use hundreds of built-in functions and also define our own functions.
# We can obtain statistics, including regression coefficients and p-values, easily.

# In this workshop, we will explore regression using R proper, and also with Rbrul.
# Rbrul is a front-end interface to some of the regression functions in R.
# Rbrul may be easier and more intuitive - but R is much more powerful and flexible.


# INSTALLING PACKAGES AND RBRUL

# Execute the following commands to install necessary packages.
# If you already did this, you don't have to do it again (but it doesn't hurt!).
# Be careful with capitalization. R is very sensitive to capitalization.

install.packages("ggplot2")
install.packages("lme4")
install.packages("lmerTest")

# Once the packages are installed, you must load them with the library() command.
# Unlike install.packages(), you don't need quotation marks for library().

library(ggplot2)
library(lme4)		
library(lmerTest)

# We install Rbrul with the command source(), which loads an external program.
# Assuming you have an Internet connection, run the following command:

source("http://www.danielezrajohnson.com/Rbrul.R")

# If the source() command returns an error message, try it again.
# If you don't have access to the Internet right now, it won't work.
# However, if you downloaded the file Rbrul.R previously, you can substitute
# the local path to the saved file, for example: source("~/Desktop/Rbrul.R").
# In general, you should source Rbrul from the Web, to get the latest update.

# A window wil open saying "This window has opened to help Google Analytics
# track the spread of Rbrul around the globe. You may close it at any time."
# You may close it at any time.


# THREE GOALS OF REGRESSION 

# Traditionally, there are three reasons why we build regression models:

# One is to quantitatively ESTIMATE the effects of predictors on a response.
# Predictor means independent variable; response means dependent variable.

# A second reason is to TEST HYPOTHESES (for example, significance testing).

# A third reason is to use observed data to PREDICT new or unseen data.

# This workshop will focus on estimation and simple hypothesis testing.
# We will ask how big an effect an predictor has on a response (estimation).
# And we'll ask if the effect could be the result of chance (hypothesis testing).


# WHAT ARE MIXED MODELS?

# Mixed models or mixed-effects models are a fairly new type of regression model.
# They're useful if data has a grouped structure (e.g. tokens grouped by speaker).
# I believe that many sociolinguistic situations require mixed-effects modeling.
# This makes it hard to properly illustrate "ordinary" regression with real data!


# ORDINARY (FIXED-EFFECTS) REGRESSION IN R

# We can create a pretend data set with a single command:

toy <- data.frame(
	x = c(0, 1, 2, 3, 4, 5, 6, 7, 8, 9),
	y = c(1, 3, 2, 5, 4, 7, 4, 6, 5, 7))

# Note that we could have also expressed x as "0:9" instead of "c(0, 1, ..., 9)".	
# We can plot these points with qplot(), a function in the ggplot2 package.

qplot(x = x, y = y, data = toy)

# When you open a graphing window, it may need to be resized to display correctly.

# Here, our response (y) is numeric, and we will perform a linear regression.
# Linear regression finds the straight line that is the best fit for these points.
# Linear regression minimizes the "sum of squared errors".
# It minimizes the squared vertical distances from the points to the line.
# We can add this best-fit line to the plot using the geom_line argument.

qplot(x = x, y = y, data = toy) + geom_line(stat = "smooth", method = "lm")

# Any straight line can be described by two numbers: its slope and its intercept.
# The slope of the line is the amount it goes up as you move one unit to the right.
# (If the line goes down as you move to the right, the slope will be negative.)
# The intercept is the value of y at x = 0 (where the line "intercepts" the y-axis).

# Just by eyeballing the line on the plot, we would say that its intercept is 2.
# And we would eyeball that its slope is a little bit more than 0.5.
# Together, these two parameters make up a simple rough model of the data.

# R fits linear regression models - finds the best line - with the lm() command.
# We feed a formula of the form "y ~ x" to lm(), and tell it where the data is.

lm(y ~ x, data = toy)

# The relevant part of this output is the "Coefficients", which are as follows:

# (Intercept)            x  
#      2.0000       0.5333  

# We see that we were correct in our eyeballing. The intercept is exactly 2.
# The slope is 0.5333, meaning that y goes up 0.5333 for every increase of 1 in x.
# This is estimation of the parameters of a model obtained by linear regression.

# A hypothesis we might want to test is the question of statistical significance.
# We might want to know if the slope, 0.5333, is significantly different than zero.
# More precisely, we want to know the proportion of the time that a slope that large
# would be seen, on average, if we just randomly threw 10 points against a wall.

# Above, we just printed the lm() model directly to the screen.
# We can also store the model and give it a name.

m1 <- lm(y ~ x, toy)

# Then we can print it out, or work with it as needed. Test the following commands:

m1				# prints the parameters of the model
coef(m1)		# prints just the coefficients of the model
coef(m1)[1]		# prints the intercept (the first coefficient)
coef(m1)[2]		# prints the slope (the second coefficient)

# To test the null hypothesis that the true slope is zero, we can fit a null model.
# This is a model with an intercept (the 1), but no slope. The slope is fixed at 0.

m0 <- lm(y ~ 1, toy)
m0

# _________________________________________________________________________________
# EXERCISE 1
# In the model m0, what is the value of the intercept?
# _________________________________________________________________________________

# Note that m1, with an intercept and a slope for x, could also have been fit 
# with the formula y ~ 1 + x. The formula y ~ x, is short for y ~ 1 + x. You could
# also tell the program to omit the intercept, forcing the line to pass through
# the origin (0, 0). You would do this with the formula y ~ 0 + x.

# To test the significance of the slope in m1, we need to compare two models:
# m1 and m0. m1 has an intercept and a slope, while m0 only has an intercept.

# As it is a more complex model, m1 will ALWAYS fit the data better than m0.
# However, it is a more complex model. In general, we prefer simpler models.
# We do a test to decide if the better fit is worth the added complexity. This
# can be called a "significance test" - testing the "statistical significance"
# of the slope. It is also called a "hypothesis test" - testing the hypothesis
# that the underlying slope is zero. If we reject this, then we have an effect!

# In the case of ordinary linear regression, this is done with an F-test.
# More generally, two models are compared with a likelihood-ratio chi-squared test.

# Note: this is also how GoldVarb compares models in its stepwise routines.
# If you haven't heard of GoldVarb, consider yourself lucky.

# We use anova() to compare the models, perform an F-test and obtain a p-value.
# Despite the name, this command does not necessarily carry out "an ANOVA".
# ANOVA is specifically for comparing discrete groups, which we're not doing here.

anova(m0, m1)		# note that the order of terms does not matter

# In the anova(m0, m1) output, focus on where it says "Pr(>F)":

# 	  Pr(>F)
#	0.005163

# This is our p-value. It is well below the usual .05 significance threshold.
# Therefore, we tend to believe that the observed trend is real and not just noise.

# This toy example had a single predictor, x.
# Naturally, regression becomes more useful when there is more than one predictor.
# Then, we estimate several "slopes" simultaneously (this is harder to visualize).
# We might have a formula y ~ x1 + x2, or y ~ x1 * x2 (indicating an interaction).

# Note: this is called "multiple regression", not "multivariate regression".
# Multivariate regression is when there is more than one dependent variable.

# _________________________________________________________________________________
# EXERCISE 2
# If the y values were chosen at random, any trend would have to be due to chance.
# Repeat the following commands by highlighting and using Command-Enter/Ctrl-R.
# Keep an eye on the p-values.

x <- 1:10			# x = 1, 2, ..., 10
y <- sample(1:10)	# y = 1, 2, ..., 10 in random order
m1 <- lm(y ~ x)
m0 <- lm(y ~ 1)
anova(m0, m1)

# How often is the p-value below .05?
# With a true slope of zero, the p-value should fall below .05 .... 5% of the time.
# Try it 100 times! How many times did you get a "false positive" (Type I error)?
# _________________________________________________________________________________


# ORDINARY (FIXED-EFFECTS) REGRESSION IN RBRUL

# R works by entering commands (or sets of commands) at the > prompt.
# Rbrul is based on text menus and it interacts with the user using questions.
# You can cut and paste multiple lines into Rbrul at once, but it's not a great idea.
# You also can't go backwards if you make a mistake.
# However, you can quit Rbrul (either from a menu or with the Stop button) and
# go back in without starting over. Or you can start over with the "reset" option.

# We will attempt to carry out the same regression that we did above, in Rbrul.
# First, note that Rbrul always uses the object named "datafile" to hold its data.
# We can load any file with Rbrul, but since we already have our "toy" data frame,
# we'll use the assignment operator (a left-pointing arrow) to create the "datafile".

datafile <- toy

# Note that you can use = instead of <-, but in R circles, this is a grave faux pas.

# We start Rbrul with the R command rbrul().

rbrul()

# After a moment and some messages, you should see the data structure and main menu.
# If you don't, it usually means that required packages have not loaded.
# On office computers, you may need to run R "as administrator" to install packages.

## Current data structure:
## x (numeric with 10 values): 0 1 2 3 4 ...
## y (numeric with 7 values): 1 3 2 5 4 ...
## Total tokens: 10

## MAIN MENU
## 1-load/save data 2-adjust data
## 4-crosstabs 5-modeling 6-plotting
## 8-restore data 9-reset 0-exit

# To fit a regression model, we type "5" for "modeling", and press Return/Enter.

5

# We then see the modeling menu:

## MODELING MENU
## 1-choose variables 2-one-level (recommended)
## 3-step-up 4-step-down 5-step-up/step-down
## 6-plotting 8-settings 9-main menu 0-exit
## 10-chi-square test

# The first thing we have to do is choose our variables.
# For example, we have to tell Rbrul which variable is the response.
# We enter "1" to choose variables.

1

## Choose response (dependent variable) by number (1-x 2-y).

# To choose y as the response variable, we type "2" followed by Return/Enter.

2

## Type of response? (1-continuous Enter-binary)

# Here we type "1" plus Return/Enter, because y is a numeric/continuous variable.
# When the response is a real number, it is called linear regression.
# (If the response was binary, we would be doing logistic regression. See below.)

1

## Choose predictors (independent variables) by number (1-x).

# Here we only have one choice: we type "1" (plus Return/Enter) to select x.

1

## Are any predictors continuous? (1-x Enter-none)

# Indeed, our predictor x is a numeric/continuous predictor, so we enter "1".

1
 
# Since there are no other predictors, Rbrul returns us to the modeling menu,
# after displaying the "Current variables": a continuous response y, and a fixed,
# continuous predictor x. Here, "fixed" means a fixed effect, an ordinary predictor.
# The alternative to fixed effects is random effects, which we will use later on.

# There is only one predictor being considered for our model: x.
# Therefore, it doesn't matter if we choose "step-up", "step-down", or "one-level".
# Stepwise modeling can be dangerous, so we will get in the "one-level" habit,
# entering "2" for a one-level analysis.

2

## ONE-LEVEL ANALYSIS WITH x (0.00516)
##
## $x
##  continuous  coef
##          +1 0.533
##
## $misc
##  deviance    AIC df intercept grand mean    R2.fixed R2.random R2.total
##    12.933 36.951  2         2        4.4      coming     soon!

# The above output is Rbrul's summary of the model with response y and predictor x.

# We see the p-value for x at the top of the summary: 0.00516.
# It is the same as what we obtained before, just rounded to 3 significant digits.

# We see the slope, 0.533, under the heading "coef" in the "x" section.
# We see the intercept, 2, under the heading "intercept" in the "misc" section.

# We also see other information, like the "grand" (overall) mean.
# We also should see the R-squared value, a measure of the fit of the model.
# Right now, there is a problem with the calculations; it says "coming soon!"


# MIXED-EFFECTS REGRESSION IN RBRUL

# We will now run a regression on a real linguistic data set.
# The response variable is post-vocalic or coda /r/; there are 1597 tokens of it.
# 40 speakers in Gretna, Scotland, read a word list containing 40 or so coda /r/s.
# The variable was coded narrowly, but we focus on zero vs. any other realization.

# This is a binary dependent variable, so we will be running a logistic regression.
# Predictors are modeled as affecting the LOG-ODDS of the response in a linear way.

# Log-odds are a way of taking the 0-to-1 probability scale, and stretching it out,
# so that it covers all numbers from negative infinity to positive infinity.
# Log-odds equals the natural logarithm of the odds, or log(p / (1 - p)).

# A log-odds of 0 corresponds to 0.5 probability.
# A log-odds of +1 corresponds to a probability of 0.73; -1 corresponds to 0.27.
# A log-odds of +2 corresponds to a probability of 0.88; -2 corresponds to 0.12.

# With coda /r/, a log-odds of 0 corresponds to 50% /r/ pronunciation.
# But regression mainly deals with effects, that is, changes in log-odds.
# On the log-odds scale, the difference between 50% and 70% /r/ is 0.85 log-odds.
# A similar change of 0.85 would get us from 70% to 85% /r/, from 85% to 93% /r/,
# or from 93% to 97%.

# We might say that a fixed change in log-odds "counts for less" the closer we are
# to 100% (or to 0%). Alternatively, we could say that a change from 5% to 10% is
# simply larger than a change from 40% to 50%. Both interpretations can make sense.

# To load the Gretna datafile, first enter "9" to go to the "Main Menu".

# If you have already saved the Gretna file on your computer:
1				# load/save data
[Enter]			# do not save current data
c				# commas separate the columns
[choose file]	# navigate to the file and choose it with the wizard

# If you have not saved the Gretna file on your computer, and have Internet access:
0				# to exit Rbrul
datafile <- read.csv("http://www.danielezrajohnson.com/gretna.csv")
rbrul()			# to re-enter Rbrul

# We should now be at the main menu in Rbrul.
# The "Current data structure" shows:

## word (factor with 58 values): air barred bars beard car ...
## set (factor with 12 values): SQUARE START NEAR NORTH FORCE ...
## position (factor with 2 values): coda-final coda-internal 
## r4 (factor with 4 values): approximant zero tap trill 
## r2 (integer with 2 values): 1 0 
## speaker (factor with 40 values): G15F G16F G17F G17F2 G18F ...
## age (integer with 24 values): 15 16 17 18 19 ...
## gender (factor with 2 values): female male 
## agegroup (factor with 2 values): young old 
## class (factor with 2 values): working middle 
## Total tokens: 1597

# Our response variable will be r2, which is coded 0 for zero /r/ and 1 otherwise.

# We will do a regression to estimate the size of three "fixed effects":
# set - how does the preceding vowel affect coda /r/?
# age - how does speaker age affect coda /r/?
# gender - how does speaker gender affect coda /r/?

# We may not be specifically interested in individual words and speakers, today,
# but we must take them into account when the data is grouped by word and speaker.
# That is, the data has multiple observations of each word and of each speaker.
# Word is nested within set (each word is always in the same lexical set).
# Speaker is nested within gender and age (the data was collected all at one time).

# This matters because if there is a lot of by-speaker or by-word variation,
# what look like significant predictor effects can plausibly arise just by chance.
# We control for this by using two "random effects" for speaker and word.
# Random effects usually have many levels, and nest within predictors of interest.

# From the Rbrul main menu, enter the following commands (manually):

5			# modeling menu
1			# choose variables
5			# choose r2 as the response
[Enter]		# r2 is a binary variable
2			# we will model what favors coda /r/, not what disfavors it
7			# choose predictor age
8			# choose predictor gender
2			# choose predictor set
1			# choose predictor word
6			# choose predictor speaker
[Enter]		# finish choosing predictors

# Rbrul should now be asking "Are any predictors continuous?" Age is continuous.
# As a rule, it's better to leave naturally-numeric predictors like age as
# continuous, rather than (for example) making several discrete age groups.

7			# mark age as continuous
[Enter]		# finish (the other predictors are categorical)

# Rbrul now asks if we want to consider "a pairwise interaction of fixed effects".
# There is sometimes confusion about what an interaction means, in statistics.
# Suppose our older speakers are mostly male and our younger speakers mostly female.
# This is not an interaction! It should be called an association or a lack of
# independence. The term "non-orthogonality" has also been used in sociolinguistics.

# An interaction is when two or more predictors affect the response in a way 
# that is not independent. For example, if the effect of gender on /r/ was greater
# for older speakers, that would be an interaction between gender and age. We will
# not be examining possible interactions right now; feel free to experiment later!

[Enter] 	# finish (don't consider any interactions)

# Rbrul now asks if there are any random intercepts. We select speaker and word.
# This model will allow individual speakers and words to vary in their overall
# rates of /r/ realization, which is sort of like giving each word and speaker
# its own intercept. A random intercept is one type of random effect. A more
# advanced type of random effect, which we will skip over today, but is often
# necessary to consider, is called a random slope. An example would be if some
# words showed more /r/ with males and other words showed more /r/ with females.
# Such things have been known to happen, but we will not consider this here.

1			# mark word as random intercept
6			# mark speaker as random intercept
[Enter]		# finish (the other predictors are fixed effects)

# Rbrul now asks about random slopes corresponding to each random intercept.

[Enter] 	# no by-word random slopes
[Enter]		# no by-speaker random slopes

# Rbrul now takes us back to the modeling menu.
# We will choose "2" for a one-level model, considering all predictors.
# A one-level model is like a step-down model except it only does the first step.

2			# choose one-level model
[Enter]

# After a minute... a long output results, including estimates for word and speaker.

# Note: mixed-effects models can take a long time to fit - sometimes very long!
# They can also have estimation problems that do not affect fixed-effects models.
# Currently, Rbrul suppressed some of the warnings that the actual model-fitting
# functions can produce. Therefore, you may want to try fitting the same models
# in R (see the next section) to be sure that you are not being led astray.

# (In the "Settings" menu, you can switch off the long random-effect display.)
# Scroll back to the top, where we see coefficients for age, gender, and set.

## ONE-LEVEL ANALYSIS WITH word [random, not tested] and speaker [random, not
## tested] and set (7.1e-05) + age (0.000392) + gender (0.734)
##
## $gender
##  factor logodds tokens 1/1+0 centered factor weight
##    male   0.131    795 0.630                  0.533
##  female  -0.131    802 0.587                  0.467
##
## $set
##  factor logodds tokens 1/1+0 centered factor weight
##     FIR   1.109    119 0.748                  0.752
##     OUR   0.859      7 0.857                  0.702
##   NURSE   0.597    225 0.689                  0.645
##  SQUARE   0.413     73 0.671                  0.602
##   NORTH   0.238    125 0.672                  0.559
##   HEARD   0.000    152 0.625                    0.5
##   FORCE  -0.235    180 0.617                  0.441
##    NEAR  -0.245     86 0.616                  0.439
##   START  -0.439    155 0.606                  0.392
##    CURE  -0.482    112 0.562                  0.382
##    FIRE  -0.598     12 0.667                  0.355
##  LETTER  -1.218    351 0.470                  0.228
##
## $age
##  continuous logodds
##          +1   0.059

# First look at the top line. It says we have word and speaker as random effects.
# These random effect terms are not tested for significance.

# Then it gives (lexical) set, with a p-value of 7.1e-05, which means 0.000071.
# Lexical set, then, is a highly significant predictor of coda /r/.
# Age is given a p-value of 0.000392, which is also highly significant.
# Gender is not significant: p = 0.734. The observed difference could be chance.

# Turning to the estimates, we see that the first column of numbers is in log-odds.
# Log-odds is the standard unit for reporting the results of logistic regressions.
# We just remember that a given change in log-odds has its biggest effect near 50%,
# and that 0% and 100% are infinitely far away in log-odds terms.

# The predictor age is given a coefficient of 0.059.
# This means that for every year older a speaker is, the model says their log-odds
# of producing coda /r/ increases by 0.059 - so the odds of /r/ go up by 6.1%.
exp(0.059)	# 1.061

# The predictor "set" has 12 coefficients, one for each lexical set.
# However, it is important to understand that these coefficients add up to zero.
# Each one represents the difference from the mean for that lexical set.
# We see that the FIR set most favors coda /r/, with a coefficient of 1.109.
# We see that the LETTER set least favors coda /r/, with a coefficient of -1.218.
# For factors, Rbrul also gives the # of tokens and raw proportion, for each level.

# Rbrul also gives the number of tokens, the raw proportion of /r/ ("1/1+0"),
# and the VARBRUL (GoldVarb) factor weight for each category.
# Note how positive log-odds correspond to factor weights over 0.5,
# negative log-odds correspond to factor weights under 0.5,
# and zero log-odds (for the HEARD lexical set) corresponds to 0.5 exactly.

# At the bottom of the output, we see the results for the random effects.
# First we will discuss word. Here are the top five and bottom five words.

## $word (random)
##      random intercept tokens 1/1+0 centered factor weight
##         for     0.381     39 0.769                  0.594
##        Kerr     0.375     39 0.744                  0.593
##         fur     0.373     39 0.795                  0.592 
##      moored     0.324     32 0.656                  0.581
##         car     0.322     40 0.675                   0.58
##         ...       ...    ...   ...                    ...
##      Kirsty    -0.346     38 0.658                  0.414
##    daughter    -0.384     33 0.333                  0.405
##      turtle    -0.472     40 0.575                  0.384
##  Kirriemuir    -0.484     40 0.425                  0.381
##      person     -0.64     40 0.450                  0.345

# Remember that we already have a predictor for lexical set. After taking set
# into account, the word effects show which individual words favored or disfavored
# coda /r/. They show the deviations of each word from the prediction for their set.

# The word that behaved the most differently from its set's average was "person".
# Under set, we see that overall, the HEARD lexical set contained 62.5% coda /r/.
# But under word, we note that "person" contained /r/ only 45% of the time.
# So "person" receives a negative random effect coefficient: -0.64.

# In the other direction, "for" had 77% /r/ while the NORTH set as a whole had 67%.
# So "for" receives a positive random effect coefficient: 0.381.

# Immediately above the individual words, there is a kind of summary row.

## $word (random)
##         intercept tokens 1/1+0 centered factor weight
## std dev     0.392   1597 0.609                    ...

# Based on this sample data, the population of all words is estimated to vary
# with a standard deviation of 0.392. (A normal distribution is assumed.)

# Under subject, we see a much higher number in the "std dev" position: 2.336.

## $speaker (random)
##         intercept tokens 1/1+0 centered factor weight
## std dev     2.336   1597 0.609                    ...

# This tells us that subjects are much more variable, or idiosyncratic, than words.
# In fact, this a quite a large subject standard deviation.
# This suggests that we may need to find other between-subject predictors,
# besides age and gender, that help condition the use of coda /r/.
# In fact, we have left social class out of the above model. Including it would
# likely reduce the "left-over" subject standard deviation.

# _________________________________________________________________________________
# EXERCISE 3
# Using the same Gretna data set, fit a model with:
#   r2 as the binary response (1 as the application value)
#   age, set, and position as fixed effects
#   no fixed-effect interactions
#   speaker and word as random intercepts
#   no random slopes
# Does the "position" variable (e.g. "car" vs. "card") come out as significant?
# If so, which favors coda /r/ more, words like "car" or words like "card"?
# Which speaker favors coda /r/ the most, taking their age into account? 
# Who favors it least for their age?
# _________________________________________________________________________________


# MIXED-EFFECTS REGRESSION IN R

# Let's enter "0" to exit Rbrul and return to R.
# We'll then perform the same regressions outside Rbrul that we did inside Rbrul.
# First, let's make a copy of Rbrul's datafile, and call it "gretna".

gretna <- datafile

# If we never used Rbrul, we might load the gretna data with a command like this:
gretna <- read.csv("http://www.danielezrajohnson.com/gretna.csv")

# As above, we want to perform a regression with r2 as the response.
# We want to estimate the fixed effects of age, gender, and lexical set.
# We want to include random intercepts for speaker and word.
# We'll use the glmer() function, from the lme4 package, to do this regression.

# Observe how we build the following regression formula:

# r2 ~ age + gender + set + (1|speaker) + (1|word)

# The response goes at the left, followed by "~".
# The predictors come after "~". We list the variable names, connected by "+".
# A random intercept for z is specified by "(1|z)".

# Rbrul, by default, uses sum contrasts, meaning that factors are expressed as
# deviations from the mean. 

# Now we'll actually fit the model as "m2". After the formula, we could enter
# "data = gretna", but just putting the object name "gretna" is sufficient.

# The argument "family = binomial" sets glmer() up for a logistic, not linear,
# regression. (For linear regression, we can put "family = gaussian" or omit it.)

# The "control" argument tells the glmer() function to ignore all the warnings
# it might be producing, as well as to perform up to a million optimization steps.
# If you left off the "control = ... " you would have to troubleshoot things.

m2 <- glmer(r2 ~ age + gender + set + (1|speaker) + (1|word),
	gretna,
	family = binomial,
	control = glmerControl(optimizer = "bobyqa", 
					check.conv.grad = "ignore",
					check.conv.singular = "ignore",
   	 				check.conv.hess = "ignore",
   	 				check.scaleX = "ignore", 
					optCtrl = list(maxfun = 1000000)))
					
# This fits the model and stores it in an object called m2. To summarize m2,
# with p-values provided by lmerTest (somewhat controversially), we use summary().

summary(m2) 

# Notice that R does not automatically print the details of the random effects
# (words and speakers). You can access these with a function called ranef().
# But we'll focus just on the fixed effects section of the output, at the top.

# Fixed effects:
#             Estimate Std. Error z value Pr(>|z|)    
# (Intercept) -2.42513    0.92666  -2.617 0.008869 ** 
# age          0.05851    0.01545   3.787 0.000152 ***
# gendermale   0.26210    0.76801   0.341 0.732899    
# setFIR       1.59092    0.52211   3.047 0.002311 ** 
# setFIRE     -0.11665    0.83314  -0.140 0.888650    
# setFORCE     0.24630    0.43302   0.569 0.569488    
# setHEARD     0.48198    0.45939   1.049 0.294094    
# setLETTER   -0.73635    0.39725  -1.854 0.063798 .  
# setNEAR      0.23679    0.51614   0.459 0.646397    
# setNORTH     0.71978    0.48012   1.499 0.133825    
# setNURSE     1.07863    0.43774   2.464 0.013737 *  
# setOUR       1.34082    1.26878   1.057 0.290615    
# setSQUARE    0.89413    0.57327   1.560 0.118828    
# setSTART     0.04209    0.44513   0.095 0.924661    
# ---
# Signif. codes:  0 *** 0.001 ** 0.01 * 0.05 . 0.1

# You should see some similarities between this and the Rbrul output.
# The estimate for age is the same, and has the same interpretation.
# (Of course, this output includes way too many decimal places.)

# The estmimate for gender is twice as large as it was in the Rbrul model.
# This is because R uses "treatment contrasts" (by default). 
# Rbrul uses "sum contrasts" (by default).
# So while Rbrul tells us males' and females' difference from the mean,
# R tells us males' difference from females. The female group, set to 0, is absent.
# ("female" is the default baseline group because it comes first alphabetically.)

# We see the same thing at work when we look at the lexical sets.
# FIR still has the highest coefficient (1.59) and LETTER the lowest (-0.74).
# However, in the R model, these coefficients all represent the difference from
# the (absent, baseline) set, CURE, which comes first in alphabetical order.

# In situations like this, where there is no logical baseline level, we may prefer
# sum contrasts. To switch, try entering the following and fitting the model again:

options(contrasts = c("contr.sum", "contr.poly"))

m3 <- glmer(r2 ~ age + gender + set + (1|speaker) + (1|word),
	gretna,
	family = binomial,
	control = glmerControl(optimizer = "bobyqa", 
					check.conv.grad = "ignore",
					check.conv.singular = "ignore",
   	 				check.conv.hess = "ignore",
   	 				check.scaleX = "ignore", 
					optCtrl = list(maxfun = 1000000)))
					
summary(m3)

# Unfortunately, in this case you have to figure out which numbered level is which.
# And you have to fill in the coefficient for the missing level of each factor,
# using the "zero-sum principle". In this sense, Rbrul's output is more friendly.

# To return to treatment contrasts, we would type:

options(contrasts = c("contr.treatment", "contr.poly"))	

# In the glmer() outputs, we see some p-values. These are called Wald tests.
# They are useful because there is one for each parameter (level of a factor),
# but they are less reliable than the term-wise p-values we get from anova().
# Also, hypothesis testing with anova() is not affected by the choice of contrasts.

m.full <- glmer(r2 ~ age + gender + set + (1|speaker) + (1|word),
	gretna,
	family = binomial,
	control = glmerControl(optimizer = "bobyqa", 
					check.conv.grad = "ignore",
					check.conv.singular = "ignore",
   	 				check.conv.hess = "ignore",
   	 				check.scaleX = "ignore", 
					optCtrl = list(maxfun = 1000000)))
	
m.noage <- glmer(r2 ~ gender + set + (1|speaker) + (1|word),
	gretna,
	family = binomial,
	control = glmerControl(optimizer = "bobyqa", 
					check.conv.grad = "ignore",
					check.conv.singular = "ignore",
   	 				check.conv.hess = "ignore",
   	 				check.scaleX = "ignore", 
					optCtrl = list(maxfun = 1000000)))

m.nogender <- glmer(r2 ~ age + set + (1|speaker) + (1|word),
	gretna,
	family = binomial,
	control = glmerControl(optimizer = "bobyqa", 
					check.conv.grad = "ignore",
					check.conv.singular = "ignore",
   	 				check.conv.hess = "ignore",
   	 				check.scaleX = "ignore", 
					optCtrl = list(maxfun = 1000000)))

m.noset <- glmer(r2 ~ age + gender + (1|speaker) + (1|word),
	gretna,
	family = binomial,
	control = glmerControl(optimizer = "bobyqa", 
					check.conv.grad = "ignore",
					check.conv.singular = "ignore",
   	 				check.conv.hess = "ignore",
   	 				check.scaleX = "ignore", 
					optCtrl = list(maxfun = 1000000)))
	
# Just now, we fit four models. One has all three predictors (the full model).
# Each of the other three is lacking one of the terms that we wish to test.

anova(m.full, m.noage)		# this tests the significance of age:    p = 0.0003922
anova(m.full, m.nogender)	# this tests the significance of gender: p = 0.7342
anova(m.full, m.noset)		# this tests the significance of set:    p = 7.099e-05

# A useful alternative to doing these tests separately involves the drop1() function.

drop1(m.full, test = "Chisq")

#        Df    AIC    LRT   Pr(Chi)    
# <none>    1353.6                     
# age     1 1364.2 12.569 0.0003922 ***
# gender  1 1351.7  0.115 0.7341985    
# set    11 1369.9 38.250 7.099e-05 ***
# ---
# Signif. codes:  0 *** 0.001 ** 0.01 * 0.05 . 0.1

# Fortunately, these p-values are the same as the ones we obtained in Rbrul.
# This is no coincidence! Behind the scenes, Rbrul runs commands like these ones.

# We saw that summary() applied to glmer() model does not give the details of the
# random effects. It gives the overall standard deviations for speakers and words:

summary(m.full)

# ...
# Random effects:
#  Groups  Name        Variance Std.Dev.
#  word    (Intercept) 0.1538   0.3922  
#  speaker (Intercept) 5.4582   2.3363
# ...

# Rounded, these same standard deviations formed the rightmost column of the
# Rbrul output. If we want to see the individual random effects, we use ranef().

ranef(m.full)

# Now we see a log-odds pseudo-coefficient for each individual word and speaker.
# (The true random effect parameters ARE the std. dev.'s for word and speaker.)

# $word
#          (Intercept)
# air      0.153680473
# barred  -0.345179443
# bars     0.038372690
#  ...

# $speaker
#          (Intercept)
# G15F    -1.39409020
# G16F     1.98312077
# G17F    -3.30176041
#  ...

# You will notice that the speakers and words are in alphabetical order here,
# while Rbrul places them in order of their intercept. Depending on the purpose,
# the Rbrul interface may be easier to work with. Still, it is good to know how to
# accomplish the same tasks in R proper.


# _________________________________________________________________________________
# EXERCISE 4
# Load the Orkney data file (http://www.danielezrajohnson.com/orkney.csv).
# If you have a local copy and there is no Internet, load the local copy.
# Call this object "orkney". Fill in the blank to complete the read.csv() command.

orkney <- read.csv(_____________)

# We want to see if the GOOSE vowels in Orkney are fronting over time.
# That is, we want to see if there is a positive slope of F2.norm ~ year of birth.
# Make a subset of the data as follows.

goose <- c("boot", "food", "fuel", "mule", "mute", "pool", "refute", "stew",
	"stewed", "sure", "through", "too", "tune")
orkney.goose <- subset(orkney, word %in% goose)

# Do a mixed-effects regression in R, using orkney.goose as the data frame.
# The response is called F2.norm. This is a continous, NORMALIZED variable.
# The predictor is called year. This is also a continuous variable.
# We want to include random intercepts for speaker and word.
# This is a linear regression, not logistic, so use lmer(), not glmer().
# You can omit the argument "family = binomial", or use "family = gaussian".
# You shouldn't need the "control" arguments this time.

m.orkney <- lmer(_______________)
summary(m.orkney)

# Is the age (year of birth) effect statistically significant?
# To answer this, fit another model without the fixed effect of "year".
# Then compare the two models using anova().

# What is the size of the age (year of birth) effect?
# To answer this, inspect the "year" coefficient in the model with "year".
# Remember that the units will be Z-scores per year.
# _________________________________________________________________________________


# This is the end of the regression workshop. Thank you for participating!
# If you have any questions, feel free to email me at danielezrajohnson@gmail.com.
# I do make it a priority to respond to (and eventually solve!) all Rbrul problems.


# SOLUTIONS

# _________________________________________________________________________________
# SOLUTION 1
# In the model m0, what is the value of the intercept?
coef(m0)[1]		# the intercept is 4.4
# _________________________________________________________________________________


# _________________________________________________________________________________
# SOLUTION 2
# This solution contains some more advanced R code, but it's not necessary.
# You could also do the "100-times test" manually, repeating Command-Enter/Ctrl-R.
p <- vector()		# set up a vector to hold the p-values
for (i in 1:1000)	# set up to loop 1000 times
{
x <- 1:10			# x = 1, 2, ..., 10
y <- sample(1:10)	# y = 1, 2, ..., 10 in random order
m1 <- lm(y ~ x)
m0 <- lm(y ~ 1)
p[i] <- anova(m0, m1)[6][2,1]	# captures the p-value from the anova() object
}
# How often is the p-value below .05?
# It should be below .05, on average, 5% of the time.
mean(p < .05) 		# 0.049
# _________________________________________________________________________________


# _________________________________________________________________________________
# SOLUTION 3
# Using the same gretna data set, fit a model with:
# r2 as the binary response (1 as the application value)
# age, set, and position as fixed effects
# speaker and word as random effects

1			# choose variables
[Enter]		# keep r2 as the response

7			# choose age as a predictor
2			# choose set as a predictor
3			# choose position as a predictor
1			# choose word as a predictor
6			# choose speaker as a predictor
[Enter]		# finish choosing predictors

7			# mark age as continuous
[Enter]		# finish

[Enter]		# no fixed-effects interactions

1			# mark word as random intercept
6			# mark speaker as random intercept
[Enter]		# finish

[Enter]		# no by-word random slopes
[Enter]		# no by-speaker random slopes

2			# one-level model

# Does the position variable (e.g. "car" vs. "card") come out as significant?
# The predictor for position in the coda gets a p-value of 0.734.
# Not significant!

# If so, which favors coda /r/ more, words like "car" or words like "card"?
# Since the predictor is not significant, we shouldn't really answer this!

# Which speaker favors coda /r/ the most for their age?
# Speaker G19F2. She produces 90% coda /r/.
# Her age is 19, so her expected log-odds of /r/ is -1.812 + 19 * 0.059 = -0.691.
# Using the logistic formula, this is exp(-0.691) / (1 + exp(-0.691)) = 33.4% /r/.
  
# Who favors it least for their age?
# Speaker G69F. She produces 0% coda /r/.
# Her age is 69, so her expected log-odds of /r/ is -1.812 + 69 * 0.059 = 2.259
# This corresponds to exp(2.259) / (1 + exp(2.259)) = 90.5% /r/.
# _________________________________________________________________________________


# _________________________________________________________________________________
# SOLUTION 4
# Load the Orkney data file (http://www.danielezrajohnson.com/orkney.csv).
# If you have a local copy and there is no Internet, load the local copy.
# Call this object orkney.

orkney <- read.csv("http://www.danielezrajohnson.com/orkney.csv")

# We want to see if the GOOSE vowels in Orkney are fronting over time.
# That is, we want to see if there is a positive slope of F2.norm ~ year of birth.
# Make a subset of the data as follows.

goose <- c("boot", "food", "fuel", "mule", "mute", "pool", "refute", "stew",
	"stewed", "sure", "through", "too", "tune")
orkney.goose <- subset(orkney, word %in% goose)

# Do a mixed-effects regression in R, using orkney.goose as the data frame.
# The response is called F2.norm. This is a continous, NORMALIZED variable.
# The predictor is called year. This is also a continuous variable.
# We want to include random intercepts for speaker and word.
# This is a linear regression (not logistic); use "family = gaussian" or omit.

model.full <- lmer(F2.norm ~ year + (1|speaker) + (1|word), orkney.goose)

# Is the age (year of birth) effect statistically significant?
# To answer this, we fit another model without the fixed effect of "year".

model.noyear <- lmer(F2.norm ~ (1|speaker) + (1|word), orkney.goose)

# Then compare the two models using anova().

anova(model.full, model.noyear)		# p = 0.06801

# What is the size of the age (year of birth) effect?
# To answer this, inspect the "year" coefficient in model.full.

summary(model.full)		# returns the entire model summary

# We see a coefficient of 0.0064 for "year" in the "Fixed effects".
# Remember that the units will be Z-scores (normalized units) per year.

fixef(model.full)		# returns the fixed effect coefficients from the model

# We have an increase of 0.0064 Z-scores for each year increase in date of birth.
# _________________________________________________________________________________