If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.
You already know Dokkio is an AI-powered assistant to organize & manage your digital files & messages. Very soon, Dokkio will support Outlook as well as One Drive. Check it out today!

EPSY6210_20100322

Page history last edited by Starr Hoffman 14 years, 1 month ago

EPSY 6210, 03.22.2010

(last time, finished going thru handout and output file on health/doctor visits, etc.)

SCHEDULE ADJUSTMENTS

April 5 - logistic regression (guest lecturer)
April 12 - propensity scores (guest lecturer)
following class: will cover original topic from April 5

Review

Case One Regression: simple linear regression with one predictor
Case Two Regression: multiple regression with two predictors that are orthogonal (unrelated)

boils down to relationship between r-squared and y-hat (R-squared; linear combination of all your predictors)
regression and prediction is all about how well the regression line (y-hat) predicts the scores

Case Three Regression: two or more predictors that are related on some level

beta-weights now are NO LONGER equal to Pearson-r's
must also look at structure coefficients; they help us decipher the various contributions of the predictors
you really need BOTH beta weights and structure coefficients -- they are both meaningful and important

Case Four Regression

one or more predictors that are correlated with a dependent variable (and they may be correlated with each other as well) and one or more predictor variables that are not correlated with the dependent variable BUT which are correlated with the other predictors (suppressor variables) and they help the earlier predictors do a better job of explaining variance (predicting).

suppression = is a good thing!

it helps you explain more variance
it does hold back and control something, but that is what helps you explain more variance
gives you a bigger R-squared

your Pearson's r between your y and your x-suppressor = 0

(NOT correlated with the dependent variable, y)

what is the structure coefficient for the x-suppressor? = 0

(because if it's not correlated with y, it can't be correlated with y-hat--structure coefficients involve predictors and thus y-hat)

beta weight for the x-suppressor? = NOT 0

if a variable receives a beta weight of .4 and a structure coefficient of 0, it's a suppressor variable and is getting some credit in the regression (even though it's not correlated with y or y-hat).

Horst first discovered this suppression stuff around WWII-era

working with military to predict pilot training program success
high-stakes prediction: expensive planes, life-and-death
want screening mechanisms that will give information about who will succeed in the program

(don't want to waste resources on those who won't succeed)

important characteristics in aptitude tests: visual spatial ability, numerical ability, mechanical ability

those three predictors were often related to each other (Case 3 Regression)
paper tests require verbal ability (by their nature) -- it's not predictive of pilot training success, but it's predictive of how well you can perform on these aptitude tests
if you include verbal ability in the regression model, you get a larger r-squared
if part of the variance in the aptitude test ability scores is due to verbal ability, it is construct-irrelevant variance (measurement error)
the more measurement error you have, the lower your r-squared would be

if your three tests were completely measurement error, your r-squared = 0

if we measure verbal ability and add it to the model, it suppresses (holds down) the construct-irrelevant error in the other variables so that they are better predictors. (you suppress the irrelevant variance.)

you don't usually look for suppression as an expectation; it's something you discover as you go along.

statistics isn't about prophesying, but about trial and error until things make sense.

beta r-s (structure coefficient) squared structure coefficient

x1 .30 .10 .01

x2 .58 .48 .23

x3 .22 .41 .17

x2 & x3 are the best at predicting the dependent variable
x2 is the best predictor (highest beta weight)
x3 is the next best, because is has the next-highest structure coefficient--it does contribute a fair amount of the explanation of y-hat (predicted scores) -- NOT a correlation between the predictor and the dependent variable (y)

how much does it explain? square it to get to Area World -- .17 is the squared structure coefficient; 17% percentage of variance of y-hat
x2 and x3 can together explain about one-fifth of the y-hat area
x1 is not a good predictor; it is the suppressor variable; the structure coefficient is NOT 0, so it is NOT a perfect suppressor variable (real-world), but it IS LOW; but its beta-weight is disproportionately HIGH compared to its structure coefficient (even more apparent when that is squared)

to identify suppressors:

don't just look for a large beta-weight and a "0" structure coefficient
look for a large beta weight and a small structure coefficient (square it to see how small it "really" is)
you have to make judgments and justify them--there ARE no strict cut-off rules here.
to test it--run the model with the variable in, and with the variable left out, and check the r-squared value each way. if the r-squared is disproportionately HIGHER with the variable IN, it's probably a suppressor.

<<< BREAK >>>

effect size statistics

variance-accounted-for statistics (how much variance have you explained?)

r-squared
eta-squared
R-squared (multiple-r-squared)

standardized mean differences

Cohen's d
Glass's Delta
Hedge's g

both groups are related some can be transformed into each other between groups.

we haven't talked about corrected effect sizes. so far we have talked about un-corrected effect sizes.

r-squared versus adjusted r-squared (to the right of r-squared in the SPSS Output file).

uncorrected = they are all biased. the uncorrected statistics are probably over-estimates of what you would get in a future sample (or the population).

generalizability matters; we want our stats to be replicable in future studies.

why might our r-squared for our particular sample be higher than it might be in a future study?

sampling error: its difficult to randomly select a sample; you may not have a good representation of the population (may or may not be able to trust your results)

OLS analyses, OLS regression = Ordinary Least Squares

"least squares" = fit your regression line to the least amount of residual (error)

smallest sum of all squared distances (errors) to the line = that process is least squares error scores (OLS)
differences in both the y scores and in the x scores can both be due to sampling error (error unique to this sample, sampling error)

sampling error has a direct impact on where the regression line is drawn

the regression line is impacted by sampling error.
the sampling error in your sample is unique to that sample; the regression line will capitalize on ALL the variance in x and y, regardless of if it's from sampling error or true variance, to maximize the effect observed.
therefore the r-squared is biased.
therefore we need to find variance based on sampling error, and correct for it = adjusted r-squared.

there are different kinds of corrections that you can make to r-squared.

there are two levels of sampling error:

the current sample
and for the future sample

what affects sampling error? (all these corrections assume a truly random sample)

n (size of the sample); the larger the sample, the lower the sampling error; as n goes down, we need to correct more
number of variables (k) you've measured; the more you measure, the greater changes you have of error
theoretical population effect; population r-squared that includes future groups to be measured; the effect you would get if you measured everyone in the population. as the theoretical population effect is bigger, there is less sampling error.

does total GRE score predict grad-level GPA? -- for all students in the US. assume that it's perfect, r-squared (1).

what might the scatterplot look like? all points would be on the regression line. (population effect would be perfect).
sample 20 people from that population: what r-squared would you expect? 1; because the r-squared of the population is PERFECT, then inherently any sample from that population is also PERFECT.
(doesn't even matter how you choose your sample, in all cases it would be 1.)

what is the population r was 0? you don't know what a sample would be from that population of no effect.
as the population effect goes down, the sampling error goes up.

there are many forms of corrections; we're going to look at just one.

this is a theoretical correction, therefore there are different ways to produce a correction formula;

some may be more or less accurate based on the situation.

the Ezekiel correction (or Whrry correction) is what we're looking at. (this is what SPSS uses.)

this attempts to account for all three things that affect sampling error.
from r-squared to the adjusted r-squared = shrinkage.
the amount of shrinkage suggests the amount of sampling error.
if there is a large amount of shrinkage, there are huge problems with your sample--much sampling error.
your r-squared for your sample is your best estimate of the theoretical population effect (population r-squared).
typically not the most accurate correction method.

(using data from first homework to calculate shrinkage.)

if your adjusted-r-squared is largely different, it can change your interpretation of the data--go with the adjusted-r-squared value. BUT BE CLEAR that that is NOT the effect that you observed in your study, but that it ACCOUNTS for SAMPLING ERROR. (report BOTH r-squared values and discuss shrinkage amount; you might even theorize what the sampling error is, if you have clues and if it's relevant to the study.)
(in the first homework, number of variables and n were held constant; so the theoretical population effect is the only thing that's being corrected for here.)