EPSY 6210, 03.22.2010
(last time, finished going thru handout and output file on health/doctor visits, etc.)
SCHEDULE ADJUSTMENTS
- April 5 - logistic regression (guest lecturer)
- April 12 - propensity scores (guest lecturer)
- following class: will cover original topic from April 5
Review
- Case One Regression: simple linear regression with one predictor
- Case Two Regression: multiple regression with two predictors that are orthogonal (unrelated)
- boils down to relationship between r-squared and y-hat (R-squared; linear combination of all your predictors)
- regression and prediction is all about how well the regression line (y-hat) predicts the scores
- Case Three Regression: two or more predictors that are related on some level
- beta-weights now are NO LONGER equal to Pearson-r's
- must also look at structure coefficients; they help us decipher the various contributions of the predictors
- you really need BOTH beta weights and structure coefficients -- they are both meaningful and important
Case Four Regression
one or more predictors that are correlated with a dependent variable (and they may be correlated with each other as well) and one or more predictor variables that are not correlated with the dependent variable BUT which are correlated with the other predictors (suppressor variables) and they help the earlier predictors do a better job of explaining variance (predicting).
suppression = is a good thing!
- it helps you explain more variance
- it does hold back and control something, but that is what helps you explain more variance
- gives you a bigger R-squared
your Pearson's r between your y and your x-suppressor = 0
- (NOT correlated with the dependent variable, y)
what is the structure coefficient for the x-suppressor? = 0
- (because if it's not correlated with y, it can't be correlated with y-hat--structure coefficients involve predictors and thus y-hat)
beta weight for the x-suppressor? = NOT 0
- if a variable receives a beta weight of .4 and a structure coefficient of 0, it's a suppressor variable and is getting some credit in the regression (even though it's not correlated with y or y-hat).
Horst first discovered this suppression stuff around WWII-era
- working with military to predict pilot training program success
- high-stakes prediction: expensive planes, life-and-death
- want screening mechanisms that will give information about who will succeed in the program
- (don't want to waste resources on those who won't succeed)
- important characteristics in aptitude tests: visual spatial ability, numerical ability, mechanical ability
- those three predictors were often related to each other (Case 3 Regression)
- paper tests require verbal ability (by their nature) -- it's not predictive of pilot training success, but it's predictive of how well you can perform on these aptitude tests
- if you include verbal ability in the regression model, you get a larger r-squared
- if part of the variance in the aptitude test ability scores is due to verbal ability, it is construct-irrelevant variance (measurement error)
- the more measurement error you have, the lower your r-squared would be
- if your three tests were completely measurement error, your r-squared = 0
- if we measure verbal ability and add it to the model, it suppresses (holds down) the construct-irrelevant error in the other variables so that they are better predictors. (you suppress the irrelevant variance.)
you don't usually look for suppression as an expectation; it's something you discover as you go along.
statistics isn't about prophesying, but about trial and error until things make sense.
beta r-s (structure coefficient) squared structure coefficient
x1 .30 .10 .01
x2 .58 .48 .23
x3 .22 .41 .17
- x2 & x3 are the best at predicting the dependent variable
- x2 is the best predictor (highest beta weight)
- x3 is the next best, because is has the next-highest structure coefficient--it does contribute a fair amount of the explanation of y-hat (predicted scores) -- NOT a correlation between the predictor and the dependent variable (y)
- how much does it explain? square it to get to Area World -- .17 is the squared structure coefficient; 17% percentage of variance of y-hat
- x2 and x3 can together explain about one-fifth of the y-hat area
- x1 is not a good predictor; it is the suppressor variable; the structure coefficient is NOT 0, so it is NOT a perfect suppressor variable (real-world), but it IS LOW; but its beta-weight is disproportionately HIGH compared to its structure coefficient (even more apparent when that is squared)
to identify suppressors:
- don't just look for a large beta-weight and a "0" structure coefficient
- look for a large beta weight and a small structure coefficient (square it to see how small it "really" is)
- you have to make judgments and justify them--there ARE no strict cut-off rules here.
- to test it--run the model with the variable in, and with the variable left out, and check the r-squared value each way. if the r-squared is disproportionately HIGHER with the variable IN, it's probably a suppressor.
<<< BREAK >>>
effect size statistics
- variance-accounted-for statistics (how much variance have you explained?)
- r-squared
- eta-squared
- R-squared (multiple-r-squared)
- standardized mean differences
- Cohen's d
- Glass's Delta
- Hedge's g
both groups are related some can be transformed into each other between groups.
we haven't talked about corrected effect sizes. so far we have talked about un-corrected effect sizes.
- r-squared versus adjusted r-squared (to the right of r-squared in the SPSS Output file).
uncorrected = they are all biased. the uncorrected statistics are probably over-estimates of what you would get in a future sample (or the population).
generalizability matters; we want our stats to be replicable in future studies.
why might our r-squared for our particular sample be higher than it might be in a future study?
- sampling error: its difficult to randomly select a sample; you may not have a good representation of the population (may or may not be able to trust your results)
OLS analyses, OLS regression = Ordinary Least Squares
- "least squares" = fit your regression line to the least amount of residual (error)
- smallest sum of all squared distances (errors) to the line = that process is least squares error scores (OLS)
- differences in both the y scores and in the x scores can both be due to sampling error (error unique to this sample, sampling error)
sampling error has a direct impact on where the regression line is drawn
- the regression line is impacted by sampling error.
- the sampling error in your sample is unique to that sample; the regression line will capitalize on ALL the variance in x and y, regardless of if it's from sampling error or true variance, to maximize the effect observed.
- therefore the r-squared is biased.
- therefore we need to find variance based on sampling error, and correct for it = adjusted r-squared.
there are different kinds of corrections that you can make to r-squared.
there are two levels of sampling error:
- the current sample
- and for the future sample
what affects sampling error? (all these corrections assume a truly random sample)
- n (size of the sample); the larger the sample, the lower the sampling error; as n goes down, we need to correct more
- number of variables (k) you've measured; the more you measure, the greater changes you have of error
- theoretical population effect; population r-squared that includes future groups to be measured; the effect you would get if you measured everyone in the population. as the theoretical population effect is bigger, there is less sampling error.
- does total GRE score predict grad-level GPA? -- for all students in the US. assume that it's perfect, r-squared (1).
- what might the scatterplot look like? all points would be on the regression line. (population effect would be perfect).
- sample 20 people from that population: what r-squared would you expect? 1; because the r-squared of the population is PERFECT, then inherently any sample from that population is also PERFECT.
- (doesn't even matter how you choose your sample, in all cases it would be 1.)
- what is the population r was 0? you don't know what a sample would be from that population of no effect.
- as the population effect goes down, the sampling error goes up.
there are many forms of corrections; we're going to look at just one.
this is a theoretical correction, therefore there are different ways to produce a correction formula;
some may be more or less accurate based on the situation.
the Ezekiel correction (or Whrry correction) is what we're looking at. (this is what SPSS uses.)
- this attempts to account for all three things that affect sampling error.
- from r-squared to the adjusted r-squared = shrinkage.
- the amount of shrinkage suggests the amount of sampling error.
- if there is a large amount of shrinkage, there are huge problems with your sample--much sampling error.
- your r-squared for your sample is your best estimate of the theoretical population effect (population r-squared).
- typically not the most accurate correction method.
(using data from first homework to calculate shrinkage.)
- if your adjusted-r-squared is largely different, it can change your interpretation of the data--go with the adjusted-r-squared value. BUT BE CLEAR that that is NOT the effect that you observed in your study, but that it ACCOUNTS for SAMPLING ERROR. (report BOTH r-squared values and discuss shrinkage amount; you might even theorize what the sampling error is, if you have clues and if it's relevant to the study.)
- (in the first homework, number of variables and n were held constant; so the theoretical population effect is the only thing that's being corrected for here.)
MIDTERM
Comments (0)
You don't have permission to comment on this page.