This is a problem on interpretation of regression equations which
have categorical explanatory variables where slopes on non-categorical
variables do not depend on the category. This model assumes that hyperplanes
are parallel for different categories, and the regression coefficients
for the binary dummy variables can be used to determine distances
between hyperplanes for different categories.
Context of data set: American post-secondary schools in 2014-2015,
where the annual in-state tuition is less than $ 20000.
The response variable is the total number of applicants (in thousands).
There were many explanatory variables in the complete data set,
but only a few are included here. Region is a categorical variable
for different parts of the United States. There is some merging
of categories compared with the original data set. For your multiple
regression output, check that your estimates are interpretable before
you submit answers.
For your subset of the university applicant data set, the response variable is:
applicants, in thousands
(for the homework questions below, convert the response with natural
logarithm).
applicants=c(2.953, 20.677, 10.111, 60.543, 26.496, 25.438, 7.408, 31.941, 16.958, 31.28, 9.679, 86.537, 66.515, 3.542, 32.19, 38.785, 10.991, 33.736, 20.756, 18.42, 21.873, 25.884, 73.782, 7.075, 31.021, 14.887, 5.111, 5.002, 12.92, 14.116, 49.776, 31.611, 28.518, 15.61, 19.814, 20.918, 16.689, 4.582, 14.944, 24.988, 30.629, 10.332, 44.76, 20.934, 18.107, 8.754, 36.788, 31.332, 16.125, 66.813, 10.039, 5.345, 5.465, 20.175, 11.552, 14.223, 20.923, 13.758, 14.933, 36.362, 36.101, 33.211, 21.359, 5.713, 35.822, 11.258, 50.299, 7.101, 20.443, 5.017, 10.217, 40.727, 73.448, 10.245, 25.194)
The explanatory variables are:
(i) per.admit (percentage admitted)
per.admit=c(55, 72, 52, 40, 62, 57, 87, 60, 83, 50, 63, 19, 37, 89, 71, 40, 81, 51, 52, 80, 66, 33, 16, 57, 29, 82, 76, 84, 70, 78, 32, 55, 44, 64, 51, 56, 76, 85, 66, 53, 53, 63, 45, 74, 60, 95, 53, 28, 84, 36, 42, 93, 83, 53, 83, 77, 77, 75, 76, 76, 58, 66, 75, 83, 59, 83, 50, 70, 52, 49, 60, 56, 33, 61, 68)
(ii) num.enroll (enrollment, in thousands)
num.enroll=c(5.695, 29.203, 9.233, 34.508, 37.485, 42.598, 28.515, 48.378, 25.912, 26.541, 20.517, 41.845, 30.051, 10.061, 61.642, 51.313, 31.515, 36.047, 21.857, 15.117, 35.158, 23.109, 37.565, 10.725, 23.732, 28.686, 7.099, 16.936, 30.69, 28.886, 43.625, 44.784, 16.695, 27.238, 20.611, 35.197, 35.313, 15.071, 30.297, 41.938, 28.617, 11.314, 51.147, 29.217, 20.655, 14.534, 58.322, 29.135, 29.477, 23.051, 12.602, 13.952, 15.805, 49.61, 11.286, 27.511, 16.571, 30.848, 28.628, 46.416, 21.498, 50.081, 24.096, 14.747, 45.14, 13.183, 47.04, 15.829, 33.989, 10.241, 13.979, 17.866, 30.709, 39.74, 22.68)
(iii) stfacratio (student/faculty ratio)
stfacratio=c(16, 18, 18, 18, 14, 18, 22, 16, 17, 16, 21, 16, 19, 17, 20, 17, 17, 21, 16, 19, 22, 18, 17, 18, 15, 15, 13, 17, 19, 19, 12, 18, 20, 19, 15, 18, 18, 19, 24, 25, 14, 11, 17, 18, 15, 19, 18, 13, 21, 17, 23, 18, 20, 26, 16, 18, 17, 17, 23, 17, 19, 17, 19, 19, 19, 20, 17, 17, 16, 15, 19, 18, 19, 26, 15)
(iv) avg.grant (average grant for financial aid, in thousands)
avg.grant=c(7.173, 11.848, 7.215, 15.528, 7.327, 8.036, 4.965, 10.726, 8.901, 10.265, 5.421, 17.423, 17.09, 4.412, 8.731, 8.727, 6.726, 10.736, 8.684, 11.579, 6.414, 8.798, 16.425, 7.745, 16.449, 10.526, 10.316, 5.896, 7.959, 7.035, 14.671, 13.821, 6.838, 6.372, 8.076, 8.227, 6.18, 6.003, 5.343, 5.976, 10.461, 9.924, 7.511, 5.715, 9.097, 5.969, 8.435, 13.447, 6.229, 16.958, 5.678, 5.036, 5.826, 5.055, 7.719, 7.609, 9.507, 7.736, 7.118, 8.834, 16.227, 9.747, 6.451, 4.89, 11.818, 7.746, 6.541, 8.591, 9.322, 6.974, 7.879, 17.287, 16.638, 6.234, 8.788)
(v) grad.rate (graduation rate, maybe this means within 4 or 5 years)
grad.rate=c(35, 58, 12, 81, 66, 82, 41, 79, 68, 82, 50, 92, 86, 45, 80, 79, 59, 67, 82, 78, 62, 79, 91, 43, 93, 67, 66, 34, 38, 61, 91, 81, 79, 53, 54, 79, 62, 47, 41, 57, 79, 69, 73, 63, 57, 50, 82, 89, 52, 80, 45, 49, 37, 49, 59, 58, 63, 56, 28, 75, 66, 79, 67, 53, 84, 43, 86, 44, 71, 40, 61, 74, 86, 40, 80)
(vi) region (5 categories are FarWest, Gl.NE for GreatLakesNewEngland,
Mid for Middle/Central longitude, Southeast, West)
region=c('Mid', 'Southeast', 'West', 'FarWest', 'Mid', 'GLNE', 'FarWest', 'Mid', 'Southeast', 'GLNE', 'Southeast', 'FarWest', 'FarWest', 'Mid', 'West', 'West', 'West', 'Southeast', 'Southeast', 'GLNE', 'West', 'Southeast', 'FarWest', 'Southeast', 'Southeast', 'FarWest', 'GLNE', 'GLNE', 'GLNE', 'FarWest', 'GLNE', 'FarWest', 'Mid', 'Southeast', 'GLNE', 'Southeast', 'GLNE', 'West', 'Southeast', 'Southeast', 'Mid', 'Mid', 'Mid', 'GLNE', 'GLNE', 'Mid', 'GLNE', 'Southeast', 'GLNE', 'FarWest', 'Southeast', 'West', 'Southeast', 'Southeast', 'GLNE', 'Southeast', 'GLNE', 'Southeast', 'West', 'GLNE', 'FarWest', 'GLNE', 'FarWest', 'Mid', 'GLNE', 'GLNE', 'Mid', 'West', 'Southeast', 'Southeast', 'Mid', 'FarWest', 'FarWest', 'West', 'Mid')
You are to fit a multiple regression model with the
response variable
log(applicants), the natural logarithm of "applicants"
and the
6 explanatory variables given above.
After you have copied the above R vectors into your R session,
you can get a dataframe with
univ = data.frame(applicants, per.admit, num.enroll, stfacratio, avg.grant, grad.rate, region)
Please use 3 decimal places for the answers below which are not
integer-valued.
For the regression being requested, you should find the most or all of the
coefficients for per.admit, num.enroll, stfacratio, avg.grant, grad.rate to be statistically
significant. Some of the regions might be significantly different from
others but not all pairs of regions are significantly different from
each other.
To answer the parts (a) and (b) below, two separate regressions could
be done (with 2 different regions as the baseline categories). If
you want to challenge yourself to answer them both based on one application
of lm(),
you need to use the cov.unscaled component of the summary of an lm object.
Part a)
The estimate of the signed distance of the hyperplane for
region Southeast relative to FarWest is
and its SE is
Part b)
The estimate of the signed distance of the hyperplane for
region Southeast relative to GLNE is
and its SE is
Part c)
What is the adjusted ?
Part d)
What is the residual SD (residual SE in R)?
Part e)
If interaction of region and num.enroll (i.e., the term region:num.enroll)
were added to the lm statement,
how many betas would be in the regression equation?
Hint:
You can earn partial credit on this problem.