In this question, we will
formulate a measure to quantify the level of association between
the two categorical variables. Such a measure is often used in a statistical
test called Chi-square test for assessing whether there is an association
between two categorical variables. This question is also used to motivate
the learning of independence and to connect the concept back to what we
have learnt in the course.
Let's revisit the example we have looked at in the course. How is
diet type (high cholesterol diet versus low cholesterol diet) related to the
risk of coronary heart disease? Data of 23 individuals:
From the table we find that the probability of having heart disease is and the
probability of having high cholesterol diet is . Similarly, we can find the probability
of not having heart disease and the probability of having low cholesterol diet.
Part a
If there is no association between the two variables (i.e., the two are independent),
the probability of having heart disease
and high cholesterol diet is: [Round to four decimal places].
Part b
If the two variables are independent, we should
expect the
number of individuals with heart disease
and high cholestoral diet to be the
probability in Part a multiplied by 23 individuals, which is: [Round to two decimal places].
Part c
Repeating Part b, we find that the expected number of individuals for the cells (ii), (iii), (iv) respectively on the table
are: 4.52, 6.52, 3.48.
The following measure (called Chi-square test statistic):
quantifies the level of association between two categorical variables. The symbol means a sum.
"Observed" here refers to the observed counts on the table, while
"Expected" refers to the expected counts
given independence for the two variables is true.
The sum is taken across all the cells (i) to (iv) on the table.
If there is no association, the observed counts should not differ very
much from the expected counts, which results in a relatively small
value of . A large value indicates disagreement
between the expected and observed counts which suggests the assumption of independence does not hold and
the two variables are likely to be associated.
Compute . [Round to two decimal places].
Of course, how large is large is another problem and this is beyond the scope of this course.
You can earn partial credit on this problem.