## Overview- What is the chi-square test? How does it work?
- Learn about the different types of Chi-Square tests and where and when you should apply them
- We will also implement a chi-square test in R in this article
## Introduction
Lets start with a case study. I want you to think of your favorite restaurant right now. Lets say you can predict a certain number of people arriving for lunch five days a week. At the end of the week, you observe that the expected footfall was different from the actual footfall. Sounds like a prime statistics problem? Thats the idea! So, how will you check the statistical significance between the observed and the expected footfall values? Remember this is a categorical variable Days of the week with 5 categories [Monday, Tuesday, Wednesday, Thursday, Friday].
We can always opt for z-tests, t-tests or ANOVA when were dealing with continuous variables. But the situation becomes tricky when working with categorical features (as most data scientists will attest to!). Ive found the chi-square test to be quite helpful in my own projects. So lets dive into the article to understand all about the chi-square test, what it is, how it works and how we can implement it in R. If you are new statistics and data science, I would recommend the below resources to get a comprehensive overview of the two broad topics: ## Table of Contents- What are Categorical Variables?
- What is a Chi-Square Test and Why Do We Use It?
- Assumptions of the Chi-Square Test
- Types of Chi-Square Tests (With implementation in R)
- Chi-Square Goodness of Fit Test
- Chi-Square Test of Association between Two Variables
## What are Categorical Variables?Im sure youve encountered categorial variables before, even if you might not have intuitively recognized them. They can be tricky to deal with in the data science world so lets first define them.
For example, the category Movie Genre in a list of movies could contain the categorical variables Action, Fantasy, Comedy, Romance, etc. There are broadly two types of categorical variables: **Nominal Variable**: A nominal variable has no natural ordering to its categories. They have two or more categories. For example, Marital Status (Single, Married, Divorcee); Gender (Male, Female, Transgender), etc.**Ordinal Variable**: A variable for which the categories can be placed in an order. For example, Customer Satisfaction (Excellent, Very Good, Good, Average, Bad), and so on
When the data we want to analyze contains this type of variable, we turn to the chi-square test, denoted by χ², to test our hypothesis. ## What is a Chi-Square Test and Why Do We use it?
Lets learn the use of chi-square with an intuitive example. A research scholar is interested in the relationship between the placement of students in the statistics department of a reputed University and their C.G.P.A (their final assessment score). He obtains the placement records of the past five years from the placement cell database (at random). He records how many students who got placed fell into each of the following C.G.P.A. categories 9-10, 8-9, 7-8, 6-7, and below 6. Source: Anibrain School of Media Design If there is no relationship between the placement rate and the C.G.P.A., then the placed students should be equally spread across the different C.G.P.A. categories (i.e. there should be similar numbers of placed students in each category). However, if students having C.G.P.A more than 8 are more likely to get placed, then there would be a large number of placed students in the higher C.G.P.A. categories as compared to the lower C.G.P.A. categories. In this case, the data collected would make up the observed frequencies.
Here enters the chi-square test!
Note: I strongly recommend going through the below article if you need to brush up your hypothesis testing concepts: We are almost at the implementing aspect of chi-square tests but theres one more thing we need to learn before we get there. ## Assumptions of the Chi-Square TestJust like any other statistical test, the chi-square test comes with a few assumptions of its own: - The χ2 assumes that the data for the study is obtained through random selection, i.e. they are randomly picked from the population
- The categories are mutually exclusive i.e. each subject fits in only one category. For e.g.- from our above example the number of people who lunched in your restaurant on Monday cant be filled in the Tuesday category
- The data should be in the form of frequencies or counts of a particular category and not in percentages
- The data should not consist of paired samples or groups or we can say the observations should be independent of each other
- When more than 20% of the expected frequencies have a value of less than 5 then Chi-square cannot be used. To tackle this problem: Either one should combine the categories only if it is relevant or obtain more data
## Types of Chi-Square Tests (By manual calculations and with implementation in R)## Chi-Square Goodness of Fit TestThis is a non-parametric test. We typically use it to find how the observed value of a given event is significantly different from the expected value. In this case, we have categorical data for one independent variable, and we want to check whether the distribution of the data is similar or different from that of the expected distribution. Lets consider the above example where the research scholar was interested in the relationship between the placement of students in the statistics department of a reputed University and their C.G.P.A. In this case, the independent variable is C.G.P.A with the categories 9-10, 8-9, 7-8, 6-7, and below 6. The statistical question here is: whether or not the observed frequencies of placed students are equally distributed for different C.G.P.A categories (so that our theoretical frequency distribution contains the same number of students in each of the C.G.P.A categories). We will arrange this data by using the contingency table which will consist of both the observed and expected values as below: C.G.P.A10-99-88-77-6Below 6TotalObserved Frequencyof Placed students303520105100Expected Frequencyof Placed students2020202020100After constructing the contingency table, the next task is to compute the value of the chi-square statistic. The formula for chi-square is given as: where, **χ 2**= Chi-Square value**Oi=**Observed frequency**Ei**= Expected frequency
Let us look at the step-by-step approach to calculate the chi-square value: **Step 1**: Subtract each expected frequency from the related observed frequency. For example, for the C.G.P.A category 10-9, it will be 30-20 = 10. Apply similar operation for all the categories**Step 2**: Square each value obtained in step 1, i.e. (O-E)2. For example: for the C.G.P.A category 10-9, the value obtained in step 1 is 10. It becomes 100 on squaring. Apply similar operation for all the categories**Step 3**: Divide all the values obtained in step 2 by the related expected frequencies i.e. (O-E)2/E. For example: for the C.G.P.A category 10-9, the value obtained in step 2 is 100. On dividing it with the related expected frequency which is 20, it becomes 5. Apply similar operation for all the categories**Step 4**: Add all the values obtained in step 3 to get the chi-square value. In this case, the chi-square value comes out to be**32.5****Step 5**: Once we have calculated the chi-square value, the next task is to compare it with the critical chi-square value. We can find this in the below chi-square table against the degrees of freedom (number of categories 1) and the level of significance:
In this case, Our obtained value of 32.5 is much larger than the critical value of 9.49. Lets further solidify our understanding by performing the Chi-Square test in R. ## The Chi-Square Goodness of Fit Test in RLets implement the chi-square goodness of fit test in R. Time to fire up RStudio!
Lets understand the problem statement before we dive into R. An organization claims that the experience of the employees of different departments is distributed in the following categories: - 11 20 Years = 20%
- 21 40 Years = 17%
- 6 10 Years = 41% and
- Up to 5 Years = 22%
A random sample of 1470 employees is collected. Does this random sample provide evidence against the organizations claim? You can download the data
**Null hypothesis:**The true proportions of the experience of the employees of different departments are distributed in the following categories: 11 20 Years = 20%, 21 40 Years = 17%, 6 10 Years = 41% and up to 5 Years = 22%**Alternative hypothesis:**The distribution of experience of the employees of different departments differs from what the organization states
Lets begin!
The p-value here is less than 0.05. Therefore, we will reject our null hypothesis. Hence, ## Chi-Square Test for Association/IndependenceThe second type of chi-square test is the Lets take another example to understand this. A teacher wants to know the answer to whether the outcome of a mathematics test is related to the gender of the person taking the test. Or in other words, she wants to know if males show a different pattern of pass/fail rates than females. So, here are two categorical variables: Gender (Male and Female) and mathematics test outcome (Pass or Fail). Let us now look at the contingency table: By looking at the above contingency table, we can see that the girls have a comparatively higher pass rate than boys. However, to test whether this observed difference is significant or not, we will carry out the chi-square test. The steps to calculate the chi-square value are as follows:
For the first cell, the expected frequency would be (37*25)/50 = 18.5. Now, write them below the observed frequencies in brackets: BoysGirlsTotalPass17(18.5)20 (18.5)37Fail8 (6.5)5 (6.5)13Total252550
Calculate the right-hand side part of each cell. For example, for the first cell, ((17-18.5)^2)/18.5 = 0.1216.
0.1216+0.1216+0.3461+0.3461 = 0.9354
The next task is to compare it with the critical chi-square value from the table we saw above. The Chi-Square calculated value is Lets further solidify our understanding by performing the chi-square test in R. ## Test for Independence in R
A Human Resources department of an organization wants to check whether age and experience of the employees are dependent on each other. For this purpose, a random sample of 1470 employees is collected with their age and experience. You can download the data
Lets begin!
The p-value here is less than 0.05. Therefore, we will reject our null hypothesis. We can conclude that age and experience are two dependent variables, aka as the experience increases, the age also increases (and vice versa). ## End NotesIn this article, we learned how to analyze the significant difference between data that contains categorical measures in it with the help of chi-square tests. We enhanced our knowledge on the use of chi-square, assumptions involved in carrying out the test, and how to conduct different types of chi-square tests both manually and in R. If you are new to statistics, want to cover your basics, and also want to get a start in data science, I recommend taking the Introduction to Data Science course. It gives you a comprehensive overview of both descriptive and inferential statistics before diving into data science techniques. Did you find this article useful? Can you think of any other applications of the chi-square test? Let me know in the comments section below and we can come up with more ideas! ## Video |