How to use chi-squared to test for Hardy-Weinberg equilibrium

This post demonstrates the use of chi-squared to test for Hardy-Weinberg equilibrium. There is a question on a recent (February 2020) AP Biology practice test that required this calculation. The question is a secure item, so the exact question will not be discussed here. There is a previous post on this blog explaining how to test for evolution using the null hypothesis and chi-squared.

Phenotypes and genotypes for examples

For our examples, we'll use the fictional species featured in many of the evolution simulations. The population demonstrates incomplete dominance for color. There are two alleles; red and blue. Heterozygotes have a purple phenotype.

Chi-squared equation

Chi-squared is a statistical test used to determine if observed data (o) is equivalent to expected data (e). A population is at Hardy-Weinberg equilibrium for a gene if five conditions are met; random mating, no mutation, no gene flow, no natural selection, and large population size. Under these circumstances, the allele frequencies for a population are expected to remain consistent (equilibrium) over time. The H-W equations are expected to estimate genotype and allele frequencies for a population that is at equilibrium. The equations may not accurately predict the frequencies if the population is not at equilibrium (for example, if selection is occurring). However, it is possible that, even with the presence of an evolutionary force, a population may still demonstrate the expected H-W data.

Hardy-Weinberg equations

In the case of a trait showing incomplete dominance, the heterozygotes are distinct from the homozygous dominant individuals, which allows the genotype and allele frequencies to be calculated directly (without the H-W equations). This direct calculation can be compared to values based on H-W calculations to determine if the population is at H-W equilibrium.


For the first example, we'll use a simple data set (not generated by a simulation). In this case, there are 50 total individuals in the population; 10 are red, 10 are purple, and 30 are blue. These are the observed values for the chi-squared analysis.

Sample data

First, we need to find the allele frequencies. The population has a total of 50 individuals, and each individual has two alleles, so there are 100 alleles in the population. Each red (RR) individual has two copies of the R allele and each purple individual (RB) has one copy, so there are 30 red alleles in the population. Based on this, the R allele frequency is 0.3 and the B allele frequency is 0.7 (work shown below).

Allele frequency calculations

Next, we have to find the expected frequencies for each genotype, based on H-W equations. The work is shown below.

Expected frequency calculations

Once we have the frequencies for each genotype, we can then find the expected numbers by multiplying the frequencies by the total number of individuals (50).

Expected value calculations

Now that we have both observed and expected values, we can plug them into the chi-squared equation.

chi-squared calculation

The resulting chi-squared value is 13.71. For a p-value of .05 and 2 degrees of freedom (see this post for a more involved discussion of how to use chi-squared results) the critical value is 5.99. The chi-squared value for this sample (13.71) is greater than 5.99, so we reject the hypothesis that the observed and expected values are equivalent. This suggests that the population is not at H-W equilibrium.

For the next examples, we'll use data generated by the population genetics simulation. See this blog post for an explanation of the simulation. In this run, there is no selection against any of the phenotypes and there is no mutation chance. The population size is set to 500.

Simulation results

At the end of the simulation run, the red allele frequency is 0.541 and the blue allele frequency is 0.459. The frequencies for the phenotypes are 0.278 for red, 0.526 for purple, and 0.196 for blue.

Simulation frequencies

We can use the phenotype frequencies and the total population number (500) to find the number of individuals for each phenotype. These numbers are the observed values for the chi-squared calculation.

Calculating observed values based on frequencies

The next step is to find the expected values. If the population is at H-W equilibrium, the phenotype values calculated from the allele frequencies will be close to the observed phenotype values. The expected frequency of the red individuals based on the H-W equation is the frequency of the red allele (0.541) squared. Then multiply by 500 to get the expected value. The work for the expected values is shown below.

Calculation of expected values

At this point, the chi-squared value can be determined by plugging the observed and expected values into the chi-squared equation. In this case, the chi-squared value is 1.4 (work is shown below).

Chi-squared calculation

Our value of 1.4 is smaller than the critical value (5.99), so we cannot reject the hypothesis that the observed and expected values are equivalent. This means that the final population distribution is consistent with the H-W equations.


One more example, again using the population genetics simulation to generate data. This time it is set to test a heterozygote advantage situation. The survival chance for red individuals is 50%, purple (RB) is 100%, and blue (BB) is 0%. The other variables are the same as the previous example.

Simulation results, heterozygote advantage

Again, we have to start by finding the observed numbers based on the phenotype frequencies, and the expected numbers based on the allele frequencies.

Here is the chi-squared calculation for this example:

Chi-squared calculation

The chi-squared value is 97.15. This is clearly more than the critical value, and so the hypothesis that the observed and expected are equivalent is rejected. This indicates that the population is not at H-W equilibrium.


For the two examples using simulated data, we used the values from the final recorded generation, but other generations could be used as well. Using the initial generation can give some interesting results as it reflects the initial change resulting from the parameters chosen for that run.

logoRed.png

Click the Donate button to support Biology Simulations, click here for additional information

  • Facebook
  • Twitter Social Icon