Genotyping quality control

Clean the demo genotype data before population structure and association testing

Real genotype data is never perfect. Some samples have too many missing calls, some variants are poorly measured, sex labels can be wrong, and related individuals can break the independence assumption used by standard GWAS tests. Quality control (QC) removes or flags these problems before any association analysis.

This page is the canonical QC tutorial for the website. The source scripts live in sections/01B_genotyping_qc/scripts/, but learners run the copied scripts from their tutorial working folder:

~/gwas_tutorial/
├── demo_data/
├── scripts/01B_genotyping_qc/
└── results/qc/
TipBefore you start

Finish Before you start first. You should have:

  • demo_data/pdac_demo.bed
  • demo_data/pdac_demo.bim
  • demo_data/pdac_demo.fam
  • demo_data/phenotype.txt
  • working plink2, plink, and Rscript commands

All commands below are run from the project root, for example ~/gwas_tutorial.

NoteAbout the example figures and tables

The commands write your local results to results/qc/. The figures and tables displayed on this web page are example outputs from this repository folder:

sections/01B_genotyping_qc/results/

Your own numbers may differ if you change thresholds, use a different random seed, or run the workflow on another dataset.

What QC Does

The demo QC pipeline is intentionally stepwise. Each script runs one check, writes output to results/qc/, and prints the next command to run.

Raw demo genotypes: demo_data/pdac_demo.bed/.bim/.fam
    |
    | 01 Initial QC statistics
    | 02 Sample call rate
    | 03 Sex check
    | 04 Heterozygosity outliers
    | 05 Variant call rate
    | 06 Hardy-Weinberg equilibrium
    | 07 Relatedness
    | 08 Minor allele frequency
    | 09 QC summary report
    v
Clean genotypes: results/qc/pdac_demo_08_filt.bed/.bim/.fam
NoteWhy one step at a time?

For teaching, we do not hide the QC decisions inside one long command. Running one step at a time lets you see how many samples or variants each filter removes. That is how you catch mistakes early.

Run the QC Pipeline

01 - Initial QC Statistics

Why. Before removing anything, measure the dataset. This step creates baseline allele frequency, heterozygosity, sample missingness, and variant missingness summaries.

bash scripts/01B_genotyping_qc/01_initial_qc_stats.sh

Main outputs:

results/qc/pdac_demo_01_qc.afreq
results/qc/pdac_demo_01_qc.het
results/qc/pdac_demo_01_qc.smiss
results/qc/pdac_demo_01_qc.vmiss
results/qc/pdac_demo_01_qc_variant_density_by_chromosome.tsv
results/qc/pdac_demo_01_qc_variant_density_by_chromosome.png
results/qc/pdac_demo_01_qc_sample_missingness_histogram.png
results/qc/pdac_demo_01_qc_variant_missingness_histogram.png
results/qc/pdac_demo_01_qc_allele_frequency_distribution.png
results/qc/pdac_demo_01_qc_heterozygote_rate_distribution.png
results/qc/pdac_demo_01_qc_sample_missingness_vs_heterozygosity.png

How to read the outputs.

  • *.smiss: one row per sample; F_MISS is the proportion of missing genotypes.
  • *.vmiss: one row per variant; F_MISS is the proportion of samples missing that variant.
  • *.afreq: allele frequencies; useful for checking rare and common variant content.
  • *.het: per-sample heterozygosity and inbreeding coefficient information.

Example Step 01 plots:

Bar plot of variant density by chromosome.

This plot checks whether variants are spread across the genome in a plausible way. Large missing chromosomes or strong chromosome-specific gaps would suggest a data preparation problem.

Histogram of sample missingness.

Histogram of variant missingness.

Most samples and variants should have low missingness. The sample call-rate filter in Step 02 and variant call-rate filter in Step 05 act on these distributions.

Histogram of alternate allele frequencies.

This plot shows how many variants are rare, low-frequency, or common. In a rare-cancer study we avoid an aggressive MAF threshold because low-frequency variants may still be informative.

Distribution of heterozygote rates.

Scatter plot of sample missingness against heterozygosity rate.

The scatter plot is a quick contamination check: samples with both unusual missingness and unusual heterozygosity deserve attention before downstream analysis.

02 - Sample Call Rate

Why. Samples with too many missing genotypes often reflect poor DNA quality, lab failure, or platform problems. In a rare cancer GWAS, every sample is valuable, but very low-quality samples can harm the entire analysis.

bash scripts/01B_genotyping_qc/02_sample_callrate.sh

This applies:

--mind 0.02

--mind 0.02 removes samples with more than 2% missing genotypes. In the demo run this removed 25 samples, leaving 1,436 samples.

Main outputs:

results/qc/pdac_demo_02_samples_callrate.mindrem.id
results/qc/pdac_demo_02_samples_keep.txt
results/qc/pdac_demo_02_filt.bed
results/qc/pdac_demo_02_filt.bim
results/qc/pdac_demo_02_filt.fam
TipWhat to inspect

Open pdac_demo_02_samples_callrate.mindrem.id if you want to see which samples failed. In real data, also check whether removed samples are disproportionately cases, controls, or one contributing cohort.

03 - Sex Check

Why. The genetic data on chromosome X should broadly agree with the recorded sex. A mismatch can indicate a sample swap, data entry error, or contamination.

bash scripts/01B_genotyping_qc/03_sex_check.sh

The script writes:

results/qc/pdac_demo_03_sexcheck.sexcheck
results/qc/pdac_demo_03_sexcheck_discordant.txt
results/qc/pdac_demo_03_filt.bed
results/qc/pdac_demo_03_filt.bim
results/qc/pdac_demo_03_filt.fam

In the demo run, this step removed 4 samples.

NoteWhy this matters in a tutorial

Sex check is not only a biology check. It is also a sample identity check. A sex mismatch may be the first visible sign that labels, sample IDs, or genotype files were mixed.

04 - Heterozygosity Outliers

Why. Extremely high or low heterozygosity can signal contamination, inbreeding, ancestry outliers, or technical artifacts. This step uses autosomal heterozygosity and removes samples outside a standard deviation-based threshold.

bash scripts/01B_genotyping_qc/04_heterozygosity.sh

Main outputs:

results/qc/pdac_demo_04_het.het
results/qc/pdac_demo_04_het_outliers.txt
results/qc/pdac_demo_04_het_outliers.pdf
results/qc/pdac_demo_04_filt.bed
results/qc/pdac_demo_04_filt.bim
results/qc/pdac_demo_04_filt.fam

In the demo run, this step removed 2 samples.

The baseline heterozygosity distribution from Step 01 is shown again here because it is the plot that motivates this filter:

Distribution of heterozygote rates used to motivate heterozygosity outlier filtering.

TipFor rare cancer studies

Do not remove a case just because it is visually interesting. Remove it because it clearly fails a documented QC rule. In the final report, keep the threshold and number removed.

05 - Variant Call Rate

Why. Variants missing in many samples are unreliable. Missingness can reflect poor probe performance, batch effects, or genotype calling problems.

bash scripts/01B_genotyping_qc/05_variant_callrate.sh

This applies:

--geno 0.05

--geno 0.05 removes variants missing in more than 5% of samples. In the demo run this removed 8,480 variants, leaving 421,520 variants.

Main outputs:

results/qc/pdac_demo_05_filt.bed
results/qc/pdac_demo_05_filt.bim
results/qc/pdac_demo_05_filt.fam

The Step 01 variant missingness plot shows why this filter exists:

Histogram of per-variant missingness.

06 - Hardy-Weinberg Equilibrium

Why. Strong Hardy-Weinberg equilibrium (HWE) deviations in controls often indicate genotyping errors. In a case/control GWAS, controls are the safer group for this filter because true disease variants may deviate in cases.

bash scripts/01B_genotyping_qc/06_hardy_weinberg.sh

This uses demo_data/phenotype.txt to identify controls, then removes variants with control-only HWE p-value below 1e-6.

Main outputs:

results/qc/pdac_demo_06_controls.txt
results/qc/pdac_demo_06_hwe.hardy
results/qc/pdac_demo_06_hwe_pvalue_distribution.png
results/qc/pdac_demo_06_hwe_exclude.txt
results/qc/pdac_demo_06_filt.bed
results/qc/pdac_demo_06_filt.bim
results/qc/pdac_demo_06_filt.fam

In the demo run this step removed 14,573 variants, leaving 406,947 variants.

Histogram of minus log10 Hardy-Weinberg p-values.

WarningDo HWE in controls

For disease GWAS, avoid filtering on HWE calculated in all samples if cases are included. A real disease association can create deviation in cases. This tutorial therefore performs the HWE filter using controls.

07 - Relatedness

Why. Standard GWAS regression assumes unrelated individuals. Related pairs can inflate test statistics if they are not removed or modelled.

bash scripts/01B_genotyping_qc/07_relatedness.sh

This step uses the phenotype file to protect rare cancer cases during pruning:

File Required columns Coding
demo_data/phenotype.txt FID IID PHENO 1 = control, 2 = case

The first two columns must match the family ID and individual ID in the .fam file. A header line is allowed.

KING, PI_HAT, and Thresholds

This step uses LD-pruned variants and KING kinship to find duplicate/twin or first-degree related samples. It also runs the older PLINK 1.9 --genome command to report PI_HAT, which many GWAS QC tutorials use for identity-by-descent checks.

PI_HAT and KING kinship are related but not the same scale. In ideal outbred pedigrees, PI_HAT is roughly twice KING kinship, but real data will not match perfectly. This tutorial reports both and uses KING for the final pruning list.

Pair type Degree Approx. PI_HAT / relationship coefficient Approx. KING kinship
Unrelated - 0 0
Parent-offspring 1st 0.5 0.25
Full siblings 1st 0.5 0.25
Half-siblings 2nd 0.25 0.125
Grandparent-grandchild 2nd 0.25 0.125
Uncle/aunt-niece/nephew 2nd 0.25 0.125
First cousins 3rd 0.125 0.0625
Identical twins or duplicate DNA sample - 1 0.5

The 0.1875 KING cutoff sits between the expected first-degree value (0.25) and second-degree value (0.125):

(0.25 + 0.125) / 2 = 0.1875

That means pairs above 0.1875 are treated as duplicate/twin or first-degree-or-closer for pruning. Some tutorials use the nearby geometric-mean cutoff instead:

sqrt(0.25 * 0.125) = 0.177

Both cutoffs try to separate first-degree-or-closer pairs from more distant relatives. Here we use 0.1875 because it is common in GWAS QC scripts and easy to explain.

Rare-Cancer Pruning Rule

Because this is a rare-cancer tutorial, the final pruning list is phenotype-aware:

  • If a related pair above 0.1875 contains one case and one control, keep the case and remove the control.
  • If both samples have the same phenotype, remove one sample to break the relationship.
  • If a sample appears in many related pairs, removing that sample may resolve multiple pairs at once.

Main outputs:

results/qc/pdac_demo_07_related_pairs.kin0
results/qc/pdac_demo_07_king_cutoff.king.cutoff.in.id
results/qc/pdac_demo_07_king_cutoff.king.cutoff.out.id
results/qc/pdac_demo_07_ibd_pi_hat.genome
results/qc/pdac_demo_07_ibd_pi_hat_annotated.tsv
results/qc/pdac_demo_07_ibd_pi_hat_summary.tsv
results/qc/pdac_demo_07_related_pairs_annotated.tsv
results/qc/pdac_demo_07_relatedness_summary.tsv
results/qc/pdac_demo_07_king_vs_pihat_comparison.tsv
results/qc/pdac_demo_07_king_vs_pihat_summary.tsv
results/qc/pdac_demo_07_pruning_components.tsv
results/qc/pdac_demo_07_relatedness_pruning_diagnostics.tsv
results/qc/pdac_demo_07_removal_decisions.tsv
results/qc/pdac_demo_07_removal_summary.tsv
results/qc/pdac_demo_07_removal_phenotype_counts.tsv
results/qc/pdac_demo_07_removal_list.txt
results/qc/pdac_demo_07_filt.bed
results/qc/pdac_demo_07_filt.bim
results/qc/pdac_demo_07_filt.fam

Example relatedness summary:

Relationship category Pairs Unique samples
Duplicate or twin 0 0
First degree 284 448
Second degree 4 8
Third degree 10 20
No third-degree-or-closer pair NA 964
Total samples before relatedness filter NA 1,430

Example pruning diagnostics:

Metric Value
Samples before relatedness filter 1,430
KING pruning pairs > 0.1875 284
Unique samples in KING pruning pairs 448
Relatedness components 164
Largest component samples 6
PLINK2 --king-cutoff removed samples 167
Phenotype-aware removed samples 191
Phenotype-aware removed percent 13.4
Controls removed 154
Cases removed 37

Example KING vs PI_HAT summary:

Metric Value
Total pairs in either report 55,337
Pairs reported by both 298
Same category 294
Different category, check values 4
Reported by KING only 0
Reported by PI_HAT only 55,039
Pruning pairs 284
Pruning pairs with PI_HAT 284
Mean absolute PI_HAT - 2 * KING 0.025632
Maximum absolute PI_HAT - 2 * KING 0.126513
WarningReview large relatedness loss

In the demo run, phenotype-aware pruning removed 191 samples. That is a substantial sample loss. If a real cohort loses many samples at this step, inspect pdac_demo_07_pruning_components.tsv, pdac_demo_07_relatedness_pruning_diagnostics.tsv, and pdac_demo_07_removal_decisions.tsv before continuing.

08 - Minor Allele Frequency

Why. Very rare variants can be unstable in small studies. For PDAC we keep the threshold low (--maf 0.01) because rare and low-frequency variants may be important in rare cancers.

bash scripts/01B_genotyping_qc/08_maf_filter.sh

The cleaned dataset is:

results/qc/pdac_demo_08_filt.bed
results/qc/pdac_demo_08_filt.bim
results/qc/pdac_demo_08_filt.fam
results/qc/pdac_demo_08_afreq.afreq

In the demo run this step removed 5,038 variants, leaving 401,909 variants and 1,239 samples.

NoteWhy not use MAF 5%?

Many standard GWAS workflows use --maf 0.05. That is often reasonable for common-variant GWAS, but rare cancer studies have fewer cases and may be especially interested in low-frequency effects. This tutorial uses --maf 0.01 as a teaching example of a less aggressive threshold.

09 - QC Summary Report

Why. A reproducible GWAS needs a record of every threshold and how many samples or variants were removed at each step. This final step creates one combined PDF report with the count table, QC plots, relatedness diagnostics, and interpretation notes.

bash scripts/01B_genotyping_qc/09_qc_summary.sh

Main outputs:

results/qc/pdac_demo_09_qc_counts.tsv
results/qc/pdac_demo_09_qc_summary.txt
results/qc/pdac_demo_09_qc_report.pdf

The generated report combines the same checks shown on this page into one review file:

  • executive summary and index
  • final sample and variant count table
  • sample and variant retention plots
  • initial QC distributions
  • HWE, heterozygosity, and allele-frequency checks
  • relatedness and phenotype-aware pruning diagnostics
  • method notes for downstream interpretation

Methods Notes and Interpretation

The report also summarizes the QC rules in a methods-friendly table. This is useful when you later write the Methods section of a manuscript or need to explain exactly why a sample or variant was removed.

Level Check Threshold Interpretation
Sample Sample call rate --mind 0.02 Remove samples with more than 2% missing genotypes.
Sample Sex check X chromosome F statistic Detect sex discordance, swaps, or metadata errors.
Sample Heterozygosity Autosomal F +/- 3 SD Remove strong contamination or heterozygosity outliers.
Sample Relatedness KING kinship > 0.1875 Remove duplicate/twin or first-degree-or-closer related samples.
Sample Relatedness pruning strategy Phenotype-aware Prefer removing controls over cases in related case-control pairs.
Variant Variant call rate --geno 0.05 Remove variants missing in more than 5% of samples.
Variant Hardy-Weinberg equilibrium --hwe 1e-6 in controls Remove variants with strong control-only HWE deviation.
Variant Minor allele frequency --maf 0.01 Keep variants with MAF at least 1%; less aggressive for rare cancer.

The main interpretation points are:

  • Rare cancer setting: cases are precious, so sample-level exclusions should be documented.
  • Relatedness: KING is used for final pruning, while PI_HAT is reported for comparison with older QC workflows.
  • Phenotype-aware pruning: when possible, related case-control pairs are resolved by removing the control.
  • Large relatedness loss: inspect relatedness components and removal decisions before continuing.
  • Final QC-pass data: the Step 08 dataset is the input for ancestry checks, PCA, and association testing.

Example Report Preview

The example report is a PDF. Some preview browsers automatically download PDFs if they are embedded directly in the page, so the report is shown here as a launch panel instead.

NoteExample Step 09 QC report

File: sections/01B_genotyping_qc/results/pdac_demo_09_qc_report.pdf

The report includes the final count table, retention plots, initial QC distributions, HWE and MAF checks, relatedness diagnostics, and method notes.

Open report in a new tab
TipWhat should you do with the report?

Keep the report with your analysis outputs. It is a compact record of what happened to the data before PCA and association testing, and it gives you the numbers needed for a Methods section.

The demo run used in this page ended with:

Stage Samples Variants Samples lost Variants lost
Raw data 1,461 430,000 0 0
Initial stats 1,461 430,000 0 0
Sample call rate 1,436 430,000 25 0
Sex check 1,432 430,000 4 0
Heterozygosity 1,430 430,000 2 0
Variant call rate 1,430 421,520 0 8,480
Hardy-Weinberg 1,430 406,947 0 14,573
Relatedness 1,239 406,947 191 0
MAF filter 1,239 401,909 0 5,038

Final retention in the demo run was 84.8% of samples and 93.5% of variants.

PDAC-Specific Notes

Pancreatic cancer GWAS is a rare-cancer setting. That changes how we think about QC:

  • Removing a case costs power, so sample-level filters should be justified and documented.
  • Relatedness pruning should preserve cases when a related case-control pair can be resolved by removing the control.
  • Overly aggressive MAF filtering can discard low-frequency signal.
  • Population stratification is especially important in multi-consortium analyses.
  • If many related samples are real family structure, consider relatedness-aware association methods downstream.

The final QC-pass dataset is the starting point for population stratification and PCA.

References

Back to top