How to carry out a Genome-Wide Association Study
A step-by-step annotated guide using pancreatic cancer as a case study
This is the companion website to the methodology manuscript developed within the TRANSPAN COST Action (CA21116), Working Group 1, targeted at XXXX Journal. It pairs annotated PLINK 2 and R code with a fully open, reproducible demo dataset, and is written for graduate students and clinicians without a bioinformatics background.
What is a GWAS?
A genome-wide association study (GWAS) tests the statistical association between genetic variants — typically single-nucleotide polymorphisms (SNPs) — and a trait across the entire genome, in order to identify genetic factors that contribute to complex traits and diseases.
Why pancreatic cancer?
Pancreatic ductal adenocarcinoma (PDAC) is the deliberate case study throughout this guide: a relativelly rare, multifactorial cancer/trait with a poorly characterised genetic component. Single-centre studies are underpowered, and international consortia (PanScan, PanC4, PANDoRA) are the rule rather than the exception. Teaching GWAS under these realistically difficult conditions makes the methodological decisions more transparent than a large, well-powered textbook example would.
Should we add dbgap accession numbers and say something like “we use the a simulated demo dataset, it is possible to access real data, which is available at dbgap under accession number XXXXXX”?
How to use this guide
Every analysis step is shown as a real command you can run yourself on the open demo dataset, so you can reproduce the entire workflow from raw genotypes to final results.
👉 Start here: Before you start — setup and your first PLINK command
Target Groups
This tutorial is designed for two main groups:
- Beginners who have little or no previous experience with GWAS, bioinformatics, or command-line analysis.
- Intermediate and advanced users who already have some experience with bioinformatics or GWAS and want a transparent, reproducible teaching workflow.
You do not need to know how to program. Every command is given in full, with an explanation of what it does and what the output should look like. If you can copy, paste and press Enter, you can follow this guide.
The structure is intentionally step by step. Some tasks are split across more than one script, even when they could technically be merged into a single script, because the goal is to make each decision visible and easier to learn.
For the same reason, this tutorial is not designed as a high-performance computing workflow or as a template for large-scale parallel processing. If you are working with large datasets, use an appropriate HPC environment and distribute jobs according to your institution’s computing guidelines.