Before you start: setup

Get your computer ready to run the GWAS pipeline

This page gets your computer ready and walks you through your first GWAS QC command. By the end you will have: - Set up a portable project folder - Downloaded the demo dataset
- Installed the required tools (currently PLINK2, PLINK1.9, METAL, REGENIE, and R) - Run your first test script that confirms everything works

NoteWhat you will need
  • A computer running Windows (with WSL2), macOS, or Linux
  • 30-45 minutes to complete all setting steps
  • Internet access to download tools and data
  • optional Admin rights to install software (or ask your IT team for help)

You do not need any programming experience. Every command is provided in full with alternatives and troubleshooting information.

A note on the command line

Most steps in this guide are typed into a SHELL (also called a “command line” or “console”) — a window where you type a command, press Enter, and the program runs.

  • Windows: You must use WSL2 terminal (see Step 0). WSL2 gives you Linux inside Windows.
  • macOS: Open the Terminal app (in Applications → Utilities).
  • Linux: Open your Terminal.
  • Inside RStudio: there is a Terminal tab next to the Console tab — you can use that too.

Let’s practice using the terminal before we dive into the pipeline.

Throughout the guide, lines you type into the terminal look like the block below the this text. Try this small practice block. It creates a tiny throwaway folder and file, looks at it in a few different ways, then removes it.

Please click the copy button on the right side of the block to copy all the commands, paste them into your terminal, and press Enter to run them. Don’t worry if you don’t understand every command right now — just get a feel for how the terminal works. To paste into the terminal, you can usually right-click on the terminal, or use Ctrl + Shift + V on Windows/Linux, or Cmd + V on macOS.

echo "Welcome to the GWAS tutorial."
echo "1) Where am I?"
pwd

echo "2) What is in this folder?"
ls

echo "3) Make a practice folder."
mkdir -p gwas_practice

echo "4) Make a tiny practice file."
echo "GWAS practice line 1: samples" > gwas_practice/hello.txt
echo "GWAS practice line 2: variants" >> gwas_practice/hello.txt
echo "GWAS practice line 3: quality control" >> gwas_practice/hello.txt

echo "5) Print the whole file with cat."
cat gwas_practice/hello.txt

echo "6) Print the first two lines with head."
head -n 2 gwas_practice/hello.txt

echo "7) Print the last two lines with tail."
tail -n 2 gwas_practice/hello.txt

echo "8) Show the file with less."
less -F -X gwas_practice/hello.txt

echo "9) Remove the practice file and folder."
rm gwas_practice/hello.txt
rmdir gwas_practice

echo "Done. You just used pwd, ls, mkdir, cat, head, tail, less, rm, and rmdir."

The commands you just ran do the following:

  • echo prints the text to the terminal. You can also use it to write text into files with > (create, overwrite) and >> (append, add end of file).
  • pwd prints your current folder.
  • ls lists files in the current folder.
  • mkdir creates folders. The -p argument allows you to create parent folders if they don’t exist.
  • cat prints a whole file.
  • head and tail show the beginning and end of a file.
  • less opens a file for reading. You need to press qto exit from reading mode. push up and down arrows to scroll. Here we use less -F -X, so this tiny file is shown and the command returns automatically.
  • rm removes a file.
  • rmdir removes an empty folder.

In this exercise, rm only removes the practice file you just created, and rmdir removes the empty practice folder. This is a safe way to practice removing files without accidentally deleting important data. There is also #rm -rf, which can remove folders and files recursively and without asking questions. The # prefix indicates this is a commented-out command. Please be very careful with this command it will permanently delete data.

If you want to learn more about these commands, you can check their manual pages by typing man <command> (for example, man ls) in the terminal.

If you have any issues with these commands, don’t worry. This is just a practice block to get you familiar with the terminal. Basic Bash knowledge is not required, but it will help you feel more comfortable as you move through the pipeline. There are many online materials available; here is one recommendation.

If you do not know how to open the terminal, just follow the next instructions (Step 0) carefully and ask for help if you get stuck.


Step 0 — Prepare your operating system

Before you begin, set up your OS environment.

WSL2 (Windows Subsystem for Linux) gives you a Linux environment inside Windows. The GWAS pipeline uses bash scripts, so WSL2 is required.

Part A: Install WSL2

  1. Open PowerShell as Administrator (right-click and select “Run as administrator”)

  2. Run:

    wsl --install
  3. Restart your computer when prompted

Part B: Open WSL2 Terminal

  1. After restart, open Windows Terminal (search for it in the Start menu)
  2. Click the dropdown arrow and select Ubuntu
  3. You should see a bash prompt like user@computer:~$

Part C: Install Basic Tools

In the WSL2 bash terminal, run:

sudo apt-get update
sudo apt-get install git curl wget unzip

WSL2 is now ready. From here on, all commands use this bash terminal.

For help troubleshooting WSL2, see WSL Setup.

Your system already has bash. No setup needed for Step 0.


Step 1 — Create your project folder

Choose any location on your computer and create a folder named gwas_tutorial. This will be your working folder for the entire pipeline.

DIR="$HOME/gwas_tutorial"
mkdir -p "$DIR"
cd "$DIR"
TipTip

You can place this folder anywhere: Desktop, Documents, custom path — it doesn’t matter. Everything is portable because we use relative paths. DIR is a variable that holds the path to your project folder. $HOME means “home directory”, so this creates the folder in your home directory. You can also create it on other drives or directories by changing the DIR variable, for example DIR="/your/desired/directory".


Step 2 — Create the directory structure

Create the folder structure that keeps scripts, data, and tools organized.

First let’s check the project folder created successfully:

if [ -d "$DIR" ]; then
  printf "%b\n"
  printf "%b\n"
  printf "%b\n" "✓ Project folder created successfully: $DIR"
  printf "%b\n"
  printf "%b\n"
  printf "%b\n" "[NEXT] Stay in this folder for the next commands."
  printf "%b\n"
  printf "%b\n"
  printf "%s\n\n" "Scripts, data, tools, and results will be stored here."
else
  printf "%b\n"
  printf "%b\n"
  printf "%b\n" "✗ Failed to create project folder: $DIR"
  printf "%b\n"
  printf "%b\n"
  printf "%b\n" "[Go to previous step] and make sure the command to create the folder ran successfully."
  exit 1
fi

If the folder was created successfully, you should see a message like:

✓ Project folder created successfully: /home/user/gwas_tutorial [NEXT] Stay in this folder for the next commands. Scripts, data, tools, and results will be stored here.

If you see an error message instead, please go back to the previous step and make sure the command to create the folder ran successfully.

Next, create the subfolders for scripts, data, tools, and results. You can do this with one command or create them manually.

Download and run the setup script:

# Download the automated setup script
# delete later, note for developers
# use by branch for all downloads, I am keeping them as main for future: curl -fL https://raw.githubusercontent.com/mgentiluomo/how-to-gwas-pdac/murat_v1/scripts/dev/init_project.sh -o init_project.sh && bash init_project.sh

curl -L https://raw.githubusercontent.com/mgentiluomo/how-to-gwas-pdac/main/scripts/dev/init_project.sh -o init_project.sh
# Make it executable
chmod +x init_project.sh
# Run it
bash init_project.sh

This creates all folders automatically.

If curl is not available on your system, use wget instead:

wget https://raw.githubusercontent.com/mgentiluomo/how-to-gwas-pdac/main/scripts/dev/init_project.sh -O init_project.sh
chmod +x init_project.sh
bash init_project.sh

Create folders manually:

# Create all needed folders
mkdir -p \
    scripts \
    scripts/dev \
    demo_data \
    tools/bin \
    data_processed \
    results/{qc,pop_structure,imputation,association,finemapping,meta_analysis}

After Step 3, your folder will look like this:

gwas_tutorial/
├── scripts/
│   ├── dev/                         # Utility scripts
│   │   ├── download_demo_data.sh
│   │   ├── tools_setup.sh
│   │   ├── test.sh
│   │   ├── init_project.sh
│   │   ├── section_manifest.txt
│   │   └── script_manifest.txt
│   ├── 01B_genotyping_qc/           # QC scripts
│   │   ├── 01_initial_qc_stats.sh
│   │   ├── 02_sample_callrate.sh
│   │   └── ...09_qc_summary.sh
│   ├── 02_population_stratification/
│   ├── 03_imputation/
│   └── ...other sections...
├── demo_data/                       # Demo dataset (Step 4)
├── tools/bin/                       # Tools (eg PLINK, Step 5)
├── data_processed/                  # Processed output files
└── results/                         # Results organized by workflow
    └── qc/                          # QC output files
TipSetup helper pages

For a compact project-folder overview, see Setup Project Structure. If you are on Windows/WSL2, keep WSL Setup nearby for path and terminal troubleshooting.


Step 3 — Download the GWAS scripts

Clone the GitHub repo to get all the GWAS pipeline scripts organized by section:

# Clone the repo
# delete later, note for developers: git clone --branch murat_v1 --single-branch https://github.com/mgentiluomo/how-to-gwas-pdac.git
git clone https://github.com/mgentiluomo/how-to-gwas-pdac.git

# Copy utility/dev scripts
mkdir -p scripts/dev
find how-to-gwas-pdac/scripts/dev -maxdepth 1 -type f -exec cp {} scripts/dev/ \;

# Start fresh manifest files for the setup test
: > scripts/dev/section_manifest.txt
: > scripts/dev/script_manifest.txt

# Copy section scripts directly into section folders
for section_dir in how-to-gwas-pdac/sections/*/; do
  section=$(basename "$section_dir")
  if [ -d "$section_dir/scripts" ]; then
    mkdir -p "scripts/$section"
    echo "scripts/$section" >> scripts/dev/section_manifest.txt
    find "$section_dir/scripts" -maxdepth 1 -type f -exec cp {} "scripts/$section/" \;
    find "scripts/$section" -maxdepth 1 -type f | sort >> scripts/dev/script_manifest.txt
  fi
done

# Add utility/dev scripts to the script manifest
find scripts/dev -maxdepth 1 -type f -name "*.sh" | sort >> scripts/dev/script_manifest.txt
sort -u scripts/dev/section_manifest.txt -o scripts/dev/section_manifest.txt
sort -u scripts/dev/script_manifest.txt -o scripts/dev/script_manifest.txt

# Optional cleanup: delete only the temporary cloned copy after scripts are copied
rm -r how-to-gwas-pdac
TipTip

If you see rm: remove write-protected regular file 'how-to-gwas-pdac/.git/objects/pack/pack-XXX.rev'?, type y and press Enter to confirm deletion. This is normal because the cloned repo is write-protected.

Verify the scripts downloaded:

# Check which section folders were copied
cat scripts/dev/section_manifest.txt

# Check the currently available QC scripts
ls scripts/01B_genotyping_qc/

At this stage, the section list should include scripts/01B_genotyping_qc, and you should see 01_initial_qc_stats.sh through 09_qc_summary.sh.

And utility scripts:

ls scripts/dev/

You should see: download_demo_data.sh, tools_setup.sh, test.sh, init_project.sh, plus section_manifest.txt and script_manifest.txt.


Step 4 — Download the demo dataset

Download the 7 demo dataset files (~164 MB total) to your demo_data/ folder.

Run the download script from your project:

bash scripts/dev/download_demo_data.sh

This downloads all 7 files automatically and verifies them. It uses curl, which is already available on most macOS and Linux systems.

Download files individually. If your system has curl but not wget, use the automated script instead.

cd ~/gwas_tutorial  # Make sure you're in your project folder

# Download all 7 files
wget https://github.com/mgentiluomo/how-to-gwas-pdac/releases/download/v0.1-data/pdac_demo.bed -O demo_data/pdac_demo.bed
wget https://github.com/mgentiluomo/how-to-gwas-pdac/releases/download/v0.1-data/pdac_demo.bim -O demo_data/pdac_demo.bim
wget https://github.com/mgentiluomo/how-to-gwas-pdac/releases/download/v0.1-data/pdac_demo.fam -O demo_data/pdac_demo.fam
wget https://github.com/mgentiluomo/how-to-gwas-pdac/releases/download/v0.1-data/phenotype.txt -O demo_data/phenotype.txt
wget https://github.com/mgentiluomo/how-to-gwas-pdac/releases/download/v0.1-data/covariates.txt -O demo_data/covariates.txt
wget https://github.com/mgentiluomo/how-to-gwas-pdac/releases/download/v0.1-data/survival.txt -O demo_data/survival.txt
wget https://github.com/mgentiluomo/how-to-gwas-pdac/releases/download/v0.1-data/sample_ancestry.tsv -O demo_data/sample_ancestry.tsv

Verify the files downloaded:

ls -lh demo_data/

You should see all 7 files.

TipFile Integrity Check

The download script automatically verifies the SHA256 hash of each file to ensure they were downloaded correctly. You should see messages like:

✓ OK: pdac_demo.bed
✓ OK: pdac_demo.bim
...
✓ All files verified successfully! (7/7)

If any file fails verification, the script will tell you which one and suggest re-downloading it. This built-in verification ensures your data is safe and ready for analysis.


Step 5 — Install dependencies

Install the required tools for the current tutorial sections. At this stage that means PLINK2, PLINK1.9, METAL, REGENIE, and R.

Before starting the install, check that you are in the project folder and that the basic download tools are visible:

cd "$HOME/gwas_tutorial"
pwd
ls scripts/dev/tools_setup.sh

command -v curl || echo "curl is missing"
command -v wget || echo "wget is missing"
command -v git || echo "git is missing"
command -v unzip || echo "unzip is missing"
command -v tar || echo "tar is missing"
command -v bzip2 || echo "bzip2 is missing"

R --version || echo "R is not installed yet"
curl -I https://github.com || echo "Internet connection check failed"

If the internet check fails inside WSL but works in Windows, see WSL Setup for managed-network and proxy troubleshooting.

Run the setup script from your project:

bash scripts/dev/tools_setup.sh

This script will: - Detect your OS and CPU architecture automatically (Linux x86_64/i686, macOS Intel/Apple Silicon) - Check whether R is available and try to install it if it is missing - Download the correct PLINK2 and PLINK1.9 binaries from official sources - Download the official precompiled METAL binary for Linux/WSL or macOS - Install micromamba into tools/micromamba/ - Install REGENIE into a micromamba environment named regenie_env - Extract and organize command-line tools in tools/plink2/, tools/plink1.9/, tools/metal/, tools/micromamba/, and tools/regenie/ - Create symlinks in tools/bin/ for easy access - Update your PATH so you can run plink2, plink, metal, and regenie from anywhere - Write scripts/dev/tool_manifest.tsv, which lists the tools checked by test.sh - Check for basic helper tools such as wget, unzip, tar, and bzip2

When finished, you’ll see:

Detected: OS=Linux, Architecture=x86_64_avx2
  ✓ R found
  ✓ PLINK2: PLINK v2.00a5 (64-bit build)
  ✓ PLINK1.9: PLINK v1.90b7.2
  ✓ METAL
  ✓ REGENIE: regenie v...

REGENIE is installed with project-local micromamba, following the official REGENIE conda install route. The environment files are stored under tools/micromamba-root/, so users do not need an existing conda installation.

The automated setup is recommended. Use this manual route only if you cannot use the setup script or need to inspect each tool installation step.

1. Install R

R is used later for QC plots and summary tables.

On WSL/Linux:

sudo apt-get update
sudo apt-get install -y r-base r-base-dev
R --version

On macOS with Homebrew:

brew install r
R --version

2. Download PLINK2

Visit: https://www.cog-genomics.org/plink/2.0/

Choose the build for your system: - Linux (Intel, AVX2): plink2_linux_avx2_*.zip - Linux (AMD, AVX2): plink2_linux_amd_avx2_*.zip - Linux (64-bit, no AVX2): plink2_linux_x86_64_*.zip - macOS (Intel, AVX2): plink2_mac_avx2_*.zip - macOS (Intel, no AVX2): plink2_mac_*.zip - macOS (Apple Silicon): plink2_mac_arm64_*.zip

Extract and move to your tools folder:

mkdir -p tools/plink2
cd tools/plink2
# Download your platform's ZIP file here
unzip plink2_*.zip
cd ../..

3. Download PLINK1.9

Visit: https://www.cog-genomics.org/plink/1.9/

Choose the build for your system: - Linux (64-bit): plink_linux_x86_64_*.zip - macOS: plink_mac_*.zip

Extract and move to your tools folder:

mkdir -p tools/plink1.9
cd tools/plink1.9
# Download your platform's ZIP file here
unzip plink_*.zip
cd ../..

4. Download METAL

Visit: https://csg.sph.umich.edu/abecasis/Metal/download/

Choose the precompiled binary for your system: - Linux/WSL: Linux-metal.tar.gz - macOS: Darwin-metal.tar.gz

Extract it in your tools folder:

mkdir -p tools/metal
cd tools/metal
# Download the Linux/WSL or macOS archive here
tar -xzf *-metal.tar.gz
find . -type f \( -name "metal" -o -name "METAL" \) -exec chmod +x {} \;
cd ../..

5. Install micromamba and REGENIE

The automated setup installs micromamba for you. If you are doing this manually, first download micromamba.

For Linux/WSL x86_64:

mkdir -p tools/micromamba
cd tools/micromamba
curl -Ls https://micro.mamba.pm/api/micromamba/linux-64/latest | tar -xvj bin/micromamba
cd ../..

For macOS Apple Silicon:

mkdir -p tools/micromamba
cd tools/micromamba
curl -Ls https://micro.mamba.pm/api/micromamba/osx-arm64/latest | tar -xvj bin/micromamba
cd ../..

For macOS Intel, replace osx-arm64 with osx-64.

Then create the REGENIE environment:

export MAMBA_ROOT_PREFIX="$(pwd)/tools/micromamba-root"
tools/micromamba/bin/micromamba create -y -n regenie_env -c conda-forge -c bioconda regenie

Then create a small wrapper so REGENIE works like the other tutorial tools:

mkdir -p tools/regenie
cat > tools/regenie/regenie <<'EOF'
#!/usr/bin/env bash
export MAMBA_ROOT_PREFIX="$(cd "$(dirname "${BASH_SOURCE[0]}")/../micromamba-root" && pwd)"
MICROMAMBA="$(cd "$(dirname "${BASH_SOURCE[0]}")/../micromamba/bin" && pwd)/micromamba"
exec "$MICROMAMBA" run -n regenie_env regenie "$@"
EOF
chmod +x tools/regenie/regenie

7. Update PATH

Add the tools folder to your PATH:

export PATH="$(cd tools/bin && pwd):$PATH"

To make this permanent on Linux/WSL, add it to ~/.bashrc:

echo 'export PATH="$HOME/gwas_tutorial/tools/bin:$PATH"' >> ~/.bashrc

On macOS, the default shell is usually zsh, so use ~/.zshrc instead:

echo 'export PATH="$HOME/gwas_tutorial/tools/bin:$PATH"' >> ~/.zshrc

Then reload your shell. Use the file you edited:

source ~/.bashrc

For macOS zsh:

source ~/.zshrc

The tool list used by the setup test is saved here:

cat scripts/dev/tool_manifest.tsv

You can also check the project-local tool links directly. These commands use ./tools/bin/..., so they work even if your PATH has not reloaded yet:

ls -l tools/bin/
./tools/bin/plink2 --version
./tools/bin/plink --version
test -x ./tools/bin/metal && echo "METAL executable found: ./tools/bin/metal"
./tools/bin/regenie --version
R --version

Most installation problems are caused by the shell not being in the right folder, internet/proxy access, an unfinished apt-get process, or R not being available yet. Start with the message printed in your terminal and match it to one of the cases below.

Error or symptom What it usually means What to try
scripts/dev/tools_setup.sh: No such file or directory You are not in the project folder, or Step 3 did not copy the scripts Run cd "$HOME/gwas_tutorial" and ls scripts/dev/
curl, wget, or micromamba cannot connect WSL or your terminal cannot reach the internet Test curl -I https://github.com; WSL users on managed networks should check WSL Setup
Could not get lock /var/lib/apt/lists/lock Another apt-get process is running Wait, or inspect the process with ps -fp <PID> shown in the error
Unable to locate package r-base Package lists are stale, universe is disabled, or the Ubuntu release is unusual See the R commands in the Manual tab; for persistent problems, use an Ubuntu LTS WSL release
404: command not found after running a downloaded script A GitHub 404 page was saved instead of a script Delete the file and repeat the current download command from this guide
plink, plink2, metal, or regenie: command not found Tools were not installed or tools/bin is not on PATH Run bash scripts/dev/tools_setup.sh again, then check ls tools/bin/
REGENIE or micromamba fails during environment creation Usually a network, proxy, or conda-channel access problem Confirm curl -I https://github.com works and retry bash scripts/dev/tools_setup.sh

If automatic R installation fails because Ubuntu cannot find r-base, enable the universe repository and refresh the package list:

sudo apt-get install -y software-properties-common
sudo add-apt-repository -y universe
sudo apt-get update
sudo apt-get install -y r-base r-base-dev

Then run the tool setup again:

bash scripts/dev/tools_setup.sh

Step 6 — Test tools and scripts

Run the test script to verify everything is installed and working:

bash scripts/dev/test.sh

This script will: 1. Check the project folders 2. Check the demo dataset files 3. Check that every copied script listed in scripts/dev/script_manifest.txt exists 4. Check every required tool listed in scripts/dev/tool_manifest.tsv 5. Print the next command to run

When it finishes, you should see output like:

✓ Initial test done! Folders, scripts, data, and required tools are working.
Next command to run the QC pipeline:
  bash scripts/01B_genotyping_qc/01_initial_qc_stats.sh

If you see this message, everything is working! You can now go to Quality Control (Section 1B) and run your first QC command.


Troubleshooting

Git is not installed on your system.

Solution (Windows WSL2):

sudo apt-get update
sudo apt-get install git

Solution (macOS):

brew install git

Solution (Linux):

sudo apt-get install git

wget is not installed.

Solution (Windows WSL2 / Linux):

sudo apt-get install wget

Solution (macOS):

brew install wget

One of the installed tools is not found in your PATH.

Solution 1: Run the setup script again

The automated script should have configured PATH:

bash scripts/dev/tools_setup.sh

Solution 2: Manually add to PATH

If the script completed but PATH isn’t set, manually add it:

# Check if tools exist
ls tools/bin/plink
ls tools/bin/plink2
ls tools/bin/metal
ls tools/bin/regenie

# Add to current session
export PATH="$(cd tools/bin && pwd):$PATH"

# Verify
plink --version
plink2 --version
command -v metal && echo "metal is on PATH"
regenie --version

To make this permanent on Linux/WSL:

echo 'export PATH="$HOME/gwas_tutorial/tools/bin:$PATH"' >> ~/.bashrc
source ~/.bashrc

On macOS zsh, use ~/.zshrc instead:

echo 'export PATH="$HOME/gwas_tutorial/tools/bin:$PATH"' >> ~/.zshrc
source ~/.zshrc

Solution 3: Use full path

If PATH setup is tricky, use the full path directly:

./tools/bin/plink2 --version
./tools/bin/plink --version
test -x ./tools/bin/metal && echo "metal is available"
./tools/bin/regenie --version

PLINK cannot find the demo dataset.

Solution: Make sure all demo files were downloaded:

ls demo_data/pdac_demo.*

You should see .bed, .bim, and .fam files. If any are missing, re-run Step 4:

bash scripts/dev/download_demo_data.sh

The script file doesn’t have execute permissions.

Solution:

find scripts -type f -name "*.sh" -exec chmod +x {} \;
bash scripts/01B_genotyping_qc/01_initial_qc_stats.sh

You’re in the wrong directory or files are missing.

Solution:

# Make sure you're in your project folder
cd ~/gwas_tutorial

# Check files exist
ls scripts/01B_genotyping_qc/
ls demo_data/

# Then run the first QC script
bash scripts/01B_genotyping_qc/01_initial_qc_stats.sh

Additional Resources


What’s next

You now have a working GWAS pipeline setup and have confirmed the tools runs successfully. Continue with the QC and other steps:

  • Quality control (Section 1B) — cleaning the raw genotypes before any analysis.
  • Population stratification (Section 2) — detecting and correcting for ancestry.

(These sections will be linked here as they are added.)

Back to top