Example data preparation

Table of contents

  1. Rationale
  2. Requirements
  3. Data download
    1. Environment settings
    2. Infinium Global Screening Array sites, to simulate SNP-array sites from whole-genome sequencing data
    3. Download genotype files
    4. Merge chromosome level files into a genome-wide file
    5. Re-format .fam files
    6. Download sample information
    7. Chromosome X specific processing

Rationale

Since large-scale population-based biobanks are available under restricted access only, we based our tutorials on publicly available data from the 1,000 Genome project. In the following steps, we will download the genotype data and pre-process it to simulate SNP-array data typically available in large-scale population-based biobanks.

All these steps are directly available on our github in folder pipeline/step0_download_genotype/.


Requirements

In addition of using C++ software such as THORIN, SHAPEIT5 and IMPUTE5, our pipeline also uses standard tools for genomic analysis and R packages that you may need to install by yourself. For this part, we will need bcftools, plink1.9 and plink2.

sudo apt install -y bcftools pink1.9 plink2

Data download

Environment settings

threads=16
ODIR=data
mkdir -p ${ODIR}

Infinium Global Screening Array sites, to simulate SNP-array sites from whole-genome sequencing data

wget https://emea.support.illumina.com/content/dam/illumina-support/documents/downloads/productfiles/global-screening-array-24/infinium-global-screening-array-24-v1-0-c2-annotated.zip -O ${ODIR}/gsa.zip
unzip data/gsa.zip -d ${ODIR}/ && rm ${ODIR}/gsa.zip
awk 'NR > 1 {print "chr"$2"\t"$3}' ${ODIR}/GSA-24v1-0_C2.hg38.annotated.txt > ${ODIR}/GSA.variant_positions.txt
rm ${ODIR}/GSA-24v1-0_C2.hg38.annotated.txt

Download genotype files

DIR="http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20201028_3202_phased"
PRE="CCDG_14151_B01_GRM_WGS_2020-08-05"
SUF="filtered.shapeit2-duohmm-phased.vcf.gz"
ODIR=data/genotype
mkdir -p ${ODIR}/vcf

for CHR in {1..22} X; do
	
	if [ ${CHR} == "X" ]; then
	        SUF="filtered.eagle2-phased.v2.vcf.gz"
	fi



	if [ ! -f "${ODIR}/vcf/${PRE}_chr${CHR}.gsa.${SUF}.csi" ]; then
	
	        # download chromosome-wide vcf.gz file
	        wget ${DIR}/${PRE}_chr${CHR}.${SUF} -O ${ODIR}/vcf/${PRE}_chr${CHR}.${SUF}
	        wget ${DIR}/${PRE}_chr${CHR}.${SUF}.tbi -O ${ODIR}/vcf/${PRE}_chr${CHR}.${SUF}.tbi
	
	        # subset to GSA sites
	        bcftools view -T data/GSA.variant_positions.txt ${ODIR}/vcf/${PRE}_chr${CHR}.${SUF} -Oz -o ${ODIR}/vcf/${PRE}_chr${CHR}.gsa.${SUF} --threads ${threads}
	        bcftools index -f ${ODIR}/vcf/${PRE}_chr${CHR}.gsa.${SUF} --threads ${threads}
	        rm ${ODIR}/vcf/${PRE}_chr${CHR}.${SUF}*

	fi



	if [ ! -f "${ODIR}/plink/KGP.chr${CHR}.gsa.bed" ]; then
	
	        # convert to plink file (to use with the relatedness inference software KING later)
	        mkdir -p ${ODIR}/plink
	        plink2 --vcf ${ODIR}/vcf/${PRE}_chr${CHR}.gsa.${SUF} --make-bed --out ${ODIR}/plink/KGP.chr${CHR}.gsa

	        mv ${ODIR}/vcf/${PRE}_chr${CHR}.gsa.${SUF} ${ODIR}/vcf/KGP.chr${CHR}.gsa.vcf.gz
	        mv ${ODIR}/vcf/${PRE}_chr${CHR}.gsa.${SUF}.csi ${ODIR}/vcf/KGP.chr${CHR}.gsa.vcf.gz.csi
	fi
done

Merge chromosome level files into a genome-wide file

printf "%s\n" ${ODIR}/genotype/plink/KGP.chr{1..22}.gsa > merge_list.txt
plink1.9 --merge-list merge_list.txt --make-bed --out data/genotype/plink/KGP.merged_chromosomes 
rm ${ODIR}/genotype/plink/KGP.chr*.gsa.*
rm merge_list.txt

Re-format .fam files

We modify the fam file so that they don’t all have the same family ID == 0, and set FID=IID.

awk '{print $2" "$2" "$3" "$4" "$5" "$6}' data/genotype/plink/KGP.merged_chromosomes.fam > data/genotype/plink/KGP.merged_chromosomes.v2.fam
mv data/genotype/plink/KGP.merged_chromosomes.v2.fam data/genotype/plink/KGP.merged_chromosomes.fam

Download sample information

For the clustering of close relatives, we will need age and sex of participants. Here, the age is not available for the 1,000 GP participants.

wget https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/1kGP.3202_samples.pedigree_info.txt -O data/1kGP.3202_samples.pedigree_info.txt

Chromosome X specific processing

We apply additional Quality-Control steps to the chromosome X. This is due to the fact that because of for example genotyping errors, males can have heterozygous sites. We therefore force homozygosity for males.


BIN=diploidize_static # available from https://github.com/odelaneau/otools
SEX=data/1kGP.3202_samples.pedigree_info.txt
VCF=data/genotype/vcf/KGP.chrX.gsa.vcf.gz
threads=8

## Get male individuals
awk '$4 == 2 {print $1}' data/1kGP.3202_samples.pedigree_info.txt > data/KGP.males.txt

## QC male individuals --> transform mixed diploid and haploid call into fully diploid calls.

IN=data/genotype/vcf/KGP.chrX.gsa.vcf.gz
OUT_M=data/genotype/vcf/KGP.chrX.gsa.males.bcf
bcftools view -S data/KGP.males.txt -Ob -o ${OUT_M} ${IN} --threads ${threads}
bcftools index ${OUT_M} --threads ${threads}


OUT_F=data/genotype/vcf/KGP.chrX.gsa.females.bcf
bcftools view -S ^data/KGP.males.txt -Ob -o ${OUT_F} ${IN} --threads ${threads}
bcftools index ${OUT_F} --threads ${threads}

OUT_D=data/genotype/vcf/KGP.chrX.gsa.males.diploidized.bcf
./${BIN} --input ${OUT_M} --output ${OUT_D}
bcftools index ${OUT_D} --threads ${threads}

OUT=data/genotype/vcf/KGP.chrX.gsa.diploidized.bcf
bcftools merge ${OUT_D} ${OUT_F} -Ob -o ${OUT}
bcftools index ${OUT} --threads ${threads}

rm ${OUT_D}* && rm ${OUT_M}* && rm ${OUT_F}*


Back to top

Copyright © 2022-2025 Robin Hofmeister, Theo Cavinato and Olivier Delaneau | All Rights Reserved | THORIN executables and source code are distributed under the MIT license.