Inputs and outputs

Description
Reference individual(s)
Unrelated individuals
Genotype input data
Outputs

Description

We explain here the main files used by the THORIN tool to map IBD between a focal individual and reference individual(s).

Reference individual(s)

The THORIN tool maps IBD between a focal individual and reference individual(s). The reference individual(s) for each focal individual are listed in a separate .group file. This file is a tab-separated file, each line corresponding to a different focal individual and it’s reference individuals. The first column lists the focal individual ID. Additional columns list the reference individual IDs. The format of the additional columns is such that the first element corresponds to the group ID GID, and the following elements correspond to the reference individual(s) IDs. GID is the element that is reported in the output file. If we want to tracks the IDs of reference individuals, we can specify GID as being the reference individual ID (particularly useful when using single reference individual, as seen in line 1 of the example file below). When grouping reference individuals, we can specify GID as being the reference group 1, reference group 2, etc.

An example of .group file is presented below (in practice, this file takes no header line) :

Focal individual ID	Reference 1	Reference 2
sample_1	sample_2=sample_2	sample_3=sample_3
sample_4	GID_1=sample_5;sample_6	GID_2=sample_7
sample_8	GID_1=sample_9;sample_10	GID_2=sample_11;sample_12

The file lists here three focal individuals (sample\_1, sample\_4 and sample\_8). Each of these focal individuals have a different group structure. For sample\_1, we will map IBD between single individuals and keep track of the reference individual IDs in the output file by specifying the reference individuals IDs in the GID fields. For sample\_4 and sample\_8, we clustered some of the reference individuals into two groups and we specify the ID of each group in the GID field.

Unrelated individuals

For the model to work correctly, we also need to provide a set of individuals unrelated to the focal individual, so that our model will compare the probability of sharing IBD with the reference individuals to the probability of sharing IBD with unrelated individuals. The set of unrelated individuals must be unrelated to all the focal individuals listed in the group file. Assuming that we have a file listing our set of unrelated individual, with one ID per line, we can build this file using:

CHR=20
IN=data_chr20.bcf
OUT=data_chr20.unrelated_samples.bcf
SAMP=unrelated_samples.txt

bcftools view -S ${SAMP} -Ob -o ${OUT} ${IN} && bcftools index ${OUT} 

Genotype input data

The input data must be indexed .vcf.gz or .bcf format containing at least all the focal individuals and reference individuals listed in your .group file. To speed-up computation of large datasets, you can subset your cohort file to include only those individuals using:

CHR=20
IN=data_chr20.bcf
OUT=data_chr20.focal_and_reference_samples.bcf
SAMP=focal_and_reference_samples.txt

bcftools view -S ${SAMP} -Ob -o ${OUT} ${IN} && bcftools index ${OUT}         

Outputs

The THORIN tool allow for different types of outputs.

1. IBD per variant site

This is the basic output of the THORIN tool, also present in v1.0.0. It reports the probability of sharing IBD per variant site on each of the focal individual haplotypes with (i) each of the reference individuals or group listed in the .group file, and (ii) the set of unrelated individuals.

An example of output file is presented below:

#CHROM	POS	IDX	CM	sample_1_sample_2_0	sample_1_sample_3_0	sample_1_HOLE_0	sample_1_sample_2_1	sample_1_sample_3_1	sample_1_HOLE_1	sample_4_GID_1_0	sample_4_GID_2_0	sample_4_HOLE_0	sample_4_GID_1_1	sample_4_GID_2_1	sample_4_HOLE_1
chr20	1	0	0.0	0.0	1.0	0.0	0.0	0.0	1.0	0.98	0.0	0.02	0.0	1.0	0.0
chr20	12	1	0.1	0.0	0.99	0.01	0.0	0.0	1.0	1.0	0.0	0.0	0.0	0.99	0.01

In the output, columns correspond to:

CHROM : chromosome ID
POS : genomic position, as indicated in the input .vcf.gz or .bcf file
IDX : index of the variant, from 0 to N - 1.
CM : centimorgan
additional columns: IBD probability based on the .group file.

The header of additional columns are formatted in such way that the first element is the focal individual ID, the second element is the GID (as specified in the .group file, or HOLE for unrelated individuals), and the third element is the haploype number. For example:

sample\_1_sample\_2_0 : IBD probability between the focal individual sample\_1 and the reference individual sample\_2 in the haplotype 0.
sample\_1_HOLE_0 : IBD probability between the focal individual sample\_1 and the set of unrelated individuals in the haplotype 0. If this value is high, it means that there is no IBD sharing with any the reference individuals.
sample\_4_GID\_1_1 : IBD probability between the focal individual sample\_4 and the reference individuals grouped in GID\_1 in the haplotype 1.

The IBD probabilities for each of the focal individual haplotypes must sum to 1, so that a high IBD probability with HOLE (i.e, unrelated individuals) means that there is no IBD sharing with any of the specified reference individual.

2. IBD per variant site, variant call format

In version 1.2.0, we provide an option that directly reformat the output file in a .vcf.gz or .bcf format. This output can also be indexed and manupulated using standard tools such as bcftools. We find this format more suitable to directly extract a specific focal individual - reference individual pair, or a given genomic region.

This is implemented simply by adding .vcf.gz or .bcf to the output file name.

3. IBD segments

In version 1.2.0, we provide an option to aggegate the IBD probabilities of consecutive variant into IBD segments. This is achieved using the option --ibd. This typically works well when the focal individual has one or two groups of reference individuals, let’s say GID1 and GID2, but has not been tested for more. This corresponds to the setting of two surrogate parents groups, the maternal one and the paternal one, as typically used for inferring the parental origin of haplotypes.

The output file as the following format:

CHR : chromosome ID
start : start position of the IBD segment
end : end position of the IBD segment
Prob : class of IBD
length_CM: length of the IBD segment in centimorgan
target : focal individual ID

The Prob column contain the class of IBD segment identified:

A : haplotype 0 is shared with GID1 and haplotype 1 is shared with GID2.
B : haplotype 0 is shared with GID2 and haplotype 1 is shared with GID1.
C : both haplotype are not in IBD with neither GID1 nor GID2.
D : both haplotype are in IBD with the same reference individual group.

4. IBD scaffold

In version 1.2.0, we provide an option that uses the IBD segment to directly output a scaffold file for intra- and inter-chromosomal phasing. Considering the four different classes of Prob describe in the previous section, the scaffolding step works on the fact that haplotype segment can be re-ordered according to their class. For example, let’s consider a focal individual having two haplotype segment of class A and one haplotype segment of class B within the same chromosome. The segment of class B can be re-order in the scaffold file by reverting the phase of the variant whithin that segment, so that variant are now on the same phase as segments of class A. The same principle applies to inter-chromosomal phasing, where basically all haplotype segments across all chromosomes are re-order in class A in the scaffold file.

This is provided by the option --scaffold.

Table of contents