Usage
=====
You should always enter the ``isweep`` environment before running workflows: ``mamba activate isweep``.
The penultimate rule of all workflows is to copy the YAML parameters file to your analysis folder, supporting reproducibility and logging.
Most of the Python scripts under ``scripts/`` use ``argparse``. You can run them solo, with description and input support via ``python your-script.py --help``. For examples, see how scripts are run in the snakemake ``rules/``.
.. note::
You should always run the workflows on cluster nodes with ``nohup snakemake [...] --cluster "sbatch [...]" &`` to manage submissions in the background. Jobs that are not ``localrules`` could take considerable RAM and time. See :ref:`snakemake-options` and :ref:`slurm-options`.
.. _selection-scan:
Selection scan
##############
The ``worfklow/scan-selection`` implements the IBD rate selection scan with two multiple-testing corrections. You should use the ``Snakefile-scan.smk`` file as input to the ``-s`` option.
Recipe YAML files to modify are ``sequence.yaml`` and ``array.yaml`` for WGS and SNP array data, respectively. There is a hierarchy of ``change`` versus ``fixed`` parameters, where ``change`` you should modify for your dataset and ``fixed`` you should reach out for advice.
The main command is
.. code-block:: shell
nohup snakemake -s Snakefile-scan.smk [...] --cluster "sbatch [...]" --jobs XX --configfile sequence.yaml
The parameters are:
* Many parameters under ``files`` determine where your data is and where you want outputs to be.
* You can use ``chromosome_low`` and ``chromosome_high`` to determine a range of such to study. All chromosome ``.vcf.gz`` and ``.map`` must be numbered.
* ``subsample``: text file with sample IDs in VCF files, which can/likely is a subset of larger consortium dataset
* ``ibd_ends:error_rate``: set this from estimated error in pilot study of your smallest chromosomes (log files from `ibd-ends `_ software)
* ``ploidy``: if your ploidy is not 1 or 2, see :ref:`ploidy-extension`
* ``step_size_cm``: you perform a hypothesis every X.XX centiMorgans
* ``scan_cutoff``: minimum length of detected IBD segments (recommended >= 2.0 or >= 3.0)
* ``confidence_level``: the family-wise error rate you want to control (e.g., 0.05)
* ``num_sims``: simulations to derive one of the multiple-testing corrections
* ``chromosome_exclude``: empty, or a text file with chromosome numbers to not analyze
.. note::
Chromosomes smaller that the centiMorgan parameter ``fixed:isweep:chromosome_size_cutoff`` will not be analyzed. You cannot reliably estimate the autocovariance decay of IBD rates in mini chromosomes. The default is 25 cM.
The outputs are:
* ``scan.png``: the IBD rates along autosomes and the estimated significance thresholds
* ``scan.modified.ibd.tsv``: the difference in IBD rates along the autosomes in tabular format
* ``autocovariance.png```: model fit for autocovariances of IBD rates
* ``zhistogram.png``: an empirical distribution of IBD rates (should look Gaussian)
* ``roi.tsv``: summary table of the genome-wide significant loci
* ``fwer.analytical.txt``: details about parameters and estimates for multiple testing
* ``chromosome-sizes-kept.tsv``: numbers and cM sizes of chromosomes analyzed
* ``ibdsegs/``: folder and subfolders with detected IBD segments
The main selection scan figure is:
.. image:: images/scan.png
:align: center
:width: 600px
|
|
Description of key columns in the ``scan.modified.ibd.tsv`` file:
* ``BPWINDOW``: base pair location
* ``CMWINDOW``: centiMorgan location
* ``COUNT``: number of overlapping IBD segments
* ``Z``: standardized IBD rate
* ``PVALUE``: the p value which is valid asymptotically
* ``UPPER``: an IBD count-based threshold
* ``Z_UPPER``: a standardized IBD rate-based threshold
* ``GW_LEVEL``: genome-wide significance threshold
* ``_ANALYTICAL``: discrete-spacing analytical approach
* ``_SIMULATE``: Ornstein-Uhlenbeck simulation-based approach
* ``_CONTINUOUS``: continuous-spacing analytical approach (very conservative)
* ``ADJ_MEAN``: mean IBD count
* ``ADJ_STDDEV``: standard deviation of IBD count
Description of columns in the ``roi.tsv`` file:
* ``NAME``: autogenerated locus name
* ``CHROM``: chromosome number
* ``MAXCOUNT``: maximum significant IBD count in the locus
* ``MINCOUNT``: minimum significant IBD count in the locus
* ``BPCENTER``: base pair where the max IBD count is
* ``CMCENTER``: centiMorgan where the max IBD count is
* ``CMLEFT``: centiMorgan where the locus ends on the left
* ``CMRIGHT``: centiMorgan where the locus ends on the right
* ``BPLEFTCENTER``: same as ``CMLEFT`` except plus a buffer if locus is too small for hard sweep modeling
* ``MODEL``: autogenerated as additive genic selection for hard sweep modeling
* ``INTERVALCOVERAGE``: autogenerated as 95 percent confidence intervals for hard sweep modeling
.. note::
The multiple-testing corrections are valid asymptotically (Temple and Thompson, 2024+). You can look at the IBD rate histogram to visually assess such (example below). Be wary of IBD rates being zero truncated in small samples.
.. image:: images/zhistogram.png
:align: center
:width: 600px
|
|
.. note::
You can look at the ``autocovariance.png`` files to see if the IBD rate decay fits the Ornstein-Uhlenbeck well.
.. image:: images/autocovariance.png
:align: center
:width: 600px
|
|
.. note::
There is a multiprocessing version using ``Snakefile-scan-mp.smk``, which may only be useful in enormous human biobanks.
.. _modeling-hard-sweeps:
Modeling hard sweeps
##############
The ``worfklow/model-selection`` estimates frequencies, locations, and selection coefficients of loci detected in the :ref:`selection-scan`. This workflow must be run after the selection scan. You should use the ``Snakefile-roi.smk`` file as input to the ``-s`` option.
The recipe YAML file to modify is ``sweep.yaml``. There is a hierarchy of ``change`` versus ``fixed`` parameters, where ``change`` you should modify for your dataset and ``fixed`` you should reach out for advice.
The main command is
.. code-block:: shell
nohup snakemake -s Snakefile-roi.smk [...] --cluster "sbatch [...]" --jobs XX --configfile sweep.yaml
The parameters are:
* Many parameters under ``files`` determine where your data is and where you want outputs to be.
* ``regions_of_interest``: these are the loci to analyse. The default are those GW significant in the scan. You can delete some, or rename the GW significant "hits".
* ``chromosome_prefix``: this is the name ``chr`` or blank that you see when you run ``bcftools query -f "%CHROM\n" chr.vcf.gz | head``.
* ``ploidy``: if your ploidy is not 1 or 2, see :ref:`ploidy-extension`
* ``Ne``: an estimate of recent effective population sizes (IBDNe text file format)
You can change the genic selection model in ``roi.tsv`` to "a" for additive, "m" for multiplicative, "d" for dominance, and "r" for recessive. You can also change alpha, which determines the (1-alpha) percent confidence intervals.
The script ``scripts/run-ibdne.sh`` runs IBDNe, which is good for populations with exponential growth. You may want to consider another Ne estimator as well.
.. code-block:: shell
sbatch [...] run-ibdne.sh [ibdne-jar] [memory-in-Gb] [main-folder-of-study] [path-to-subfolder-with-ibd-data] [chromosome_low] [chromosome_high] [output_file] [random_seed]
The outputs are:
* ``summary.hap.norm.tsv``: sweep model estimates for best haplotype-based analysis and Gaussian confidence intervals for selection coefficient
* ``summary.snp.norm.tsv``: sweep model estimates for best SNP-based analysis and Gaussian confidence intervals for selection coefficient
* ``hit*/second.ranks.tsv.gz``: alleles with putative evidence for selection (or strong correlation with a selected allele)
* ``hit*/outlier*.txt``: files with sample haplotype IDs in clusters on excess IBD sharing
Description of columns in ``summary.*.*.tsv`` files:
* ``MAXCOUNT``: maximum IBD count from the selection scan
* ``INTERVALCOVERAGE``: The (1 - alpha) percent coverage of confidence interval (e.g. 95 percent)
* ``LOCHAT``: Base pair estimate of a putative sweeping allele
* ``COUNTIBD``: IBD count at the base pair estimate
* ``PHAT``: Sweeping allele frequency estimate
* ``SHAT``: Selection coefficient estimate
* ``CONF_INT_LOW``: If s in (A,B) is the confidence interval, this value is A
* ``CONF_INT_UPP``: If s in (A,B) is the confidence interval, this value is B
* ``MODEL``: a for additive, m for multiplicative, d for dominance, r for recessive
* ``GINI_IMPURITY``: Gini impurity of the excess IBD sharing group
* ``NUM_OUTLIER_GROUPS``: number of disconnected clusters in excess IBD sharing group
* ``PROP_IN_OUTLIER_GROUP``: fraction of sample haplotypes in excess IBD sharing group
.. note::
The Gaussian bootstrap intervals are valid asymptotically (Temple and Thompson, 2024+). You can uncomment lines in ``rule all`` of the ``Snakefile-roi.smk`` to get percentile-based bootstrap intervals.
All of the IBD clusters are in the files ``communities.csv``. Each cluster is a comma-separated cluster with sample haplotype IDs. The ``_1`` and ``_2`` suffixes correspond to the haplotypes of a diploid individual.
.. note::
If ``NUM_OUTLIER_GROUPS`` is more than a few, ``GINI_IMPURITY`` exceeds 0.6, or ``PROP_IN_OUTLIER_GROUP`` is less than 0.10, the significant locus may not be the result of a hard sweep. The ``outliers.py`` method has been modified to call the largest cluster the excess IBD group if no cluster satisfies the heuristic. This behavior could explain a small ``PROP_IN_OUTLIER_GROUP`` value. This behavior may also identify haplotypes from one sweep where there may be multiple sweeps.
Description of columns in ``second.ranks.tsv.gz`` files:
* ``POS``: marker location
* ``AAF``: allele frequency in entire sample
* ``AAF1``: allele frequency in excess IBD sharing group
* ``AAF0``: allele frequency in the rest of the sample
* ``ZDELTA``: this value is ``AAF1`` minus ``AAF0`` divided by square root ``AAF`` times one minus ``AAF``
You can use the values in this table and the notebook ``scripts/model/telltale-v2.ipynb`` to make plots like below. The decay of intermediate frequencies is good evidence of a sweep.
.. image:: images/telltale-sweep.png
:align: center
:width: 600px
|
|
.. _case-control-scan:
Case-control scan
##############
The ``worfklow/scan-case-control`` implements the difference in IBD rates scan with two multiple-testing corrections. You should use the ``Snakefile-case.smk`` file as input to the ``-s`` option.
You must run this workflow after the selection scan workflow (where the IBD segments are detected). You should scrutinize the results to see if strong selection confounds your case-control study.
The recipe YAML file to modify is ``case.yaml``. The parameters are nearly all the same as in :ref:`selection-scan`. The ``case`` parameter is a two-column text file with sample IDs and binary phenotypes.
The main command is
.. code-block:: shell
nohup snakemake -s Snakefile-case.smk [...] --cluster "sbatch [...]" --jobs XX --configfile case.yaml.
The outputs have the same nomenclature as in the selection scan workflow, but ``.case.`` and ``.control.`` is inserted in file names:
* ``scan.case.control.png``: the standardized difference in IBD rates along autosomes and the estimated significance thresholds
* ``scan.case.ibd.tsv``: the difference in IBD rates along the autosomes in tabular format
* ``roi.case.tsv``: summary table of the genome-wide significant loci
* ``fwer.analytical.case.txt``: details about parameters and estimates for multiple testing
.. note::
The multiple-testing corrections are valid asymptotically (Temple and Thompson, 2024+). You can look at the IBD rate histogram to visually assess such. Be wary of IBD rates being zero truncated in small samples.
.. note::
There is a multiprocessing version using ``Snakefile-case-mp.smk``, which may only be useful in enormous human biobanks.
.. note::
There is a version with a randomization test using ``Snakefile-case-randomized.smk`` and ``case.randomized.yaml`` as the template YAML containing a parameter ``num_randomized``. This could be computationally intensive in enormous human biobanks.
.. note::
The randomization test is not a valid way to get the genome-wide significance threshold if there are true positive signals or loci under strong selection. You can identify confounding in the output ``scan.randomized.ibd.tsv`` file if there are loci significant in many runs of randomized phenotypes.
You can try to detect clusters of cases or controls with excess IBD sharing GW significant loci using ``Snakefile-case-roi.smk`` and the template ``--configfile case.roi.yaml``.
The output to this feature will be a tab-separated file with sample haplotype IDs, their binary phenotype, and indicators if they are in excess IBD sharing groups (``matrix.outlier.phenotypes.tsv`` for each hit). An example of this file is ``design.sorted.tsv``. You could perform regression analyses on these dataframes. Scripts ``scripts/utilities/fake-phenotypes-*.py`` can be used for testing and evaluating confounding from strong recent selection.
You can also look at the sample haplotype IDs in the ``hit*/outlier*.phenotype.tsv`` files.
.. note::
I tested that ``Snakefile-case-roi.smk`` runs smoothly, but not if it works well at its task in a simulation study.
.. _phasing-and-ancestry:
Phasing and ancestry
##############
This ``worfklow/phasing-ancestry`` provides support for automated haplotype phasing (`Beagle `_), local ancestry inference (`Flare `_), and kinship inference (`IBDkin `_).
The main command is
.. code-block:: shell
nohup snakemake -s Snakefile-*.smk [...] --cluster "sbatch [...]" --jobs XX --configfile phasing-and-lai.yaml
The Snakefiles are:
* ``Snakefile-beagle-flare-gds.smk``: your data is stored as GDS files, and you want to phase as well as LAI and IBD inference
* ``Snakefile-beagle-flare-vcf.smk``: your data is stored as VCF files, and you want to phase as well as LAI and IBD inference
* ``Snakefile-flare-only-gds.smk``: your data is already phased in GDS files, and you want to perform LAI and IBD inference
* ``Snakefile-flare-only-vcf.smk``: your data is already phased in VCF files, and you want to perform LAI and IBD inference
The YAML example file is ``phasing-and-lai.yaml``. Most of the parameters are written exactly as the parameters in `Beagle `_, `Flare `_, or `hap-ibd `_. Other parameters define file locations. The remaining parameters are:
* ``rename-chrs-map-adx``, ``rename-chrs-map-ref``: harmonizes 9 vs chr9 in VCF CHROM column with genetic map. Files are in ``rename-chrs/``. The ``num-chrnum.txt`` means 9 is in the VCF column, but chr9 is in the genetic map column.
* ``ref-panel-map``: tab-separated, headerless file with reference sample ID (column 1) and reference panel label (column 2)
* ``keep-samples``: the sample IDs to phase, LAI, and IBD infer, which may be a subset of a larger consortium dataset
* ``bcftools-parameters:c-min-mac``: minimum minor allele count, where 1 and 2 are incredibly difficult to phase
.. note::
We strongly recommend against setting ``flare-parameter:probs`` equal to ``true``, which may create enormous file sizes and require a lot of RAM.
.. note::
The output files are in ``gtdata/``, ``lai/``, and ``ibdsegs/``. Rephasing is unphasing the reference panel and phasing them again with all the admixed samples; reference phasing is using the existing phase of the reference panel. Rephasing takes longer and creates more disk memory. You can uncomment or comment these output files in the ``rule all`` of the Snakefile.
.. note::
You can use ``run-ibdkin.sh`` (with `IBDkin `_), ``high-kinship.py``, and ``keep-one-family-member.py`` in ``scripts/pre-processing/`` to filter out close relatives, say kinship >= 0.125. These scripts are not documented, so I recommend copy and paste into an LLM and ask it what these do.
.. _publication-ready-figures:
Publication-ready Figures
##############
Here, I provide some examples terminal commands to make publication-ready figures. The pipelines make some initial plots, but you may want to modify labels and colors. Many of the terminal options are the options in the Python ``matplotlib`` library.
This script makes figures for selection scans.
.. code-block:: shell
python scripts/plotting/plot-scan-pipeline.py \
--input_file scan.modified.ibd.tsv \
--output_prefix scan \
--sample_size 13778 \
--ploidy 2 \
--heuristic_cutoff 4. \
--num_sims 500 \
--chr_low 1 \
--chr_high 22 \
--statistic COUNT \
--title "TOPMed European ancestry" \
--xlabel "Base pair along autosomes" \
--ylabel "IBD rate" \
--rotation 90 \
--yupp 2e-4 \
--alpha 0.25 \
--fontsize 14 \
--width 12 \
--height 4
.. note::
If ``--num_sims`` is 0, no simulation-based threshold will be plotted. Any value of ``--num_sims`` greater than 0 will plot the simulation-based threshold in ``scan.modified.ibd.tsv``.
.. note::
You could use this script, or ``scripts/plotting/plot-scan.py`` as a general use plotter. However, you would need be careful with the ``--sample_size`` and ``--statistic`` parameter. For instance, you could augment the ``scan.case.ibd.tsv`` file with a simulation-based threshold.
This script makes figures for IBD mapping / case-control scans.
.. code-block:: shell
python scripts/plotting/plot-scan-case-pipeline.py \
--input_file scan.case.ibd.tsv \
--output_prefix scan.case \
--chr_low 1 \
--chr_high 22 \
--statistic ZDIFFZ \
--title "ADSP European ancestry" \
--xlabel "Base pair along autosomes" \
--ylabel "Standardized IBD rate difference" \
--rotation 90 \
--yupp 8. \
--alpha 0.25 \
--fontsize 14 \
--width 12 \
--height 4
This script plots the empirical distribution of IBD rates of IBD rate differences.
.. code-block:: shell
python scripts/plotting/plot-histogram.py \
--input_file scan.modified.ibd.tsv \
--output_file histogram.png \
--chr_low 1 \
--chr_high 22 \
--chrom CHROM \
--statistic COUNT \
--outlier_cutoff 1000
.. note::
We reuse this plotter for the case-control scan, changing ``--statistic`` to ``ZDIFFZ``.
.. note::
The ``--outlier_cutoff`` option, in normally-distributed quantiles, is used to remove outliers before standardizing.
This script plots autocovariance by genetic distance.
.. code-block:: shell
python scripts/plotting/plot-autocovariance.py \
--input_autocov_file fwer.autocovariance.tsv \
--input_analytical_file fwer.analytical.tsv \
--output_prefix autocovariance \
--theta_type estimated-theta: \
--title "TOPMed European ancestry" \
--xlabel "Genetic distance (cM)" \
--ylabel "Autocovariance" \
--yupp 1.5 \
--color tab:blue \
--width 6.4 \
--height 4.8
.. note::
If your ``--input_analytical_file`` is ``fwer.analytical.case.tsv``, you may use ``estimated-theta0:`` or ``estimated-theta1:`` to plot results for case and control sample sets. You must also modify the ``input_autocov_file`` parameter.
.. note::
We reuse this plotter for the case-control scan, changing the ``--input_autocov_file`` and ``--input_analytical_file``.
This script makes figures for sweep modeling.
.. code-block:: shell
python scripts/plotting/plot-sweep.py \
--output_file sweep.LCT.png \
--s 0.03 \
--su 0.04 \
--z \
--p 0.70 \
--Ne ibdne.ne \
--standing_variation -0.01 \
--genetic_model a \
--ploidy 2 \
--nboot 1000 \
--upper_quantile 0.99 \
--lower_quantile 0.01 \
--xaxis_length 150 \
--line_color tab:blue \
--font_size 14 \
--alpha 0.25 \
--title "LCT gene"
The parameters are:
- ``--s`` estimate of selection coefficient
- ``--su`` the upper bound of selection coefficient confidence interval
- ``--z`` this corresponds to to the quantile of a N(0,1), e.g. 1.96 for 95 percent confidence intervals
- ``--p`` estimate of sweeping allele frequency
- ``--Ne`` file with estimates of population effective sizes
- ``--standing_variation`` positive selection stops once this allele frequency is reached
- ``--genetic_model`` a for additive, m for multiplicative, d for dominance, r for recessive
- ``--nboot`` number of Wright-Fisher simulations with selection
- ``--upper_quantile`` and ``--lower_quantile`` concern the sweep frequency intervals
- ``--xaxis_length`` is the number of generations
.. note::
You should get these parameters from ``summary.hap.norm.tsv`` or ``summary.snp.norm.tsv`` files. The plotter assumes normally-distributed confidence interval, which is only reasonable when Temple and Thompson conditions hold.