There are a number of parameters I always want to run for snakemake, and writing them on the command line every time can be tedious. For example, I nearly always want snakemake to “keep going” (--keep-going
) with independent jobs, even if a single one fails. I nearly always use dedicated conda environments for each rule, so I’d like --use-conda
to be the default execution.
Luckily, that’s where --profile
comes in! You can now set up a snakemake profile with these bits of information. If you use a job scheduler, you can also set default partition, cpu, and memory resources to submit to the scheduler.
Building your profile
For jobs executed without a job scheduler
If you don’t use a job scheduler (or you sometimes run jobs directly, e.g. on an interactive session) you can build your own profile like so:
mkdir -p ~/.config/snakemake/default
touch ~/.config/snakemake/default/config.yaml
Then, open the config.yaml file using your favorite text editor, and add settings you like. Here’s my default
profile config:
# non-slurm profile defaults
restart-times: 3
local-cores: 1
latency-wait: 60
use-conda: True
jobs: 1
keep-going: True
rerun-incomplete: True
shadow-prefix: /scratch/ntpierce
printshellcmds: True
If you don’t want to set up another profile for a job scheduler, skip to the run section below :).
For job schedulers
Snakemake profiles for a number of job schedulers (e.g. slurm, pbs) can be found here. I followed the instructions to set up a SLURM profile for our farm
hpc at uc davis, so I’ll walk through set up for that.
Make sure cookiecutter
is available. For me, that requires conda activate snakemake
.
mkdir -p ~/.config/snakemake
cd ~/.config/snakemake
cookiecutter https://github.com/Snakemake-Profiles/slurm.git
Follow the setup prompts. If you don’t know what to write, leave it blank - you can always edit the information later (this is what I did).
Go find your profile. Mine is at ~/.config/snakemake/slurm
. You should see the following files in your profile folder:
config.yaml
slurm-jobscript.sh
slurm-status.py
slurm-submit.py
slurm_utils.py
The main settings are in config.yaml
For slurm, that currently looks like this:
restart-times: 3
jobscript: "slurm-jobscript.sh"
cluster: "slurm-submit.py"
cluster-status: "slurm-status.py"
max-jobs-per-second: 1
max-status-checks-per-second: 10
local-cores: 1
latency-wait: 60
Now, add or modify anything you need to. I added the following:
# snakemake settings I like
use-conda: True
jobs: 10
keep-going: True
rerun-incomplete: True
shadow-prefix: /scratch/ntpierce
printshellcmds: True
NOTE: For now, I actually commented out the the slurm-status.py
line, as sacct
isn’t enabled on our system, and the output was very verbose.
Setting default resource utilization
Using your favorite text editor, open a new file (mine is: ~/.config/snakemake/slurm/cluster_config.yml
)
Write in your desired default resources. For me, this file contains:
__default__:
account: ctbrowngrp # your hpc account
partition: bml # the partition you use
mail-user: ntpierce@ucdavis.edu # your email (optional!)
time: 360 # default time (minutes)
nodes: 1
ntasks: 1
mem: 14GB # default memory
snakemake will apply anything in __default__
to all rules, unless they specifically override the parameters.
Now edit the slurm-submit.py
file so the CLUSTER_CONFIG
variable points to the file you just wrote. For me:
CLUSTER_CONFIG = "/home/ntpierce/.config/snakemake/slurm/cluster_config.yml"
Run snakemake using your profile
To execute using a profile, use:
snakemake --profile slurm
or
snakemake --profile default
Voila! No need to keep writing --keep-going
on the command line anymore!
Setting Rule-specific Resource Utilization
I don’t always want to execute my rules with the default resources. Sometimes I’d like to allocate additional time or memory, or scale back the memory (my default is rather high).
You can do this for all jobs in a snakefile by setting different default resources (--default-resources
) at the command line. However, I prefer to set job-specific resources, and you can do that within the rules themselves, using the following parameters:
threads: 1
resources:
mem_mb=4000, #4GB
runtime=60 #minutes
These resources are used to specify the cpu, memory and runtime allocated via SLURM. Here’s what they look like in the context of a rule.
rule seqtk_fasta_to_fastq:
input: "sample_{rep}.fasta.gz"
output: "{refname}_{rep}.fq.gz"
threads: 1
resources:
mem_mb=1000,
runtime=10
conda: "envs/seqtk-env.yml"
shell:
"""
seqtk seq -F 'I' {input} | gzip -9 > {output}
"""
This rule converts a fasta file to a fastq file by providing dummy quality values. It’s being run on a single thread, given 1GB memory and 10 minutes to run.
Advanced Resource Specification
You can also set the resources based on the number of attempts
for running a specific command.
Example from the documentation:
rule:
input: ...
output: ...
resources:
mem_mb=lambda wildcards, attempt: attempt * 100
shell:
"..."
Here, if you have the restart-times
set to 3, then the rule will run with 100mb of memory on its first attempt, 200mb on second attempt, and 300mb on third attempt.
For more information, see the resources documentation!
What’s Missing?
It would be nice to have a default profile that is used without specifying --profile
(thanks for this idea, Titus!)
When I was writing rule-specific --cluster-config
files, I could set the rulename and slurm output files a little more intelligently, using snakemake rulenames and wildcards.
For example, here was my config for a rule called polyA_trim
:
polyA_trim:
partition: bmm
time: 2:00:00 # time limit for each job
nodes: 1
cpus_per_task: 10
mem: 5GB
chdir: /home/ntpierce/2020-pep/orthopep_out
stderr: "logs/cutadapt/slurm-%j.stderr"
stdout: "logs/cutadapt/slurm-%j.stdout"
jobname: "{rule}.w{wildcards}"
Luckily, it looks like a solution for this is in progress here!
Thanks to @johanneskoester for kindly answering my questions on resources and to everyone working on snakemake & snakemake profiles!