具有Python脚本,conda和集群的SnakeMake规则 [英] SnakeMake rule with Python script, conda and cluster

查看:372
本文介绍了具有Python脚本,conda和集群的SnakeMake规则的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想让snakemake通过SGE集群在特定的conda环境下运行Python脚本.

I would like to get snakemake running a Python script with a specific conda environment via a SGE cluster.

在群集上,我的主目录中安装了miniconda.我的主目录是通过NFS挂载的,因此所有群集节点都可以访问.

On the cluster I have miniconda installed in my home directory. My home directory is mounted via NFS so accessible to all cluster nodes.

因为miniconda在我的主目录中,所以默认情况下conda命令不在操作系统路径上.也就是说,要使用conda,我需要先将其明确添加到路径中.

Because miniconda is in my home directory, the conda command is not on the operating system path by default. I.e., to use conda I need to first explicitly add this to the path.

我有一个conda环境规范作为yaml文件,可以与--use-conda选项一起使用.还能与--cluster"qsub"选项一起使用吗?

I have a conda environment specification as a yaml file, which could be used with the --use-conda option. Will this work with the --cluster "qsub" option also?

FWIW我也使用conda环境(实际上是我要运行脚本的环境)启动snakemake.

FWIW I also launch snakemake using a conda environment (in fact the same environment I want to run the script).

推荐答案

我在SGE群集上拥有一个运行conda的现有Snakemake系统.这是令人愉快且非常可行的.我将尝试提供观点和指导.

I have an existing Snakemake system, running conda, on an SGE cluster. It's delightful and very do-able. I'll try to offer perspective and guidance.

您的小型图标的位置(本地或共享)可能无关紧要.如果您使用登录名访问集群,则应该能够在登录时更新默认变量.这将产生全局影响.如果可能的话,我强烈建议您在您的集群中编辑默认设置 .bashrc 完成这.登录后,这将自动正确地设置您的conda路径.

The location of your miniconda, local or shared, may not matter. If you are using a login to access your cluster, you should be able to update your default variables upon logging in. This will have a global effect. If possible, I highly suggest editing the default settings in your .bashrc to accomplish this. This will properly, and automatically, setup your conda path upon login.

文件中的一行"home/tboyarski/.bashrc"

One of the lines in my file, "home/tboyarski/.bashrc"

 export PATH=$HOME/share/usr/anaconda/4.3.0/bin:$PATH

EDIT 1评论中的要点

EDIT 1 Good point made in comment

就我个人而言,我认为将所有内容置于conda控制之下是一种好习惯;但是,对于通常需要访问conda不支持的软件的用户而言,这可能不是理想的选择.通常,支持问题与使用旧的操作系统有关(例如,最近取消了对CentOS 5的支持).如评论中所建议,对于不是专门在管道上工作的用户,在单个终端会话中手动导出PATH变量可能更为理想,因为这不会产生全局影响.

Personally, I consider it good practice to put everything under conda control; however, this may not be ideal for users who commonly require access to software not supported by conda. Typically support issues have to do with using old operating systems (E.g. CentOS 5 support was recently dropped). As suggested in the comment, manually exporting the PATH variable in a single terminal session may be more ideal for users who do not work on pipelines exclusively, as this will not have a global effect.

话虽如此,就像执行Snakemake之前一样,我建议初始化大多数或整个管道使用的conda环境.我发现这是首选方法,因为它允许 conda 创建环境,而不是让 Snakemake要求conda 创建环境.我没有网络讨论的链接,但我相信我读过某个地方,那些仅依靠Snakemake来创建环境而不是从基本环境中闲逛的人发现,这些环境存储在/中. snakemake目录,并且目录变得过大.随时寻找帖子.问题是由作者解决的,该作者减少了隐藏文件夹上的负载,但是我认为从现有的Snakemake环境中启动作业更为有意义,该环境与您的头节点进行交互,然后将相应的环境变量传递给它是子节点.我喜欢一些等级制度.

With that said, like myself prior to Snakemake execution, I recommend initializing the conda environment used by the majority, or entirety of your pipeline. I find this the preferred way as it allows conda to create the environment, instead of getting Snakemake to ask conda to create the environment. I don't have the link for the web-dicussion, but I believe I read somewhere that individuals who only rely on Snakemake to create the environments, not lanching from a base environment, they found that the environments were being stored in the /.snakemake directory, and that it was getting excessively large. Feel free to look for the post. The issue was address by the author who reduced the load on the hidden folder, but still, I think it makes more sense to launch the jobs from an existing Snakemake environment, which interacts with your head node, and then passes the corresponding environmental variables to it's child nodes. I like a bit of hierarchy.

话虽如此,如果您要从头节点的环境中运行Snakemake,并允许Snakemake通过 DRMAA功能,我强烈建议您这样做.两种提交方式都要求我提供以下参数:

With that said, you will likely need to pass the environments to your child nodes if you are running Snakemake from your head node's environment and letting Snakemake interact with the SGE job scheduler, via qsub. I actually use the built-in DRMAA feature, which I highly recommend. Both submission mediums require me to provide the following arguments:

   -V     Available for qsub, qsh, qrsh with command and qalter.

         Specifies that all environment variables active within the qsub
          utility be exported to the context of the job.

还...

  -S [[hostname]:]pathname,...
         Available for qsub, qsh and qalter.

         Specifies the interpreting shell for the job.  pathname must be
          an executable file which interprets command-line options -c and
          -s as /bin/sh does.

为了给您一个更好的起点,我还指定了虚拟内存和核心数,这可能是我的SGE系统所特有的,我不知道.

To give you a better starting point, I also specify virtual memory and core counts, this might be specific to my SGE system, I do not know.

-V -S /bin/bash -l h_vmem=10G -pe ncpus 1

我非常希望您像我个人一样在提交SGE集群时同时需要两个参数.我建议将集群提交变量以JSON格式放在单独的文件中.上面的代码段可以在我亲自做过.我的组织方式与教程,但这是因为我需要更多的粒度.

I highly expect you'll require both arguments when submitting the the SGE cluster, as I do personally. I recommend putting your cluster submission variables in JSON format, in a separate file. The code snippet above can be found in this example of what I've done personally. I've organized it slightly differently than in the tutorial, but that's because I needed a bit more granularity.

就我个人而言,我只在运行与用于启动和提交Snakemake作业的环境不同的conda环境时才使用--use-conda命令.例如,我的主要conda环境运行的是python 3,但是如果我需要使用一个要求python 2的工具,那么我将并且仅然后使用Snakemake在该特定环境下启动规则,以便执行该规则.规则使用与python2安装相对应的路径.这对我的雇主来说非常重要,因为我要替换的现有系统很难用conda和snakemake在python2和3之间切换,这很容易.

Personally, I only use the --use-conda command when running a conda environment different than the one I used to launch and submit my Snakemake jobs. Example being, my main conda environment runs python 3, but if I need to use a tool that say, requires python 2, I will then and only then use Snakemake to launch a rule, with that specific environment, such that the execution of that rule uses a path corresponding to a python2 installation. This was of huge importance by my employer, as the existing system I was replacing struggled to seemlessly switch between python2 and 3, with conda and snakemake, this is very easy.

原则上,我认为这是启动基本conda环境并从那里运行Snakemake的良好做法.它鼓励在整个运行过程中使用单个环境.保持简单吧?仅在必要时使事情复杂化,例如需要在同一管道中同时运行python2和python3. :)

In principle I think this is good practice to launch a base conda environemnt, and to run Snakemake from there. It encourages the use of a single environment for the entire run. Keep it simple, right? Complicate things only when necessary, like when needing to run both python2 and python3 in the same pipeline. :)

这篇关于具有Python脚本,conda和集群的SnakeMake规则的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆