处理较大列表中大小可变的子列表 [英] Processing a sub-list of variable size within a larger list

查看:78
本文介绍了处理较大列表中大小可变的子列表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是一名生物工程博士学位的学生,试图自学Python编程以自动化部分研究工作,但我遇到了处理更大列表中我可以做到的子列表的问题.似乎无法解决.

I'm a biological engineering PhD student here trying to self-learn Python programming for use in automating a part of my research, but I've ran into a problem with processing sub-lists within a bigger list that I can't seem to solve.

基本上,我想要做的目标是编写一个小脚本,该脚本将处理一个CSV文件,该文件包含我使用各种DNA组装方法构建的质粒序列的列表,然后吐出引物序列我需要订购才能构建质粒.

Basically, the goal of what I'm trying to do is write a small script that will process a CSV file containing a list of plasmid sequences that I'm building using various DNA assembly methods, and then spit out the primer sequences that I need to order in order to build the plasmid.

这是我要处理的情况:

当我要构建质粒时,我必须在Excel电子表格中输入该质粒的完整序列.我必须在两种DNA组装方法之间进行选择,即吉布森"和"iPCR".每个"iPCR"装配体只需要列表中的一行,因此我已经知道如何处理这些人了,因为我只需要将要构建的质粒的完整序列放入一个单元格即可.另一方面,吉布森"程序集要求我必须将完整的DNA序列分成较小的块,因此有时我需要在Excel电子表格中使用2-5行来完整描述一个质粒.

When I want to build a plasmid, I have to enter into my Excel spreadsheet the full sequence of that plasmid. I have to choose between two DNA assembly methods, called "Gibson" and "iPCR". Each "iPCR" assembly only requires one line in the list, so I know how to process those guys already, as I just have to put in one cell the full sequence of the plasmid I'm trying to build. "Gibson" assemblies, on the other hand, require that I have to split up the full DNA sequence into smaller chunks, so sometimes I need 2-5 lines within the Excel spreadsheet to fully describe one plasmid.

所以我最终得到的电子表格看起来像这样:

So I end up with a spreadsheet that sort of ends up looking like this:

构造.....策略.....名称

Construct.....Strategy.....Name

1 ..... Gibson ..... P(OmpC)-cI :: P(cI)-LacZ控制器
1 ...... Gibson ..... P(OmpC)-cI :: P(cI)-LacZ控制器
1 ...... Gibson ..... P(OmpC)-cI :: P(cI)-LacZ控制器
2 ..... iPCR .......带有K1F pos的P(cpcG2)-K1F控制器.反馈
3 ..... Gibson ..... P(cpcG2)-K1F控制器具有互换的启动子位置
3 ..... Gibson ..... P(cpcG2)-K1F控制器具有互换的启动子位置
4 ..... iPCR ....... P(cpcG2)-K1F控制器具有更强大的K1F RBS库

1.....Gibson.....P(OmpC)-cI::P(cI)-LacZ controller
1.....Gibson.....P(OmpC)-cI::P(cI)-LacZ controller
1.....Gibson.....P(OmpC)-cI::P(cI)-LacZ controller
2.....iPCR.......P(cpcG2)-K1F controller with K1F pos. feedback
3.....Gibson.....P(cpcG2)-K1F controller with swapped promoter positions
3.....Gibson.....P(cpcG2)-K1F controller with swapped promoter positions
4.....iPCR.......P(cpcG2)-K1F controller with stronger K1F RBS library

我认为这个长度的列表足够具有代表性.

I think the list at this length is representative enough.

所以我遇到的问题是,我希望能够遍历列表并处理Gibsons,但是我似乎无法使代码按我想要的方式工作.这是我到目前为止编写的代码:

So the problem I'm running into is, I'd like to be able to run through the list and process the Gibsons, but I can't seem to get the code to work the way I want. Here's the code I've written so far:

#import BioPython Tools
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC

#import csv tools
import csv
import sys
import os

with open('constructs-to-make.csv', 'rU') as constructs:
    construct_list = csv.reader(constructs, delimiter=',')
    construct_list.next()
    construct_number = 1
    primer_list = []
    temp_list = []
    counter = 2

    for row in construct_list:
        print('Current row is row number ' + str(counter))
        print('Current construct number is ' + str(construct_number))
        print('Current assembly type is ' + row[1])
        if row[1] == "Gibson": #here, we process the Gibson assemblies first
            print('Current construct number is: #' + row[0] + ' on row ' + str(counter) + ', which is a Gibson assembly')
##            print(int(row[0]))
##            print(row[3])
            if int(row[0]) == construct_number:
                print('Adding DNA sequence from row ' + str(counter) + ' for construct number ' + row[0])
                temp_list.append(str(row[3]))
                counter += 1
            if int(row[0]) > construct_number:
                print('Current construct number is ' + str(row[0]) + ', which is greater than the current construct number, ' + str(construct_number))
                print('Therefore, going to work on construct number ' + str(construct_number))
                for part in temp_list: #process the primer design work here
                    print('test')
##                    print(part)
                construct_number += 1
                temp_list = []
                print('Adding DNA from row #' + str(counter) + ' from construct number ' + str(construct_number))
                temp_list.append(row)
                print('Next construct number is number ' + str(construct_number))
                counter += 1
##            counter += 1
        if str(row[1]) == "iPCR":
            print('Current construct number is: ' + row[0] + ' on row ' + str(counter) + ', which is an iPCR assembly.')
            #process the primer design work here
            #get first 60 nucleotides from the sequence
            sequence = row[3]
            fw_primer = sequence[1:61]
            print('Sequence of forward primer:')
            print(fw_primer)
            last_sixty = sequence[-60:]
##            print(last_sixty)
            re_primer = Seq(last_sixty).reverse_complement()
            print('Sequence of reverse primer:')
            print(re_primer)
            #ending code: add 1 to counter and construct number
            counter += 1
            construct_number += 1
##            if int(row[0]) == construct_number:
##        else:
##            counter += 1
##            construct_number += 1
##    print(temp_list)

##        for row in temp_list:
##    print(temp_list)        
##    print(temp_list[-1])
#                fw_primer = temp_list[counter - 1].

(我知道代码看上去很傻-除了Java入门,我从未做过任何编程类.)

(I know the code probably looks noob - I've never done any programming class beyond introductory Java.)

此代码的问题是,如果我尝试通过吉布森"装配构建n个构建体"(又称质粒),它将处理前n-1个质粒,而不处理最后一个n-1个质粒.但是,我也没有想到任何更好的方法来编写此代码,但是我可以看到我正在尝试实现的工作流,知道如何处理列表中的"n"个事物,但是每个事物"的行数可变,对我来说真的很方便.

The problem with this code is that if I have n "constructs" (a.k.a. plasmids) that I'm trying to build by "Gibson" assembly, it will process the first n-1 plasmids, but not the last one. I also can't think of any better way to write this code, however, but I can see that for the workflow that I'm trying to implement, knowing how to process "n" things in a list, but with each "thing" of variable numbers of rows, would come in really handy for me.

在这里,我非常感谢任何人的帮助!非常感谢!

I'd really appreciate anybody's help here! Thanks a lot!

推荐答案

只是python的一些常规编码帮助.如果您还没有阅读PEP8,请这么做.

Just some general coding help with python. If you haven't read PEP8 do so.

要保持清晰的代码,将变量分配给记录/行中引用的字段会很有帮助.

To maintain clear code it can be helpful to assign variables to fields referenced in a record/row.

我会为任何引用的字段添加这样的内容:

I would add something like this for any field referenced:

construct_idx = 0

此外,我建议使用字符串格式,这样更干净.

Also, I would recommend using string formatting, it's cleaner.

所以:

print('Current construct number is: #{} on row {}, which is a Gibson assembly'.format(row[construct_idx], counter))

代替:

print('Current construct number is: #' + row[0] + ' on row ' + str(counter) + ', which is a Gibson assembly')

如果要创建csv阅读器对象,则将其设为变量名"* _list"可能会导致误导.将其命名为"* _reader"更加直观.

If you're creating a csv reader object, making it's variable name "*_list" can be miss-leading. Calling it "*_reader" is more intuitive.

construct_reader = csv.reader(constructs, delimiter=',')

代替:

construct_list = csv.reader(constructs, delimiter=',')

这篇关于处理较大列表中大小可变的子列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆