在python中过滤CSV文件 [英] Filtering a CSV file in python
问题描述
我已下载此 csv file ,其创建基因信息的电子表格。重要的是,在 HLA - *
列中,有基因信息。如果基因太低的分辨率。 DQB1 * 03
,则应该删除该行。如果数据太高,例如 DQB1 * 03:02:01
,则需要删除末尾的:01
标记。所以,理想情况下,我想要的格式是 DQB1 * 03:02
,以便它有两个级别的解决方案 DQB1 *
。我如何告诉python寻找这些格式,并忽略它们存储的数据。
例如
I have downloaded this csv file, which creates a spreadsheet of gene information. What is important is that in the HLA-*
columns, there is gene information. If the gene is too low of a resolution e.g. DQB1*03
then the row should be deleted. If the data is too high resoltuion e.g. DQB1*03:02:01
, then the :01
tag at the end needs to be removed. So, ideally I want to proteins to be in the format DQB1*03:02
, so that it has two levels of resolution after DQB1*
. How can I tell python to look for these formats, and ignore the data stored in them.
e.g.
if (csvCell is of format DQB1*03:02:01):
delete the :01 # but do this in a general format
elif (csvCell is of format DQB1*03):
delete row
else:
goto next line
更新:我引用的编辑代码
UPDATE: Edited code I referenced
import csv
import re
import sys
csvdictreader = csv.DictReader(open('mhc.csv','r+b'), delimiter=',')
csvdictwriter = csv.DictWriter(file('mhc_fixed.csv','r+b'), fieldnames=csvdictreader.fieldnames, delimiter=',')
csvdictwriter.writeheader()
targets = [name for name in csvdictreader.fieldnames if name.startswith('HLA-D')]
for rowfields in csvdictreader:
keep = True
for field in targets:
value = rowfields[field]
if re.match(r'^\w+\*\d\d$', value):
keep = False
break # quit processing target fields
elif re.match(r'^(\w+)\*(\d+):(\d+):(\d+):(\d+)$', value):
rowfields[field] = re.sub(r'^(\w+)\*(\d+):(\d+):(\d+):(\d+)$',r'\1*\2:\3', value)
else: # reduce gene resolution if too high
# by only keeping first two alles if three are present
rowfields[field] = re.sub(r'^(\w+)\*(\d+):(\d+):(\d+)$',r'\1*\2:\3', value)
if keep:
csvdictwriter.writerow(rowfields)
推荐答案
这里有一些东西,我认为会做你想要的。这不像Peter的回答那么简单,因为它使用Python的 csv
模块来处理文件。
Here's something that I think will do what you want. It's not as simple as Peter's answer because it uses Python's csv
module to process the file. It could probably be rewritten and simplified to just treat the file as a plain text as his does, but that should be easy.
import csv
import re
import sys
csvdictreader = csv.DictReader(sys.stdin, delimiter=',')
csvdictwriter = csv.DictWriter(sys.stdout, fieldnames=csvdictreader.fieldnames, delimiter=',')
csvdictwriter.writeheader()
targets = [name for name in csvdictreader.fieldnames if name.startswith('HLA-')]
for rowfields in csvdictreader:
keep = True
for field in targets:
value = rowfields[field]
if re.match(r'^DQB1\*\d\d$', value): # gene resolution too low?
keep = False
break # quit processing target fields
else: # reduce gene resolution if too high
# by only keeping first two alles if three are present
rowfields[field] = re.sub(r'^DQB1\*(\d\d):(\d\d):(\d\d)$',
r'DQB1*\1:\2', value)
if keep:
csvdictwriter.writerow(rowfields)
我最难的部分是决定你想做什么。
The hardest part for me was determining what you wanted to do.
这篇关于在python中过滤CSV文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!