基于多个条件通过python或R脚本删除或删除意外的记录和字符串 [英] Delete or remove unexpected records and strings based on multiple criteria by python or R script
问题描述
我有一个 .csv
文件名为 fileOne.csv
,其中包含许多不必要的字符串和记录。我想使用Python或R脚本根据多个条件/条件删除不必要的记录/行和字符串,并将记录保存到新的 .csv
文件名为 resultFile.csv
。
我想要做的是:
-
删除第一列。
-
将列BB拆分为两个名为
a_id
c> c_id 。用_(下划线)分隔该值,左侧将转到a_id
,右侧将转到c_id
。 -
只保留在files列中包含.csv文件扩展名的记录,但不包含
No Bi $ c $
-
less
字符串的记录。 -
修剪所有其他不必要的字符串
-
在每行中找到Mi后,删除每行的重写文件。
我的 fileOne.csv
如下:
AA BB CC DD EE FF GG
1 1_1.csv(= 0 = 1027= 57Mi
0.97 0.9 0.8 NaN 0.9 od 0.2
2 1_3.csv(= 0 = 1027Mi0.5
0.97 0.5 0.8 NaN 0.9 od 0.4
3 1_6.csv(= 0 = 10 53 cnt
0.97 0.9 0.8 NaN 0.9 od 0.6
4 2_6.csv否Bi 000 000 000 000
5 2_8.csv否Bi 000 000 000 000
6 6_9.csv less 000 000 000 000
7 7_9.csv s(= 0 = 26= 46Mi121
我的第一个预期结果文件如下:
a_id b_id CC DD EE FF GG
1 1 0 10 27 57 Mi
1 3 0 10 27 Mi 0.5
1 6 0 10 Mi 53 cnt
7 9 0 26 46 Mi 121
我的最终预期结果文件如下:
a_id b_id CC DD EE FF GG
1 1 0 10 27 57
1 3 0 10 27
1 6 0 10
7 9 0 26 46
$
解决方案这可以通过以下Python脚本实现:
import csv
import re
import string
output_header = ['a_id','b_id ','CC','DD','EE','FF','GG']
sanitise_table = string.maketrans(,)
nodigits_table = sanitise_table。翻译(sanitise_table,string.digits)
def sanitise_cell(cell):
return cell.translate(sanitise_table,nodigits_table)#保持数字
with open fileOne.csv')as f_input,open('resultFile.csv','wb')as f_output:
csv_input = csv.reader(f_input)
csv_output = csv.writer(f_output)
input_header = next(f_input)
csv_output.writerow(output_header)
csv_input中的行:
bb = re.match(r'(\d + )_(\ d +)\.csv',row [1])$ b
$ b如果bb和row [2]不在['No Bi','less']:
#删除'Mi'后的所有列
try:
mi = row.index('Mi')
row [:] = row [:mi] + [''] * (len(row) - mi)
,除了ValueError:
pass
row [:] = [san in row_col ] = bb.group(1)
row [1] = bb.group(2)
csv_output.writerow(row)
要从现有文件中简单删除
列 c>,可以使用以下命令:
import csv
with open('input.csv')as f_input,open('output.csv','wb')as f_output :
csv_input = csv.reader(f_input)
csv_output = csv.writer(f_output)
csv_input中的行:
try:
mi = row.index('Mi')
row [:] = row [:mi] + [''] *(len(row) - mi)
ValueError:
pass
csv_output.writerow(row)
使用Python 2.7.9测试
I have a
.csv
file namedfileOne.csv
that contains many unnecessary strings and records. I want to delete unnecessary records / rows and strings based on multiple condition / criteria using a Python or R script and save the records into a new.csv
file namedresultFile.csv
.What I want to do is as follows:
Delete the first column.
Split column BB into two column named as
a_id
, andc_id
. Separate the value by _ (underscore) and left side will go toa_id
, and right side will go toc_id
.Keep only records that have the .csv file extension in the files column, but do not contain
No Bi
in cut column.Assign new name to each of the columns.
Delete the records that contain strings like
less
in the CC column.Trim all other unnecessary string from the records.
Delete the reamining filds of each rows after I find the "Mi" in each rows.
My
fileOne.csv
is as follows:AA BB CC DD EE FF GG 1 1_1.csv (=0 =10" 27" =57 "Mi" 0.97 0.9 0.8 NaN 0.9 od 0.2 2 1_3.csv (=0 =10" 27" "Mi" 0.5 0.97 0.5 0.8 NaN 0.9 od 0.4 3 1_6.csv (=0 =10" "Mi" =53 cnt 0.97 0.9 0.8 NaN 0.9 od 0.6 4 2_6.csv No Bi 000 000 000 000 5 2_8.csv No Bi 000 000 000 000 6 6_9.csv less 000 000 000 000 7 7_9.csv s(=0 =26" =46" "Mi" 121
My 1st expected results files would be as follows:
a_id b_id CC DD EE FF GG 1 1 0 10 27 57 Mi 1 3 0 10 27 Mi 0.5 1 6 0 10 Mi 53 cnt 7 9 0 26 46 Mi 121
My final expected results files would be as follows:
a_id b_id CC DD EE FF GG 1 1 0 10 27 57 1 3 0 10 27 1 6 0 10 7 9 0 26 46
解决方案This can be achieved with the following Python script:
import csv import re import string output_header = ['a_id', 'b_id', 'CC', 'DD', 'EE', 'FF', 'GG'] sanitise_table = string.maketrans("","") nodigits_table = sanitise_table.translate(sanitise_table, string.digits) def sanitise_cell(cell): return cell.translate(sanitise_table, nodigits_table) # Keep digits with open('fileOne.csv') as f_input, open('resultFile.csv', 'wb') as f_output: csv_input = csv.reader(f_input) csv_output = csv.writer(f_output) input_header = next(f_input) csv_output.writerow(output_header) for row in csv_input: bb = re.match(r'(\d+)_(\d+)\.csv', row[1]) if bb and row[2] not in ['No Bi', 'less']: # Remove all columns after 'Mi' if present try: mi = row.index('Mi') row[:] = row[:mi] + [''] * (len(row) - mi) except ValueError: pass row[:] = [sanitise_cell(col) for col in row] row[0] = bb.group(1) row[1] = bb.group(2) csv_output.writerow(row)
To simply remove
Mi
columns from an existing file the following can be used:import csv with open('input.csv') as f_input, open('output.csv', 'wb') as f_output: csv_input = csv.reader(f_input) csv_output = csv.writer(f_output) for row in csv_input: try: mi = row.index('Mi') row[:] = row[:mi] + [''] * (len(row) - mi) except ValueError: pass csv_output.writerow(row)
Tested using Python 2.7.9
这篇关于基于多个条件通过python或R脚本删除或删除意外的记录和字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!