基于多个条件通过python或R脚本删除或删除意外的记录和字符串 [英] Delete or remove unexpected records and strings based on multiple criteria by python or R script

查看:175
本文介绍了基于多个条件通过python或R脚本删除或删除意外的记录和字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 .csv 文件名为 fileOne.csv ,其中包含许多不必要的字符串和记录。我想使用Python或R脚本根据多个条件/条件删除不必要的记录/行和字符串,并将记录保存到新的 .csv 文件名为 resultFile.csv



我想要做的是:


  1. 删除第一列。


  2. 将列BB拆分为两个名为 a_id c> c_id 。用_(下划线)分隔该值,左侧将转到 a_id ,右侧将转到 c_id


  3. 只保留在files列中包含.csv文件扩展名的记录,但不包含 No Bi


  4. >删除包含CC列中 less 字符串的记录。


  5. 修剪所有其他不必要的字符串


  6. 在每行中找到Mi后,删除每行的重写文件。


我的 fileOne.csv 如下:

  AA BB CC DD EE FF GG 
1 1_1.csv(= 0 = 1027= 57Mi
0.97 0.9 0.8 NaN 0.9 od 0.2
2 1_3.csv(= 0 = 1027Mi0.5
0.97 0.5 0.8 NaN 0.9 od 0.4
3 1_6.csv(= 0 = 10 53 cnt
0.97 0.9 0.8 NaN 0.9 od 0.6
4 2_6.csv否Bi 000 000 000 000
5 2_8.csv否Bi 000 000 000 000
6 6_9.csv less 000 000 000 000
7 7_9.csv s(= 0 = 26= 46Mi121

我的第一个预期结果文件如下:

  a_id b_id CC DD EE FF GG 
1 1 0 10 27 57 Mi
1 3 0 10 27 Mi 0.5
1 6 0 10 Mi 53 cnt
7 9 0 26 46 Mi 121



我的最终预期结果文件如下:

  a_id b_id CC DD EE FF GG 
1 1 0 10 27 57
1 3 0 10 27
1 6 0 10
7 9 0 26 46

$


解决方案

这可以通过以下Python脚本实现:

  import csv 
import re
import string

output_header = ['a_id','b_id ','CC','DD','EE','FF','GG']

sanitise_table = string.maketrans(,)
nodigits_table = sanitise_table。翻译(sanitise_table,string.digits)

def sanitise_cell(cell):
return cell.translate(sanitise_table,nodigits_table)#保持数字

with open fileOne.csv')as f_input,open('resultFile.csv','wb')as f_output:
csv_input = csv.reader(f_input)
csv_output = csv.writer(f_output)

input_header = next(f_input)
csv_output.writerow(output_header)

csv_input中的行:
bb = re.match(r'(\d + )_(\ d +)\.csv',row [1])$ ​​b
$ b如果bb和row [2]不在['No Bi','less']:
#删除'Mi'后的所有列
try:
mi = row.index('Mi')
row [:] = row [:mi] + [''] * (len(row) - mi)
,除了ValueError:
pass

row [:] = [san in row_col ] = bb.group(1)
row [1] = bb.group(2)
csv_output.writerow(row)

要从现有文件中简单删除列 c>,可以使用以下命令:

  import csv 

with open('input.csv')as f_input,open('output.csv','wb')as f_output :
csv_input = csv.reader(f_input)
csv_output = csv.writer(f_output)

csv_input中的行:
try:
mi = row.index('Mi')
row [:] = row [:mi] + [''] *(len(row) - mi)
ValueError:
pass

csv_output.writerow(row)

使用Python 2.7.9测试


I have a .csv file named fileOne.csv that contains many unnecessary strings and records. I want to delete unnecessary records / rows and strings based on multiple condition / criteria using a Python or R script and save the records into a new .csv file named resultFile.csv.

What I want to do is as follows:

  1. Delete the first column.

  2. Split column BB into two column named as a_id, and c_id. Separate the value by _ (underscore) and left side will go to a_id, and right side will go to c_id.

  3. Keep only records that have the .csv file extension in the files column, but do not contain No Bi in cut column.

  4. Assign new name to each of the columns.

  5. Delete the records that contain strings like less in the CC column.

  6. Trim all other unnecessary string from the records.

  7. Delete the reamining filds of each rows after I find the "Mi" in each rows.

My fileOne.csv is as follows:

   AA      BB       CC       DD     EE      FF    GG
   1       1_1.csv  (=0      =10"   27"     =57   "Mi"
   0.97    0.9      0.8      NaN    0.9     od    0.2
   2       1_3.csv  (=0      =10"   27"     "Mi"  0.5
   0.97    0.5      0.8      NaN    0.9     od    0.4
   3       1_6.csv  (=0      =10"   "Mi"     =53  cnt
   0.97    0.9      0.8      NaN    0.9     od    0.6
   4       2_6.csv  No Bi    000    000     000   000
   5       2_8.csv  No Bi    000    000     000   000
   6       6_9.csv  less     000    000     000   000
   7       7_9.csv  s(=0     =26"   =46"    "Mi"  121     

My 1st expected results files would be as follows:

a_id    b_id    CC    DD    EE    FF    GG             
1       1       0     10    27    57    Mi              
1       3       0     10    27    Mi    0.5
1       6       0     10    Mi    53    cnt 
7       9       0     26    46    Mi    121  

My final expected results files would be as follows:

a_id    b_id    CC    DD    EE    FF    GG             
1       1       0     10    27    57              
1       3       0     10    27
1       6       0     10 
7       9       0     26    46  

解决方案

This can be achieved with the following Python script:

import csv
import re
import string

output_header = ['a_id', 'b_id', 'CC', 'DD', 'EE', 'FF', 'GG']

sanitise_table = string.maketrans("","")
nodigits_table = sanitise_table.translate(sanitise_table, string.digits)

def sanitise_cell(cell):
    return cell.translate(sanitise_table, nodigits_table)       # Keep digits

with open('fileOne.csv') as f_input, open('resultFile.csv', 'wb') as f_output:
    csv_input = csv.reader(f_input)
    csv_output = csv.writer(f_output)

    input_header = next(f_input)
    csv_output.writerow(output_header)

    for row in csv_input:
        bb = re.match(r'(\d+)_(\d+)\.csv', row[1])

        if bb and row[2] not in ['No Bi', 'less']:
            # Remove all columns after 'Mi' if present
            try:
                mi = row.index('Mi')
                row[:] = row[:mi] + [''] * (len(row) - mi)
            except ValueError:
                pass

            row[:] = [sanitise_cell(col) for col in row]
            row[0] = bb.group(1)
            row[1] = bb.group(2)
            csv_output.writerow(row)

To simply remove Mi columns from an existing file the following can be used:

import csv

with open('input.csv') as f_input, open('output.csv', 'wb') as f_output:
    csv_input = csv.reader(f_input)
    csv_output = csv.writer(f_output)

    for row in csv_input:
        try:
            mi = row.index('Mi')
            row[:] = row[:mi] + [''] * (len(row) - mi)
        except ValueError:
            pass

        csv_output.writerow(row)

Tested using Python 2.7.9

这篇关于基于多个条件通过python或R脚本删除或删除意外的记录和字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆