Python 3.4:匹配csv并返回具有匹配值的新csv [英] Python 3.4 : match from csv and return new csv with matched values

查看:217
本文介绍了Python 3.4:匹配csv并返回具有匹配值的新csv的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

QUESTION



如何扫描reader csv中的任何项目的阅读器csv并返回具有匹配信息的新csv。



Reader2 csv格式



  66740,1800,1001463,1467373, 896159 



阅读器csv格式



  1001385 | NORTHWEST PIPE CO | 10-Q | 2015-05-06 | edgar / data / 1001385 / 0001193125-15-174814.txt 
1001426 | PERICOM SEMICONDUCTOR CORP | 10-Q | 2015-05-05 | edgar / data / 1001426 / 0001145443-15-000628.txt
1001463 | Acacia多元化控股公司| 10-K | 2015-05-20 | edgar / data / 1001463 / 0001185185- 15-001386.txt
1001463 | Acacia多元化控股公司| 10-K | 2015-05-20 | edgar / data / 1001463 / 0001185185-15-001394.txt
1001463 | Acacia多元化控股,Inc. | 10-Q | 2015-05-20 | edgar / data / 1001463 / 0001185185-15-001388.txt
1001463 | Acacia多元化控股公司| 10-Q | 2015-05-20 | edgar / data / 1001463 / 0001185185-15-001390.txt
1001463 | Acacia多元化控股公司| 10-Q | 2015-05-20 | edgar / data / 1001463 / 0001185185-15-001392.txt
1001463 | Acacia Diversified Holdings,Inc. | 10-Q | 2015-05-20 | edgar / data / 1001463 / 0001185185-15-001396.txt



目前代码



  with open('newCIK.csv')as reader2:
reader2 = csv.reader(reader2)

open('search.file')as f_in,open('SP500_10K.csv','w')as f_out:
reader = csv.reader(f_in,delimiter ='|')
writer = csv.writer(f_out,delimiter ='|')

$ b for reader2中的cik:
如果cik在行中:
writer.writerow(line)


解决方案

您试图将文件对象视为一个列表,循环遍历它多次。这将不工作,没有做额外的工作。此外,你不是循环的一行的列;您正在尝试测试整行是否在其他CSV文件行中。您需要测试每个值,然后只针对 search.file CSV数据中的最后一行。



文件对象有一个文件位置;当您从文件读取位置从开始到结束移动。



您可以再次尝试将文件对象重新启动:

p>

 与reader('newCIK.csv')一起作为reader2_file:
reader2 = csv.reader(reader2_file)

with open('search.file')as f_in,open('SP500_10K.csv','w')as f_out:
reader = csv.reader(f_in,delimiter ='|')
writer = csv.writer(f_out,delimiter ='|')

读取器中的行:
reader2_file.seek(0)#rewind to the start
cik in reader2:
if cik in line:
writer.writerow(line)

但是,一遍又一遍地读取文件是。你最好在开始时把整个东西读进记忆。而上面没有解决其他问题,即您正在测试 newCIK.csv 的每一行,而不是每列。



一行读入内存,然后循环:

  with open('newCIK.csv',newline ='')as reader2:
reader2 = csv.reader(reader2)
cik_values = next(reader2)#first row

将open('search.file',newline ='')as f_in,open('SP500_10K.csv','w',newline ='')as f_out:
reader = csv.reader(f_in,delimiter ='|')
writer = csv.writer(f_out,delimiter ='|')

用于阅读器中的行:
cik中的cik_values:
if cik in line [-1]:#测试只有最后一列
writer.writerow(line)

请注意,我在 open()调用中添加了 newline ='' csv 模块需要更多的对换行的控制;不这样做可能会导致在Windows和处理包含换行符的值时出现问题。



演示:



< >>>来自io import StringIO
>>>> import csv,sys
>>>> newcik ='''\
... 66740,1800,1001463,1467373,896159
...'''
>>>> search_file ='''\
... 1001385 | NORTHWEST PIPE CO | 10-Q | 2015-05-06 | edgar / data / 1001385 / 0001193125-15-174814.txt
... 1001426 | PERICOM SEMICONDUCTOR CORP | 10-Q | 2015-05-05 | edgar / data / 1001426 / 0001145443-15-000628.txt
... 1001463 | Acacia多元化控股公司| 10-K | 2015 -05-20 | edgar / data / 1001463 / 0001185185-15-001386.txt
... 1001463 | Acacia多元化控股公司| 10-K | 2015-05-20 | edgar / data / 1001463 / 0001185185-15-001394.txt
... 1001463 | Acacia多元化控股公司| 10-Q | 2015-05-20 | edgar / data / 1001463 / 0001185185-15-001388.txt
... 1001463 | Acacia多元化控股公司| 10-Q | 2015-05-20 | edgar / data / 1001463 / 0001185185-15-001390.txt
... 1001463 | Acacia多元化控股公司| 10-Q | 2015-05-20 | edgar / data / 1001463 / 0001185185-15-001392.txt
... 1001463 | Acacia多元化控股公司| 10-Q | 2015-05-20 | edgar / data / 1001463 / 0001185185-15-001396.txt
...'''
>>>>使用StringIO(newcik)作为reader2:
... reader2 = csv.reader(reader2)
... cik_values = next(reader2)#first row
...
>>>>将StringIO(search_file)作为f_in:
... reader = csv.reader(f_in,delimiter ='|')
... writer = csv.writer(sys.stdout,delimiter ='| ')
...在阅读器中的行:
... for cik in cik_values:
...如果cik在行[-1]:#测试只有最后一列
... writer.writerow(line)
...
1001463 | Acacia多元化控股公司| 10-K | 2015-05-20 | edgar / data / 1001463 / 0001185185-15 -001386.txt
103
1001463 | Acacia多元化控股公司| 10-K | 2015-05-20 | edgar / data / 1001463 / 0001185185-15-001394.txt
103
1001463 | Acacia多元化控股公司| 10-Q | 2015-05-20 | edgar / data / 1001463 / 0001185185-15-001388.txt
103
1001463 | Acacia多元化控股,Inc. | 10-Q | 2015-05-20 | edgar / data / 1001463 / 0001185185-15-001390.txt
103
1001463 | Acacia多元化控股公司| 10-Q | 2015 -05-20 | edgar / data / 1001463 / 0001185185-15-001392.txt
103
1001463 | Acacia多元化控股公司| 10-Q | 2015-05-20 | edgar / data / 1001463 / 0001185185-15-001396.txt
103

103 数字是在每个 writer.writerow()调用中写入的字节数,由REPL回显。


QUESTION

How can I scan the reader csv for any items in the reader2 csv and return a new csv with the matched information.

Reader2 csv format

66740,1800,1001463,1467373,896159

reader csv format

1001385|NORTHWEST PIPE CO|10-Q|2015-05-06|edgar/data/1001385/0001193125-15-174814.txt
1001426|PERICOM SEMICONDUCTOR CORP|10-Q|2015-05-05|edgar/data/1001426/0001145443-15-000628.txt
1001463|Acacia Diversified Holdings, Inc.|10-K|2015-05-20|edgar/data/1001463/0001185185-15-001386.txt
1001463|Acacia Diversified Holdings, Inc.|10-K|2015-05-20|edgar/data/1001463/0001185185-15-001394.txt
1001463|Acacia Diversified Holdings, Inc.|10-Q|2015-05-20|edgar/data/1001463/0001185185-15-001388.txt
1001463|Acacia Diversified Holdings, Inc.|10-Q|2015-05-20|edgar/data/1001463/0001185185-15-001390.txt
1001463|Acacia Diversified Holdings, Inc.|10-Q|2015-05-20|edgar/data/1001463/0001185185-15-001392.txt
1001463|Acacia Diversified Holdings, Inc.|10-Q|2015-05-20|edgar/data/1001463/0001185185-15-001396.txt

Current Code

with open('newCIK.csv') as reader2:
    reader2 = csv.reader(reader2)

    with open('search.file') as f_in, open('SP500_10K.csv', 'w') as f_out:
        reader = csv.reader(f_in, delimiter='|')
        writer = csv.writer(f_out, delimiter='|')

        for line in reader:
            for cik in reader2:
                if cik in line:
                    writer.writerow(line)

解决方案

You are trying to treat a file object as a list, looping over it more than once. That won't work without doing extra work. Moreover, you are not looping over the columns of the one row; you are trying to test if the whole row is in the other CSV file rows. You'd want to test each value, and then only against the last column of the rows in the search.file CSV data.

File objects have a file position; as you read from the file the position moves from start to end. Once at the end it won't move to the start again automatically.

You could seek the file object to the start again:

with open('newCIK.csv') as reader2_file:
    reader2 = csv.reader(reader2_file)

    with open('search.file') as f_in, open('SP500_10K.csv', 'w') as f_out:
        reader = csv.reader(f_in, delimiter='|')
        writer = csv.writer(f_out, delimiter='|')

        for line in reader:
            reader2_file.seek(0)  # rewind to the start
            for cik in reader2:
                if cik in line:
                    writer.writerow(line)

However, reading a file over and over is slow. You'd be better of reading the whole thing into memory at the start. And the above doesn't address the other problem, namely that you are testing each row, and not each column, from newCIK.csv.

Read the one row into memory, then loop over that:

with open('newCIK.csv', newline='') as reader2:
    reader2 = csv.reader(reader2)
    cik_values = next(reader2)  # first row

with open('search.file', newline='') as f_in, open('SP500_10K.csv', 'w', newline='') as f_out:
    reader = csv.reader(f_in, delimiter='|')
    writer = csv.writer(f_out, delimiter='|')

    for line in reader:
        for cik in cik_values:
            if cik in line[-1]:  # test only the last column
                writer.writerow(line)

Note that I added in newline='' arguments to the open() calls; the csv module needs more control over newlines; not doing so could cause problems on Windows and when handling values containing newlines.

Demo:

>>> from io import StringIO
>>> import csv, sys
>>> newcik = '''\
... 66740,1800,1001463,1467373,896159
... '''
>>> search_file = '''\
... 1001385|NORTHWEST PIPE CO|10-Q|2015-05-06|edgar/data/1001385/0001193125-15-174814.txt
... 1001426|PERICOM SEMICONDUCTOR CORP|10-Q|2015-05-05|edgar/data/1001426/0001145443-15-000628.txt
... 1001463|Acacia Diversified Holdings, Inc.|10-K|2015-05-20|edgar/data/1001463/0001185185-15-001386.txt
... 1001463|Acacia Diversified Holdings, Inc.|10-K|2015-05-20|edgar/data/1001463/0001185185-15-001394.txt
... 1001463|Acacia Diversified Holdings, Inc.|10-Q|2015-05-20|edgar/data/1001463/0001185185-15-001388.txt
... 1001463|Acacia Diversified Holdings, Inc.|10-Q|2015-05-20|edgar/data/1001463/0001185185-15-001390.txt
... 1001463|Acacia Diversified Holdings, Inc.|10-Q|2015-05-20|edgar/data/1001463/0001185185-15-001392.txt
... 1001463|Acacia Diversified Holdings, Inc.|10-Q|2015-05-20|edgar/data/1001463/0001185185-15-001396.txt
... '''
>>> with StringIO(newcik) as reader2:
...     reader2 = csv.reader(reader2)
...     cik_values = next(reader2)  # first row
... 
>>> with StringIO(search_file) as f_in:
...     reader = csv.reader(f_in, delimiter='|')
...     writer = csv.writer(sys.stdout, delimiter='|')
...     for line in reader:
...         for cik in cik_values:
...             if cik in line[-1]:  # test only the last column
...                 writer.writerow(line)
... 
1001463|Acacia Diversified Holdings, Inc.|10-K|2015-05-20|edgar/data/1001463/0001185185-15-001386.txt
103
1001463|Acacia Diversified Holdings, Inc.|10-K|2015-05-20|edgar/data/1001463/0001185185-15-001394.txt
103
1001463|Acacia Diversified Holdings, Inc.|10-Q|2015-05-20|edgar/data/1001463/0001185185-15-001388.txt
103
1001463|Acacia Diversified Holdings, Inc.|10-Q|2015-05-20|edgar/data/1001463/0001185185-15-001390.txt
103
1001463|Acacia Diversified Holdings, Inc.|10-Q|2015-05-20|edgar/data/1001463/0001185185-15-001392.txt
103
1001463|Acacia Diversified Holdings, Inc.|10-Q|2015-05-20|edgar/data/1001463/0001185185-15-001396.txt
103

The 103 numbers are the number of bytes written in each writer.writerow() call, echoed by the REPL.

这篇关于Python 3.4:匹配csv并返回具有匹配值的新csv的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆