在Python中即时计算csv列的出现次数 [英] Counting csv column occurrences on the fly in Python

查看:243
本文介绍了在Python中即时计算csv列的出现次数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个csv文件,其中包含2亿行.

I have a csv file that has 200 million lines in it.

加载此文件的最佳方法是使用csv读取器逐行(因为我有许多这样的文件,因此以后并行化代码不需要加载大量数据集和RAM过载).

The best way to load in this file is line by line using a csv reader (as I have many of these files and so parallelising the code later requires not loading in massive datasets and overloading RAM).

我正在尝试计算某列中值的出现次数,并将它们的值和频率记录在字典中.例如,计算一列中唯一ID的数量以及这些ID出现的次数.

I'm trying to count the number of occurrences of values in a certain column and record both their values and frequency in a dictionary. For example, count the number of unique IDs in a column and the number of times those IDs appear.

以下是我如何执行此操作的示例:

Here is an example of how I would do this:

import csv
from tqdm import tqdm

field_names = ['A','B','ID','C','D']
filename = '/PATH/file'

ID_dict = {}
with open(data_path+filename) as f:
    reader = csv.DictReader(f,field_names,delimiter=',')
    for row in tqdm(reader):
        label = row['ID']
        if label not in ID_dict.keys():
                ID_dict[label] = 0
        ID_dict[label] += 1

因此,在这里我感兴趣的是标记为"ID"的列,但可以想象其中有大约2亿个条目.

So the column labelled 'ID' is what I'm interested in here, but imagine it has ~200 million entries in it.

遍历所有这些行并填充字典很慢(在我的计算机上大约需要10个小时).

Iterating through all those lines and populating a dictionary is slow (it would take around 10 hours on my machine).

或者,将值附加到新数组,然后使用Counter查找每个唯一元素的出现次数也需要很长时间. (请参阅如何计算列表中的唯一值)

Alternatively, appending the values to a new array and then using Counter to find the number of appearances of each unique element also takes too long. (see How do I count unique values inside a list)

有没有一种更快速的方法可以使我丢失?也许有一种更快的熊猫方式? 预先感谢

Is there a much faster way of doing this that I'm missing? Perhaps there is a Pandas way that is faster? Thanks in advance

推荐答案

请勿使用DictReader(). DictReader()做了很多工作,将行转换为字典,并且可以配置缺少和多余的列,而您在这里实际上并不需要.只需使用常规阅读器并访问每一行的第三列即可.

Don't use DictReader(). DictReader() does a lot of work converting a row to a dictionary, with configurable handling of missing and extra columns, that you really don't need here. Just use a regular reader and access the 3rd column of each row.

您可以通过使用Counter()对象开始进一步加快此速度(它将自动为您处理0情况).通过使用newline=''打开文件,您可能会获得很小的速度提升. CSV模块建议您还是这样做,因为它要确保它知道行尾与列中可能嵌入的换行符.

You can further speed this up by using a Counter() object to begin with (it'll automatically handle the 0 case for you). You may be able to get a very minor speed boost by opening the file with newline=''; the CSV module recommends you do this anyway as it wants to make sure it knows about line endings versus possible embedded newlines in columns.

如果使用map()对象和operator.itemgetter(),则可以进一步避免评估循环开销,并将ID直接传递给计数器:

If you use a map() object and operator.itemgetter(), you can further avoid the evaluation loop overhead, and pass the ids directly to the counter:

import csv
import os.path
from collections import Counter
from operator import itemgetter

filename = '/PATH/file'

with open(os.path(data_path, filename), newline='') as f:
    reader = csv.reader(f)
    id_counts = Counter(map(itemgetter(2), reader))

仍有2亿行需要处理.我使用 Faker 生成了半百万行的数据,并将这些行复制了200次到一个新文件中,我的2017年型号Macbook Pro带有SSD的情况下,仅在6分钟内(包括tqdm)处理了12GB的数据,而在5分钟14秒内不处理了. tqdm声称它每次迭代仅增加60纳秒(2亿行12秒),但是在我的测试中,它看起来很容易是该数字的3或4倍.

Still, 200 million rows is a lot of work to handle. I generated 1 million rows of semi-realistic data using Faker, copied those rows 200 times into a new file, and my 2017 model Macbook Pro with SSD processed the resulting 12GB of data in just over 6 minutes with tqdm included, and 5 minutes 14 seconds without. tqdm claims it adds only 60 nanoseconds to each iteration (12 seconds over 200 million rows) but in my tests it appears to be easily 3 or 4 times that number.

Pandas 读取数据的速度将大致相同,因为Pandas的read_csv()建立在csv.reader()之上,并且上述速度与Python读取数据的速度一样快. 2亿行的文件.但是,它将为这2亿行建立一个数据帧,这将需要大量的内存来处理.您必须对数据进行分块处理并汇总结果,使之完全可行.

Pandas reading the data is going to be about the same speed, as Pandas' read_csv() is built on top of csv.reader(), and the above is as fast as Python can make reading a file with 200 million rows. However, it'll then build a dataframe for those 200 million rows, and that'll take a significant amount of memory to process. You'd have to process your data in chunks and aggregate the results for this to be feasible at all.

让我们进行一些速度测试,比较您的版本(带有和不带有tqdm减速块的版本),Pandas和上述方法.我们将使用一个包含约100个唯一ID的1万行测试集来均匀地比较事物,而无需使用I/O.这将测试每种方法的公正计数功能.因此,设置测试数据和测试; name=name关键字分配有助于避免针对重复测试进行全局名称查找:

Lets do some speed tests, comparing your version (one with and one without the tqdm speed bump), Pandas, and the above approach. We'll use a test set of 10k rows with about 100 unique ids to compare things evenly, without using I/O. This tests just the counting capabilities of each approach. So, setting up the test data and tests; the name=name keyword assignments help avoid global name lookups for repeated tests:

>>> import csv, pandas
>>> from timeit import Timer
>>> from collections import Counter
>>> from contextlib import redirect_stderr
>>> from io import StringIO
>>> from operator import itemgetter
>>> from random import randrange
>>> from tqdm import tqdm
>>> row = lambda: f",,{randrange(100)},,\r\n"  # 5 columns, only care about middle column
>>> test_data = ''.join([row() for _ in range(10 ** 4)])  # CSV of 10.000 rows
>>> field_names = ['A', 'B', 'ID', 'C', 'D']
>>> filename = '/PATH/file'
>>> tests = []
>>> def as_test(f):
...     tests.append((f.__name__, f))
...
>>> @as_test
... def in_question(f, csv=csv, tqdm=tqdm, field_names=field_names):
...     ID_dict = {}
...     reader = csv.DictReader(f, field_names, delimiter=',')
...     for row in tqdm(reader):
...         label = row['ID']
...         if label not in ID_dict.keys():
...                 ID_dict[label] = 0
...         ID_dict[label] += 1
...
>>> @as_test
... def in_question_no_tqdm(f, csv=csv, tqdm=tqdm, field_names=field_names):
...     ID_dict = {}
...     reader = csv.DictReader(f, field_names, delimiter=',')
...     for row in reader:
...         label = row['ID']
...         if label not in ID_dict.keys():
...                 ID_dict[label] = 0
...         ID_dict[label] += 1
...
>>> @as_test
... def pandas_groupby_count(f, pandas=pandas, field_names=field_names):
...     df = pandas.read_csv(f, names=field_names)
...     grouped_counts = df.groupby('ID').count()
...
>>> @as_test
... def pandas_value_counts(f, pandas=pandas, field_names=field_names):
...     df = pandas.read_csv(f, names=field_names)
...     counts = df['ID'].value_counts()
...
>>> @as_test
... def counter_over_map(f, csv=csv, Counter=Counter, ig2=itemgetter(2)):
...     reader = csv.reader(f)
...     id_counts = Counter(map(ig2, reader))
...

并运行定时测试:

>>> for testname, testfunc in tests:
...     timer = Timer(lambda s=StringIO, t=test_data: testfunc(s(t)))
...     with redirect_stderr(StringIO()):  # silence tqdm
...         count, totaltime = timer.autorange()
...         print(f"{testname:>25}: {totaltime / count * 1000:6.3f} microseconds ({count:>2d} runs)")
...
              in_question: 33.303 microseconds (10 runs)
      in_question_no_tqdm: 30.467 microseconds (10 runs)
     pandas_groupby_count:  5.298 microseconds (50 runs)
      pandas_value_counts:  5.975 microseconds (50 runs)
         counter_over_map:  4.047 microseconds (50 runs)

DictReader()和Python for循环的结合真正导致您的版本慢6到7倍.在抑制了stderr的情况下,tqdm的开销已降至0.3纳秒;删除with redirect_stderr()上下文管理器会使输出更加冗长,并将时间增加到50微秒,因此每次迭代大约需要2纳秒:

The combination of a DictReader() and Python for loop is what really causes your version to be 6 to 7 times as slow. tqdm's overhead has dropped to 0.3 nanosecond with stderr suppressed; dropping the with redirect_stderr() context manager made output more verbose and increases the time to 50 microseconds, so about 2 nanoseconds per iteration:

>>> timer = Timer(lambda s=StringIO, t=test_data: tests[0][1](s(t)))
>>> count, totaltime = timer.autorange()
10000it [00:00, 263935.46it/s]
10000it [00:00, 240672.96it/s]
10000it [00:00, 215298.98it/s]
10000it [00:00, 226025.18it/s]
10000it [00:00, 201787.96it/s]
10000it [00:00, 202984.24it/s]
10000it [00:00, 192296.06it/s]
10000it [00:00, 195963.46it/s]
>>> print(f"{totaltime / count * 1000:6.3f} microseconds ({count:>2d} runs)")
50.193 microseconds ( 5 runs)

但是,熊猫在这里保持良好状态!但是,如果不分块将全部2亿行数据读入内存所需的千兆字节内存(使用实际数据集,而不是我在此处生成的空列),它将变得非常慢,并且可能不是您的计算机可以实际携带的东西.在这里使用Counter()不需要千兆字节的内存.

Pandas is keeping up well here however! But without chunking the gigabytes of memory needed to read all 200 million rows of data into memory (with the actual dataset, not empty columns as I produced here) is going to be a lot slower and perhaps not something your machine can actually carry. Using a Counter() doesn't require gigabytes of memory here.

如果您需要对CSV数据集进行更多处理,那么使用SQLite也是一个好主意.那时我什至不使用Python.只需使用 SQLite命令行工具直接导入CSV数据:

If you need to do more processing of the CSV dataset, then using SQLite is going to be a good idea too. I wouldn't even use Python then; just use the SQLite command line tool to import the CSV data directly:

$  csvanalysis.db
SQLite version 3.19.3 2017-06-27 16:48:08
Enter ".help" for usage hints.
sqlite> CREATE TABLE csvdata (A, B, ID, C, D);
sqlite> CREATE INDEX csvdata_id on csvdata (ID);
sqlite> .import /path/to/file.csv csvdata
sqlite> SELECT ID, COUNT(*) FROM csvdata GROUP BY ID;

这篇关于在Python中即时计算csv列的出现次数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆