比较python(numpy)中两个巨大的csv文件的最快方法 [英] Fastest way to compare two huge csv files in python(numpy)

查看:411
本文介绍了比较python(numpy)中两个巨大的csv文件的最快方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在两个相当大的csv文件之间找到intesect子集 电话号码(一个有60万行,另一个有3亿行).我目前正在使用pandas打开两个文件,然后将所需的列转换为一维numpy数组,然后使用numpy相交来获得相交.有没有更好的方法可以做到这一点,可以使用python或任何其他方法.感谢您的帮助

I am trying find the intesect sub set between two pretty big csv files of phone numbers(one has 600k rows, and the other has 300mil). I am currently using pandas to open both files and then converting the needed columns into 1d numpy arrays and then using numpy intersect to get the intersect. Is there a better way of doing this, either with python or any other method. Thanks for any help

import pandas as pd
import numpy as np

df_dnc = pd.read_csv('dncTest.csv', names = ['phone'])
df_test = pd.read_csv('phoneTest.csv', names = ['phone'])

dnc_phone = df_dnc['phone']
test_phone = df_test['phone']

np.intersect1d(dnc_phone, test_phone)

推荐答案

我将为您提供一些Python伪代码的通用解决方案.您要在这里解决的是来自

I will give you general solution with some Python pseudo code. What you are trying to solve here is the classical problem from the book "Programming Pearls" by Jon Bentley.

只需一个简单的位数组就可以非常有效地解决此问题,因此,我的评论是电话号码有多少位(多少位数).

This is solved very efficiently with just a simple bit array, hence my comment, how long is (how many digits does have) the phone number.

比方说,电话号码的长度最多为10位数字,而您可以拥有的最大电话号码为:9 999 999 999(使用空格可以提高可读性).在这里,我们可以使用每个数字1位来标识该数字是否已设置(分别设置位或未设置位),因此,我们将使用9 999 999 999位来标识每个数字,即:

Let's say the phone number is at most 10 digits long, than the max phone number you can have is: 9 999 999 999 (spaces are used for better readability). Here we can use 1bit per number to identify if the number is in set or not (bit is set or not set respectively), thus we are going to use 9 999 999 999 bits to identify each number, i.e.:

  • bits[0]标识数字0 000 000 000
  • bits[193]标识数字0 000 000 193
  • 电话号码659 234-4567将由bits[6592344567]
  • 处理
  • bits[0] identifies the number 0 000 000 000
  • bits[193] identifies the number 0 000 000 193
  • having a number 659 234-4567 would be addressed by the bits[6592344567]

这样做,我们需要预先分配最初设置为09 999 999 999位,即:9 999 999 999 / 8 / 1024 / 1024 =大约1.2 GB的内存.

Doing so we'd need to pre-allocate 9 999 999 999 bits initially set to 0, which is: 9 999 999 999 / 8 / 1024 / 1024 = around 1.2 GB of memory.

我认为在末尾保留数字的交集将比位表示形式占用更多的空间=>最多将存储600k int => 64bit * 600k =约4.6 GB(

I think that holding the intersection of numbers at the end will use more space than the bits representation => at most 600k ints will be stored => 64bit * 600k = around 4.6 GB (actually int is not stored that efficiently and might use much more), if these are string you'll probably end with even more memory requirements.

从CSV文件(逐行或缓冲的文件阅读器)中解析电话号码字符串,将其转换为数字,并且比进行固定时间的内存查找要比处理字符串和合并它们快IMO.不幸的是,我没有这些电话号码文件可以测试,但是希望听到您的发现.

Parsing a phone number string from CSV file (line by line or buffered file reader), converting it to a number and than doing a constant time memory lookup will be IMO faster than dealing with strings and merging them. Unfortunately, I don't have these phone number files to test, but would be interested to hear your findings.

from bitstring import BitArray


max_number = 9999999999

found_phone_numbers = BitArray(length=max_number+1)


# replace this function with the file open function and retrieving
#  the next found phone number
def number_from_file_iteator(dummy_data):
    for number in dummy_data:
        yield number


def calculate_intersect():
    # should be open a file1 and getting the generator with numbers from it
    #  we use dummy data here
    for number in number_from_file_iteator([1, 25, 77, 224322323, 8292, 1232422]):
        found_phone_numbers[number] = True

    # open second file and check if the number is there
    for number in number_from_file_iteator([4, 24, 224322323, 1232422, max_number]):
        if found_phone_numbers[number]:
            yield number


number_intersection = set(calculate_intersect())

print number_intersection

我从 bitstring pip包中使用了BitArray, 2秒钟初始化整个位串.之后,扫描文件将使用固定内存.最后,我使用set来存储项目.

I used BitArray from bitstring pip package and it needed around 2 secs to initialize the entire bitstring. Afterwards, scanning the file will use constant memory. At the end I used a set to store the items.

注释1:可以修改此算法以仅使用list.在这种情况下,一旦位数匹配,就必须进行第二次循环,以确保重复项不再匹配.

Note 1: This algorithm can be modified to just use the list. In that case a second loop as soon as bit number matches this bit must be reset, so that duplicates do not match again.

注2::由于我们在第二个for循环中使用了生成器,因此懒惰地存储在set/list中.运行时复杂度是线性的,即O(N).

Note 2: Storing in the set/list occurs lazy, because we use the generator in the second for loop. Runtime complexity is linear, i.e. O(N).

这篇关于比较python(numpy)中两个巨大的csv文件的最快方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆