解析非常大的CSV数据集 [英] Parse a very large CSV dataset

查看：297 发布时间：2020/5/24 3:46:27 python python-2.7 csv pandas scikit-learn

本文介绍了解析非常大的CSV数据集的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个非常大的CSV数据集(900M条记录)，由以下格式组成:

I have a very large CSV dataset (900M records) that consists of the following format:

URL | IP | ActivityId

示例数据:

http://google.com/ | 127.0.0.1 | 2
http://google.com/ | 12.3.3.1 | 2

对于这种格式，我希望获得每个URL的所有唯一活动.

For this format, I wish to get all the unique activities per URL.

我试图做的是创建一个字典，其中的键是URL，值是一组唯一的活动.但是，这在性能方面非常可惜-它耗尽了所有RAM，并且在时间方面非常慢(O(n)操作)

What I tried to do was create a dictionary where the key is the URL, and the value is a set of unique activities. However, this fails miserably performance wise - it eats up all the RAM and is very slow time-wise ( O(n) operation )

还有其他更快的方法吗?

Is there any other faster approach?

解析非常大的CSV数据集 [英] Parse a very large CSV dataset

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

解析非常大的CSV数据集 [英] Parse a very large CSV dataset

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭