为什么 Pandas 和 R 之间数据帧的内存使用量有如此大的差异? [英] Why is there such a large difference in memory usage for dataframes between pandas and R?

查看：40 发布时间：2021/6/2 19:29:22 python r pandas dataframe memory

本文介绍了为什么 Pandas 和 R 之间数据帧的内存使用量有如此大的差异?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在处理来自 https://opendata.rdw.nl/Voertuigen/Open-Data-RDW-Gekentekende_voertuigen_brandstof/8ys7-d773(使用导出器"按钮下载 CSV 文件).

I am working with the data from https://opendata.rdw.nl/Voertuigen/Open-Data-RDW-Gekentekende_voertuigen_brandstof/8ys7-d773 (download CSV file using the 'Exporteer' button).

当我使用 read.csv() 将数据导入 R 时，它需要 3.75 GB 的内存但是当我使用 pd.read_csv() 将数据导入到 Pandas 时它占用 6.6 GB 内存.

When I import the data into R using read.csv() it takes 3.75 GB of memory but when I import it into pandas using pd.read_csv() it takes up 6.6 GB of memory.

为什么这种差异如此之大?

Why is this difference so large?

我使用以下代码来确定 R 中数据帧的内存使用情况:

I used the following code to determine the memory usage of the dataframes in R:

library(pryr) 
object_size(df)

和蟒蛇:

df.info(memory_usage="deep")

推荐答案

我发现这个链接非常有用，并认为值得从评论和总结中拿出来:

I found that link super useful and figured it's worth breaking out from the comments and summarizing:

减少 Pandas 内存使用 #1:无损压缩

使用 usecols

df = pd.read_csv('voters.csv', usecols=['First Name', 'Last Name'])

使用较小的 dtype 缩小数字列

Shrink numerical columns with smaller dtypes

int64:(默认) -9223372036854775808 到 9223372036854775807
int16:-32768 到 32767
int8:-128 到 127

int64: (default) -9223372036854775808 to 9223372036854775807
int16: -32768 to 32767
int8: -128 to 127

df = pd.read_csv('voters.csv', dtype={'Ward Number': 'int8'})

使用 dtype category

df = pd.read_csv('voters.csv', dtype={'Party Affiliation': 'category'})

将大部分 nan 数据转换为 dtype Sparse

Convert mostly nan data to dtype Sparse

sparse_str_series = series.astype('Sparse[str]')
sparse_int16_series = series.astype('Sparse[int16]')

这篇关于为什么 Pandas 和 R 之间数据帧的内存使用量有如此大的差异?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

为什么 Pandas 和 R 之间数据帧的内存使用量有如此大的差异? [英] Why is there such a large difference in memory usage for dataframes between pandas and R?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

为什么 Pandas 和 R 之间数据帧的内存使用量有如此大的差异? [英] Why is there such a large difference in memory usage for dataframes between pandas and R?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭