我怎么知道我的R中的数据集何时太大? [英] How can I tell when my dataset in R is going to be too large?

查看:162
本文介绍了我怎么知道我的R中的数据集何时太大?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我将在R中进行一些日志文件分析(除非我不能在R中进行),并且我了解我的数据需要放在RAM中(除非我使用某种修复程序,例如接口keyval商店,也许?).因此,我想知道如何提前告知我的数据将在RAM中占用多少空间,以及我是否有足够的空间.我知道我有多少内存(不是很大的内存-XP下为3GB),我知道我的日志文件最终会变成多少行和列,列条目应该是哪种数据类型(大概我需要检查为它读取).

I am going to be undertaking some logfile analyses in R (unless I can't do it in R), and I understand that my data needs to fit in RAM (unless I use some kind of fix like an interface to a keyval store, maybe?). So I am wondering how to tell ahead of time how much room my data is going to take up in RAM, and whether I will have enough. I know how much RAM I have (not a huge amount - 3GB under XP), and I know how many rows and cols my logfile will end up as and what data types the col entries ought to be (which presumably I need to check as it reads).

我该如何将其汇总为执行R中分析的通过/否决定? (大概R需要能够有一些RAM来进行操作以及保存数据!)我直接需要的输出是一堆简单的摘要统计信息,频率,意外情况等,因此我可能会写一些解析器/制表器,它将为我提供我需要短期输出的信息,但下一步,我也想尝试许多不同的方法来处理此数据,因此,我在探讨使用R的可行性.

How do I put this together into a go/nogo decision for undertaking the analysis in R? (Presumably R needs to be able to have some RAM to do operations, as well as holding the data!) My immediate required output is a bunch of simple summary stats, frequencies, contingencies, etc, and so I could probably write some kind of parser/tabulator that will give me the output I need short term, but I also want to play around with lots of different approaches to this data as a next step, so am looking at feasibility of using R.

我在这里看到了很多有关R中大型数据集的有用建议,我已经阅读并且将重新阅读,但现在我想更好地了解如何弄清楚是否应该(a)去那里,( b)去那里,但是期望必须做一些额外的事情以使其易于管理,或者(c)在为时已晚之前逃跑,并以其他语言/环境进行操作(欢迎提出建议!!).谢谢!

I have seen lots of useful advice about large datasets in R here, which I have read and will reread, but for now I would like to understand better how to figure out whether I should (a) go there at all, (b) go there but expect to have to do some extra stuff to make it manageable, or (c) run away before it's too late and do something in some other language/environment (suggestions welcome...!). thanks!

推荐答案

R非常适合大型数据集,可以使用现成的解决方案,例如bigmemory

R is well suited for big datasets, either using out-of-the-box solutions like bigmemory or the ff package (especially read.csv.ffdf) or processing your stuff in chunks using your own scripts. In almost all cases a little programming makes processing large datasets (>> memory, say 100 Gb) very possible. Doing this kind of programming yourself takes some time to learn (I don't know your level), but makes you really flexible. If this is your cup of tea, or if you need to run depends on the time you want to invest in learning these skills. But once you have them, they will make your life as a data analyst much easier.

关于分析日志文件,我知道从《使命召唤4》(计算机多人游戏)生成的统计信息页面的工作方式是:将日志文件迭代地解析到数据库中,然后从数据库中检索每个用户的统计信息.有关界面示例,请参见此处.迭代(成块)的方法意味着日志文件的大小(几乎)是无限的.但是,获得良好的性能并非易事.

In regard to analyzing logfiles, I know that stats pages generated from Call of Duty 4 (computer multiplayer game) work by parsing the log file iteratively into a database, and then retrieving the statsistics per user from the database. See here for an example of the interface. The iterative (in chunks) approach means that logfile size is (almost) unlimited. However, getting good performance is not trivial.

您可以在R中做很多事情,可以在Python或Matlab中做,甚至C ++或Fortran.但是只有该工具对您想要的内容具有现成的支持,我才能看到该工具相对于R的明显优势.有关处理大数据的信息,请参见读取大块的文本文件.您可能感兴趣的其他相关链接:

A lot of the stuff you can do in R, you can do in Python or Matlab, even C++ or Fortran. But only if that tool has out-of-the-box support for what you want, I could see a distinct advantage of that tool over R. For processing large data see the HPC Task view. See also an earlier answer of min for reading a very large text file in chunks. Other related links that might be interesting for you:

  • Quickly reading very large tables as dataframes in R
  • https://stackoverflow.com/questions/1257021/suitable-functional-language-for-scientific-statistical-computing (discussion includes that to use for large data processing).
  • Trimming a huge (3.5 GB) csv file to read into R
  • A blog post of mine showing how to estimate the RAM usage of a dataset. Note that this assumes that the data will be stored in a matrix or array, and is just one datatype.
  • Log file processing with R

关于选择R或其他工具,我想说它是否对Google足够好,对我也足够好;).

In regard to choosing R or some other tool, I'd say if it's good enough for Google it is good enough for me ;).

这篇关于我怎么知道我的R中的数据集何时太大?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆