“大数据"使用 pandas 的工作流程 [英] "Large data" workflows using pandas

查看:72
本文介绍了“大数据"使用 pandas 的工作流程的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在学习熊猫的过程中,我试图迷惑了这个问题很多月了.我在日常工作中使用SAS,这非常有用,因为它提供了核心支持.但是,由于许多其他原因,SAS作为一个软件还是很糟糕的.

I have tried to puzzle out an answer to this question for many months while learning pandas. I use SAS for my day-to-day work and it is great for it's out-of-core support. However, SAS is horrible as a piece of software for numerous other reasons.

有一天,我希望用python和pandas代替SAS,但目前我缺少大型数据集的核心工作流程.我并不是在说需要分布式网络的大数据",而是文件太大而无法容纳在内存中,但又足够小而无法容纳在硬盘驱动器上.

One day I hope to replace my use of SAS with python and pandas, but I currently lack an out-of-core workflow for large datasets. I'm not talking about "big data" that requires a distributed network, but rather files too large to fit in memory but small enough to fit on a hard-drive.

我的第一个想法是使用 HDFStore 将大型数据集保存在磁盘上,然后仅将我需要的部分提取到数据帧中进行分析.其他人则提到MongoDB是一种更易于使用的替代方案.我的问题是这样:

My first thought is to use HDFStore to hold large datasets on disk and pull only the pieces I need into dataframes for analysis. Others have mentioned MongoDB as an easier to use alternative. My question is this:

实现以下目标的最佳做法是什么?

What are some best-practice workflows for accomplishing the following:

  1. 将平面文件加载到永久的磁盘数据库结构中
  2. 查询该数据库以检索要输入到熊猫数据结构中的数据
  3. 处理熊猫中的碎片​​后更新数据库

现实世界中的示例将不胜感激,特别是对于那些在大数据"上使用熊猫的人.

Real-world examples would be much appreciated, especially from anyone who uses pandas on "large data".

编辑-我希望如何工作的示例:

Edit -- an example of how I would like this to work:

  1. 迭代导入一个大的平面文件,并将其存储在永久的磁盘数据库结构中.这些文件通常太大而无法容纳在内存中.
  2. 为了使用Pandas,我想读取这些数据的子集(通常一次只读取几列),以适合内存.
  3. 我将通过对所选列执行各种操作来创建新列.
  4. 然后我必须将这些新列追加到数据库结构中.

我正在尝试找到执行这些步骤的最佳实践方法.阅读有关熊猫和pytables的链接,似乎添加一个新列可能是个问题.

I am trying to find a best-practice way of performing these steps. Reading links about pandas and pytables it seems that appending a new column could be a problem.

编辑-专门回答杰夫的问题:

Edit -- Responding to Jeff's questions specifically:

  1. 我正在建立消费者信用风险模型.数据类型包括电话,SSN和地址特征;财产价值;犯罪记录,破产等贬义信息.我每天使用的数据集平均有近1,000到2,000个字段,这些字段是混合数据类型:数字和字符数据的连续,名义和有序变量.我很少追加行,但是我确实执行了许多创建新列的操作.
  2. 典型操作涉及使用条件逻辑将几列合并到一个新的复合列中.例如, if var1>2然后newvar ='A'elif var2 = 4然后newvar ='B'.这些操作的结果是为数据集中的每条记录添加了一个新列.
  3. 最后,我想将这些新列附加到磁盘数据结构中.我将重复步骤2,使用交叉表和描述性统计数据探索数据,以寻找有趣的直观关系进行建模.
  4. 一个典型的项目文件通常约为1GB.文件组织成这样的方式,其中一行包含消费者数据记录.每条记录的每一行都有相同的列数.情况总是如此.
  5. 在创建新列时我会按行进行子集化是非常罕见的.但是,在创建报告或生成描述性统计信息时,在行上设置子集对我来说是很常见的.例如,我可能想为特定业务创建一个简单的频率,例如零售信用卡.为此,除了我要报告的任何列之外,我将只选择业务范围=零售的那些记录.但是,在创建新列时,我将拉出所有数据行,而仅提取操作所需的列.
  6. 建模过程要求我分析每一列,寻找与某些结果变量有关的有趣关系,并创建描述这些关系的新复合列.我探索的列通常以小集形式完成.例如,我将集中介绍一组仅涉及属性值的20个列,并观察它们与贷款违约的关系.在探索了这些列并创建了新的列之后,我接着转到另一组列,例如大学学历,然后重复该过程.我正在做的是创建候选变量,这些变量解释我的数据和某些结果之间的关系.在此过程的最后,我应用了一些学习技术,可以根据这些复合列创建方程.
  1. I am building consumer credit risk models. The kinds of data include phone, SSN and address characteristics; property values; derogatory information like criminal records, bankruptcies, etc... The datasets I use every day have nearly 1,000 to 2,000 fields on average of mixed data types: continuous, nominal and ordinal variables of both numeric and character data. I rarely append rows, but I do perform many operations that create new columns.
  2. Typical operations involve combining several columns using conditional logic into a new, compound column. For example, if var1 > 2 then newvar = 'A' elif var2 = 4 then newvar = 'B'. The result of these operations is a new column for every record in my dataset.
  3. Finally, I would like to append these new columns into the on-disk data structure. I would repeat step 2, exploring the data with crosstabs and descriptive statistics trying to find interesting, intuitive relationships to model.
  4. A typical project file is usually about 1GB. Files are organized into such a manner where a row consists of a record of consumer data. Each row has the same number of columns for every record. This will always be the case.
  5. It's pretty rare that I would subset by rows when creating a new column. However, it's pretty common for me to subset on rows when creating reports or generating descriptive statistics. For example, I might want to create a simple frequency for a specific line of business, say Retail credit cards. To do this, I would select only those records where the line of business = retail in addition to whichever columns I want to report on. When creating new columns, however, I would pull all rows of data and only the columns I need for the operations.
  6. The modeling process requires that I analyze every column, look for interesting relationships with some outcome variable, and create new compound columns that describe those relationships. The columns that I explore are usually done in small sets. For example, I will focus on a set of say 20 columns just dealing with property values and observe how they relate to defaulting on a loan. Once those are explored and new columns are created, I then move on to another group of columns, say college education, and repeat the process. What I'm doing is creating candidate variables that explain the relationship between my data and some outcome. At the very end of this process, I apply some learning techniques that create an equation out of those compound columns.

我很少向数据集添加行.我几乎总是会创建新列(统计数据/机器学习术语中的变量或功能).

It is rare that I would ever add rows to the dataset. I will nearly always be creating new columns (variables or features in statistics/machine learning parlance).

推荐答案

我通常以这种方式使用数十GB的数据例如我在磁盘上有一些表,这些表是通过查询读取,创建数据并追加回去的.

I routinely use tens of gigabytes of data in just this fashion e.g. I have tables on disk that I read via queries, create data and append back.

值得阅读文档在此线程中查找以获取有关如何存储的一些建议您的数据.

It's worth reading the docs and late in this thread for several suggestions for how to store your data.

将影响您存储数据的方式的详细信息,例如:
提供尽可能多的细节;并且我可以帮助您开发结构.

Details which will affect how you store your data, like:
Give as much detail as you can; and I can help you develop a structure.

  1. 数据大小,行数,列数,列类型;你在追加吗行,还是列?
  2. 典型的操作将是什么样子.例如.在列上进行查询以选择一堆行和特定的列,然后进行操作(在内存中),创建新列并保存.
    (提供一个玩具示例可以使我们提供更具体的建议.)
  3. 处理之后,您该怎么办?步骤2是临时的还是可重复的?
  4. 输入平面文件:粗糙的总大小(以Gb为单位).这些是如何组织的,例如通过记录?每个字段是否包含不同的字段,或者每个文件是否都有一些记录以及每个文件中的所有字段?
  5. 您是否曾经根据条件选择行(记录)的子集(例如,选择字段A> 5的行)?然后执行某些操作,还是只选择包含所有记录的A,B,C字段(然后执行某些操作)?
  6. 您是否工作"所有列(成组),或者有很大一部分可以仅用于报告(例如,您希望保留数据,但不需要拉入该列)列显示直到最终结果的时间)?

解决方案

确保您拥有熊猫至少为 0.10.1 已安装.

阅读迭代文件逐块多个表查询.

由于pytables已优化为按行操作(这是您要查询的内容),因此我们将为每组字段创建一个表.这样一来,很容易选择一小组字段(将与一个大表一起使用,但是这样做更有效.我想我将来可能会解决此限制.这是更加直观):
(以下是伪代码.)

Since pytables is optimized to operate on row-wise (which is what you query on), we will create a table for each group of fields. This way it's easy to select a small group of fields (which will work with a big table, but it's more efficient to do it this way... I think I may be able to fix this limitation in the future... this is more intuitive anyhow):
(The following is pseudocode.)

import numpy as np
import pandas as pd

# create a store
store = pd.HDFStore('mystore.h5')

# this is the key to your storage:
#    this maps your fields to a specific group, and defines 
#    what you want to have as data_columns.
#    you might want to create a nice class wrapping this
#    (as you will want to have this map and its inversion)  
group_map = dict(
    A = dict(fields = ['field_1','field_2',.....], dc = ['field_1',....,'field_5']),
    B = dict(fields = ['field_10',......        ], dc = ['field_10']),
    .....
    REPORTING_ONLY = dict(fields = ['field_1000','field_1001',...], dc = []),

)

group_map_inverted = dict()
for g, v in group_map.items():
    group_map_inverted.update(dict([ (f,g) for f in v['fields'] ]))

读取文件并创建存储(基本上是执行 append_to_multiple 的操作):

Reading in the files and creating the storage (essentially doing what append_to_multiple does):

for f in files:
   # read in the file, additional options may be necessary here
   # the chunksize is not strictly necessary, you may be able to slurp each 
   # file into memory in which case just eliminate this part of the loop 
   # (you can also change chunksize if necessary)
   for chunk in pd.read_table(f, chunksize=50000):
       # we are going to append to each table by group
       # we are not going to create indexes at this time
       # but we *ARE* going to create (some) data_columns

       # figure out the field groupings
       for g, v in group_map.items():
             # create the frame for this group
             frame = chunk.reindex(columns = v['fields'], copy = False)    

             # append it
             store.append(g, frame, index=False, data_columns = v['dc'])

现在您已在文件中拥有所有表(实际上,您可以根据需要将它们存储在单独的文件中,可能需要将文件名添加到group_map中,但这可能不是必需的.)

Now you have all of the tables in the file (actually you could store them in separate files if you wish, you would prob have to add the filename to the group_map, but probably this isn't necessary).

这是您获取列并创建新列的方式:

This is how you get columns and create new ones:

frame = store.select(group_that_I_want)
# you can optionally specify:
# columns = a list of the columns IN THAT GROUP (if you wanted to
#     select only say 3 out of the 20 columns in this sub-table)
# and a where clause if you want a subset of the rows

# do calculations on this frame
new_frame = cool_function_on_frame(frame)

# to 'add columns', create a new group (you probably want to
# limit the columns in this new_group to be only NEW ones
# (e.g. so you don't overlap from the other tables)
# add this info to the group_map
store.append(new_group, new_frame.reindex(columns = new_columns_created, copy = False), data_columns = new_columns_created)

当您准备进行后期处理时:

When you are ready for post_processing:

# This may be a bit tricky; and depends what you are actually doing.
# I may need to modify this function to be a bit more general:
report_data = store.select_as_multiple([groups_1,groups_2,.....], where =['field_1>0', 'field_1000=foo'], selector = group_1)

关于data_columns,您实际上不需要定义 ANY data_columns;它们使您可以根据列来子选择行.例如.像这样:

About data_columns, you don't actually need to define ANY data_columns; they allow you to sub-select rows based on the column. E.g. something like:

store.select(group, where = ['field_1000=foo', 'field_1001>0'])

在最后的报告生成阶段,它们可能对您最有趣(实际上,数据列与其他列是分开的,如果定义太多,这可能会影响效率).

They may be most interesting to you in the final report generation stage (essentially a data column is segregated from other columns, which might impact efficiency somewhat if you define a lot).

您可能还想:

  • 创建一个函数,该函数需要一个字段列表,在groups_map中查找组,然后选择它们并连接结果,以便获得结果框架(本质上就是select_as_multiple所做的事情).这样,结构对您来说将非常透明.
  • 在某些数据列上的索引(使行子设置更快).
  • 启用压缩.

当您有疑问时让我知道!

Let me know when you have questions!

这篇关于“大数据"使用 pandas 的工作流程的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆