将具有索引格式的大.json文件读入Pandas数据框 [英] Read large .json file with index format into Pandas dataframe

查看:64
本文介绍了将具有索引格式的大.json文件读入Pandas数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在关注答案,但是在与它的作者进行了一些讨论之后,看来它只能为 orient ='records'数据格式。

I was following this answer but after some discussion with it's writer, it seems it only gives a solution to orient='records' data format.

这是区别:

# orient='records'
[
    {"Product":"Desktop Computer","Price":700},
    {"Product":"Tablet","Price":250},
    {"Product":"iPhone","Price":800},
    {"Product":"Laptop","Price":1200}
]

# orient='index'
{
    "0":{"Product":"Desktop Computer","Price":700},
    "1":{"Product":"Tablet","Price":250},
    "2":{"Product":"iPhone","Price":800},
    "3":{"Product":"Laptop","Price":1200}
}

我具有索引格式,因为我的数据是从SQL数据库中读取到数据帧中,并且需要使用索引字段来指定每条记录。

I have the index format because my data is from an SQL database read into a dataframe and the index field is needed to specify every records.

我的json文件为2.5 GB ,已从fr导出在数据框中使用 orient ='index'格式。

My json file is 2.5 GB, had been exported from the dataframe with orient='index' format.

df.to_json('test.json', orient='index')

这意味着整个文件是实际上是一个巨大的字符串,而不是类似记录集合的列表:

This means that the whole file is actually one huge string and not a list like collection of records:

{"0":{"Product":"Desktop Computer","Price":700},"1":{"Product":"Tablet","Price":250},"2":{"Product":"iPhone","Price":800},"3":{"Product":"Laptop","Price":1200}}

我不能使用像这样的任何基于行或块的迭代解决方案:

This means I can't use any line or chunck based iterative solution like this:

df = pd.read_json('test.json', orient='index', lines=True, chunksize=5)

根据文档 lines = True 仅在记录位于类似格式的列表中时才能使用,这就是为什么 pandas.DataFrame.to_json 甚至不接受的原因除非orient不是 orient ='records',否则将使用此参数。 chunksize = 的限制也来自于此,它说:

According to the documentation, lines=True can only be used if the records are in a list like format, this is why pandas.DataFrame.to_json does not even accept this argument unless the orient is not orient='records'. The restriction for chunksize= comes from this as well, it says:

"This can only be passed if lines=True. If this is None, the file will be read into memory all at once."

这正是问题的原因,试图读取如此大的.json文件会:

And exactly this is the reason of the question, trying to read such a huge .json file gives back:

df = pd.read_json('test.json', orient='index')

File "C:\Users\Username\AppData\Local\Programs\Python\Python37\lib\site-
packages\pandas\io\json\_json.py", line 1100,
in _parse_no_numpy                                                                                          
loads(json, precise_float=self.precise_float),
MemoryError 

我也在考虑将索引值也添加为第一列,这种情况下,记录格式不会丢失它;甚至可能单独存储索引列表。只有我担心它会在以后降低搜索性能。

I was thinking about adding the index values as a first column as well, this case it wouldn't be lost with the records format; or maybe even store an index list separately. Only I fear it would decrease the search performance later on.

是否有任何解决方案可以严格使用.json文件来处理这种情况,而无需其他基于数据库或大数据的解决方案技术?

Is there any solution to handle the situation strictly using the .json file and no other database or big-data based technology?

更新#1

要提出要求,请参见此处的实际结构我的数据。 SQL表:

For request here is the actual structure of my data. The SQL table:

          Serial           Date                   PatientID     Type Gender  YearWeek
0         425571118001461E 2011-06-30 20:59:30    186092        3    1.0     2011-w26
1         425571118001461E 2011-06-30 20:55:30    186092        3    1.0     2011-w26
2         425571118001461E 2013-08-28 09:29:30    186092        3    1.0     2013-w35
3         425571118001461E 2013-08-28 07:44:30    186092        3    1.0     2013-w35
4         425571118001461E 2013-08-27 20:44:30    186092        3    1.0     2013-w35
...                    ...                 ...       ...      ...    ...         ...
32290281  4183116300254921 2020-04-09 08:07:50    217553        8    2.0     2020-w15
32290282  4183116300254921 2020-04-08 10:29:50    217553        8    2.0     2020-w15
32290283  4141119420031548 2020-04-20 10:18:02    217555       12    2.0     2020-w17
32290284  4141119420043226 2020-04-20 12:33:11    217560       12    NaN     2020-w17
32290285  4141119420000825 2020-04-20 17:31:44    217568       12    1.0     2020-w17

pandas数据透视表与示例中的几乎相同,但是具有50,000行和4,000列:

The pandas pivot table is almost the same as in the example, but with a 50,000 rows and 4,000 columns:

df = df.pivot_table(index='PatientID', values='Serial', columns='YearWeek', aggfunc=len, fill_value=0)

YearWeek  1969-w01  1969-w02  1969-w03  1969-w04  1969-w05  ...  2138-w17  2138-w18  2138-w19  2138-w20  2138-w21
PatientID
0                0         0         0         0         0  ...         0         0         0         0         0
455              1         0         3         0         0  ...         0         0         0         0         0
40036            0         0         0         0         0  ...         0         0         0         0         0
40070            0         0         0         0         0  ...         0         0         0         0         0
40082            0         0         0         0         0  ...         0         0         0         0         0
...            ...       ...       ...       ...       ...  ...       ...       ...       ...       ...       ...
217559           0         0         0         0         0  ...         0         0         0         0         0
217560           0         0         0         0         0  ...         0         0         0         0         0
217561           0         0         0         0         0  ...         0         0         0         0         0
217563           0         0         0         0         0  ...         0         0         0         0         0
217568           0         1         0         2         0  ...         0         0         0         0         0

这是使用索引格式的json保存的方式:

And this is how it is saved with an index formatted json:

{
    "0":{"1969-w01":0,"1969-w02":0,"1969-w03":0,"1969-w04":0, ...},
    "455":{"1969-w01":1,"1969-w02":0,"1969-w03":3,"1969-w04":0, ...},
    "40036":{"1969-w01":0,"1969-w02":0,"1969-w03":0,"1969-w04":0, ...},
    ...
    "217568":{"1969-w01":0,"1969-w02":1,"1969-w03":0,"1969-w04":2, ...}
}

只有我不能给出 line = True arg,所以它实际上局限在一个巨大的字符串中,使其成为单线json:

Only I could not give the line=True arg, so it is actually cramped into one huge string making it a one-liner json:

{"0":{"1969-w01":0,"1969-w02":0,"1969-w03":0,"1969-w04":0, ...},"455":{"1969-w01":1,"1969-w02":0,"1969-w03":3,"1969-w04":0, ...},"40036":{"1969-w01":0,"1969-w02":0,"1969-w03":0,"1969-w04":0, ...}, ... "217568":{"1969-w01":0,"1969-w02":1,"1969-w03":0,"1969-w04":2, ...}}


推荐答案

从最简单到涉及更多的几种解决方案:

A few solutions, listed from easiest to more involved:

如果可以在数据库上执行查询,也许最好的解决方案是尝试以更好的格式写入数据?或者,您可以尝试直接从数据库中读取-熊猫也可以这样做:)这是 pd.read_sql()的文档

If you can perform the query on the DB, maybe the best solution would be to try writing the data in a nicer format? Alternatively, you could try reading directly from the database - Pandas can do that too :) Here is the documentation for pd.read_sql().

要像示例中那样读取JSON文件并创建一个具有与数据透视表示例(JSON键作为数据框索引)可比的形式的DataFrame,您可以尝试以下简单方法:

To read the JSON file as you gave the example, and create a DataFrame of the form comparable to your pivot-table example (JSON keys as dataframe index), you can try this simple approach:

# read and transpose!
df = pd.read_json("test.json").T

但是可能无法解决内存问题。

However, this probably doesn't solve the memory issue.

也许最快的方法是将大文件简单地切成较小的文件,每个文件都可以读入Pandas Dataframe(限制所需的工作内存) (在任何给定时间),然后 pd.merge pd.concat 得到的数据帧。

Perhaps the fastest way would be to simply cut the large file into smaller files, which can each be read into a Pandas Dataframe (limiting the working memory required at any given time), then pd.merge or pd.concat the resulting dataframes.

Linux中有一个名为 split ,可以做到。我注意到您正在使用Windows(如果启用,则新的Windows版本将提供Linux终端!)。否则也许有一个类似的工具,但恐怕我不知道。

There is a nice tool in Linux called split, which could do it. I notice you are using windows (newer windows version offer a Linux terminal if you enable it!). Otherwise perhaps there is a similar tool, but I don't know one I'm afraid.

如果您只需要这样做一次,那就继续生活吧,您也许可以使用Emacs或VS Code这样的文本编辑器打开文件,然后将部分复制粘贴到新文件中... me脚,但可能可以用$ _b(ツ)_ /

If you only need to do this once then get on with your life, you might be able to open your file with some text editor like Emacs or VS Code, then copy-paste portions into new files... lame, but might work ¯\_(ツ)_/¯

一个名为 ijson 将迭代加载JSON文件,该文件允许您定义中断或对每个嵌套的分区进行处理-例如,您可以创建<$ c即时 的熊猫的$ c>记录格式。此解决方案还承诺作为迭代器(也称为生成器),将降低内存消耗,不过您需要了解其工作原理。看看在这里获得很好的解释

One package called ijson will iteratively load a JSON file, which allows you to define breaks or do processing to each nested division - you could then for example create the records format for Pandas on-the-fly. This solution also promised low memory consumption, being an iterator (a.k.a generator) - you will need to learn how it works though. Have a look here for a nice explanation.

另一个软件包名为 json-streamer 的人也可以读取部分JSON内容,尽管它可能会走得很远,如果您有一个静态文件。

Another package called json-streamer can also read partial JSON contents, although it is perhaps going a bit far, given you have a static file.

这篇关于将具有索引格式的大.json文件读入Pandas数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆