如何正确地遍历一个大的 json 文件 [英] How to properly iterate through a big json file

查看:42
本文介绍了如何正确地遍历一个大的 json 文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

亲爱的 Stackoverflow 社区,

我有一个 34 GB 的 json 文件,里面有很多数据.我试图通过使用 mongoimport --file file.json 导入到我的 mongodb - 但它失败了,当然文件太大并且抛出了一个内存系统抛出错误,你知道的.是否可以使用 php 代码通过游标遍历文件?我在这方面的经验为零,有人告诉我这是可能的.我想知道文件是如何构建的,但我不知道如何查看它的示例数组.从源代码我可以得到一个示例数组:

<代码>{"_id": ObjectId("53b29644aafd413977b23b7e"),"summonerId": NumberLong(24570940),"地区": "euw","updatedAt": NumberLong(1404212804),季节":NumberLong(4),统计":{110":{"totalSessionsPlayed": NumberLong(3),"totalSessionsLost": NumberLong(2),"totalSessionsWon": NumberLong(1),"totalChampionKills": NumberLong(34),"totalDamageDealt": NumberLong(415051),"totalDamageTaken": NumberLong(63237),"mostChampionKillsPerSession": NumberLong(12),"totalMinionKills": NumberLong(538),"totalDoubleKills": NumberLong(5),"totalTripleKills": NumberLong(1),"totalDeathsPerSession": NumberLong(18),"totalGoldEarned": NumberLong(40977),"totalTurretsKilled": NumberLong(6),"totalPhysicalDamageDealt": NumberLong(381668),"totalMagicDamageDealt": NumberLong(31340),总助攻数":NumberLong(25),"maxChampionsKilled": NumberLong(12),"maxNumDeaths": NumberLong(10)}}}

字段 stats 包含更多数组,110 只是一个示例.如何遍历这个大文件或如何将其导入到我的 mongodb 中?例如;我想回显 summonerid,championid(在本例中为 110),totalSessionsPlayed.它必须根据需要尽可能多地重新循环,直到此特定召唤师 ID 没有任何冠军 ID 为止.

再一次……一个summererID有一个它在他的职业生涯中使用过的冠军列表.Champions是指(在这个例子中)110.每个单一的summerid可以包含多个冠军,我想拥有所有的冠军,这个冠军被summerid玩了多少次(totalsessionplayed).

解决方案

您需要使用流式解析器.它们一次只能将文件的一小部分提取到内存中.

它们有几种不同的风格:类似 SAX 的推式解析器和拉式解析器.XML 阅读器模型:SAX 与 XML 拉式解析器 概述了差异.

<小时>

推送解析器

这是一个使用 salsify/json-streaming-parser 的快速示例.

当它遍历文件时,我们将跟踪 summonerIdchampionId 和状态.这一切都是基于事件的 - 您无法使用顺序解析器进行随机访问,因此您必须自己跟踪事物.每次出现 totalSessionsPlayed 时,它都会回显 summonerIdchampionIdtotalSessionsPlayed.

<小时>

data.json

这是一个用于演示目的的成对的 json 文件.

<预><代码>[{"_id": "53b29644aafd413977b23b7e","summonerId": 24570940,"地区": "euw",统计":{110":{总会话播放次数":3,总会话丢失":2,总会话数":1},112":{总会话播放次数":45,总会话丢失":2,总会话数":1}}},{"_id": "asdfasdfasdf","summonerId": 555555,"地区": "euw",统计":{42":{总会话播放次数":65,总会话丢失":2,总会话数":1},88":{总会话播放次数":99,总会话丢失":2,总会话数":1}}}]

示例:

class ListMatchUps 扩展 JsonStreamingParser\Listener\IdleListener{私人 $key;私人 $summonerId;私人 $championId;私人 $inStats;公共函数 start_document(){$this->key = null;$this->summonerId = null;$this->championId = null;$this->inStats = false;}公共函数 start_object(){if ($this->key === 'stats') {$this->inStats = true;} else if ($this->inStats) {$this->championId = $this->key;}}公共函数 end_object(){如果 ($this->championId !== null) {$this->championId = null;} else if ($this->inStats) {$this->inStats = false;} 别的 {$this->summonerId = null;}}公共功能键($key){$this->key = $key;}公共函数值($value){开关 ($this->key) {案例'summonerId':$this->summonerId = $value;休息;案例'totalSessionsPlayed':echo "{$this->summonerId},{$this->championId},$value\n";休息;}}}$stream = fopen('data.json', 'r');$listener = new ListMatchUps();尝试 {$parser = new JsonStreamingParser_Parser($stream, $listener);$parser->parse();} 捕获(异常 $e){fclose($stream);扔 $e;}

输出:

24570940,110,324570940,112,45555555、42、65555555、88、99

<小时>

拉取解析器

这是使用我最近编写的解析器,pcrov/jsonreader(需要 PHP 7.)

与上面相同的 data.json.

示例:

use pcrov\JsonReader\JsonReader;$reader = new JsonReader();$reader->open("data.json");while($reader->read("summonerId")) {$summonerId = $reader->value();$reader->next("stats");foreach($reader->value() as $championId => $stats) {echo "$summonerId, $championId, {$stats['totalSessionsPlayed']}\n";}}$reader->close();

输出:

24570940, 110, 324570940、112、45555555, 42, 65555555、88、99

Dear Stackoverflow community,

I have a 34 GB json file that has many data inside. I tried to import into my mongodb by using mongoimport --file file.json - but it failed ofcourse the file is too big and threw a memory system throw error you know it. Is it possible to use php code to iterate through the file with a cursor? I have zero experience on this, someone told me that would be possible. I want to know how the file is build, but I do not know how to view an example array of it. From the source I could get an example array:

{
     "_id": ObjectId("53b29644aafd413977b23b7e"),
     "summonerId": NumberLong(24570940),
     "region": "euw",
     "updatedAt": NumberLong(1404212804),
     "season": NumberLong(4),
     "stats": {
         "110": {
             "totalSessionsPlayed": NumberLong(3),
             "totalSessionsLost": NumberLong(2),
             "totalSessionsWon": NumberLong(1),
             "totalChampionKills": NumberLong(34),
             "totalDamageDealt": NumberLong(415051),
             "totalDamageTaken": NumberLong(63237),
             "mostChampionKillsPerSession": NumberLong(12),
             "totalMinionKills": NumberLong(538),
             "totalDoubleKills": NumberLong(5),
             "totalTripleKills": NumberLong(1),
             "totalDeathsPerSession": NumberLong(18),
             "totalGoldEarned": NumberLong(40977),
             "totalTurretsKilled": NumberLong(6),
             "totalPhysicalDamageDealt": NumberLong(381668),
             "totalMagicDamageDealt": NumberLong(31340),
             "totalAssists": NumberLong(25),
             "maxChampionsKilled": NumberLong(12),
             "maxNumDeaths": NumberLong(10)
         }
     }
 }

The field stats contains more arrays, 110 is just an example. How can I iterate through this big sized file or how can I import it into my mongodb? For example; I want to echo summonerid,championid (which is 110 in this case),totalSessionsPlayed. It has to reloop as much as it needs until theres no championid left for this particular summonerid.

Again... A summonerID has a list of champions that it has been playing in his playing career. Champions are referring to (in this example) 110. Every single summonerid can contain multiple champions and I want to have all champions, how many times the champion has been played (totalsessionplayed) by summonerid.

解决方案

You'll want to use a streaming parser. These only pull small portions of your file into memory at a time.

They come in a couple different flavors: SAX-like push parsers, and pull parsers. XML reader models: SAX versus XML pull parser gives an overview of the difference.


Push Parser

This is a quick example using salsify/json-streaming-parser.

As it rolls through the file we'll keep track of the summonerId, championId, and state. It's all event-based - you don't get random access with a sequential parser so you have to keep track of things yourself. Every time a totalSessionsPlayed comes up it'll echo out the summonerId, championId, and totalSessionsPlayed.


data.json

This is a paired-down json file for demonstration purposes.

[
    {
        "_id": "53b29644aafd413977b23b7e",
        "summonerId": 24570940,
        "region": "euw",
        "stats": {
            "110": {
                "totalSessionsPlayed": 3,
                "totalSessionsLost": 2,
                "totalSessionsWon": 1
            },
            "112": {
                "totalSessionsPlayed": 45,
                "totalSessionsLost": 2,
                "totalSessionsWon": 1
            }
        }
    },
    {
        "_id": "asdfasdfasdf",
        "summonerId": 555555,
        "region": "euw",
        "stats": {
            "42": {
                "totalSessionsPlayed": 65,
                "totalSessionsLost": 2,
                "totalSessionsWon": 1
            },
            "88": {
                "totalSessionsPlayed": 99,
                "totalSessionsLost": 2,
                "totalSessionsWon": 1
            }
        }
    }
]

Example:

class ListMatchUps extends JsonStreamingParser\Listener\IdleListener
{

    private $key;
    private $summonerId;
    private $championId;
    private $inStats;

    public function start_document()
    {
        $this->key        = null;
        $this->summonerId = null;
        $this->championId = null;
        $this->inStats    = false;
    }

    public function start_object()
    {
        if ($this->key === 'stats') {
            $this->inStats = true;
        } else if ($this->inStats) {
            $this->championId = $this->key;
        }
    }

    public function end_object()
    {
        if ($this->championId !== null) {
            $this->championId = null;
        } else if ($this->inStats) {
            $this->inStats = false;
        } else {
            $this->summonerId = null;
        }
    }

    public function key($key)
    {
        $this->key = $key;
    }

    public function value($value)
    {
        switch ($this->key) {
            case 'summonerId':
                $this->summonerId = $value;
                break;
            case 'totalSessionsPlayed':
                echo "{$this->summonerId},{$this->championId},$value\n";
                break;
        }
    }
}

$stream = fopen('data.json', 'r');
$listener = new ListMatchUps();
try {
    $parser = new JsonStreamingParser_Parser($stream, $listener);
    $parser->parse();
} catch (Exception $e) {
    fclose($stream);
    throw $e;
}

Output:

24570940,110,3
24570940,112,45
555555,42,65
555555,88,99


Pull Parser

This is using a parser I recently wrote, pcrov/jsonreader (requires PHP 7.)

Same data.json as above.

Example:

use pcrov\JsonReader\JsonReader;

$reader = new JsonReader();
$reader->open("data.json");

while($reader->read("summonerId")) {
    $summonerId = $reader->value();
    $reader->next("stats");
    foreach($reader->value() as $championId => $stats) {
        echo "$summonerId, $championId, {$stats['totalSessionsPlayed']}\n";
    }
}
$reader->close();

Output:

24570940, 110, 3
24570940, 112, 45
555555, 42, 65
555555, 88, 99

这篇关于如何正确地遍历一个大的 json 文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆