从/读取(或编辑)大JSON的方法 [英] Way to read (or edit) big JSON from / to stream

查看:115
本文介绍了从/读取(或编辑)大JSON的方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

(尚未回答-至少有3个解决方案留在这里,而不是原来的问题。)
我一直在尝试解析&拆分大JSON,但不希望修改内容。

浮点转换更改了数字,直到FloatParseHandling更改。

(Answered yet - at least 3 solutions left there instead of original question.)
I was been trying to parse & split big JSON, but did not want to modify content.
Floating point conversions changed numbers till FloatParseHandling change.

类似的循环可以拆分1 / 4GB我的机器上的JSON在40秒钟内仅使用14MB的内存,而使用普通Stream.ReadToEnd->耗尽或耗尽可用的RAM->崩溃或停止的方式则为30s / 5-7GB。

Similar loop can split 1/4GB JSON on my machine in 40s using only 14MB of RAM comparing to 30s/5-7GB using common Stream.ReadToEnd -> out of or exhaust free RAM -> crash or "stop" approach.

然后也想通过二进制比较来验证结果,但是很多数字都发生了变化。

Wanted also to verify results by binary comparision then, but lot of numbers changed.

jsonReader .FloatParseHandling = FloatParseHandling.Decimal;

jsonReader.FloatParseHandling = FloatParseHandling.Decimal;

using Newtonsoft.Json; // intentionally ugly - complete working code

long batchSize = 500000, start = 0, end = 0, pos = 0;
bool neverEnd = true;
while (neverEnd) {
  end = start + batchSize - 1;
  var sr = new StreamReader(File.Open("bigOne.json", FileMode.Open, FileAccess.Read));
  var sw = new StreamWriter(new FileStream(@"PartNo" + start + ".json", FileMode.Create));
  using (JsonWriter writer = new JsonTextWriter(sw))
  using (var jsonR = new JsonTextReader(sr)) {
    jsonR.FloatParseHandling = FloatParseHandling.Decimal;
    while (neverEnd) {
      neverEnd &= jsonR.Read();
      if (jsonR.TokenType == JsonToken.StartObject
       && jsonR.Path.IndexOf("BigArrayPathStart") == 0) { // batters[0] ... batters[3]
        if (pos > end) break;
        if (pos++ < start) {
          do { jsonR.Read(); } while (jsonR.TokenType != JsonToken.EndObject);
          continue;
        }
      }

      if (jsonR.TokenType >= JsonToken.PropertyName){ writer.WriteToken(jsonR); }
      else if (jsonR.TokenType == JsonToken.StartObject) { writer.WriteStartObject(); }
      else if (jsonR.TokenType == JsonToken.StartArray) { writer.WriteStartArray(); }
      else if (jsonR.TokenType == JsonToken.StartConstructor) {
          writer.WriteStartConstructor(jsonR.Value.ToString());
      }
    }
    start = pos; pos = 0;
  }
}


推荐答案

Gason 转换为C#可能是C#语言中最快的解析器,速度类似于C ++版本(Debug Build,发行版慢2倍),内存消耗大2倍:
https://github.com/eltomjan/gason

Gason translated to C# is probably quickest parser in C# language now, speed similar to C++ version (Debug Build, 2x slower in Release), memory consumption 2x bigger: https://github.com/eltomjan/gason

(免责声明:我隶属于Gason的C#分支。)

解析器具有实验功能-解析完最后一个数组和下一个数组中的预定义行数后退出时间在下一个批次的最后一个项目之后继续进行:

Parser has experimental feature - exit after parsing predefined # of lines in last array and next time continue after last item with next batch:

using Gason;

int endPos = -1;
JsonValue jsn;
Byte[] raw;

String json = @"{""id"":""0001"",""type"":""donut"",""name"":""Cake"",""ppu"":0.55, 
  ""batters"": [ { ""id"": ""1001"", ""type"": ""Regular"" },
                 { ""id"": ""1002"", ""type"": ""Chocolate"" },
                 { ""id"": ""1003"", ""type"": ""Blueberry"" }, 
                 { ""id"": ""1004"", ""type"": ""Devil's Food"" } ]
  }"
raw = Encoding.UTF8.GetBytes(json);
ByteString[] keys = new ByteString[]
{
    new ByteString("batters"),
    null
};
Parser jsonParser = new Parser(true); // FloatAsDecimal (,JSON stack array size=32)
jsonParser.Parse(raw, ref endPos, out jsn, keys, 2, 0, 2); // batters / null path...
ValueWriter wr = new ValueWriter(); // read only 1st 2
using (StreamWriter sw = new StreamWriter(Console.OpenStandardOutput()))
{
    sw.AutoFlush = true;
    wr.DumpValueIterative(sw, jsn, raw);
}
Parser.Parse(raw, ref endPos, out jsn, keys, 2, endPos, 2); // and now following 2
using (StreamWriter sw = new StreamWriter(Console.OpenStandardOutput()))
{
    sw.AutoFlush = true;
    wr.DumpValueIterative(sw, jsn, raw);
}

这是分割长JSON的一种快速简单的方法-整个1 / Newtonsoft.Json消耗了30GB / 5.36GB的存储空间,其中4GB的存储空间在<5,3s的主阵列中的主阵列中的存储空间少于5.3s,而在<5M,3s的存储空间中则少于30s / 5.36GB。如果仅解析前100行<330ms,> 250MB RAM。

在发行版中,构建更好的< 3.2s,其中Newton花费> 29.3s(性能提高> 10.8倍)。

It is a quick and simple option to split long JSONs now - whole 1/4GB, <18Mio rows in main array in <5,3s on a quick machine (Debug Build) using <950MB RAM, Newtonsoft.Json consumed >30s/5.36GB. If parsing only first 100 rows <330ms, >250MB RAM.
In Release Build even better <3.2s where Newton spent >29.3s (>10.8x better performance).

1st Parse:
{
  "id": "0001",
  "type": "donut",
  "name": "Cake",
  "ppu": 0.55,
  "batters": [
    {
      "id": "1001",
      "type": "Regular"
    },
    {
      "id": "1002",
      "type": "Chocolate"
    }
  ]
}
2nd Parse:
{
  "id": "0001",
  "type": "donut",
  "name": "Cake",
  "ppu": 0.55,
  "batters": [
    {
      "id": "1003",
      "type": "Blueberry"
    },
    {
      "id": "1004",
      "type": "Devil's Food"
    }
  ]
}

这篇关于从/读取(或编辑)大JSON的方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆