分割大型JSON文件的策略 [英] Strategy for splitting a large JSON file

查看:132
本文介绍了分割大型JSON文件的策略的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将非常大的JSON文件拆分为给定数组的较小文件.例如:

I'm trying to split very large JSON files into smaller files for a given array. For example:

{
    "headerName1": "headerVal1",
    "headerName2": "headerVal2",
    "headerName3": [{
        "element1Name1": "element1Value1"
    },
    {
        "element2Name1": "element2Value1"
    },
    {
        "element3Name1": "element3Value1"
    },
    {
        "element4Name1": "element4Value1"
    },
    {
        "element5Name1": "element5Value1"
    },
    {
        "element6Name1": "element6Value1"
    }]
}

...向下到{"elementNName1":"elementNValue1"},其中N是一个大数字

...down to { "elementNName1": "elementNValue1" } where N is a large number

用户提供代表要拆分的数组的名称(在本示例中为"headerName3")和每个文件的数组对象数,例如1,000,000

The user provides the name which represents the array to be split (in this example "headerName3") and the number of array objects per file, e.g. 1,000,000

这将导致N个文件,每个文件包含名称:值对(headerName1,headerName3)以及每个文件中最多1,000,000个headerName3对象.

This would result in N files each containing the top name:value pairs (headerName1, headerName3) and up to 1,000,000 of the headerName3 objects in each file.

我使用的是极佳的Newtonsof JSON.net,并且了解我需要使用流来做到这一点.

I'm using the excellent Newtonsof JSON.net and understand that I need to do this using a stream.

到目前为止,我已经看过JToken对象的读取,以确定在读取令牌时PropertyName =="headerName3"发生的位置,但是我想做的是读取整个JSON对象中数组中每个对象的位置不必继续将JSON解析为JTokens;

So far I have looked a reading in JToken objects to establish where the PropertyName == "headerName3" occurs when reading in the tokens but what I would like to do is then read in the entire JSON object for each object in the array and not have to continue parsing JSON into JTokens;

这是到目前为止我正在构建的代码的片段:

Here's a snippet of the code I am building so far:

        using (StreamReader oSR = File.OpenText(strInput))
        {
            using (var reader = new JsonTextReader(oSR))
            {
                while (reader.Read())
                {
                    if (reader.TokenType == JsonToken.StartObject)
                    {
                        intObjectCount++;
                    }
                    else if (reader.TokenType == JsonToken.EndObject)
                    {
                        intObjectCount--;

                        if (intObjectCount == 1)
                        {
                            intArrayRecordCount++;
                            // Here I want to read the entire object for this record into an untyped JSON object

                            if( intArrayRecordCount % 1000000 == 0)
                            {
                                //write these to the split file
                            }
                        }
                    }
                }
            }
        }

我不知道-实际上,也不关心-JSON本身的结构,并且对象在数组中可以具有不同的结构.因此,我没有序列化为类.

I don't know - and in fact, and am not concerned with - the structure of the JSON itself, and the objects can be of varying structures within the array. I am therefore not serializing to classes.

这是正确的方法吗?我可以轻松地使用JSON.net库中的一组方法来执行此类操作吗?

Is this the right approach? Is there a set of methods in the JSON.net library I can easily use to perform such operation?

任何帮助表示赞赏.

推荐答案

您可以使用 JsonWriter.WriteToken(JsonReader reader, true) ,以从 JsonReader JsonWriter .您还可以使用 JProperty.Load(JsonReader reader)

You can use JsonWriter.WriteToken(JsonReader reader, true) to stream individual array entries and their descendants from a JsonReader to a JsonWriter. You can also use JProperty.Load(JsonReader reader) and JProperty.WriteTo(JsonWriter writer) to read and write entire properties and their descendants.

使用这些方法,您可以创建一个状态机,该状态机解析JSON文件,遍历根对象,加载"prefix"和"postfix"属性,拆分array属性,并写入前缀,array slice和postfix属性保存到新文件中.

Using these methods, you can create a state machine that parses the JSON file, iterates through the root object, loads "prefix" and "postfix" properties, splits the array property, and writes the prefix, array slice, and postfix properties out to new file(s).

这是一个原型实现,它采用TextReader和回调函数来为分割文件创建顺序输出TextWriter对象:

Here's a prototype implementation that takes a TextReader and a callback function to create sequential output TextWriter objects for the split file:

    enum SplitState
    {
        InPrefix,
        InSplitProperty,
        InSplitArray,
        InPostfix,
    }

    public static void SplitJson(TextReader textReader, string tokenName, long maxItems, Func<int, TextWriter> createStream, Formatting formatting)
    {
        List<JProperty> prefixProperties = new List<JProperty>();
        List<JProperty> postFixProperties = new List<JProperty>();
        List<JsonWriter> writers = new List<JsonWriter>();

        SplitState state = SplitState.InPrefix;
        long count = 0;

        try
        {
            using (var reader = new JsonTextReader(textReader))
            {
                bool doRead = true;
                while (doRead ? reader.Read() : true)
                {
                    doRead = true;
                    if (reader.TokenType == JsonToken.Comment || reader.TokenType == JsonToken.None)
                        continue;
                    if (reader.Depth == 0)
                    {
                        if (reader.TokenType != JsonToken.StartObject && reader.TokenType != JsonToken.EndObject)
                            throw new JsonException("JSON root container is not an Object");
                    }
                    else if (reader.Depth == 1 && reader.TokenType == JsonToken.PropertyName)
                    {
                        if ((string)reader.Value == tokenName)
                        {
                            state = SplitState.InSplitProperty;
                        }
                        else
                        {
                            if (state == SplitState.InSplitProperty)
                                state = SplitState.InPostfix;
                            var property = JProperty.Load(reader);
                            doRead = false; // JProperty.Load() will have already advanced the reader.
                            if (state == SplitState.InPrefix)
                            {
                                prefixProperties.Add(property);
                            }
                            else
                            {
                                postFixProperties.Add(property);
                            }
                        }
                    }
                    else if (reader.Depth == 1 && reader.TokenType == JsonToken.StartArray && state == SplitState.InSplitProperty)
                    {
                        state = SplitState.InSplitArray;
                    }
                    else if (reader.Depth == 1 && reader.TokenType == JsonToken.EndArray && state == SplitState.InSplitArray)
                    {
                        state = SplitState.InSplitProperty;
                    }
                    else if (state == SplitState.InSplitArray && reader.Depth == 2)
                    {
                        if (count % maxItems == 0)
                        {
                            var writer = new JsonTextWriter(createStream(writers.Count)) { Formatting = formatting };
                            writers.Add(writer);
                            writer.WriteStartObject();
                            foreach (var property in prefixProperties)
                                property.WriteTo(writer);
                            writer.WritePropertyName(tokenName);
                            writer.WriteStartArray();
                        }
                        count++;
                        writers.Last().WriteToken(reader, true);
                    }
                    else
                    {
                        throw new JsonException("Internal error");
                    }
                }
            }
            foreach (var writer in writers)
                using (writer)
                {
                    writer.WriteEndArray();
                    foreach (var property in postFixProperties)
                        property.WriteTo(writer);
                    writer.WriteEndObject();
                }
        }
        finally
        {
            // Make sure files are closed in the event of an exception.
            foreach (var writer in writers)
                using (writer)
                {
                }

        }
    }

如果需要在array属性后出现"postfix"属性,则此方法将所有文件保持打开状态直至结束.请注意,有一个

This method leaves all the files open until the end in case "postfix" properties, appearing after the array property, need to be appended. Be aware that there is a limit of 16384 open files at one time, so if you need to create more split files, this won't work. If postfix properties are never encountered in practice, you can just close each file before opening the next and throw an exception in case any postfix properties are found. Otherwise you may need to parse the large file in two passes or close and reopen the split files to append them.

这是一个如何在内存中使用JSON字符串的方法的示例:

Here is an example of how to use the method with an in-memory JSON string:

    private static void TestSplitJson(string json, string tokenName)
    {
        var builders = new List<StringBuilder>();
        using (var reader = new StringReader(json))
        {
            SplitJson(reader, tokenName, 2, i => { builders.Add(new StringBuilder()); return new StringWriter(builders.Last()); }, Formatting.Indented);
        }
        foreach (var s in builders.Select(b => b.ToString()))
        {
            Console.WriteLine(s);
        }
    }

原型小提琴.

这篇关于分割大型JSON文件的策略的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆