过滤json字符串中的无效值 [英] filter invalid values in json string

查看:103
本文介绍了过滤json字符串中的无效值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在html主体中得到一个字符串,我试图将其处理成有效的json.我收到的字符串不是有效的json字符串,包含以下架构:

I'm getting a string in a html body that I am trying to process into valid json. The string I receive isn't a valid json string and contains the following schema:

äÄ
    "key1": "  10",
    "key2": "beigef}gtem Zahlschein",
    "key3": "     G E L \ S C H T",
    "key4": "M}nchen",
    "key5": "M{rz",
    "key6": "[huus"
Ü
ä

我编写了一个函数来替换所有错误字符,以创建有效的json-string,但是如何在不破坏json中所需字母的情况下进行反向操作?

I've written a function to replace all the faulty characters to create a valid json-string, but how do i do the reverse without destroying the letters needed in json?

这是我替换字符的方式:

This is how I replaced the characters:

private static string FixChars(string input)
    {
        if (!string.IsNullOrEmpty(input))
        {
            if (input.Contains("["))
            {
                input = input.Replace("[", "Ä");
            }
            if (input.Contains(@"\"))
            {
                input = input.Replace(@"\", "Ö");
            }
            if (input.Contains("]"))
            {
                input = input.Replace("]", "Ü");
            }
            if (input.Contains("{"))
            {
                input = input.Replace("{", "ä");
            }
            if (input.Contains("|"))
            {
                input = input.Replace("|", "ö");
            }
            if (input.Contains("}"))
            {
                input = input.Replace("}", "ü");
            }
            if (input.Contains("~"))
            {
                input = input.Replace("~", "ß");
            }
            //DS_Stern hat Probleme beim xml erstellen gemacht
            //if (input.Contains("*"))
            //{
            //    input = input.Replace("*", "Stern");
            //}
        }
        return input;
    }

然后我尝试将json-array反序列化为如下所示的Dictionary:

Then I've tried to deserialize the json-array into an Dictionary like this:

deserializedRequest = JsonConvert.DeserializeObject<Dictionary<string, string>[]>(json);

如何访问不同的字典,如何在值上使用FixChars方法,并从中重新序列化有效的json字符串?

How do I access the different dictionaries, use my FixChars-method on the values and reserialize a valid json-string from that?

IBM273和通过IBM037进行解码可以很好地创建有效的json字符串,但仍然包含一个小错误:字符ö"为"|"用这种编码.

IBM273 and decoding via IBM037 works fine to create a valid json string, but still contains a minor error: the character 'ö' is '|' in that encoding.

推荐答案

在您的 Unisys A系列计算机(cobol74)上,似乎包含JSON的HTML页面被编码为字节流使用一个编码,然后将其解码您的代码使用了不同的编码,从而导致某些字符被重新映射或丢失.要解决您的问题,您需要确定该Unisys计算机上使用的原始编码,并使用它来解码HTML流.使事情变得更加复杂的是,我们不确定.Net选择哪种编码方式来解码HTML.

It looks as though the HTML page containing your JSON was encoded into a byte stream on your Unisys A-Series type of machine (cobol74) using one encoding and then decoded by your code using a different encoding, thereby causing some characters to get remapped or lost. To fix your problem, you need to determine the original encoding used on that Unisys computer, and decode the HTML stream using it. Making things a little more complicated is that we're not sure which encoding .Net chose to decode the HTML either.

一种确定方法是对所需JSON进行采样,然后使用.Net中可用的所有编码对它进行编码和解码.如果任何一对编码产生的结果都不正确,那么用于编码字符串的编码可能就是Unisys计算机上使用的编码.而且,通过反转转换,假设没有任何字符丢失,您也许可以修复您的字符串.

One way to make the determination is to take a sample of the expected JSON, then encode it and decode it using all possible pairs of encodings available in .Net. If any pair of encodings produces the incorrect results you are seeing, then the encoding used to encode the string may possibly be the one used on the Unisys computer. And, by reversing the transformation you may be able to fix your string, assuming no characters were dropped.

以下代码进行了此测试:

The following code does this test:

var correctString = "{}[]";
var observedString = "äüÄÜ";

int count = 0;
foreach (var toEncoding in Encoding.GetEncodings())
    foreach (var fromEncoding in Encoding.GetEncodings())
    {
        var s = toEncoding.GetEncoding().GetString(fromEncoding.GetEncoding().GetBytes(correctString));
        if (s == observedString)
        {
            Console.WriteLine(string.Format("Match Found: Encoding via {0} and decoding via {1}", fromEncoding.Name, toEncoding.Name));
            count++;
        }
    }
Console.WriteLine("Found {0} matches", count);

这会产生147个匹配项,包括一堆成对的编码.有关完整列表,请参见此小提琴.

This produces 147 matches, including a bunch of pairs of ebcdic encodings. For the full list see this fiddle.

接下来,让我们尝试通过测试完整的JSON字符串来减少匹配:

Next, let's try to cut down on the matches by testing the full JSON string:

var correctJson = @"{[
    ""key1"": ""  10"",
    ""key2"": ""beigefügtem Zahlschein"",
    ""key3"": ""     G E L Ö S C H T"",
    ""key4"": ""München"",
    ""key5"": ""März"",
    ""key6"": ""Ähuus"",
    ""key7"": ""ö"",
    ""key8"": ""ß"",
]
{";
var observedJson = @"äÄ
    ""key1"": ""  10"",
    ""key2"": ""beigef}gtem Zahlschein"",
    ""key3"": ""     G E L \ S C H T"",
    ""key4"": ""M}nchen"",
    ""key5"": ""M{rz"",
    ""key6"": ""[huus"",
    ""key7"": ""|"",
    ""key8"": ""~"",
Ü
ä";

int count = 0;
foreach (var toEncoding in Encoding.GetEncodings())
    foreach (var fromEncoding in Encoding.GetEncodings())
    {
        var s = toEncoding.GetEncoding().GetString(fromEncoding.GetEncoding().GetBytes(correctJson));
        if (s == observedJson)
        {
            Console.WriteLine(string.Format("Match Found: Encoding via {0} and decoding via {1}", fromEncoding.Name, toEncoding.Name));
            count++;
        }
    }
Console.WriteLine("Found {0} matches", count);

这只会产生2场EBCDIC比赛:

This produces just 2 EBCDIC matches:

Match Found: Encoding via IBM01141 and decoding via IBM870
Match Found: Encoding via IBM273 and decoding via IBM870

因此,几乎可以肯定其中之一是正确的编码对.但是,哪一个?根据维基百科:

So one of these is almost certainly the correct pair of encodings. But, which one? According to wikipedia:

CCSID 1141是代码页/CCSID 273的欧元货币更新.在该代码页中,代码点9F的¤"(货币)字符被替换为€"(欧元)字符.

CCSID 1141 is the Euro currency update of code page/CCSID 273. In that code page, the "¤" (currency) character at code point 9F is replaced with the "€" (Euro) character.

因此,要将编码范围缩小到一个选择,您需要测试带有€"字符的样本.

So to narrow down the encoding to a single choice, you'll need to test a sample with the "€" character.

然后,如果我添加以下扩展方法:

Then if I add the following extension method:

public static class TextExtensions
{
    public static string Reencode(this string s, Encoding toEncoding, Encoding fromEncoding)
    {
        return toEncoding.GetString(fromEncoding.GetBytes(s));
    }
}

我可以通过以下方式修复您的JSON:

I can fix your JSON by doing:

var fixedJson = observedJson.Reencode(Encoding.GetEncoding("IBM01141"), Encoding.GetEncoding("IBM870"));
Console.WriteLine(fixedJson);

这篇关于过滤json字符串中的无效值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆