C# 将混合语言的字符串拆分为不同的语言块 [英] C# Split a string with mixed language into different language chunks

查看:66
本文介绍了C# 将混合语言的字符串拆分为不同的语言块的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试解决一个问题,其中输入的字符串是混合语言.

<块引用>

例如"现代汽车公司현대자동차 其他一些英语单词"现代

我想将字符串分成不同的语言块.

<块引用>

例如[现代汽车公司"、현대자동차"、现代"、其他一些英语单词"]

OR(空格/标点符号和顺序无关紧要)

<块引用>

[现代汽车公司"、현대자동차"、现代"、SomeotherEnglishwords"]

有没有简单的方法可以解决这个问题?或者我可以使用的任何程序集/nuget 包?

谢谢

我认为我的语言块"是模棱两可的.我想要的语言块"是语言字符集.

<块引用>

例如现代汽车公司"是英文字符集,현대자동차";在韩语中,现代"在中文集,其他一些英语单词";英文版.

澄清我的问题要求的补充是:

1:输入可以有空格或任何其他标点符号,但我总是可以使用正则表达式来忽略它们.

2:我将预处理输入以忽略变音符号.所以å"变成a"在我的输入中.所以所有喜欢英文的字符都会变成英文字符.

我真正想要的是找到一种方法将输入解析为不同的语言字符集而忽略空格和;标点符号.

<块引用>

例如来自现代汽车公司현대자동차SomeotherEnglishwords"现代英语单词

致[现代汽车公司"、현대자동차"、现代"、SomeotherEnglishwords"]

解决方案

语言块可以通过使用 UNICODE 块来定义.当前的 UNICODE 块列表可在 ftp://www.unicode.org/Public/UNIDATA/Blocks.txt.以下是列表的摘录:

<前>0000..007F;基本拉丁语0080..00FF;拉丁语 1 补充0100..017F;拉丁文扩展-A0180..024F;拉丁文扩展-B0250..02AF;国际音标扩展02B0..02FF;间距修饰字母0300..036F;组合变音符号0370..03FF;希腊语和科普特语0400..04FF;西里尔0500..052F;西里尔文补充

这个想法是使用 UNICODE 块对字符进行分类.属于同一 UNICODE 块的连续字符定义了一个语言块.

此定义的第一个问题是,您可能认为单个脚本(或语言)跨越多个块,例如 CyrillicCyrillic Supplement.为了解决这个问题,您可以合并包含相同名称的块,以便所有 Latin 块合并为一个 Latin 脚本等.

然而,这会产生几个新问题:

  1. 应该将 Greek 和 CopticCopticGreek Supplement 块合并到一个脚本中,还是应该尝试区分希腊文和科普特文?
  2. 您可能应该合并所有 CJK 块.但是,由于这些块同时包含中文以及 Kanji(日语)和 Hanja(韩语)字符,因此当使用 CJK 字符时,您将无法区分这些脚本.

假设您有一个关于如何使用 UNICODE 块将字符分类为脚本的计划,那么您必须决定如何处理间距和标点符号.空格字符和几种形式的标点符号属于Basic Latin 块.但是,其他块也可能包含非字母字符.

处理此问题的策略是忽略"非字母字符的 UNICODE 块,但将它们包含在块中.在您的示例中,您有两个非拉丁语块,它们碰巧不包含空格或标点符号,但许多脚本将使用拉丁语脚本中使用的空格,例如西里尔.即使空格被归类为拉丁语,您仍然希望使用西里尔字母而不是西里尔字母后跟拉丁空格然后是另一个空格分隔的西里尔文单词序列被视为单个块西里尔字等

最后,您需要决定如何处理数字.您可以将它们视为空格和标点符号或将它们归类为它们所属的块,例如拉丁文数字是拉丁文,而梵文数字是梵文

这是将所有这些组合在一起的一些代码.首先是一个表示脚本的类(基于 UNICODE 块,如希腊语和科普特语":0x0370 - 0x03FF):

公共类脚本{公共脚本(int from,int to,字符串名称){从 = 从;到 = 到;姓名 = 姓名;}公共整数从{得到;}公共 int 到 { 得到;}公共字符串名称{获取;}public bool contains(char c) =>从 <= (int) c &&(int) c <= To;}

接下来是用于下载和解析 UNICODE 块文件的类.此代码下载构造函数中的文本,这可能并不理想.相反,您可以使用文件的本地副本或类似的东西.

公共类脚本{只读列表<脚本>脚本;公共脚本(){使用 (var webClient = new WebClient()){const string url = "ftp://www.unicode.org/Public/UNIDATA/Blocks.txt";var blocks = webClient.DownloadString(url);var regex = new Regex(@"^(?[0-9A-F]{4})\.\.(?[0-9A-F]{4}); (?<名称>.+)$");脚本 = 块.Split(new[] { '\r', '\n' }, StringSplitOptions.RemoveEmptyEntries).Select(line => regex.Match(line)).Where(match => match.Success).Select(match => new Script(Convert.ToInt32(match.Groups["from"].Value, 16),Convert.ToInt32(match.Groups["to"].Value, 16),NormalizeName(match.Groups["name"].Value))).ToList();}}公共字符串 GetScript(char c){如果 (!char.IsLetterOrDigit(c))//使用空字符串表示空格和标点符号.返回字符串.空;//线性搜索 - 可以通过使用二分搜索来改进.foreach(脚本中的 var 脚本)如果(脚本.包含(c))返回脚本.名称;返回字符串.空;}//如果需要,添加更多特殊名称.readonly string[] specialNames = new[] { "Latin", "Cyrillic", "Arabic", "CJK" };string NormalizeName(string name) =>specialNames.FirstOrDefault(sn => name.Contains(sn)) ??名称;}

请注意,UNICODE 代码点 0xFFFF 以上的块将被忽略.如果您必须使用这些字符,则必须对我提供的代码进行大量扩展,假设 UNICODE 字符由 16 位值表示.

下一个任务是将字符串拆分为 UNICODE 块.它将返回由属于同一脚本(元组的第二个元素)的一串连续字符组成的单词.scripts 变量是上面定义的 Scripts 类的一个实例.

public IEnumerable<(string text, string script)>SplitIntoWords(字符串文本){if (text.Length == 0)产量中断;var script = scripts.GetScript(text[0]);无功开始= 0;for (var i = 1; i 

对您的文本执行 SplitIntoWords 将返回如下内容:

<前>文字 |脚本----------+----------------现代 |拉丁[空格] |[空字符串]电机 |拉丁[空格] |[空字符串]公司 |拉丁[空格] |[空字符串]현대자동차 |韩文音节[空格] |[空字符串]现代 |中日韩...

下一步是连接属于同一脚本的连续单词,忽略空格和标点符号:

public IEnumerableJoinWords(IEnumerable<(string text, string script)> words){使用 (var enumerator = words.GetEnumerator()){如果 (!enumerator.MoveNext())产量中断;var (text, script) = enumerator.Current;var stringBuilder = new StringBuilder(text);while (enumerator.MoveNext()){var (nextText, nextScript) = enumerator.Current;如果(脚本 == 字符串.空){stringBuilder.Append(nextText);脚本 = nextScript;}else if (nextScript != string.Empty && nextScript != script){yield return stringBuilder.ToString();stringBuilder = new StringBuilder(nextText);脚本 = nextScript;}别的stringBuilder.Append(nextText);}yield return stringBuilder.ToString();}}

此代码将包含任何空格和标点符号以及使用相同脚本的前面的单词.

综合起来:

var chunks = JoinWords(SplitIntoWords(text));

这将导致这些块:

  • 现代汽车公司
  • 현대자동차
  • 现代
  • 其他一些英语单词

除最后一个块外的所有块都有一个尾随空格.

I am trying to solve a problem where I have a string with mixed language as input.

E.g. "Hyundai Motor Company 현대자동차 现代 Some other English words"

And I want to split the string into different language chunks.

E.g. ["Hyundai Motor Company", "현대자동차", "现代", "Some other English words"]

OR (Space/Punctuation marks and order do not matter)

["HyundaiMotorCompany", "현대자동차", "现代", "SomeotherEnglishwords"]

Is there an easy way to solve this problem? Or any assembly/nuget package I can use?

Thanks

Edit: I figured that my "language chunks" is ambiguous. What I want by "language chunks" is language character sets.

For example "Hyundai Motor Company" is in English character set, "현대자동차" in Korean set, "现代" in Chinese set, "Some other English words" in English set.

Additions to clarify the requirement of my problem is:

1: The input can have spaces or any other punctuation marks, but I can always use regular expressions to ignore them.

2: I will pre-process the input to ignore Diacritics. So "å" becomes "a" in my input. So all the English like characters will become English characters.

What I really want is to find a way to parse the inputs into different language-character sets ignoring spaces & punctuation marks.

E.g. From "HyundaiMotorCompany현대자동차现代SomeotherEnglishwords"

To ["HyundaiMotorCompany", "현대자동차", "现代", "SomeotherEnglishwords"]

解决方案

Language chunks can be defined by using UNICODE blocks. The current list of UNICODE blocks is available at ftp://www.unicode.org/Public/UNIDATA/Blocks.txt. Here is an excerpt from the list:

0000..007F; Basic Latin
0080..00FF; Latin-1 Supplement
0100..017F; Latin Extended-A
0180..024F; Latin Extended-B
0250..02AF; IPA Extensions
02B0..02FF; Spacing Modifier Letters
0300..036F; Combining Diacritical Marks
0370..03FF; Greek and Coptic
0400..04FF; Cyrillic
0500..052F; Cyrillic Supplement

The idea is to classify the characters using the UNICODE block. Consecutive characters belonging to the same UNICODE block define a language chunk.

First problem with this definition is that what you might consider a single script (or language) spans several blocks like Cyrillic and Cyrillic Supplement. To handle this you can merge blocks containing the same name so all Latin blocks are merged into a single Latin script etc.

However, this creates several new problems:

  1. Should the blocks Greek and Coptic, Coptic and Greek Supplement be merged into a single script or should you try to make a distinction between Greek and Coptic script?
  2. You should probably merge all the CJK blocks. However, because these blocks contain both Chinese as well as Kanji (Japanese) and Hanja (Korean) characters you will not be able to distinguish between these scripts when CJK characters are used.

Assuming that you have a plan for how to use UNICODE blocks to classify characters into scripts you then have to decide how to handle spacing and punctuation. The space character and several forms of punctuation belong to the Basic Latin block. However, other blocks may also contain non-letter characters.

A strategy for dealing with this is to "ignore" the UNICODE block of non-letter characters but include them in chunks. In your example you have two non-latin chunks that happens to not contain space or punctuation but many scripts will use space as it is used in the latin script, e.g. Cyrillic. Even though a space is classifed as Latin you still want a sequence of words in Cyrillic separated by spaces to be considered a single chunk using the Cyrillic script instead of a Cyrillic word followed by a Latin space and then another Cyrillic word etc.

Finally, you need to decide how to handle numbers. You can treat them as space and punctuation or classify them as the block they belong to, e.g. Latin digits are Latin while Devanagari digits are Devanagari etc.

Here is some code putting all this together. First a class to represent a script (based on UNICODE blocks like "Greek and Coptic": 0x0370 - 0x03FF):

public class Script
{
    public Script(int from, int to, string name)
    {
        From = from;
        To = to;
        Name = name;
    }

    public int From { get; }
    public int To { get; }
    public string Name { get; }

    public bool Contains(char c) => From <= (int) c && (int) c <= To;
}

Next a class for downloading and parsing the UNICODE blocks file. This code downloads the text in the constructor which might not be ideal. Instead you can use a local copy of the file or something similar.

public class Scripts
{
    readonly List<Script> scripts;

    public Scripts()
    {
        using (var webClient = new WebClient())
        {
            const string url = "ftp://www.unicode.org/Public/UNIDATA/Blocks.txt";
            var blocks = webClient.DownloadString(url);
            var regex = new Regex(@"^(?<from>[0-9A-F]{4})\.\.(?<to>[0-9A-F]{4}); (?<name>.+)$");
            scripts = blocks
                .Split(new[] { '\r', '\n' }, StringSplitOptions.RemoveEmptyEntries)
                .Select(line => regex.Match(line))
                .Where(match => match.Success)
                .Select(match => new Script(
                    Convert.ToInt32(match.Groups["from"].Value, 16),
                    Convert.ToInt32(match.Groups["to"].Value, 16),
                    NormalizeName(match.Groups["name"].Value)))
                .ToList();
        }
    }

    public string GetScript(char c)
    {
        if (!char.IsLetterOrDigit(c))
            // Use the empty string to signal space and punctuation.
            return string.Empty;
        // Linear search - can be improved by using binary search.
        foreach (var script in scripts)
            if (script.Contains(c))
                return script.Name;
        return string.Empty;
    }

    // Add more special names if required.
    readonly string[] specialNames = new[] { "Latin", "Cyrillic", "Arabic", "CJK" };

    string NormalizeName(string name) => specialNames.FirstOrDefault(sn => name.Contains(sn)) ?? name;
}

Notice that blocks above UNICODE code point 0xFFFF are ignored. If you have to work with these characters you will have to expand a lot on the code I have provided that assumes that a UNICODE character is represented by a 16 bit value.

Next task is to split a string into UNICODE blocks. It will return words consisting of a string of consecutive characters that belong to the same script (the second element of the tuple). The scripts variable is an instance of the Scripts class defined above.

public IEnumerable<(string text, string script)> SplitIntoWords(string text)
{
    if (text.Length == 0)
        yield break;
    var script = scripts.GetScript(text[0]);
    var start = 0;
    for (var i = 1; i < text.Length - 1; i += 1)
    {
        var nextScript = scripts.GetScript(text[i]);
        if (nextScript != script)
        {
            yield return (text.Substring(start, i - start), script);
            start = i;
            script = nextScript;
        }
    }
    yield return (text.Substring(start, text.Length - start), script);
}

Executing SplitIntoWords on your text will return something like this:

Text      | Script
----------+----------------
Hyundai   | Latin
[space]   | [empty string]
Motor     | Latin
[space]   | [empty string]
Company   | Latin
[space]   | [empty string]
현대자동차 | Hangul Syllables
[space]   | [empty string]
现代      | CJK
...

Next step is to join consecutive words belonging to the same script ignoring space and punctuation:

public IEnumerable<string> JoinWords(IEnumerable<(string text, string script)> words)
{
    using (var enumerator = words.GetEnumerator())
    {
        if (!enumerator.MoveNext())
            yield break;
        var (text, script) = enumerator.Current;
        var stringBuilder = new StringBuilder(text);
        while (enumerator.MoveNext())
        {
            var (nextText, nextScript) = enumerator.Current;
            if (script == string.Empty)
            {
                stringBuilder.Append(nextText);
                script = nextScript;
            }
            else if (nextScript != string.Empty && nextScript != script)
            {
                yield return stringBuilder.ToString();
                stringBuilder = new StringBuilder(nextText);
                script = nextScript;
            }
            else
                stringBuilder.Append(nextText);
        }
        yield return stringBuilder.ToString();
    }
}

This code will include any space and punctuation with the preceeding words using the same script.

Putting it all together:

var chunks = JoinWords(SplitIntoWords(text));

This will result in these chunks:

  • Hyundai Motor Company
  • 현대자동차
  • 现代
  • Some other English words

All chunks except the last have a trailing space.

这篇关于C# 将混合语言的字符串拆分为不同的语言块的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆