正则表达式unicode文本的意外结果??? [英] Regex Unexpected results for unicode text ???

查看:123
本文介绍了正则表达式unicode文本的意外结果???的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

 

您好,

我正在尝试从包含阿拉伯语和英语字符串的文本文件中提取一些信息嵌套如下:

I am trying to extract some info from text file which contains Arabic and English strings that are nested as follows :

=================================

=================================

XXXX:此处有些名称              YYYYY:和一些文字在这里

XXXX: some Name here               YYYYY:and some text here

ZZZZ:01234567890

ZZZZ:01234567890

XXXXXX:这里有一些额外的文字aslo

XXXXXX:some extra text here aslo

= ===============================

================================

XXXX,YYYYY,ZZZZ,XXXXXXX是名称和地址的阿拉伯语单词等等。

Where XXXX, YYYYY,ZZZZ,XXXXXXX are Arabic words for Name and Address and so on.

我需要得到的是每个field_name的()之后的信息,也可能是阿拉伯语或英语。

What I need to get is the info after the ( : ) of each field_name which could also be in Arabic or in English.

我有以下代码 根据需要工作但仅适用于所有英文文本文件:

I have the code below  working as needed but just for ALL English text files:

 


public static void REGEXX(string file)
        {
            //Declare reader as a new StreamReader with file as the file to use
            System.IO.StreamReader reader = new System.IO.StreamReader(file);
            //Declare text as the reader reading to the end
            //string str = reader.ReadToEnd();

            string str = File.ReadAllText(file, Encoding.GetEncoding(1256));

            var re = new Regex(
            @"\n?Name:\s*(?<name>.+?)\n.+?ID:\s*(?<id>.+?)\n.+?Address:\s*"
            + @"(?<addr>.+?)Notes:",
            RegexOptions.IgnoreCase
            | RegexOptions.Singleline
            | RegexOptions.Compiled);

            //re.Options = RegexOptions.RightToLeft;

            var m = re.Match(str);
            if (m.Success)
            {
                var name = m.Groups["name"].Value;
                var id = m.Groups["id"].Value;
                var addr = m.Groups["addr"].Value;

                CustomerName = name;
                CustomerID = id;
                CustomerAddress = addr;
            }
        }

推荐答案

您能否向我们提供实际的文本片段你正在尝试解析?

Can you provide us with the actual text snippet you're trying to parse?

关于你遇到的可能问题是阿拉伯语从左到右/从右到左,特别是当与英语混合时。据我所知,正则表达式引擎严格从左到右采用文本。所以请确保你的阿拉伯语插页记住这个

On eof the possible issues you're running into is the left to right/right to left-ness of Arabic, especially when mixed with English. As far as I'm aware the Regex engine strictly takes the text from left-to-right. so make sure your Arabic inserts keep this in mind.

要记住一点,经常使用。+?是一场通向绩效灾难的道路。看起来您的文本被换行符整齐地分隔,如果是这种情况,那么使用[^ \ n] +很多或很简单。+比。+快很多?偶((?!XXXX:)。)+将比。+更快
?给定正确的输入。

One point to keep in mind, the frequent use of .+? is a road to performance disaster. It looks like your text is neatly separated by newlines, if that is the case, then using [^\n]+ is much or simple .+ is a lot faster than .+? Even ((?!XXXX:).)+ will be faster than .+? given the correct input.

RegexOptions.RightToLeft选项实际上以相反的方式解析文本,并且在一个字符串中组合英语和阿拉伯语文本没有多大帮助。所以从表达式构造函数中删除它。

The option RegexOptions.RightToLeft actually parses the text the other way around, and isn't helping much combined with English and Arabic text combined in one string. So remove it from your expression constructor.

RegexOptions.CultureInvariant改变了IgnoreCase的工作方式,现在我尝试使表达式在没有IgnoreCase且没有CultureInvariant选项的情况下工作。 

RegexOptions.CultureInvariant changes the way IgnoreCase works, for now I'd try to make the expression work without IgnoreCase and without the CultureInvariant option. 

不要在构造函数和属性中设置RegexOptions。仅使用构造函数作为最佳实践。

Don't set the RegexOptions in both the constructor and through the Property. Only use the Constructor as a best practice.

 


这篇关于正则表达式unicode文本的意外结果???的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆