在句子边界处分割文本文件 [英] Split text file at sentence boundary

查看:127
本文介绍了在句子边界处分割文本文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我必须处理一个文本文件(一本电子书).我想对其进行处理,以使每行只有一个句子(换行分隔的文件",是吗?).我将如何使用sed UNIX实用程序执行此任务?它是否具有句子边界"的符号,如单词边界"的符号(我认为GNU版本具有该符号).请注意,该句子可以以句号,省略号,问号或感叹号结尾,最后两个词可以组合使用(例如,?,!,!?,!!!!!都是有效的句子终止符").输入文件的格式设置为使某些句子包含必须删除的换行符.

I have to process a text file (an e-book). I'd like to process it so that there is one sentence per line (a "newline-separated file", yes?). How would I do this task using sed the UNIX utility? Does it have a symbol for "sentence boundary" like a symbol for "word boundary" (I think the GNU version has that). Please note that the sentence can end in a period, ellipsis, question or exclamation mark, the last two in combination (for example, ?, !, !?, !!!!! are all valid "sentence terminators"). The input file is formatted in such a way that some sentences contain newlines that have to be removed.

我想到了一个像s/...|. |[!?]+ |/\n/g这样的脚本(为了更好地阅读而未进行转义).但这不会从句子中删除换行符.

I thought about a script like s/...|. |[!?]+ |/\n/g (unescaped for better reading). But it does not remove the newlines from inside the sentences.

在C#中怎么样?如果使用像sed这样的正则表达式,速度会更快吗? (我想不是).还有其他更快的方法吗?

How about in C#? Would it be remarkably faster if I use regular expressions like in sed? (I think not). Is there an other faster way?

无论哪种方式(sed或C#)都可以.谢谢.

Either way (sed or C#) is fine. Thank you.

推荐答案

正则表达式是我使用了很长时间的一个好选择.

Regex is a good option that I was using for a long time.

对我来说很好的一个很好的正则表达式是

A very good regex that worked fine for me is

 string[] sentences = Regex.Split(sentence, @"(?<=['""A-Za-z0-9][\.\!\?])\s+(?=[A-Z])");

但是,正则表达式效率不高.而且,尽管该逻辑适用于理想情况,但在生产环境中效果不佳.

However, regex is not efficient. Also, though the logic works for ideal cases, it does not work good in production environment.

例如,如果我的文字是

美国是一个美好的国家.大多数人在这里生活感到很开心.

U.S.A. is a wonderful nation. Most people feel happy living there.

regex方法将通过在每个句点拆分将其分类为5个句子.但是我们从逻辑上知道应该将其拆分为仅两个句子.

The regex method will classify it as 5 sentences by splitting at each period. But we know that logically that it should be split as only two sentences.

这就是让我寻找一种机器学习技术的原因,而SharpNLP最终对我来说还算不错.

This is what made me to look for a Machine Learning Technique and at last the SharpNLP worked pretty fine for me.

 private string mModelPath = @"C:\Users\ATS\Documents\Visual Studio 2012\Projects\Google_page_speed_json\Google_page_speed_json\bin\Release\";
 private OpenNLP.Tools.SentenceDetect.MaximumEntropySentenceDetector mSentenceDetector;
 private string[] SplitSentences(string paragraph)
    {
        if (mSentenceDetector == null)
        {
            mSentenceDetector = new OpenNLP.Tools.SentenceDetect.EnglishMaximumEntropySentenceDetector(mModelPath + "EnglishSD.nbin");
        }

        return mSentenceDetector.SentenceDetect(paragraph);
    }

在此示例中,我使用了SharpNLP,在其中我使用了EnglishSD.nbin-一种用于句子检测的预训练模型.

Here in this example, I have made use of SharpNLP, in which I have used EnglishSD.nbin - a pre-trained model for sentence detection.

现在,如果我在此方法上应用相同的输入,它将完美地将文本分成两个逻辑语句.

Now if I apply the same input on this method, it will perfectly split text into two logical sentences.

您甚至可以使用SharpNLP项目来标记化,POSTag,Chuck等.

You can even tokenize, POSTag, Chuck etc., using the SharpNLP project.

要逐步将SharpNLP集成到C#应用程序中,请通读我写过的详细文章.它将向您介绍与代码段的集成.

For step by step integration of SharpNLP into your C# application, read through the detailed article I have written. It will explain to you the integration with code snippets.

谢谢

这篇关于在句子边界处分割文本文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆