在句子边界处分割文本文件 [英] Split text file at sentence boundary
问题描述
我必须处理一个文本文件(一本电子书).我想对其进行处理,以使每行只有一个句子(换行分隔的文件",是吗?).我将如何使用sed UNIX实用程序执行此任务?它是否具有句子边界"的符号,如单词边界"的符号(我认为GNU版本具有该符号).请注意,该句子可以以句号,省略号,问号或感叹号结尾,最后两个词可以组合使用(例如,?,!,!?,!!!!!都是有效的句子终止符").输入文件的格式设置为使某些句子包含必须删除的换行符.
I have to process a text file (an e-book). I'd like to process it so that there is one sentence per line (a "newline-separated file", yes?). How would I do this task using sed the UNIX utility? Does it have a symbol for "sentence boundary" like a symbol for "word boundary" (I think the GNU version has that). Please note that the sentence can end in a period, ellipsis, question or exclamation mark, the last two in combination (for example, ?, !, !?, !!!!! are all valid "sentence terminators"). The input file is formatted in such a way that some sentences contain newlines that have to be removed.
我想到了一个像s/...|. |[!?]+ |/\n/g
这样的脚本(为了更好地阅读而未进行转义).但这不会从句子中删除换行符.
I thought about a script like s/...|. |[!?]+ |/\n/g
(unescaped for better reading). But it does not remove the newlines from inside the sentences.
在C#中怎么样?如果使用像sed这样的正则表达式,速度会更快吗? (我想不是).还有其他更快的方法吗?
How about in C#? Would it be remarkably faster if I use regular expressions like in sed? (I think not). Is there an other faster way?
无论哪种方式(sed或C#)都可以.谢谢.
Either way (sed or C#) is fine. Thank you.
推荐答案
正则表达式是我使用了很长时间的一个好选择.
Regex is a good option that I was using for a long time.
对我来说很好的一个很好的正则表达式是
A very good regex that worked fine for me is
string[] sentences = Regex.Split(sentence, @"(?<=['""A-Za-z0-9][\.\!\?])\s+(?=[A-Z])");
但是,正则表达式效率不高.而且,尽管该逻辑适用于理想情况,但在生产环境中效果不佳.
However, regex is not efficient. Also, though the logic works for ideal cases, it does not work good in production environment.
例如,如果我的文字是
美国是一个美好的国家.大多数人在这里生活感到很开心.
U.S.A. is a wonderful nation. Most people feel happy living there.
regex方法将通过在每个句点拆分将其分类为5个句子.但是我们从逻辑上知道应该将其拆分为仅两个句子.
The regex method will classify it as 5 sentences by splitting at each period. But we know that logically that it should be split as only two sentences.
这就是让我寻找一种机器学习技术的原因,而SharpNLP最终对我来说还算不错.
This is what made me to look for a Machine Learning Technique and at last the SharpNLP worked pretty fine for me.
private string mModelPath = @"C:\Users\ATS\Documents\Visual Studio 2012\Projects\Google_page_speed_json\Google_page_speed_json\bin\Release\";
private OpenNLP.Tools.SentenceDetect.MaximumEntropySentenceDetector mSentenceDetector;
private string[] SplitSentences(string paragraph)
{
if (mSentenceDetector == null)
{
mSentenceDetector = new OpenNLP.Tools.SentenceDetect.EnglishMaximumEntropySentenceDetector(mModelPath + "EnglishSD.nbin");
}
return mSentenceDetector.SentenceDetect(paragraph);
}
在此示例中,我使用了SharpNLP,在其中我使用了EnglishSD.nbin-一种用于句子检测的预训练模型.
Here in this example, I have made use of SharpNLP, in which I have used EnglishSD.nbin - a pre-trained model for sentence detection.
现在,如果我在此方法上应用相同的输入,它将完美地将文本分成两个逻辑语句.
Now if I apply the same input on this method, it will perfectly split text into two logical sentences.
您甚至可以使用SharpNLP项目来标记化,POSTag,Chuck等.
You can even tokenize, POSTag, Chuck etc., using the SharpNLP project.
要逐步将SharpNLP集成到C#应用程序中,请通读我写过的详细文章.它将向您介绍与代码段的集成.
For step by step integration of SharpNLP into your C# application, read through the detailed article I have written. It will explain to you the integration with code snippets.
谢谢
这篇关于在句子边界处分割文本文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!