从文本文件中删除停用词 [英] removing stopwords from textfile

查看:82
本文介绍了从文本文件中删除停用词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用下面的代码,但是当我写我的名字是asmaa时

结果:我的ne asmaa

为什么名字转换为ne?







使用(TextWriter tw = new StreamWriter(@D:\\\ output.txt) ))

{

使用(StreamReader reader = new StreamReader(D:\\ input.txt))

{

字符串行;

while((line = reader.ReadLine())!= null)

{

string [] parts = line.Split('''');

string [] stopWord = new string [] {is,are,am,could,将,in};

foreach(字符串在stopWord中)

{

line = line.Replace(word, );



}

tw.Write(line);

}

}

}

I used the following code but when i write for example " my name is asmaa "
the result :"my ne asmaa"
why name convert to ne?



using (TextWriter tw = new StreamWriter(@"D:\output.txt"))
{
using (StreamReader reader = new StreamReader("D:\\input.txt"))
{
string line;
while ((line = reader.ReadLine()) != null)
{
string[] parts = line.Split('' '');
string[] stopWord = new string[] { "is", "are", "am", "could", "will" , "in" };
foreach (string word in stopWord)
{
line = line.Replace(word, "");

}
tw.Write(line);
}
}
}

推荐答案

它正在按照你所说的去做:用任何东西替换这些字符串的所有实例。无论如何,你没有指明你的意思只是整个单词。你想在这里使用的是正则表达式 [ ^ ]。在这种情况下,您特别感兴趣的是特殊组\ b,它是单词边界。例如,使用正则表达式\ bam \ b只会匹配am,如果它不是另一个单词的一部分(请注意,这将在某些情况下匹配,前提是之前和之后没有空格,如果它出现在字符串的开头或结尾,或者在标点之前或之后,比如am。,但我猜你想要那个。)



在C#中,正则表达式可以与正则表达式一起使用[ ^ ] class。



例如:

It''s doing exactly what you told it to: replace all instances of those strings with nothing. You haven''t specified in anyway that you mean only whole words. What you''ll want to use here is regular expressions[^]. In this case, what will be of specific interest to you is the special group "\b", which is word boundary. For example, using the regular expression "\bam\b" will only match "am" if it isn''t part of another word (note, this will however match in some cases where there is not a space before and after, like if it appears at the beginning or end of a string, or before or after punctuiation, like "am.", but I''m guessing you want that).

In C#, regular expressions can be used with the Regex[^] class.

For example:
line = Regex.Replace(line, @"\bam\b", "");





会将am替换为空字符串。 (并且,如果您之前没有看到过,字符串前面的@ @指定转义字符\,只会被视为自身,这在编写正则表达式时很有用,否则示例将变为\\ \\\bam \\\\,并且开始变得更难以阅读更复杂的模式。)



will replace "am" in line with an empty string. (And, in case you haven''t seen it before, the @ before the string specifies that the escape character, \, will just be treated as itself, which is helpful when writing regular expressions, otherwise the example would become "\\bam\\b", and can start to become more difficult to read in more complicated patterns.)


这篇关于从文本文件中删除停用词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆