使用Regex C#在文本文件中搜索一些短语 [英] Search for some phrases in a text file using Regex C#
问题描述
任务:
编写一个程序,该程序对文本文件中的短语进行计数.任何字符序列都可以用作计数短语,甚至包含分隔符的序列也可以.例如,在文本我是索非亚的学生"中,短语"s","stu","a"和我是"分别被发现了2、1、3和1次.
Write a program, which counts the phrases in a text file. Any sequence of characters could be given as phrase for counting, even sequences containing separators. For instance in the text "I am a student in Sofia" the phrases "s", "stu", "a" and "I am" are found respectively 2, 1, 3 and 1 times.
我知道使用 string.IndexOf 或 LINQ 或某些类型的算法(例如 Aho-Corasick )的解决方案.我想对 Regex 做同样的事情.
I know the solution with string.IndexOf or with LINQ or with some type of algorithm like Aho-Corasick. I want to do same thing with Regex.
这是我到目前为止所做的:
This is what I've done so far:
using System;
using System.Collections.Generic;
using System.IO;
using System.Text.RegularExpressions;
namespace CountThePhrasesInATextFile
{
class Program
{
static void Main(string[] args)
{
string input = ReadInput("file.txt");
input.ToLower();
List<string> phrases = new List<string>();
using (StreamReader reader = new StreamReader("words.txt"))
{
string line = reader.ReadLine();
while (line != null)
{
phrases.Add(line.Trim());
line = reader.ReadLine();
}
}
foreach (string phrase in phrases)
{
Regex regex = new Regex(String.Format(".*" + phrase.ToLower() + ".*"));
int mathes = regex.Matches(input).Count;
Console.WriteLine(phrase + " ----> " + mathes);
}
}
private static string ReadInput(string fileName)
{
string output;
using (StreamReader reader = new StreamReader(fileName))
{
output = reader.ReadToEnd();
}
return output;
}
}
}
我知道我的正则表达式不正确,但是我不知道要更改什么.
I know my regular expression is incorrect but I don't know what to change.
输出:
Word ----> 2
S ----> 2
MissingWord ----> 0
DS ----> 2
aa ----> 0
正确的输出:
Word --> 9
S --> 13
MissingWord --> 0
DS --> 2
aa --> 3
file.txt包含:
file.txt contains:
Word? We have few words: first word, second word, third word.
Some passwords: PASSWORD123, @PaSsWoRd!456, AAaA, !PASSWORD
words.txt包含:
words.txt contains:
Word
S
MissingWord
DS
aa
推荐答案
您需要先发布file.txt的内容,否则很难验证正则表达式是否正常工作.
You need to post the file.txt contents first, otherwise it's difficult to verify if the regex is working correctly or not.
话虽如此,请在此处查看Regex答案: 查找子字符串的所有位置在C#中的大字符串中 看看这是否对您的代码有帮助.
That being said, check out the Regex answer here: Finding ALL positions of a substring in a large string in C# and see if that helps with your code in the mean time.
因此,有一个简单的解决方案,在每个短语中添加(?=(和"))".这是正则表达式中的先行断言.以下代码处理了您想要的.
So there's a simple solution, add "(?=(" and "))" to each of your phrases. This is a lookahead assertion in regex. The following code handles what you want.
foreach (string phrase in phrases) {
string MatchPhrase = "(?=(" + phrase.ToLower() + "))";
int mathes = Regex.Matches(input, MatchPhrase).Count;
Console.WriteLine(phrase + " ----> " + mathes);
}
您还遇到了一个问题
input.ToLower();
应改为
input = input.ToLower();
因为c#中的字符串是不可变的.总的来说,您的代码应为:
as strings in c# are immutable. In total, your code should be:
static void Main(string[] args) {
string input = ReadInput("file.txt");
input = input.ToLower();
List<string> phrases = new List<string>();
using (StreamReader reader = new StreamReader("words.txt")) {
string line = reader.ReadLine();
while (line != null) {
phrases.Add(line.Trim());
line = reader.ReadLine();
}
}
foreach (string phrase in phrases) {
string MatchPhrase = "(?=(" + phrase.ToLower() + "))";
int mathes = Regex.Matches(input, MatchPhrase).Count;
Console.WriteLine(phrase + " ----> " + mathes);
}
Thread.Sleep(50000);
}
private static string ReadInput(string fileName) {
string output;
using (StreamReader reader = new StreamReader(fileName)) {
output = reader.ReadToEnd();
}
return output;
}
这篇关于使用Regex C#在文本文件中搜索一些短语的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!