使用Regex C#在文本文件中搜索一些短语 [英] Search for some phrases in a text file using Regex C#

查看:97
本文介绍了使用Regex C#在文本文件中搜索一些短语的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

任务:

编写一个程序,该程序对文本文件中的短语进行计数.任何字符序列都可以用作计数短语,甚至包含分隔符的序列也可以.例如,在文本我是索非亚的学生"中,短语"s","stu","a"和我是"分别被发现了2、1、3和1次.

Write a program, which counts the phrases in a text file. Any sequence of characters could be given as phrase for counting, even sequences containing separators. For instance in the text "I am a student in Sofia" the phrases "s", "stu", "a" and "I am" are found respectively 2, 1, 3 and 1 times.

我知道使用 string.IndexOf LINQ 或某些类型的算法(例如 Aho-Corasick )的解决方案.我想对 Regex 做同样的事情.

I know the solution with string.IndexOf or with LINQ or with some type of algorithm like Aho-Corasick. I want to do same thing with Regex.

这是我到目前为止所做的:

This is what I've done so far:

using System;
using System.Collections.Generic;
using System.IO;
using System.Text.RegularExpressions;

namespace CountThePhrasesInATextFile
{
    class Program
    {
        static void Main(string[] args)
        {
            string input = ReadInput("file.txt");
            input.ToLower();
            List<string> phrases = new List<string>();
            using (StreamReader reader = new StreamReader("words.txt"))
            {
                string line = reader.ReadLine();
                while (line != null)
                {
                    phrases.Add(line.Trim());
                    line = reader.ReadLine();
                }
            }
            foreach (string phrase in phrases)
            {
                Regex regex = new Regex(String.Format(".*" + phrase.ToLower() + ".*"));
                int mathes = regex.Matches(input).Count;
                Console.WriteLine(phrase + " ----> " + mathes);
            }
        }

        private static string ReadInput(string fileName)
        {
            string output;
            using (StreamReader reader = new StreamReader(fileName))
            {
                output  = reader.ReadToEnd();
            }
            return output;
        }
    }
}

我知道我的正则表达式不正确,但是我不知道要更改什么.

I know my regular expression is incorrect but I don't know what to change.

输出:

Word ----> 2
S ----> 2
MissingWord ----> 0
DS ----> 2
aa ----> 0

正确的输出:

Word --> 9
S --> 13
MissingWord --> 0
DS --> 2
aa --> 3

file.txt包含:

file.txt contains:

Word? We have few words: first word, second word, third word.
Some passwords: PASSWORD123, @PaSsWoRd!456, AAaA, !PASSWORD

words.txt包含:

words.txt contains:

Word
S
MissingWord
DS
aa

推荐答案

您需要先发布file.txt的内容,否则很难验证正则表达式是否正常工作.

You need to post the file.txt contents first, otherwise it's difficult to verify if the regex is working correctly or not.

话虽如此,请在此处查看Regex答案: 查找子字符串的所有位置在C#中的大字符串中 看看这是否对您的代码有帮助.

That being said, check out the Regex answer here: Finding ALL positions of a substring in a large string in C# and see if that helps with your code in the mean time.

因此,有一个简单的解决方案,在每个短语中添加(?=(和"))".这是正则表达式中的先行断言.以下代码处理了您想要的.

So there's a simple solution, add "(?=(" and "))" to each of your phrases. This is a lookahead assertion in regex. The following code handles what you want.

        foreach (string phrase in phrases) {
            string MatchPhrase = "(?=(" + phrase.ToLower() + "))";
            int mathes = Regex.Matches(input, MatchPhrase).Count;
            Console.WriteLine(phrase + " ----> " + mathes);
        }

您还遇到了一个问题

input.ToLower();

应改为

input = input.ToLower();

因为c#中的字符串是不可变的.总的来说,您的代码应为:

as strings in c# are immutable. In total, your code should be:

    static void Main(string[] args) {
        string input = ReadInput("file.txt");
        input = input.ToLower();
        List<string> phrases = new List<string>();
        using (StreamReader reader = new StreamReader("words.txt")) {
            string line = reader.ReadLine();
            while (line != null) {
                phrases.Add(line.Trim());
                line = reader.ReadLine();
            }
        }
        foreach (string phrase in phrases) {
            string MatchPhrase = "(?=(" + phrase.ToLower() + "))";
            int mathes = Regex.Matches(input, MatchPhrase).Count;
            Console.WriteLine(phrase + " ----> " + mathes);
        }
        Thread.Sleep(50000);
    }

    private static string ReadInput(string fileName) {
        string output;
        using (StreamReader reader = new StreamReader(fileName)) {
            output = reader.ReadToEnd();
        }
        return output;
    }

这篇关于使用Regex C#在文本文件中搜索一些短语的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆