文件内容搜索C# [英] File content search c#

查看:63
本文介绍了文件内容搜索C#的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在应用程序中实现此功能.

I'm trying to implement this feature in my application.

就像在Windows中一样,我在搜索框中输入内容,如果在设置中选中了文件内容,则无论是文本文件还是pdf/word文件,搜索都会向我返回包含字符串的文件.搜索框.

Just like in windows, I type into the search box and if the File contents is checked in the settings, than no matter its a text file or pdf/word file, the search returns me the file that contains the string in the search box.

因此,我已经想出了一个用于文件和文件夹搜索的应用程序,该应用程序对于文本文件和Word文件的文件内容搜索来说非常有效.我正在使用互操作单词作为单词文件.

So, I already have come up with a application for files and folder search which works pretty good for the file content search for text files and word file. I'm using interop word for word files.

我知道,我可以使用 iTextSharp 或其他一些第三方的东西对pdf文件执行此操作.但这不令我满意.我只想了解Windows是怎么做的?还是其他人以其他方式做到了?我只是不想使用任何第三方工具,但这并不意味着我不能.我只是想保持我的应用程序轻巧,而不是使用许多工具来丢弃它.

I know, I can use iTextSharp or some other 3rd party stuff to do this for pdf files. But that doesn't satisfy me. I just wanted to find out how windows does it? Or if anyone else has done it in a different way? I just didn't wanted to use any 3rd party tool but doesn't mean I can't. I just wanted to keep my application light and not dump it with many tools.

推荐答案

据我所知,没有安装第3方工具,软件或实用程序就无法搜索pdf内容.因此,以pdfgrep为例.但是,如果您设法以某种方式制作C#程序,我将包括一个第三方库来完成这项工作.

As far as I know, it is not possible to search for pdf content with out having 3rd party tool, software or utility installed. So there are pdfgrep for example. But if you manage to any way make a c# program, I would include a third party library to do the job.

我为这个答案中的类似问题提供了解决方案

I made a solution for some thing similar in this answer Read specific value based on label name from PDF in C#, with a bit of tweak you can have what you are looking for. The only thing is with PdfClown, it is for .net framework, but at the other hand it is open source, free and has no limitation. But if you are looking for .net core you might find some free (with limitation) or paid pdf libraries.

根据您在注释中的要求,此处提供了一个示例解决方案,可在pdf旁页中查找文本.我在代码中留下了注释:

As you request in the comment here is a sample solution to find text in side pdf pages. I have left comments inside the code:

//The found content
private List<string> _contentList;

//Search for content in a given pdf file
public bool SearchPdf(FileInfo fileInfo, string word)
{
    _contentList = new List<string>();
    ExtractPages(fileInfo.FullName);
    var content = string.Join(" ", _contentList);
    return content.Contains(word);
}

//Extract content for each page of given pdf file
private void ExtractPages(string filePath)
{
    using (var file = new File(filePath))
    {
        var document = file.Document;

        foreach (var page in document.Pages)
        {
            Extract(new ContentScanner(page));
        }
    }
}

//Extract content of pdf page and put the found result inside _contentList
private void Extract(ContentScanner level)
{
    if (level == null)
        return;

    while (level.MoveNext())
    {
        var content = level.Current;
        switch (content)
        {
            case ShowText text:
                {
                    var font = level.State.Font;
                    _contentList.Add(font.Decode(text.Text));
                    break;
                }
            case Text _:
            case ContainerObject _:
                Extract(level.ChildLevel);
                break;
        }
    }
}

现在让我们进行快速测试,因此我们假设您的所有发票都在c:\ temp文件夹中:

Now lets do quick test, so we assume all your invoice are in c:\temp folder:

static void Main(string[] args)
{
    var program = new SearchPdfContent();

    DirectoryInfo d = new DirectoryInfo(@"c:\temp");
    FileInfo[] Files = d.GetFiles("*.pdf");
    var word = "Sushi";
    foreach (FileInfo file in Files)
    {
        var found = program.SearchPdf(file, word);
        if (found)
        {
            Console.WriteLine($"{file.FullName} contains word {word}");
        }
    }
}

例如,在发票中我有寿司一词:

In my case I have for example word sushi inside the invoice:

c:\temp\invoice0001.pdf contains word Sushi

所有这些,这是解决方案的一个示例.您可以从这里开始将其带入新的高度.祝您愉快.

All that said, this is an example of solution. You can take it from here bring it to the next level. Enjoy your day.

我留下了一些我搜索过的链接:

I leave some links of what I have searched for:

这篇关于文件内容搜索C#的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆