从pdf和word文件中提取文本 [英] Extract text from pdf and word files

查看:95
本文介绍了从pdf和word文件中提取文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何在C#中从pdf或word文件(删除粗体,图像和其他富文本格式的媒体)中提取文本?

How can I extract text from pdf or word files (remove bold, images, and other rich text formatting media) in C#?

推荐答案

您可以使用为索引服务设计的过滤器.它们旨在从各种文档中提取纯文本,这对于在文档内部进行搜索非常有用.您可以将其用于Office文件,PDF,HTML等,基本上是任何具有过滤器的文件类型.唯一的缺点是您必须在服务器上安装这些过滤器,因此,如果您无权直接访问服务器,则可能无法实现. Windows预先安装了一些过滤器,但有些过滤器(例如PDF)则需要自行安装.对于C#实现,请查看本文:在C#中使用IFilter

You can use the filters designed for / used by the indexing service. They're designed to extract the plain text out of various documents, which is useful for searching inside a document. You can use it for Office files, PDFs, HTML and so on, basically any file type that has a filter. The only downside is that you have to install these filters on the server, so if you don't have direct access to the server this may not be possible. Some filters come pre-installed with Windows, but some, like PDF, you have to install yourself. For a C# implementation check out this article: Using IFilter in C#

这篇关于从pdf和word文件中提取文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆