如何从Word文件使用C#中提取文本? [英] How to extract text from Word files using C#?

查看:135
本文介绍了如何从Word文件使用C#中提取文本?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想转换的Word文档文件的大量(100,000),这些都是很老。从1995年左右到2000版本的Word,我猜。我一直从兜兜转转我看到堆栈溢出,MS文档在这里。

I am trying to convert a large number (100,000) of word DOC files, these are quite old. From around 1995 to 2000 version of Word, i supposed. I keep going around in circles from what i see here in stack overflow and the MS documentation.

我要的是这样做简直就是读文件,坚持文成字符串,解析字符串,取出结构的东西(该文件实际上是一个结构化的报告,看起来就像病人:乔恩DOE)。在这一点上,我知道我在做什么。我可以解析字符串数据,把它粘成有用的变量,然后坚持这个数据到数据库中。但我不知道如何真正把文字转换成字符串。任何帮助?

What i want do so is simply read the file, stick the text into a string, parse the string, take out the structure stuff (the file is actually a structured report, looks like Patient: Jon Doe). At that point, I know what i am doing. I can parse the string data, stick it into useful variables, then stick this data into a database. But I do not know how to actually put the text into a string. Any help?

PPS我发现这参考这理应把一个DOC文件到一个文本文件中。这是一个开始,但我宁愿避免做一堆文件操作的。

PPS i found this reference which supposedly puts a DOC file into a text file. It's a start, but i'd rather avoid doing a bunch of file manipulations.

推荐答案

如果您尝试使用Word对象模型,你必须总是实例化客户端在某一版本的Word(因为一台服务器上运行Word,不推荐) 。不幸的是,你将取决于字的关于旧文件的限制,例如对在Word 2010中,您只能在沙盒模式下打开从Office 95文件(即你不能够通过编程访问该文件的内容)。此外,你将不得不面对未知模板内容(与宏的文档附加,例如)。

If you try to use the Word object model, you must always instantiate a certain version of Word on the client (since running Word on a server is not recommended). Unfortunately, you'll depend of the restriction of Word concerning older files, e.g. in Word 2010 you can open files from Office 95 only in sandbox mode (i.e you're not able to access the file content programmatically). Additionally, you'll have to deal with unknown template content (documents with macros attached, for example).

在你的情况,我宁愿找一个3P-组件,它允许用户访问的内容。 我知道从文档管理系统,如OpenText的eDocs中与自治iManage的,他们使用其他工具的所有类型的全索引文件和可以present在查看器应用程序的内容。所以,如果你在这个方向上,可能是你找到一些有用的东西。

In your case I'd rather look for a 3p-component which allows to access the content. I know from document management systems like OpenText eDocs and Autonomy iManage that they use other tools to full-index documents of all types and can present the content in a viewer application. So if you look in this direction, may be you find something useful.

这篇关于如何从Word文件使用C#中提取文本?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆