有没有办法逐行读取word文档 [英] is there a way to read a word document line by line
问题描述
我正在尝试提取 Word 文档中的所有单词.我可以一次性完成以下所有操作...
I am trying to extract all the words in a Word document. I am able to do it all in one go as follows...
Word.Application word = new Word.Application();
doc = word.Documents.Open(@"C:SampleText.doc");
doc.Activate();
foreach (Word.Range docRange in doc.Words) // loads all words in document
{
IEnumerable<string> sortedSubstrings = Enumerable.Range(0, docRange.Text.Trim().Length)
.Select(i => docRange.Text.Substring(i))
.OrderBy(s => s.Length < 3 ? s : s.Remove(2, Math.Min(s.Length - 2, 2)));
wordPosition =
(int)
docRange.get_Information(
Microsoft.Office.Interop.Word.WdInformation.wdFirstCharacterColumnNumber);
foreach (var substring in sortedSubstrings)
{
index = docRange.Text.IndexOf(substring) + wordPosition;
charLocation[index] = substring;
}
}
但是,我更希望一次加载一行文档...是否可以这样做?
However I would have preferred to load the document one line at a time... is it possible to do so?
我可以按段落加载它,但是我无法遍历段落以提取所有单词.
I can load it by paragraph however I am unable to iterate through the paragraphs to extract all words.
foreach (Word.Paragraph para in doc.Paragraphs)
{
foreach (Word.Range docRange in para) // Error: type Word.para is not enumeranle**
{
IEnumerable<string> sortedSubstrings = Enumerable.Range(0, docRange.Text.Trim().Length)
.Select(i => docRange.Text.Substring(i))
.OrderBy(s => s.Length < 3 ? s : s.Remove(2, Math.Min(s.Length - 2, 2)));
wordPosition =
(int)
docRange.get_Information(
Microsoft.Office.Interop.Word.WdInformation.wdFirstCharacterColumnNumber);
foreach (var substring in sortedSubstrings)
{
index = docRange.Text.IndexOf(substring) + wordPosition;
charLocation[index] = substring;
}
}
}
推荐答案
我建议遵循此页面上的代码 这里
I would suggest following the code on this page here
问题的关键在于您使用 Word.ApplicationClass (Microsoft.Interop.Word) 对象阅读它,尽管我无法理解他从何处获取Doc"对象.我假设您使用 ApplicationClass 创建它.
The crux of it is that you read it with a Word.ApplicationClass (Microsoft.Interop.Word) object, although where he's getting the "Doc" object is beyond me. I would assume you create it with the ApplicationClass.
通过调用此检索文档:
Word.Document doc = wordApp.Documents.Open(ref file, ref nullobj, ref nullobj,
ref nullobj, ref nullobj, ref nullobj,
ref nullobj, ref nullobj, ref nullobj,
ref nullobj, ref nullobj, ref nullobj);
遗憾的是,我链接的页面上的代码格式并不容易.
Sadly the formatting of the code on the page I linked wasn't all to easy.
从那里你可以循环遍历文档段落,但是据我所知,没有办法循环遍历行.我建议使用一些模式匹配来查找换行符.
From there you can loop through doc paragraphs, however as far as I can see there is no way of looping through lines. I would suggest using some pattern matching to find linebreaks.
为了从段落中提取文本,请使用 Word.Paragraph.Range.Text,这将返回段落内的所有文本.然后您必须搜索换行符.我会使用 string.IndexOf().
In order to extract the text from a paragraph, use Word.Paragraph.Range.Text, this will return all the text inside a paragraph. Then you must search for linebreak characters. I'd use string.IndexOf().
或者,如果您想按行一次提取一个句子,您可以简单地遍历 Range.Sentences
Alternatively, if by lines you want to extract one sentence at a time, you can simply iterate through Range.Sentences
这篇关于有没有办法逐行读取word文档的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!