有没有办法逐行读取word文档 [英] is there a way to read a word document line by line

查看:28
本文介绍了有没有办法逐行读取word文档的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试提取 Word 文档中的所有单词.我可以一次性完成以下所有操作...

I am trying to extract all the words in a Word document. I am able to do it all in one go as follows...

Word.Application word = new Word.Application();
doc = word.Documents.Open(@"C:SampleText.doc");
doc.Activate();

foreach (Word.Range docRange in doc.Words) // loads all words in document
{
    IEnumerable<string> sortedSubstrings = Enumerable.Range(0, docRange.Text.Trim().Length)
        .Select(i => docRange.Text.Substring(i))
        .OrderBy(s => s.Length < 3 ? s : s.Remove(2, Math.Min(s.Length - 2, 2)));

    wordPosition =
        (int)
        docRange.get_Information(
            Microsoft.Office.Interop.Word.WdInformation.wdFirstCharacterColumnNumber);

    foreach (var substring in sortedSubstrings)
    {
        index = docRange.Text.IndexOf(substring) + wordPosition;
        charLocation[index] = substring;
    }
}

但是,我更希望一次加载一行文档...是否可以这样做?

However I would have preferred to load the document one line at a time... is it possible to do so?

我可以按段落加载它,但是我无法遍历段落以提取所有单词.

I can load it by paragraph however I am unable to iterate through the paragraphs to extract all words.

foreach (Word.Paragraph para in doc.Paragraphs)
{
    foreach (Word.Range docRange in para) // Error: type Word.para is not enumeranle**
    {
        IEnumerable<string> sortedSubstrings = Enumerable.Range(0, docRange.Text.Trim().Length)
            .Select(i => docRange.Text.Substring(i))
            .OrderBy(s => s.Length < 3 ? s : s.Remove(2, Math.Min(s.Length - 2, 2)));

        wordPosition =
            (int)
            docRange.get_Information(
                Microsoft.Office.Interop.Word.WdInformation.wdFirstCharacterColumnNumber);

        foreach (var substring in sortedSubstrings)
        {
            index = docRange.Text.IndexOf(substring) + wordPosition;
            charLocation[index] = substring;
        }

    }
}

推荐答案

我建议遵循此页面上的代码 这里

I would suggest following the code on this page here

问题的关键在于您使用 Word.ApplicationClass (Microsoft.Interop.Word) 对象阅读它,尽管我无法理解他从何处获取Doc"对象.我假设您使用 ApplicationClass 创建它.

The crux of it is that you read it with a Word.ApplicationClass (Microsoft.Interop.Word) object, although where he's getting the "Doc" object is beyond me. I would assume you create it with the ApplicationClass.

通过调用此检索文档:

Word.Document doc = wordApp.Documents.Open(ref file, ref nullobj, ref nullobj,
                                      ref nullobj, ref nullobj, ref nullobj,
                                      ref nullobj, ref nullobj, ref nullobj,
                                      ref nullobj, ref nullobj, ref nullobj);

遗憾的是,我链接的页面上的代码格式并不容易.

Sadly the formatting of the code on the page I linked wasn't all to easy.

从那里你可以循环遍历文档段落,但是据我所知,没有办法循环遍历行.我建议使用一些模式匹配来查找换行符.

From there you can loop through doc paragraphs, however as far as I can see there is no way of looping through lines. I would suggest using some pattern matching to find linebreaks.

为了从段落中提取文本,请使用 Word.Paragraph.Range.Text,这将返回段落内的所有文本.然后您必须搜索换行符.我会使用 string.IndexOf().

In order to extract the text from a paragraph, use Word.Paragraph.Range.Text, this will return all the text inside a paragraph. Then you must search for linebreak characters. I'd use string.IndexOf().

或者,如果您想按行一次提取一个句子,您可以简单地遍历 Range.Sentences

Alternatively, if by lines you want to extract one sentence at a time, you can simply iterate through Range.Sentences

这篇关于有没有办法逐行读取word文档的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆