有没有办法读取由线word文档线 [英] is there a way to read a word document line by line

查看:114
本文介绍了有没有办法读取由线word文档线的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图提取Word文档中的所有单词。我能做到这一切一气呵成如下...

  Word.Application字=新Word.Application();
DOC = word.Documents.Open(@C:\\ SampleText.doc);
doc.Activate();的foreach(Word.Range docRange在doc.Words)//加载所有文件的话
{
    IEnumerable的<串GT; sortedSubstrings = Enumerable.Range(0,docRange.Text.Trim()。长度)
        。选择(ⅰ= GT; docRange.Text.Substring(i))的
        .OrderBy(S => s.Length 3; S:s.Remove(2,Math.Min(s.Length - 2,2)));    wordPosition =
        (int)的
        docRange.get_Information(
            Microsoft.Office.Interop.Word.WdInformation.wdFirstCharacterColumnNumber);    的foreach(在sortedSubstrings VAR子)
    {
        指数= docRange.Text.IndexOf(子)+ wordPosition;
        charLocation [指数] =子;
    }
}

不过我会pferred同时加载文档一行$ P $ ...是有可能这样做?

我可以通过加载项然而它我无法通过段迭代提取所有单词。

 的foreach(在doc.Paragraphs Word.Paragraph段)
{
    的foreach(Word.Range docRange在段)//错误:类型Word.para不enumeranle **
    {
        IEnumerable的<串GT; sortedSubstrings = Enumerable.Range(0,docRange.Text.Trim()。长度)
            。选择(ⅰ= GT; docRange.Text.Substring(i))的
            .OrderBy(S => s.Length 3; S:s.Remove(2,Math.Min(s.Length - 2,2)));        wordPosition =
            (int)的
            docRange.get_Information(
                Microsoft.Office.Interop.Word.WdInformation.wdFirstCharacterColumnNumber);        的foreach(在sortedSubstrings VAR子)
        {
            指数= docRange.Text.IndexOf(子)+ wordPosition;
            charLocation [指数] =子;
        }    }
}


解决方案

我建议按照此页面的这里

问题的症结所在是,你有Word.ApplicationClass(Microsoft.Interop.Word)对象读取它,但在那里他得到了医生的对象是超越我。我会假设你与ApplicationClass创建它。

编辑:文件是通过调用这个检索:

  Word.Document DOC = wordApp.Documents.Open(REF文件,文献nullobj,REF nullobj,
                                      REF nullobj,裁判nullobj,REF nullobj,
                                      REF nullobj,裁判nullobj,REF nullobj,
                                      REF nullobj,裁判nullobj,裁判nullobj);

可悲的是我链接的页面上的code的格式是不是所有容易。

EDIT2:从那里你可以遍历文档段落,但据我可以看有没有通过线循环方式。我会建议使用一些模式匹配找换行。

为了提取段落中的文本,使用的 Word.Paragraph.Range 文本,这将返回一个段落中的所有文本。然后,你必须寻找断行的字符。我会使用 string.IndexOf()

另外,如果用线你想在一次提取一个句子,你可以简单地通过的 Range.Sentences

I am trying to extract all the words in a Word document. I am able to do it all in one go as follows...

Word.Application word = new Word.Application();
doc = word.Documents.Open(@"C:\SampleText.doc");
doc.Activate();

foreach (Word.Range docRange in doc.Words) // loads all words in document
{
    IEnumerable<string> sortedSubstrings = Enumerable.Range(0, docRange.Text.Trim().Length)
        .Select(i => docRange.Text.Substring(i))
        .OrderBy(s => s.Length < 3 ? s : s.Remove(2, Math.Min(s.Length - 2, 2)));

    wordPosition =
        (int)
        docRange.get_Information(
            Microsoft.Office.Interop.Word.WdInformation.wdFirstCharacterColumnNumber);

    foreach (var substring in sortedSubstrings)
    {
        index = docRange.Text.IndexOf(substring) + wordPosition;
        charLocation[index] = substring;
    }
}

However I would have preferred to load the document one line at a time... is it possible to do so?

I can load it by paragraph however I am unable to iterate through the paragraphs to extract all words.

foreach (Word.Paragraph para in doc.Paragraphs)
{
    foreach (Word.Range docRange in para) // Error: type Word.para is not enumeranle**
    {
        IEnumerable<string> sortedSubstrings = Enumerable.Range(0, docRange.Text.Trim().Length)
            .Select(i => docRange.Text.Substring(i))
            .OrderBy(s => s.Length < 3 ? s : s.Remove(2, Math.Min(s.Length - 2, 2)));

        wordPosition =
            (int)
            docRange.get_Information(
                Microsoft.Office.Interop.Word.WdInformation.wdFirstCharacterColumnNumber);

        foreach (var substring in sortedSubstrings)
        {
            index = docRange.Text.IndexOf(substring) + wordPosition;
            charLocation[index] = substring;
        }

    }
}

解决方案

I would suggest following the code on this page here

The crux of it is that you read it with a Word.ApplicationClass (Microsoft.Interop.Word) object, although where he's getting the "Doc" object is beyond me. I would assume you create it with the ApplicationClass.

EDIT: Document is retrieved by calling this:

Word.Document doc = wordApp.Documents.Open(ref file, ref nullobj, ref nullobj,
                                      ref nullobj, ref nullobj, ref nullobj,
                                      ref nullobj, ref nullobj, ref nullobj,
                                      ref nullobj, ref nullobj, ref nullobj);

Sadly the formatting of the code on the page I linked wasn't all to easy.

EDIT2: From there you can loop through doc paragraphs, however as far as I can see there is no way of looping through lines. I would suggest using some pattern matching to find linebreaks.

In order to extract the text from a paragraph, use Word.Paragraph.Range.Text, this will return all the text inside a paragraph. Then you must search for linebreak characters. I'd use string.IndexOf().

Alternatively, if by lines you want to extract one sentence at a time, you can simply iterate through Range.Sentences

这篇关于有没有办法读取由线word文档线的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆