有没有办法读取由线word文档线 [英] is there a way to read a word document line by line
问题描述
我试图提取Word文档中的所有单词。我能做到这一切一气呵成如下...
Word.Application字=新Word.Application();
DOC = word.Documents.Open(@C:\\ SampleText.doc);
doc.Activate();的foreach(Word.Range docRange在doc.Words)//加载所有文件的话
{
IEnumerable的<串GT; sortedSubstrings = Enumerable.Range(0,docRange.Text.Trim()。长度)
。选择(ⅰ= GT; docRange.Text.Substring(i))的
.OrderBy(S => s.Length 3; S:s.Remove(2,Math.Min(s.Length - 2,2))); wordPosition =
(int)的
docRange.get_Information(
Microsoft.Office.Interop.Word.WdInformation.wdFirstCharacterColumnNumber); 的foreach(在sortedSubstrings VAR子)
{
指数= docRange.Text.IndexOf(子)+ wordPosition;
charLocation [指数] =子;
}
}
不过我会pferred同时加载文档一行$ P $ ...是有可能这样做?
我可以通过加载项然而它我无法通过段迭代提取所有单词。
的foreach(在doc.Paragraphs Word.Paragraph段)
{
的foreach(Word.Range docRange在段)//错误:类型Word.para不enumeranle **
{
IEnumerable的<串GT; sortedSubstrings = Enumerable.Range(0,docRange.Text.Trim()。长度)
。选择(ⅰ= GT; docRange.Text.Substring(i))的
.OrderBy(S => s.Length 3; S:s.Remove(2,Math.Min(s.Length - 2,2))); wordPosition =
(int)的
docRange.get_Information(
Microsoft.Office.Interop.Word.WdInformation.wdFirstCharacterColumnNumber); 的foreach(在sortedSubstrings VAR子)
{
指数= docRange.Text.IndexOf(子)+ wordPosition;
charLocation [指数] =子;
} }
}
我建议按照此页面的这里
问题的症结所在是,你有Word.ApplicationClass(Microsoft.Interop.Word)对象读取它,但在那里他得到了医生的对象是超越我。我会假设你与ApplicationClass创建它。
编辑:文件是通过调用这个检索:
Word.Document DOC = wordApp.Documents.Open(REF文件,文献nullobj,REF nullobj,
REF nullobj,裁判nullobj,REF nullobj,
REF nullobj,裁判nullobj,REF nullobj,
REF nullobj,裁判nullobj,裁判nullobj);
可悲的是我链接的页面上的code的格式是不是所有容易。
EDIT2:从那里你可以遍历文档段落,但据我可以看有没有通过线循环方式。我会建议使用一些模式匹配找换行。
为了提取段落中的文本,使用的 Word.Paragraph.Range 的文本,这将返回一个段落中的所有文本。然后,你必须寻找断行的字符。我会使用 string.IndexOf()。
另外,如果用线你想在一次提取一个句子,你可以简单地通过的 Range.Sentences
I am trying to extract all the words in a Word document. I am able to do it all in one go as follows...
Word.Application word = new Word.Application();
doc = word.Documents.Open(@"C:\SampleText.doc");
doc.Activate();
foreach (Word.Range docRange in doc.Words) // loads all words in document
{
IEnumerable<string> sortedSubstrings = Enumerable.Range(0, docRange.Text.Trim().Length)
.Select(i => docRange.Text.Substring(i))
.OrderBy(s => s.Length < 3 ? s : s.Remove(2, Math.Min(s.Length - 2, 2)));
wordPosition =
(int)
docRange.get_Information(
Microsoft.Office.Interop.Word.WdInformation.wdFirstCharacterColumnNumber);
foreach (var substring in sortedSubstrings)
{
index = docRange.Text.IndexOf(substring) + wordPosition;
charLocation[index] = substring;
}
}
However I would have preferred to load the document one line at a time... is it possible to do so?
I can load it by paragraph however I am unable to iterate through the paragraphs to extract all words.
foreach (Word.Paragraph para in doc.Paragraphs)
{
foreach (Word.Range docRange in para) // Error: type Word.para is not enumeranle**
{
IEnumerable<string> sortedSubstrings = Enumerable.Range(0, docRange.Text.Trim().Length)
.Select(i => docRange.Text.Substring(i))
.OrderBy(s => s.Length < 3 ? s : s.Remove(2, Math.Min(s.Length - 2, 2)));
wordPosition =
(int)
docRange.get_Information(
Microsoft.Office.Interop.Word.WdInformation.wdFirstCharacterColumnNumber);
foreach (var substring in sortedSubstrings)
{
index = docRange.Text.IndexOf(substring) + wordPosition;
charLocation[index] = substring;
}
}
}
I would suggest following the code on this page here
The crux of it is that you read it with a Word.ApplicationClass (Microsoft.Interop.Word) object, although where he's getting the "Doc" object is beyond me. I would assume you create it with the ApplicationClass.
EDIT: Document is retrieved by calling this:
Word.Document doc = wordApp.Documents.Open(ref file, ref nullobj, ref nullobj,
ref nullobj, ref nullobj, ref nullobj,
ref nullobj, ref nullobj, ref nullobj,
ref nullobj, ref nullobj, ref nullobj);
Sadly the formatting of the code on the page I linked wasn't all to easy.
EDIT2: From there you can loop through doc paragraphs, however as far as I can see there is no way of looping through lines. I would suggest using some pattern matching to find linebreaks.
In order to extract the text from a paragraph, use Word.Paragraph.Range.Text, this will return all the text inside a paragraph. Then you must search for linebreak characters. I'd use string.IndexOf().
Alternatively, if by lines you want to extract one sentence at a time, you can simply iterate through Range.Sentences
这篇关于有没有办法读取由线word文档线的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!