如何从.odt文件中获取文本 [英] How to grab text from .odt file

查看:310
本文介绍了如何从.odt文件中获取文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要从C#中的odf文件(开放文档格式)中获取所有文本.我找到了AODL库,并安装了它. 我访问了AODL的页面 https://wiki.openoffice.org ,以找到有关如何完成所需任务的示例,但是他们都不成功.由于我无法想象的原因,所有示例都构建了新文档,并且没有如何加载文档并获取所有文本的示例(类似于OpenXML).你们知道有什么指南可以指导我吗?

I need to grab all text from odf files (open document format) in C#. I found AODL library, and installed it. I visited AODL's page https://wiki.openoffice.org to find examples on how to do the task I need, but they were all unsuccessful. For a reason that I can't imagine, all examples build new document, and there's no example in how to load a document and grab all the text (something like OpenXML). Do you guys know any reference that can guide me?

我的尝试"

var doc = new AODL.Document.TextDocuments.TextDocument();
        doc.Load(@"C:\path/to/Sample.odt");

但是我不知道如何迭代doc文档.

But I can't figure out how to iterate with the doc document.

推荐答案

最后,我明白了.这是我创建的提取所有文本的方法.也许是不完整的,因为我不知道形成.odt文件的所有部分.此方法获取页眉和页脚,文本框和段落,并将其与回车分隔符连接.您需要可以通过程序包管理器控制台PM> Install-Package AODL安装的AODL程序包.然后添加

Finally, I figured out. This is the method I created to extract all the text. Maybe is not complete, because I don't know all the parts that form the .odt file. This method grabs headers and footers, textboxes and paragraphs and concatenate it with return carriage separator. You need the AODL package, that can be installed through package manager console: PM> Install-Package AODL. And add

using AODL.Document.TextDocuments;
using AODL.Document.Content;

位于程序顶部.

/// <summary>
    /// Gets all plain text from an .odt file
    /// </summary>
    /// <param name="path">
    /// the physical path of the file
    /// </param>
    /// <returns>a string with all text content</returns>
    public String GetTextFromOdt(String path)
    {
        var sb = new StringBuilder();
        using (var doc = new TextDocument())
        {
            doc.Load(path);

            //The header and footer are in the DocumentStyles part. Grab the XML of this part
            XElement stylesPart = XElement.Parse(doc.DocumentStyles.Styles.OuterXml);
            //Take all headers and footers text, concatenated with return carriage
            string stylesText = string.Join("\r\n", stylesPart.Descendants().Where(x => x.Name.LocalName == "header" || x.Name.LocalName == "footer").Select(y => y.Value));

            //Main content
            var mainPart = doc.Content.Cast<IContent>();
            var mainText = String.Join("\r\n", mainPart.Select(x => x.Node.InnerText));

            //Append both text variables
            sb.Append(stylesText + "\r\n");
            sb.Append(mainText);
        }




        return sb.ToString();
    }

这篇关于如何从.odt文件中获取文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆