一般读取任何文件格式并将其转换为.txt格式 [英] Generically read any file format and convert it to .txt format

查看:137
本文介绍了一般读取任何文件格式并将其转换为.txt格式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在进行进一步处理之前,我需要确保将用户提供的文件转换为.txt文件(如果包含文本).

I need to make sure that the file given by the user is converted to the .txt file if contains text, before any further processing.

目前,我有一个switch语句,用于检查特定格式并将其转换为.txt格式.

At the moment I have a switch statement checking for the specific formats and converting from those to the .txt format.

switch (extension)
{
    case ".pdf":
        //Convert from .pdf to .txt file
        break;
    case ".doc":
        //Convert from .doc to .txt file
        break;
    default:
        Console.WriteLine("The file could not be converted!");
        break;
}

问题是,我需要一些更通用的方法来检查给定文件是否为.txt或不是,但可以转换为.txt.

The problem is, I'd need something more generic to check if the given file is .txt or if it's not but could be converted, to do so.

推荐答案

按照LB的建议,我将转世

Following L.B's advice I am going to reincarnate this.

这听起来可能令人恐惧和异端,但是您是否知道可以利用.Net应用程序中的Java库,而不会导致TCP套接字或Web服务陷入困境?让我向您介绍IKVM,坦率地说,这是神奇的事情:

This may sound scary and heretical but did you know it is possible to leverage Java libraries from .Net applications with no TCP sockets or web services getting caught in the crossfire? Let me introduce you to IKVM, which is frankly magic:

IKVM.NET是Java for Mono和Microsoft .NET Framework的实现.它包括以下组件:

IKVM.NET is an implementation of Java for Mono and the Microsoft .NET Framework. It includes the following components:

  • 在.NET中实现的Java虚拟机
  • Java类库的.NET实现
  • 启用Java和.NET互操作性的工具

使用IKVM,我们已经能够成功地将Dovetail Seeker搜索应用程序与Java中实现的Tika文本提取库集成在一起.借助Tika,我们可以轻松地从多种受支持的格式中将文本从丰富文档中提取出来.为什么是提卡?因为在.Net世界中没有任何东西可与Tika媲美.

Using IKVM we have been able to successfully integrate our Dovetail Seeker search application with the Tika text extraction library implemented in Java. With Tika we can easily pull text out of rich documents from many supported formats. Why Tika? Because there is nothing comparable in the .Net world as Tika.

这篇文章将回顾我们如何与Tika集成.如果您喜欢代码,可以在Github上的仓库中找到此示例.

This post will review how we integrated with Tika. If you like code you can find this example in a repo up on Github.

将罐子编译到装配体中

首先,我们需要获得最新版本的Tika.我按照说明使用Maven下载并构建了Tika 源代码.结果是几个jar文件.我们感兴趣的是tika-app-x.x.jar,它将我们需要的所有东西捆绑到一个有用的容器中.

First thing, we need to get our hands on the latest version of Tika. I downloaded and built the Tika source using Maven as instructed. The result of this was a few jar files. The one we are interested in is tika-app-x.x.jar which has everything we need bundled into one useful container.

接下来,我们需要将我们构建的这个jar转换为.Net程序集.使用 ikvmc.exe .

Next up we need to convert this jar we’ve built to a .Net assembly. Do this using ikvmc.exe.

tika\build>ikvmc.exe -target:library tika-app-0.7.jar

不幸的是,您会看到大量麻烦的警告,但最终结果是一个包裹Java jar的.Net程序集,您可以在项目中引用它.

Unfortunately, you will see tons of troublesome looking warnings but the end result is a .Net assembly wrapping the Java jar which you can reference in your projects.

从.Net使用Tika

IKVM非常透明.您只需引用Tika应用程序程序集,您的.Net代码就可以与Java类型对话.起初有点奇怪,因为您有Java版本的类型和.Net版本.接下来,您将要确保项目中包含所有依赖的IKVM运行时程序集.通过使用Reflector,我发现Tika应用程序程序集引用了许多未使用的IKVM程序集.我不得不通过反复试验弄清楚哪些程序集没有被丰富的文档提取所影响.如果需要,您可以在应用程序中简单地包含所有引用的IKVM程序集.下面,我为您完成了工作,并消除了对所有似乎正在起作用的IKVM程序集的引用.

IKVM is pretty transparent. You simply reference the the Tika app assembly and your .Net code is talking to Java types. It is a bit weird at first as you have Java versions of types and .Net versions. Next you’ll want to make sure that all the dependent IKVM runtime assemblies are included with your project. Using Reflector I found that the Tika app assembly referenced a lot of IKVM assemblies which do not appear to be used. I had to figure out through trial and error which assemblies where not being touched by the rich document extractions being done. If need be you could simple include all of the referenced IKVM assemblies with your application. Below I have done the work for you and eliminated all references to all the IKVM assemblies which appear to be in play.

16个程序集降至5个.更小的部署.

16 assemblies down to 5. A much smaller deployment.

使用Tika

要进行一些文本提取,我们非常乐意让Tika解析我们扔给它的文件.出于我的目的,这涉及让Tika自动确定如何解析流并提取有关文档的文本和元数据.

To do some text extraction we’ll ask Tika, very nicely, to parse the files we throw at it. For my purposes this involved having Tika automatically determine how to parse the stream and extract the text and metadata about the document.

public TextExtractionResult Extract(string filePath)
{
var parser = new AutoDetectParser();
var metadata = new Metadata();
var parseContext = new ParseContext();
java.lang.Class parserClass = parser.GetType();
parseContext.set(parserClass, parser);

try
{
var file = new File(filePath);
var url = file.toURI().toURL();
using (var inputStream = MetadataHelper.getInputStream(url, metadata))
{
parser.parse(inputStream, getTransformerHandler(), metadata, parseContext);
inputStream.close();
}

return assembleExtractionResult(_outputWriter.toString(), metadata);
}
catch (Exception ex)
{
throw new ApplicationException("Extraction of text from the file '{0}' failed.".ToFormat(filePath), ex);
}
}

一个重要警告事项

Java有一个称为 ClassLoader 的概念,它与如何查找和加载Java类型有关.可能有一种更好的方法,但是由于某种原因,如果您不实现自定义ClassLoader并设置应用程序设置,以提示IKVM运行时有关将哪种.Net类型用作ClassLoader.

Java has a concept called a ClassLoader which has something to do with how Java types are found and loaded. There is probably a better way around this but for some reason if you do not implement a custom ClassLoader and also set an application setting cueing the IKVM runtime about which .Net type to use as the ClassLoader.

public class MySystemClassLoader : ClassLoader
{
public MySystemClassLoader(ClassLoader parent)
: base(new AppDomainAssemblyClassLoader(typeof(MySystemClassLoader).Assembly))
{
}
}

这是一个示例app.config,它告诉IKVM在哪里找到了ClassLoader.

Here is an example app.config telling IKVM where the ClassLoader is found.

<?xml version="1.0" encoding="utf-8" ?>
<configuration>
<appSettings>
<add key="ikvm:java.system.class.loader" value="TikaOnDotNet.MySystemClassLoader, TikaOnDotNet" />
</appSettings>
</configuration>

这一步非常重要..如果IKVM由于某种可怕的原因找不到类加载器,则Tika可以正常工作,但只能提取没有元数据的空文档.造成麻烦的主要原因是没有引发异常.因此,我们实际上在应用程序中包含一个验证步骤,以确保该应用程序设置存在并且可以解析为有效类型.

This step is very important. If IKVM cannot find a class loader, for some horrible reason, Tika will work fine but extract only empty documents with no metadata. The main reason this is troubling is that no exception is raised. For this reason we actually have a validation step in our application ensuring that the app setting is present and that it resolves to a valid type.

演示

这是演示提取和结果的测试.

Here is a test demonstrating an extraction and the result.

[Test]
public void should_extract_from_pdf()
{
var textExtractionResult = new TextExtractor().Extract("Tika.pdf");

textExtractionResult.Text.ShouldContain("pack of pickled almonds");

Console.WriteLine(textExtractionResult);
}

简单地输入像这样的丰富文档.

Put simply rich documents like this go in.

然后出现一个TextExtractionResult:

And a TextExtractionResult comes out:

public class TextExtractionResult
{
public string Text { get; set; }
public string ContentType { get; set; }
public IDictionary<string, string> Metadata { get; set; }
//toString() override
}

这是蒂卡的原始输出:

结论

我希望这有助于增强您对可以在.Net代码中使用Java库的信心,并且希望我的示例如果您需要在.Net平台上使用Tika做一些工作,则repo 将为您提供帮助.享受.

I hope this helps boost your confidence that you can use Java libraries in your .Net code and I hope my example repo will be of assistance if you need to do some work with Tika on the .Net platform. Enjoy.

要进行设置的信息:

使用Nuget查找TikaOnDotnet并安装TikaOnDotnet& TikaOnDotnet.TextExtractor到您的项目.这是在Winform应用程序上对其进行测试的代码:

Use Nuget to look up TikaOnDotnet and install both TikaOnDotnet & TikaOnDotnet.TextExtractor to your project. Here's the code to test it out on a Winform App:

public partial class Form1 : Form
{
    private System.Windows.Forms.TextBox textBox1;
    private TextExtractor _textExtractor;
    public Form1()
    {
        InitializeComponent();
        _textExtractor = new TextExtractor();

        textBox1 = new System.Windows.Forms.TextBox();
        textBox1.Dock = System.Windows.Forms.DockStyle.Fill;
        textBox1.Multiline = true;
        textBox1.Name = "textBox1";
        textBox1.ScrollBars = System.Windows.Forms.ScrollBars.Vertical;
        textBox1.AllowDrop = true;
        textBox1.DragDrop += new System.Windows.Forms.DragEventHandler(this.textBox1_DragDrop);
        textBox1.DragOver += new System.Windows.Forms.DragEventHandler(this.textBox1_DragOver);
        Controls.Add(this.textBox1);
        Name = "Drag/Drop any file on to the TextBox";
        ClientSize = new System.Drawing.Size(867, 523);
    }

    private void textBox1_DragOver(object sender, DragEventArgs e)
    {
        if (e.Data.GetDataPresent(DataFormats.FileDrop))
            e.Effect = DragDropEffects.Copy;
        else
            e.Effect = DragDropEffects.None;
    }

    private void textBox1_DragDrop(object sender, DragEventArgs e)
    {
        string[] files = (string[])e.Data.GetData(DataFormats.FileDrop);
        if (files != null && files.Length != 0)
        {
            TextExtractionResult textExtractionResult = _textExtractor.Extract(files[0]);
            textBox1.Text = textExtractionResult.Text;
        }
    }
}

原始博客页面已移至,但没有302个烫发重定向 http://clarify.dovetailsoftware.com/kmiller/2010/07/02/using-the- tika-java-library在您的网络应用程序中带有ikvm/

The original blog page has moved to but there is no 302 perm redirect http://clarify.dovetailsoftware.com/kmiller/2010/07/02/using-the-tika-java-library-in-your-net-application-with-ikvm/

这篇关于一般读取任何文件格式并将其转换为.txt格式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆