从URL解析来自Pdf,txt或docx文件的文本,而无需在Java 8中下载文本 [英] Parse text from Pdf, txt, or docx file from URL without downloading it in Java 8

查看:165
本文介绍了从URL解析来自Pdf,txt或docx文件的文本,而无需在Java 8中下载文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要能够使用给定的URL(即http://website.com/document.pdf)在线解析文件中包含的文本.

I need to be able to parse the text contained in a file online with a given url, i.e. http://website.com/document.pdf.

我正在制作一个搜索引擎,该引擎基本上可以告诉我所搜索的单词是否在线存在于某个文件中,并检索文件的URL,因此我不需要下载文件而只需要阅读它即可.

I am making a search engine which basically can tell me if the searched word is in some file online, and retrieve the file's URL, so I don't need to download the file but to just read it.

我一直在寻找一种方法,并且找到了InputStreamOpenConnection的东西,但是并没有真正做到这一点.

I was looking for a way and found something with InputStream and OpenConnection but didn't managed to actually do it.

我正在使用jsoup来在网站上爬网以检索URL,并且我试图使用Jsoup方法对其进行解析,但这是行不通的.

I am using jsoup in order to crawl around a website in order to retrieve the URLs, and I was trying to parse it with a Jsoup method, but it does not work.

那么最好的方法是什么?

So what is the best way to do this?

我希望能够做这样的事情:

I want to be able to do something like this:

File in = new File("http://website.com/document.pdf");
Document doc = Jsoup.parse(in, "UTF-8");
System.out.println(doc.toString());

推荐答案

您可以使用URL而不是文件来访问URL.因此,使用Apache Tika,您应该可以通过这种方式获取一串内容.

You can use URL instead of file for access to the URL. So using Apache Tika you should be able to grab a string of the content this way.

import org.apache.tika.parser.pdf.PDFParser;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.ContentHandler;

public class URLReader {
    public static void main(String[] args) throws Exception {

        URL url = new URL("http://website.com/document.pdf");
        ContentHandler contenthandler = new BodyContentHandler();
        Metadata metadata = new Metadata();
        PDFParser pdfparser = new PDFParser();
        pdfparser.parse(is, contenthandler, metadata, new ParseContext());

        System.out.println(contenthandler.toString());
    }
}

这篇关于从URL解析来自Pdf,txt或docx文件的文本,而无需在Java 8中下载文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆