从URL解析来自Pdf，txt或docx文件的文本，而无需在Java 8中下载文本 [英] Parse text from Pdf, txt, or docx file from URL without downloading it in Java 8

查看：165 发布时间：2020/4/24 10:00:37 java parsing pdf stream jsoup

本文介绍了从URL解析来自Pdf，txt或docx文件的文本，而无需在Java 8中下载文本的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我需要能够使用给定的URL(即http://website.com/document.pdf)在线解析文件中包含的文本.

I need to be able to parse the text contained in a file online with a given url, i.e. http://website.com/document.pdf.

我正在制作一个搜索引擎，该引擎基本上可以告诉我所搜索的单词是否在线存在于某个文件中，并检索文件的URL，因此我不需要下载文件而只需要阅读它即可.

I am making a search engine which basically can tell me if the searched word is in some file online, and retrieve the file's URL, so I don't need to download the file but to just read it.

我一直在寻找一种方法，并且找到了InputStream和OpenConnection的东西，但是并没有真正做到这一点.

I was looking for a way and found something with InputStream and OpenConnection but didn't managed to actually do it.

我正在使用jsoup来在网站上爬网以检索URL，并且我试图使用Jsoup方法对其进行解析，但这是行不通的.

I am using jsoup in order to crawl around a website in order to retrieve the URLs, and I was trying to parse it with a Jsoup method, but it does not work.

那么最好的方法是什么?

So what is the best way to do this?

我希望能够做这样的事情:

I want to be able to do something like this:

File in = new File("http://website.com/document.pdf");
Document doc = Jsoup.parse(in, "UTF-8");
System.out.println(doc.toString());

推荐答案

您可以使用URL而不是文件来访问URL.因此，使用Apache Tika，您应该可以通过这种方式获取一串内容.

You can use URL instead of file for access to the URL. So using Apache Tika you should be able to grab a string of the content this way.

import org.apache.tika.parser.pdf.PDFParser;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.ContentHandler;

public class URLReader {
    public static void main(String[] args) throws Exception {

        URL url = new URL("http://website.com/document.pdf");
        ContentHandler contenthandler = new BodyContentHandler();
        Metadata metadata = new Metadata();
        PDFParser pdfparser = new PDFParser();
        pdfparser.parse(is, contenthandler, metadata, new ParseContext());

        System.out.println(contenthandler.toString());
    }
}

这篇关于从URL解析来自Pdf，txt或docx文件的文本，而无需在Java 8中下载文本的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

从URL解析来自Pdf，txt或docx文件的文本，而无需在Java 8中下载文本 [英] Parse text from Pdf, txt, or docx file from URL without downloading it in Java 8

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

从URL解析来自Pdf，txt或docx文件的文本，而无需在Java 8中下载文本 [英] Parse text from Pdf, txt, or docx file from URL without downloading it in Java 8

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭