从HTML Java中提取文本 [英] Text Extraction from HTML Java

查看：104 发布时间：2018/6/13 15:50:27 java html screen-scraping html-content-extraction text-extraction

本文介绍了从HTML Java中提取文本的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在开发一个程序，下载HTML页面，然后选择一些信息并将其写入另一个文件。

我想提取的信息在段落标签之间是int，但我只能得到段落的一行。我的代码如下所示;

  FileReader fileReader = new FileReader（file）; 
 BufferedReader buffRd = new BufferedReader（fileReader）; 
 BufferedWriter out = new BufferedWriter（new FileWriter（newFile.txt））; 
 String s; （（s = br.readLine（））！= null）{
 if（s.contains（< p>））{
 try {
 
 b $ b out.write（s）; 
} catch（IOException e）{
} 
} 
}

我试图添加另一个while循环，它会告诉程序继续写入文件，直到该行包含< / p> 标记为止（（s = br.readLine（））！= null）{
if（s。）;

  （（< / p>））{
 try $ b} catch（IOException e）
 
 
 
 $ b $ $ $ 
 
解决方案
 
 jsoup 
 
 
 另外一个我非常喜欢的html解析器是 jsoup 。你可以得到所有的< p> 元素在两行代码中。 
 
 
  Document doc = Jsoup.connect（http://en.wikipedia.org/）.get（）; 
元素ps = doc.select（p）; 
  
然后再写出一个文件到另一个文件中 
 
 
  out.write（ps.text（））; //它会将所有p元素附加在一个长字符串中
  
或者如果你想让它们在单独的行可以遍历元素并单独写出。  
I'm working on a program that downloads HTML pages and then selects some of the information and write it to another file.


I want to extract the information which is intbetween the paragraph tags, but i can only get one line of the paragraph. My code is as follows;
FileReader fileReader = new FileReader(file);
BufferedReader buffRd = new BufferedReader(fileReader);
BufferedWriter out = new BufferedWriter(new FileWriter(newFile.txt));
String s;

while ((s = br.readLine()) !=null) {
    if(s.contains("<p>")) {
        try {
            out.write(s);
        } catch (IOException e) {
        }
    }
}
i was trying to add another while loop, which would tell the program to keep writing to file until the line contains the </p> tag, by saying;
while ((s = br.readLine()) !=null) {
    if(s.contains("<p>")) {
        while(!s.contains("</p>") {
            try {
                out.write(s);
            } catch (IOException e) {
            }
        }
    }
}
But this doesn't work. Could someone please help.
 解决方案 
jsoup

Another html parser I really liked using was jsoup. You could get all the <p> elements in 2 lines of code.
Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
Elements ps = doc.select("p");
Then write it out to a file in one more line
out.write(ps.text());  //it will append all of the p elements together in one long string
or if you want them on separate lines you can iterate through the elements and write them out separately. 

                        这篇关于从HTML Java中提取文本的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

从HTML Java中提取文本 [英] Text Extraction from HTML Java

问题描述

jsoup

jsoup

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

从HTML Java中提取文本 [英] Text Extraction from HTML Java

问题描述

jsoup

jsoup

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭