从HTML Java中提取文本 [英] Text Extraction from HTML Java

查看:104
本文介绍了从HTML Java中提取文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在开发一个程序,下载HTML页面,然后选择一些信息并将其写入另一个文件。



我想提取的信息在段落标签之间是int,但我只能得到段落的一行。我的代码如下所示;

  FileReader fileReader = new FileReader(file); 
BufferedReader buffRd = new BufferedReader(fileReader);
BufferedWriter out = new BufferedWriter(new FileWriter(newFile.txt));
String s; ((s = br.readLine())!= null){
if(s.contains(< p>)){
try {

b $ b out.write(s);
} catch(IOException e){
}
}
}

我试图添加另一个while循环,它会告诉程序继续写入文件,直到该行包含< / p> 标记为止((s = br.readLine())!= null){
if(s。);

  ((< / p>)){
try $ b} catch(IOException e)



$ b $ $ $

解决方案

jsoup



另外一个我非常喜欢的html解析器是 jsoup 。你可以得到所有的< p> 元素在两行代码中。

  Document doc = Jsoup.connect(http://en.wikipedia.org/).get(); 
元素ps = doc.select(p);

然后再写出一个文件到另一个文件中

  out.write(ps.text()); //它会将所有p元素附加在一个长字符串中

或者如果你想让它们在单独的行可以遍历元素并单独写出。

I'm working on a program that downloads HTML pages and then selects some of the information and write it to another file.

I want to extract the information which is intbetween the paragraph tags, but i can only get one line of the paragraph. My code is as follows;

FileReader fileReader = new FileReader(file);
BufferedReader buffRd = new BufferedReader(fileReader);
BufferedWriter out = new BufferedWriter(new FileWriter(newFile.txt));
String s;

while ((s = br.readLine()) !=null) {
    if(s.contains("<p>")) {
        try {
            out.write(s);
        } catch (IOException e) {
        }
    }
}

i was trying to add another while loop, which would tell the program to keep writing to file until the line contains the </p> tag, by saying;

while ((s = br.readLine()) !=null) {
    if(s.contains("<p>")) {
        while(!s.contains("</p>") {
            try {
                out.write(s);
            } catch (IOException e) {
            }
        }
    }
}

But this doesn't work. Could someone please help.

解决方案

jsoup

Another html parser I really liked using was jsoup. You could get all the <p> elements in 2 lines of code.

Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
Elements ps = doc.select("p");

Then write it out to a file in one more line

out.write(ps.text());  //it will append all of the p elements together in one long string

or if you want them on separate lines you can iterate through the elements and write them out separately.

这篇关于从HTML Java中提取文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆