从HTML Java中提取文本 [英] Text Extraction from HTML Java
问题描述
我正在开发一个程序,下载HTML页面,然后选择一些信息并将其写入另一个文件。
我想提取的信息在段落标签之间是int,但我只能得到段落的一行。我的代码如下所示;
FileReader fileReader = new FileReader(file);
BufferedReader buffRd = new BufferedReader(fileReader);
BufferedWriter out = new BufferedWriter(new FileWriter(newFile.txt));
String s; ((s = br.readLine())!= null){
if(s.contains(< p>)){
try {
b $ b out.write(s);
} catch(IOException e){
}
}
}
我试图添加另一个while循环,它会告诉程序继续写入文件,直到该行包含< / p>
标记为止((s = br.readLine())!= null){
if(s。);
((< / p>)){
try $ b} catch(IOException e)
$ b $ $ $
解决方案
jsoup
另外一个我非常喜欢的html解析器是 jsoup 。你可以得到所有的< p>
元素在两行代码中。
Document doc = Jsoup.connect(http://en.wikipedia.org/).get();
元素ps = doc.select(p);
然后再写出一个文件到另一个文件中
out.write(ps.text()); //它会将所有p元素附加在一个长字符串中
或者如果你想让它们在单独的行可以遍历元素并单独写出。
I'm working on a program that downloads HTML pages and then selects some of the information and write it to another file.
I want to extract the information which is intbetween the paragraph tags, but i can only get one line of the paragraph. My code is as follows;
FileReader fileReader = new FileReader(file);
BufferedReader buffRd = new BufferedReader(fileReader);
BufferedWriter out = new BufferedWriter(new FileWriter(newFile.txt));
String s;
while ((s = br.readLine()) !=null) {
if(s.contains("<p>")) {
try {
out.write(s);
} catch (IOException e) {
}
}
}
i was trying to add another while loop, which would tell the program to keep writing to file until the line contains the </p>
tag, by saying;
while ((s = br.readLine()) !=null) {
if(s.contains("<p>")) {
while(!s.contains("</p>") {
try {
out.write(s);
} catch (IOException e) {
}
}
}
}
But this doesn't work. Could someone please help.
解决方案 jsoup
Another html parser I really liked using was jsoup. You could get all the <p>
elements in 2 lines of code.
Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
Elements ps = doc.select("p");
Then write it out to a file in one more line
out.write(ps.text()); //it will append all of the p elements together in one long string
or if you want them on separate lines you can iterate through the elements and write them out separately.
这篇关于从HTML Java中提取文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!