使用Java和Regex帮助从html标签中提取文本 [英] Help extracting text from html tag with Java and Regex
问题描述
我想使用Regex从html文件中提取一些文本。我正在学习正则表达式,我仍然无法理解这一切。我有一段代码,它提取了< body>
和< / body>
之间包含的所有文本:
public class Harn2 {
public static void main(String [] args)throws IOException {
String toMatch = readFile();
// Pattern pattern = Pattern.compile(。*?< body。*?>(。*?)< / body>。*?);这个工作正常
Pattern pattern = Pattern.compile(。*?< table class = \claroTable\。*?>(。*?)< / table>。*? ); //我希望这个工作
Matcher匹配器= pattern.matcher(toMatch);
if(matcher.matches()){
System.out.println(matcher.group(1));
$ b private static String readFile(){
try {
//打开文件第一个
//命令行参数
FileInputStream fstream = new FileInputStream(user.html);
//获取DataInputStream的对象
DataInputStream in = new DataInputStream(fstream);
BufferedReader br = new BufferedReader(new InputStreamReader(in));
String strLine = null;
//读取文件行按行
while(br.readLine()!= null){
//打印控制台上的内容
//System.out.println (strLine中);
strLine + = br.readLine();
}
//关闭输入流
in.close();
返回strLine;
} catch(Exception e){//捕获异常(如果有)
System.err.println(Error:+ e.getMessage());
return;
$ / code $ / pre
但现在我想提取标签之间的文本:
< table class =claroTable>
和< /所以我用。*?< table class = \claroTable替换了我的正则表达式字符串\。*?>(。*?)< / table>。*?
我也尝试了
。* ?< table class = \claroTable\>(。*?)< / table>。*?
但它不起作用,不明白为什么。 html文件中只有一个表格,但在javascript代码中出现table:... dataTables.js ...可能是错误原因吗?
预先感谢您的帮助,
编辑:html文本以extranct为例:
< body>
.....
< table class =claroTable>
< td>< th>一些数据和许多标签< / td>
.....
< / table>
我想提取的是< table class =之间的任何内容claroTable>>
和< / table>
解决方案您可以使用 JSoup解析器完成此操作: File file = new File(path / to / your / file.html);
String charSet =ISO-8859-1;
字符串innerHtml = Jsoup.parse(file,charSet).select(body)。html();
是的,您 can 永远不要这么容易。
更新:你的正则表达式模式的主要问题是你缺少 DOTALL
flag:
Pattern pattern = Pattern.compile(。*?< body。*?>(。*? )LT /体> *,Pattern.DOTALL)?;
如果你只是想要内容指定的表格标签,你可以这样做: p>
String tableTag =
Pattern.compile(。*?< table。*?claroTable。*?>( 。*?)< / table>。*?,Pattern.DOTALL)
.matcher(html)
.replaceFirst($ 1);
(更新:现在只返回表格标签的内容,而不是表格标签本身) p>
I would like to extract some text from an html file using Regex. I am learning regex and I still have trouble understanding it all. I have a code which extracts all the text included betweeen <body>
and </body>
here it is:
public class Harn2 {
public static void main(String[] args) throws IOException{
String toMatch=readFile();
//Pattern pattern=Pattern.compile(".*?<body.*?>(.*?)</body>.*?"); this one works fine
Pattern pattern=Pattern.compile(".*?<table class=\"claroTable\".*?>(.*?)</table>.*?"); //I want this one to work
Matcher matcher=pattern.matcher(toMatch);
if(matcher.matches()) {
System.out.println(matcher.group(1));
}
}
private static String readFile() {
try{
// Open the file that is the first
// command line parameter
FileInputStream fstream = new FileInputStream("user.html");
// Get the object of DataInputStream
DataInputStream in = new DataInputStream(fstream);
BufferedReader br = new BufferedReader(new InputStreamReader(in));
String strLine = null;
//Read File Line By Line
while (br.readLine() != null) {
// Print the content on the console
//System.out.println (strLine);
strLine+=br.readLine();
}
//Close the input stream
in.close();
return strLine;
}catch (Exception e){//Catch exception if any
System.err.println("Error: " + e.getMessage());
return "";
}
}
}
Well it works fine like this but now I would like to extract the text between the tag:
<table class="claroTable">
and </table>
So I replace my regex string by ".*?<table class=\"claroTable\".*?>(.*?)</table>.*?"
I have also tried
".*?<table class=\"claroTable\">(.*?)</table>.*?"
but it doesn't work and I don't understand why. There is only one table in the html file but there is an occurence of "table" in a javascript code : "...dataTables.js..." could that be the reason for the mistake?
Thank you in advance for helping me,
EDIT: the html text to extranct is something like:
<body>
.....
<table class="claroTable">
<td><th>some data and manya many tags </td>
.....
</table>
What I would like to extract is anything between <table class="claroTable">
and </table>
解决方案 Here's how you can do it with the JSoup parser:
File file = new File("path/to/your/file.html");
String charSet = "ISO-8859-1";
String innerHtml = Jsoup.parse(file,charSet).select("body").html();
Yes, you can also somehow do it with regex, but it will never be this easy.
Update: The main problem with your regex pattern is that you are missing the DOTALL
flag:
Pattern pattern=Pattern.compile(".*?<body.*?>(.*?)</body>.*?",Pattern.DOTALL);
And if you just want the specified table tag with contents, you can do something like this:
String tableTag =
Pattern.compile(".*?<table.*?claroTable.*?>(.*?)</table>.*?",Pattern.DOTALL)
.matcher(html)
.replaceFirst("$1");
(Updated: now returns the contents of the table tag only, not the table tag itself)
这篇关于使用Java和Regex帮助从html标签中提取文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文