使用Java和Regex帮助从html标签中提取文本 [英] Help extracting text from html tag with Java and Regex

查看:169
本文介绍了使用Java和Regex帮助从html标签中提取文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用Regex从html文件中提取一些文本。我正在学习正则表达式,我仍然无法理解这一切。我有一段代码,它提取了< body> < / body> 之间包含的所有文本:

  public class Harn2 {

public static void main(String [] args)throws IOException {

String toMatch = readFile();
// Pattern pattern = Pattern.compile(。*?< body。*?>(。*?)< / body>。*?);这个工作正常
Pattern pattern = Pattern.compile(。*?< table class = \claroTable\。*?>(。*?)< / table>。*? ); //我希望这个工作
Matcher匹配器= pattern.matcher(toMatch);

if(matcher.matches()){
System.out.println(matcher.group(1));



$ b private static String readFile(){

try {
//打开文件第一个
//命令行参数
FileInputStream fstream = new FileInputStream(user.html);
//获取DataInputStream的对象
DataInputStream in = new DataInputStream(fstream);
BufferedReader br = new BufferedReader(new InputStreamReader(in));
String strLine = null;
//读取文件行按行
while(br.readLine()!= null){
//打印控制台上的内容
//System.out.println (strLine中);
strLine + = br.readLine();
}
//关闭输入流
in.close();
返回strLine;
} catch(Exception e){//捕获异常(如果有)

System.err.println(Error:+ e.getMessage());
return;



$ / code $ / pre

但现在我想提取标签之间的文本:
< table class =claroTable> < /所以我用。*?< table class = \claroTable替换了我的正则表达式字符串\。*?>(。*?)< / table>。*?
我也尝试了
。* ?< table class = \claroTable\>(。*?)< / table>。*?
但它不起作用,不明白为什么。 html文件中只有一个表格,但在javascript代码中出现table:... dataTables.js ...可能是错误原因吗?

预先感谢您的帮助,

编辑:html文本以extranct为例:

 < body> 
.....
< table class =claroTable>
< td>< th>一些数据和许多标签< / td>
.....
< / table>

我想提取的是< table class =之间的任何内容claroTable>> < / table>

解决方案您可以使用 JSoup解析器完成此操作:

  File file = new File(path / to / your / file.html); 
String charSet =ISO-8859-1;
字符串innerHtml = Jsoup.parse(file,charSet).select(body)。html();

是的,您 can 永远不要这么容易。



更新:你的正则表达式模式的主要问题是你缺少 DOTALL flag:

  Pattern pattern = Pattern.compile(。*?< body。*?>(。*? )LT /体> *,Pattern.DOTALL)?; 

如果你只是想要内容指定的表格标签,你可以这样做: p>

  String tableTag = 
Pattern.compile(。*?< table。*?claroTable。*?>( 。*?)< / table>。*?,Pattern.DOTALL)
.matcher(html)
.replaceFirst($ 1);

(更新:现在只返回表格标签的内容,而不是表格标签本身) p>

I would like to extract some text from an html file using Regex. I am learning regex and I still have trouble understanding it all. I have a code which extracts all the text included betweeen <body> and </body> here it is:

public class Harn2 {

public static void main(String[] args) throws IOException{

String toMatch=readFile();
//Pattern pattern=Pattern.compile(".*?<body.*?>(.*?)</body>.*?"); this one works fine
Pattern pattern=Pattern.compile(".*?<table class=\"claroTable\".*?>(.*?)</table>.*?"); //I want this one to work
Matcher matcher=pattern.matcher(toMatch);

if(matcher.matches()) {
    System.out.println(matcher.group(1));
}

}

 private static String readFile() {

      try{
            // Open the file that is the first 
            // command line parameter
            FileInputStream fstream = new FileInputStream("user.html");
            // Get the object of DataInputStream
            DataInputStream in = new DataInputStream(fstream);
            BufferedReader br = new BufferedReader(new InputStreamReader(in));
            String strLine = null;
            //Read File Line By Line
            while (br.readLine() != null)   {
                // Print the content on the console
                //System.out.println (strLine);
                strLine+=br.readLine();
            }
            //Close the input stream
            in.close();
            return strLine;
            }catch (Exception e){//Catch exception if any

                System.err.println("Error: " + e.getMessage());
                return "";
            }
}
}

Well it works fine like this but now I would like to extract the text between the tag: <table class="claroTable"> and </table>

So I replace my regex string by ".*?<table class=\"claroTable\".*?>(.*?)</table>.*?" I have also tried ".*?<table class=\"claroTable\">(.*?)</table>.*?" but it doesn't work and I don't understand why. There is only one table in the html file but there is an occurence of "table" in a javascript code : "...dataTables.js..." could that be the reason for the mistake?

Thank you in advance for helping me,

EDIT: the html text to extranct is something like:

<body>
.....
<table class="claroTable">
<td><th>some data and manya many tags </td>
.....
</table>

What I would like to extract is anything between <table class="claroTable"> and </table>

解决方案

Here's how you can do it with the JSoup parser:

File file = new File("path/to/your/file.html");
String charSet = "ISO-8859-1";
String innerHtml = Jsoup.parse(file,charSet).select("body").html();

Yes, you can also somehow do it with regex, but it will never be this easy.

Update: The main problem with your regex pattern is that you are missing the DOTALL flag:

Pattern pattern=Pattern.compile(".*?<body.*?>(.*?)</body>.*?",Pattern.DOTALL);

And if you just want the specified table tag with contents, you can do something like this:

String tableTag = 
    Pattern.compile(".*?<table.*?claroTable.*?>(.*?)</table>.*?",Pattern.DOTALL)
           .matcher(html)
           .replaceFirst("$1");

(Updated: now returns the contents of the table tag only, not the table tag itself)

这篇关于使用Java和Regex帮助从html标签中提取文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆