使用Java和Regex帮助从html标签中提取文本 [英] Help extracting text from html tag with Java and Regex

查看：169 发布时间：2018/6/23 14:54:42 java html regex tags

本文介绍了使用Java和Regex帮助从html标签中提取文本的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想使用Regex从html文件中提取一些文本。我正在学习正则表达式，我仍然无法理解这一切。我有一段代码，它提取了< body> 和< / body> 之间包含的所有文本：

  public class Harn2 {
 
 public static void main（String [] args）throws IOException { 
 
 String toMatch = readFile（）; 
 // Pattern pattern = Pattern.compile（。*？< body。*？>（。*？）< / body>。*？）;这个工作正常
 Pattern pattern = Pattern.compile（。*？< table class = \claroTable\。*？>（。*？）< / table>。*？ ）; //我希望这个工作
 Matcher匹配器= pattern.matcher（toMatch）; 
 
 if（matcher.matches（））{
 System.out.println（matcher.group（1））; 
 
 
 
 $ b private static String readFile（）{
 
 try {
 //打开文件第一个
 //命令行参数
 FileInputStream fstream = new FileInputStream（user.html）; 
 //获取DataInputStream的对象
 DataInputStream in = new DataInputStream（fstream）; 
 BufferedReader br = new BufferedReader（new InputStreamReader（in））; 
 String strLine = null; 
 //读取文件行按行
 while（br.readLine（）！= null）{
 //打印控制台上的内容
 //System.out.println （strLine中）; 
 strLine + = br.readLine（）; 
} 
 //关闭输入流
 in.close（）; 
返回strLine; 
} catch（Exception e）{//捕获异常（如果有）
 
 System.err.println（Error：+ e.getMessage（））; 
 return; 
 
 
 
 $ / code $ / pre 
 
 但现在我想提取标签之间的文本：
 < table class =claroTable> 和< /所以我用。*？< table class = \claroTable替换了我的正则表达式字符串\。*？>（。*？）< / table>。*？ 
我也尝试了
 。* ？< table class = \claroTable\>（。*？）< / table>。*？ 
但它不起作用，不明白为什么。 html文件中只有一个表格，但在javascript代码中出现table：... dataTables.js ...可能是错误原因吗？
 
 预先感谢您的帮助， 
 
 
编辑：html文本以extranct为例：
 < body> 
 ..... 
< table class =claroTable> 
< td>< th>一些数据和许多标签< / td> 
 ..... 
< / table> 
  
我想提取的是< table class =之间的任何内容claroTable>> 和< / table>   
 
解决方案您可以使用 JSoup解析器完成此操作：
  File file = new File（path / to / your / file.html）; 
 String charSet =ISO-8859-1; 
字符串innerHtml = Jsoup.parse（file，charSet）.select（body）。html（）; 
  
是的，您 can       永远不要这么容易。
 
 
  更新：你的正则表达式模式的主要问题是你缺少  DOTALL   flag：
  Pattern pattern = Pattern.compile（。*？< body。*？>（。*？ ）LT /体> *，Pattern.DOTALL）？; 
  
如果你只是想要内容指定的表格标签，你可以这样做： p> 
 
 
  String tableTag = 
 Pattern.compile（。*？< table。*？claroTable。*？>（ 。*？）< / table>。*？，Pattern.DOTALL）
 .matcher（html）
 .replaceFirst（$ 1）; 
  
（更新：现在只返回表格标签的内容，而不是表格标签本身） p> 
I would like to extract some text from an html file using Regex. I am learning regex and I still have trouble understanding it all. I have a code which extracts all the text included betweeen <body> and </body> here it is:
public class Harn2 {

public static void main(String[] args) throws IOException{

String toMatch=readFile();
//Pattern pattern=Pattern.compile(".*?<body.*?>(.*?)</body>.*?"); this one works fine
Pattern pattern=Pattern.compile(".*?<table class=\"claroTable\".*?>(.*?)</table>.*?"); //I want this one to work
Matcher matcher=pattern.matcher(toMatch);

if(matcher.matches()) {
    System.out.println(matcher.group(1));
}

}

 private static String readFile() {

      try{
            // Open the file that is the first 
            // command line parameter
            FileInputStream fstream = new FileInputStream("user.html");
            // Get the object of DataInputStream
            DataInputStream in = new DataInputStream(fstream);
            BufferedReader br = new BufferedReader(new InputStreamReader(in));
            String strLine = null;
            //Read File Line By Line
            while (br.readLine() != null)   {
                // Print the content on the console
                //System.out.println (strLine);
                strLine+=br.readLine();
            }
            //Close the input stream
            in.close();
            return strLine;
            }catch (Exception e){//Catch exception if any

                System.err.println("Error: " + e.getMessage());
                return "";
            }
}
}
Well it works fine like this but now I would like to extract the text between the tag:
<table class="claroTable"> and </table>

So I replace my regex string by ".*?<table class=\"claroTable\".*?>(.*?)</table>.*?"
I have also tried 
".*?<table class=\"claroTable\">(.*?)</table>.*?" 
but it doesn't work and I don't understand why. There is only one table in the html file but  there is an occurence of "table" in a javascript code : "...dataTables.js..." could that be the reason for the mistake?

Thank you in advance for helping me,

EDIT: the html text to extranct is something like:
<body>
.....
<table class="claroTable">
<td><th>some data and manya many tags </td>
.....
</table>
What I would like to extract is anything between <table class="claroTable"> and </table> 
 解决方案 
Here's how you can do it with the JSoup parser:
File file = new File("path/to/your/file.html");
String charSet = "ISO-8859-1";
String innerHtml = Jsoup.parse(file,charSet).select("body").html();
Yes, you can also somehow do it with regex, but it will never be this easy.

Update: The main problem with your regex pattern is that you are missing the DOTALL flag:
Pattern pattern=Pattern.compile(".*?<body.*?>(.*?)</body>.*?",Pattern.DOTALL);
And if you just want the specified table tag with contents, you can do something like this:
String tableTag = 
    Pattern.compile(".*?<table.*?claroTable.*?>(.*?)</table>.*?",Pattern.DOTALL)
           .matcher(html)
           .replaceFirst("$1");
(Updated: now returns the contents of the table tag only, not the table tag itself)

                        这篇关于使用Java和Regex帮助从html标签中提取文本的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！
                        
                    

                    
                        查看全文

使用Java和Regex帮助从html标签中提取文本 [英] Help extracting text from html tag with Java and Regex

问题描述

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

使用Java和Regex帮助从html标签中提取文本 [英] Help extracting text from html tag with Java and Regex

问题描述

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭