使用Java从HTML提取锚标记 [英] Extracting anchor tag from html using Java

查看:57
本文介绍了使用Java从HTML提取锚标记的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在文本中有几个锚标记

I have several anchor tags in a text,

输入:<a href="http://stackoverflow.com" >Take me to StackOverflow</a>

Input: <a href="http://stackoverflow.com" >Take me to StackOverflow</a>

输出: http://stackoverflow.com

如何在不使用第三方API的情况下找到所有这些输入字符串并将其转换为Java中的输出字符串?

How can I find all those input strings and convert it to the output string in java, without using a 3rd party API ???

推荐答案

public static void main(String[] args) {
    String test = "qazwsx<a href=\"http://stackoverflow.com\">Take me to StackOverflow</a>fdgfdhgfd"
            + "<a href=\"http://stackoverflow2.com\">Take me to StackOverflow2</a>dcgdf";

    String regex = "<a href=(\"[^\"]*\")[^<]*</a>";

    Pattern p = Pattern.compile(regex);

    Matcher m = p.matcher(test);
    System.out.println(m.replaceAll("$1"));
}

注意: 所有Andrzej Doyle的点都是有效的,并且如果您在输入中包含的内容比简单的<a href="X">Y</a>多,并且您确定这是可解析的HTML,那么您使用HTML解析器会更好.

NOTE: All Andrzej Doyle's points are valid and if you have more then simple <a href="X">Y</a> in your input, and you are sure that is parsable HTML, then you are better with HTML parser.

总结:

  1. 如果您在评论中添加了<a>,我发布的正则表达式将无法正常工作. (您可以将其视为特例)
  2. 如果您在<a>标记中具有其他属性,则该功能将不起作用. (同样,您可以将其视为特例)
  3. 还有很多其他情况无法使用正则表达式,并且由于HTML不是常规语言,因此您无法用正则表达式来涵盖所有这些情况.
  1. The regex i posted doesn't work if you have <a> in comment. (you can treat it as special case)
  2. It doesn't work if you have other attributes in the <a> tag. (again you can treat it as special case)
  3. there are many other cases that regex wont work, and you can not cover all of them with regex, since HTML is not regular language.

但是,如果您的要求始终是用"X"替换<a href="X">Y</a>而不考虑上下文,那么我发布的代码将起作用.

However, if your req is always replace <a href="X">Y</a> with "X" without considering the context, then the code i've posted will work.

这篇关于使用Java从HTML提取锚标记的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆