使用Java从HTML提取锚标记 [英] Extracting anchor tag from html using Java
问题描述
我在文本中有几个锚标记
I have several anchor tags in a text,
输入:<a href="http://stackoverflow.com" >Take me to StackOverflow</a>
Input: <a href="http://stackoverflow.com" >Take me to StackOverflow</a>
输出:
http://stackoverflow.com
如何在不使用第三方API的情况下找到所有这些输入字符串并将其转换为Java中的输出字符串?
How can I find all those input strings and convert it to the output string in java, without using a 3rd party API ???
推荐答案
public static void main(String[] args) {
String test = "qazwsx<a href=\"http://stackoverflow.com\">Take me to StackOverflow</a>fdgfdhgfd"
+ "<a href=\"http://stackoverflow2.com\">Take me to StackOverflow2</a>dcgdf";
String regex = "<a href=(\"[^\"]*\")[^<]*</a>";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(test);
System.out.println(m.replaceAll("$1"));
}
注意: 所有Andrzej Doyle的点都是有效的,并且如果您在输入中包含的内容比简单的<a href="X">Y</a>
多,并且您确定这是可解析的HTML,那么您使用HTML解析器会更好.
NOTE: All Andrzej Doyle's points are valid and if you have more then simple <a href="X">Y</a>
in your input, and you are sure that is parsable HTML, then you are better with HTML parser.
总结:
- 如果您在评论中添加了
<a>
,我发布的正则表达式将无法正常工作. (您可以将其视为特例) - 如果您在
<a>
标记中具有其他属性,则该功能将不起作用. (同样,您可以将其视为特例) - 还有很多其他情况无法使用正则表达式,并且由于HTML不是常规语言,因此您无法用正则表达式来涵盖所有这些情况.
- The regex i posted doesn't work if you have
<a>
in comment. (you can treat it as special case) - It doesn't work if you have other attributes in the
<a>
tag. (again you can treat it as special case) - there are many other cases that regex wont work, and you can not cover all of them with regex, since HTML is not regular language.
但是,如果您的要求始终是用"X"
替换<a href="X">Y</a>
而不考虑上下文,那么我发布的代码将起作用.
However, if your req is always replace <a href="X">Y</a>
with "X"
without considering the context, then the code i've posted will work.
这篇关于使用Java从HTML提取锚标记的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!