删除所有HTML标记 [英] Removing all html markup
问题描述
我有一个字符串,其中包含完整的XML get请求.
I have a string that holds a complete XML get request.
在请求中,有很多HTML和一些我想删除的自定义命令.
In the request, there is a lot of HTML and some custom commands which I would like to remove.
我知道的唯一方法是使用 jSoup .
The only way of doing so I know is by using jSoup.
例如像这样.
现在,由于请求来自的网站还具有自定义命令,因此我无法完全删除所有代码.
Now, because the website the request came from also features custom commands, I was not able to completely remove all code.
例如,这是我要' clean '的字符串:
For example here is a string I would like to 'clean':
\u0027s normal text here\u003c/b\u003e http://a_random_link_here.com\r\n\r\nSome more text here
如您所见,自定义命令前面都带有反斜杠.
As you can see, the custom commands all have backslashes in front of them.
我将如何使用Java删除这些命令?
How would I go about removing these commands with Java?
如果我使用正则表达式,该如何编程使其仅删除命令,而不删除命令后的任何内容? (因为如果我进行软编码:我事先不知道命令的大小,并且我不想对所有命令进行硬编码.)
If I use regex, how can I program it such that it only removes the command, not anything after the command? (because if I softcode: I don't know the size of the command beforehand and I don't want to hardcode all the commands).
推荐答案
请参见 http://regex101.com/r /gJ2yN2
正则表达式(\\.\d{3,}.*?\s|(\\r|\\n)+)
用于删除您指出的内容.
The regex (\\.\d{3,}.*?\s|(\\r|\\n)+)
works to remove the things you were pointing out.
结果(用单个空格替换匹配项):
Result (replacing the match with a single space):
normal text here http://a_random_link_here.com Some more text here
如果这不是您想要的结果,请用预期的结果编辑您的问题.
If this was not the result you were looking for, please edit your question with the expected result.
EDIT 正则表达式说明:
() - match everything inside the parentheses (later, the "match" gets replaced with "space")
\\ - an 'escaped' backslash (i.e. an actual backslash; the first one "protects" the second
so it is not interpreted as a special character
. - any character (I saw 'u', but there might be others
\d - a digit
{3,} - "at least three"
.*? - any characters, "lazy" (stop as soon as possible)
\s - until you hit a white space
| - or
() - one of these things
\\r - backslash - r (again, with escaped '\')
\\n - backslash - n
这篇关于删除所有HTML标记的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!