删除所有HTML标记 [英] Removing all html markup

查看:117
本文介绍了删除所有HTML标记的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个字符串,其中包含完整的XML get请求.

I have a string that holds a complete XML get request.

在请求中,有很多HTML和一些我想删除的自定义命令.

In the request, there is a lot of HTML and some custom commands which I would like to remove.

我知道的唯一方法是使用 jSoup .

The only way of doing so I know is by using jSoup.

例如像这样.

现在,由于请求来自的网站还具有自定义命令,因此我无法完全删除所有代码.

Now, because the website the request came from also features custom commands, I was not able to completely remove all code.

例如,这是我要' clean '的字符串:

For example here is a string I would like to 'clean':

\u0027s normal text here\u003c/b\u003e http://a_random_link_here.com\r\n\r\nSome more text here

如您所见,自定义命令前面都带有反斜杠.

As you can see, the custom commands all have backslashes in front of them.

我将如何使用Java删除这些命令?

How would I go about removing these commands with Java?

如果我使用正则表达式,该如何编程使其仅删除命令,而不删除命令后的任何内容? (因为如果我进行软编码:我事先不知道命令的大小,并且我不想对所有命令进行硬编码.)

If I use regex, how can I program it such that it only removes the command, not anything after the command? (because if I softcode: I don't know the size of the command beforehand and I don't want to hardcode all the commands).

推荐答案

请参见 http://regex101.com/r /gJ2yN2

正则表达式(\\.\d{3,}.*?\s|(\\r|\\n)+)用于删除您指出的内容.

The regex (\\.\d{3,}.*?\s|(\\r|\\n)+) works to remove the things you were pointing out.

结果(用单个空格替换匹配项):

Result (replacing the match with a single space):

normal text here http://a_random_link_here.com Some more text here

如果这不是您想要的结果,请用预期的结果编辑您的问题.

If this was not the result you were looking for, please edit your question with the expected result.

EDIT 正则表达式说明:

()  - match everything inside the parentheses (later, the "match" gets replaced with "space")
\\  - an 'escaped' backslash (i.e. an actual backslash; the first one "protects" the second
      so it is not interpreted as a special character
.   - any character (I saw 'u', but there might be others
\d  - a digit
{3,} - "at least three"
.*? - any characters, "lazy" (stop as soon as possible)
\s  - until you hit a white space
|   - or
()  - one of these things
\\r - backslash - r (again, with escaped '\')
\\n - backslash - n

这篇关于删除所有HTML标记的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆