C ++：删除所有HTML从字符串格式化？ [英] C++: Remove all HTML formatting from string?

查看：191 发布时间：2016/8/23 11:45:07 c++ html c decode

本文介绍了C ++：删除所有HTML从字符串格式化？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个字符串，它可能包括BR或跨度... / span标签或其它的HTML字符/实体。我想剥离这一切，让剩下的UTF-8字符的一条有效的途径。这是应该是跨平台的，理想的。

I have a string which might include br or span.../span tags or other HTML characters/entities. I want a robust way of stripping all that and getting the remaining UTF-8 characters. This be should be cross-platform, ideally.

这样的事情将是理想的：

Something like this would be ideal:

的http://snipplr.com/view/15261/python-de$c$c-and-strip-html-entites-to-uni$c$c/

但也删除标签。

推荐答案

到底有多严格的您的要求？一个简单的两种状态FSA应该做的。开始在READCHAR状态。当你读了'＆LT;'在这种状态下，过渡到READTAG状态;否则，写出的字符到你的结果字符串。每当你在READTAG状态和阅读'>'，转换回READCHAR状态。

Just how stringent are your requirements? A simple two-state FSA ought to do. Start in the READCHAR state. Whenever you read a '<' in that state, transition to the READTAG state; otherwise, write the character to your result string. Whenever you're in the READTAG state and read a '>', transition back to the READCHAR state.

编辑：哎呀。错过了关于实体的一部分。你会NEAD为一个READENTITY状态了。当你转变了吧，你也可以转换为code到相应的UTF-8字符。

Oops. Missed the part of about entities. You'll nead a READENTITY state for that too. When you transition out of it, you could also convert the code into the corresponding UTF-8 character.

这篇关于C ++：删除所有HTML从字符串格式化？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

C ++：删除所有HTML从字符串格式化？ [英] C++: Remove all HTML formatting from string?

问题描述

推荐答案

相关文章

C/C++开发最新文章

热门教程

热门工具

登录关闭

C ++：删除所有HTML从字符串格式化？ [英] C++: Remove all HTML formatting from string?

问题描述

推荐答案

相关文章

C/C++开发最新文章

热门教程

热门工具

登录 关闭

登录关闭