所以我们有我们的HTML转义函数,真正工作在C ++的方式,如何unescape? [英] So we've got our HTML escape functions that really work in a C++ manner, how to do unescape?
问题描述
这里我发现一种格式化方式,对HTML编码/转义特殊字符。现在我想知道如何在C ++中解除HTML编码的文本?
Here I've found a grate way to HTML encode/escape special chars. Now I wonder how to unescape HTML encoded text in C++?
所以代码库是:
#include <algorithm>
namespace xml {
// Helper for null-terminated ASCII strings (no end of string iterator).
template<typename InIter, typename OutIter>
OutIter copy_asciiz ( InIter begin, OutIter out )
{
while ( *begin != '\0' ) {
*out++ = *begin++;
}
return (out);
}
// XML escaping in it's general form. Note that 'out' is expected
// to an "infinite" sequence.
template<typename InIter, typename OutIter>
OutIter escape ( InIter begin, InIter end, OutIter out )
{
static const char bad[] = "&<>";
static const char* rep[] = {"&", "<", ">"};
static const std::size_t n = sizeof(bad)/sizeof(bad[0]);
for ( ; (begin != end); ++begin )
{
// Find which replacement to use.
const std::size_t i =
std::distance(bad, std::find(bad, bad+n, *begin));
// No need for escaping.
if ( i == n ) {
*out++ = *begin;
}
// Escape the character.
else {
out = copy_asciiz(rep[i], out);
}
}
return (out);
}
}
和
#include <iterator>
#include <string>
namespace xml {
// Get escaped version of "content".
std::string escape ( const std::string& content )
{
std::string result;
result.reserve(content.size());
escape(content.begin(), content.end(), std::back_inserter(result));
return (result);
}
// Escape data on the fly, using "constant" memory.
void escape ( std::istream& in, std::ostream& out )
{
escape(std::istreambuf_iterator<char>(in),
std::istreambuf_iterator<char>(),
std::ostreambuf_iterator<char>(out));
}
}
它适用于:
#include <iostream>
int main ( int, char ** )
{
std::cout << xml::escape("<foo>bar & qux</foo>") << std::endl;
}
$ b $ p
所以我不知道 - 如何让HTML以这种方式解析?
So I wonder - how to make HTML unescape in such manner?
推荐答案
看看我如何解决类似的问题'&# \\ d +);'
字符串,即数字字符引用(NCR),使用 boost :: spirit , boost :: regex_token_iterator , Flex , Perl 。
Take a look at how I've solved a similar problem for '&#(\d+);'
strings i.e., numeric character references (NCRs) using boost::spirit, boost::regex_token_iterator, Flex, Perl.
在您的情况下,正则表达式如果您不需要转换&(amp | lt | gt); rel =nofollow>所有html实体。
In your case the regex is &(amp|lt|gt);
if you don't need to convert all html entities.
这篇关于所以我们有我们的HTML转义函数,真正工作在C ++的方式,如何unescape?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!