编码麻烦-一种格式转换为另一种格式 [英] Encoding troubles - one format to another
问题描述
我有一个抓取工具,它从我无法控制的其他地方收集一些数据.源数据会执行各种有趣的Unicode字符,但会将它们转换为非常无用的格式,因此
I have a scraper that is collecting some data from elsewhere that I have no control over. The source data does all sorts of interesting Unicode characters but it converts them to a pretty unhelpful format, so
\u00e4
带有变音符号的小"a"(不含我认为应该在其中的双引号)*.当然,这会以纯文本格式呈现在我的HTML中.
for a small 'a' with umlaut (sans the double quotes that I think are supposed to be there)*. of course this gets rendered in my HTML as plain text.
有没有一种现实的方法可以将unicode源转换为适当的字符,而无需我手动处理每个单个字符串序列并在抓取期间替换它们呢?
Is there any realistic way to convert the unicode source into proper characters that doesn't involve me manually crunching out every single string sequence and replacing them during the scrape?
*这是它吐出的json的示例:
*here is a sample of the json that it spits out:
({"content":{"pagelet_tab_content":"<div class=\"post_user\">Latest post by <span>D\u00e4vid<\/span><\/div>\n})
推荐答案
考虑\ u00e4是Unicode字符的Javascript表示,可能是使用
Considering \u00e4 is the Javascript representation of an Unicode character, a possibility could be to use the json_decode()
PHP function, to decode that to a PHP string...
有效的JSON字符串为:
The valid JSON string would be :
$json = '"\u00e4"';
这:
header('Content-type: text/html; charset=UTF-8');
$php = json_decode($json);
var_dump($php);
将为您提供正确的输出:
would give you the right output :
string 'ä' (length=2)
(这是一个字符,但长两个字节)
不过,还是有点破烂的感觉^^
根据输入的字符串类型,它可能无法很好地工作.
Still, it feels a bit hackish ^^
And it might not work too well, depending on the kind of string you get as input...
我刚刚看到您的注释,似乎表明您将JSON作为输入?如果是这样的话, json_decode()
可能确实是适合该工作的工具;-)
I've just seen your comment where you seem to indicate you get JSON as input ? If so, json_decode()
might really be the right tool for the job ;-)
这篇关于编码麻烦-一种格式转换为另一种格式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!