编码麻烦-一种格式转换为另一种格式 [英] Encoding troubles - one format to another

查看:68
本文介绍了编码麻烦-一种格式转换为另一种格式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个抓取工具,它从我无法控制的其他地方收集一些数据.源数据会执行各种有趣的Unicode字符,但会将它们转换为非常无用的格式,因此

I have a scraper that is collecting some data from elsewhere that I have no control over. The source data does all sorts of interesting Unicode characters but it converts them to a pretty unhelpful format, so

\u00e4

带有变音符号的小"a"(不含我认为应该在其中的双引号)*.当然,这会以纯文本格式呈现在我的HTML中.

for a small 'a' with umlaut (sans the double quotes that I think are supposed to be there)*. of course this gets rendered in my HTML as plain text.

有没有一种现实的方法可以将unicode源转换为适当的字符,而无需我手动处理每个单个字符串序列并在抓取期间替换它们呢?

Is there any realistic way to convert the unicode source into proper characters that doesn't involve me manually crunching out every single string sequence and replacing them during the scrape?

*这是它吐出的json的示例:

*here is a sample of the json that it spits out:

({"content":{"pagelet_tab_content":"<div class=\"post_user\">Latest post by <span>D\u00e4vid<\/span><\/div>\n})

推荐答案

考虑\ u00e4是Unicode字符的Javascript表示,可能是使用

Considering \u00e4 is the Javascript representation of an Unicode character, a possibility could be to use the json_decode() PHP function, to decode that to a PHP string...

有效的JSON字符串为:

The valid JSON string would be :

$json = '"\u00e4"';

这:

header('Content-type: text/html; charset=UTF-8');
$php = json_decode($json);
var_dump($php);

将为您提供正确的输出:

would give you the right output :

string 'ä' (length=2)

(这是一个字符,但长两个字节)


不过,还是有点破烂的感觉^^
根据输入的字符串类型,它可能无法很好地工作.


Still, it feels a bit hackish ^^
And it might not work too well, depending on the kind of string you get as input...

我刚刚看到您的注释,似乎表明您将JSON作为输入?如果是这样的话, json_decode()可能确实是适合该工作的工具;-)

I've just seen your comment where you seem to indicate you get JSON as input ? If so, json_decode() might really be the right tool for the job ;-)

这篇关于编码麻烦-一种格式转换为另一种格式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆