为什么 Perl 的 LWP 给我的编码与原始网站不同? [英] Why does Perl's LWP gives me a different encoding than the original website?

查看:52
本文介绍了为什么 Perl 的 LWP 给我的编码与原始网站不同?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有这个代码:

use strict;
use LWP qw ( get );

my $content = get ( "http://www.msn.co.il" );

print STDERR $content;

错误日志显示类似\xd7\x9c\xd7\x94\xd7\x93\xd7\xa4\xd7\xa1\xd7\x94"我猜它是 utf-16 吗?

The error log shows something like "\xd7\x9c\xd7\x94\xd7\x93\xd7\xa4\xd7\xa1\xd7\x94" which i'm guessing it's utf-16 ?

网站的编码是用

<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=windows-1255">

那么为什么会出现这些字符而不是 windows-1255 字符呢?

so why these characters appear and not the windows-1255 chars ?

而且,另一个奇怪的事情是我有两台服务器:

And, another weird thing is that i have two servers:

第一台服务器返回 CP1255 字符,我可以简单地将其转换为 utf8,而当前的服务器给了我这些字符,我不能用它做任何事情......

the first server returning CP1255 chars and i can simply convert it to utf8, and the current server gives me these chars and i can't do anything with it ...

apache/perl/module 中是否有任何配置文件弄乱了编码?强迫某事......?

is there any configuration file in apache/perl/module that is messing up the encoding ? forcing something ... ?

我的网站在第二台服务器上的结果是 perl 文件和标题都是 utf8,所以当我写的文本不是英文字符时,上面例子中的内容显示正常(即使它是奇怪的utf字符)但我自己的静态文本看起来像×ס'××ר××:"

The result in my website at the second server, is that the perl file and the headers are all utf8, so when i write text that aren't english chars, the content from the example above is showing ok ( even though it's weird utf chars ) but my own static text are look like "×ס'××ר××:"

我测试的另一件事是......

One more thing that i tested is ...

通过 perl:

my $content = `curl "http://www.anglo-saxon.co.il"`;    

我得到 utf8 编码.

I get utf8 encoding.

通过 Bash:

curl "http://www.anglo-saxon.co.il"

在这里我得到 CP1255 (Windows-1255) 编码 ...

and here i get CP1255 ( Windows-1255 ) encoding ...

还有,当我在 bash 中运行脚本时 - 它给出了 CP1255,当我通过网络运行它时 - 然后它又是 utf8 ...

Also, when i run the script in bash - it gives CP1255, and when run it through the web - then it's utf8 again ...

通过将内容从 utf8 更改为应该更改的内容,然后返回到 utf8 来解决问题:

fixed the problem by changin the content from utf8 - to what is supposed to, and then back to utf8:

use Text::Iconv;

my $converter = Text::Iconv->new("utf8", "CP1255");
   $content=$converter->convert($content);

my $converter = Text::Iconv->new("CP1255", "utf8");
   $content=$converter->convert($content);

推荐答案

您提供的带有十六进制值的字符串似乎是 UTF-8 编码.你得到这个是因为 Perl 在处理字符串时喜欢"使用 UTF-8.LWP::Simple->get() 方法自动解码来自服务器的内容,包括撤消任何内容编码以及转换为 UTF-8.

The string with the hex values that you gave appears to be a UTF-8 encoding. You are getting this because Perl ‘likes to’ use UTF-8 when it deals with strings. The LWP::Simple->get() method automatically decodes the content from the server which includes undoing any Content-Encoding as well as converting to UTF-8.

您可以深入研究内部结构并获得确实更改字符编码的版本(请参阅 HTTP::Message 的 decoded_content,由 HTTP::Response 的 decoded_content 使用a>,您可以从 LWP::UserAgent's get 获取.但是用你想要的编码重新编码数据可能更容易

You could dig into the internals and get a version that does change the character encoding (see HTTP::Message's decoded_content, which is used by HTTP::Response's decoded_content, which you can get from LWP::UserAgent's get). But it may be easier to re-encode the data in your desired encoding with something like

use Encode; 
...; 
$cp1255_bytes = encode('CP1255', decode('UTF_8', $utf8_bytes));

您看到的混合可读/垃圾字符是由于在同一流中混合了多个不兼容的编码.可能该流被标记为 UTF-8,但您将 CP1255 编码的字符放入其中.您需要将流标记为 CP1255 并仅将 CP1255 编码的数据放入其中,或者将其标记为 UTF-8 并仅将 UTF-8 编码的数据放入其中.提醒自己字节不是字符,并在它们之间进行适当的转换.

The mixed readable/garbage characters you see are due to mixing multiple, incompatible encodings in the same stream. Probably the stream is labeled as UTF-8 but you are putting CP1255 encoded characters into it. You either need to label the stream as CP1255 and put only CP1255-encoded data into it, or label it as UTF-8 and put only UTF-8-encoded data into it. Remind yourself that bytes are not characters and convert between them appropriately.

这篇关于为什么 Perl 的 LWP 给我的编码与原始网站不同?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆