无法使用Perl Encode模块对某些字符编码为iso-8859-1编码 [英] Unable to encode to iso-8859-1 encoding for some chars using Perl Encode module

查看:132
本文介绍了无法使用Perl Encode模块对某些字符编码为iso-8859-1编码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个ISO-8859-1编码的HTML字符串。我需要将此字符串传递给HTML:Entities :: decode_entities(),用于将一些HTML ASCII代码转换为相应的字符。所以我正在使用一个模块HTML :: Parser :: Entities 3.65,但在decode_entities()操作之后,我的整个字符串更改为utf-8字符串。 HTML :: Parse的文档似乎很好。因为我需要这个字符串回到ISO-8859-1格式进一步处理,所以我已经使用Encode :: encode(iso-8859-1,$ str)将字符串更改为ISO-8859-1编码。
我的结果很好,除了一些字符,一个问号即将到来。一个例子是单引号'ASCII码(’)

I have a HTML string in ISO-8859-1 encoding. I need to pass this string to HTML:Entities::decode_entities() for converting some of the HTML ASCII codes to respective chars. To so i am using a module HTML::Parser::Entities 3.65 but after decode_entities() operation my whole string changes to utf-8 string. This behavior seems fine as the documentation of the HTML::Parse. As i need this string back in ISO-8859-1 format for further processing so i have used Encode::encode("iso-8859-1",$str) to change the string back to ISO-8859-1 encoding. My results are fine excepts for some chars, a question mark is coming instead. One example is single quote ' ASCII code (’)

如果Encode模块有任何限制,有人可以帮我吗?任何其他指针也将有助于解决问题。
我正在粘贴具有引发问题的字符的示例文本:

Can anybody help me if there any limitation of Encode module? Any other pointer will also be helpful to solve the problem. I am pasting the sample text having the char causing the issue:

my $str = "This is a test string to test the encoding of some chars like ’ “ ” etc these are failing to encode; some of them which encode correctly are é « etc.";

谢谢

推荐答案

根本的问题是由& rsquo; & ldquo; 不存在于 ISO-8859-1 。你必须决定你想要做什么。

The fundamental problem is that the characters represented by ’, “, and ” do not exist in ISO-8859-1. You'll have to decide what it is that you want to do with them.

有些可能性:

使用Microsoft的扩展版本的ISO-8859-1的 cp1252 ,而不是真实的东西它包含这些字符。

Use cp1252, Microsoft's "extended" version of ISO-8859-1, instead of the real thing. It does include those characters.

重新编码ISO-8859-1范围之外的实体(加& ),在从utf-8转换为ISO-8859-1之前:

Re-encode the entities outside the ISO-8859-1 range (plus &), before converting from utf-8 to ISO-8859-1:

my $toEncode = do { no warnings 'utf8'; "&\x{0100}-\x{10FFFF}" };
$string = HTML::Entities::encode_entities($string, $toEncode);

无警告位是需要的,因为U + 10FFFF尚未实际分配。)

(The no warnings bit is needed because U+10FFFF hasn't actually been assigned yet.)

还有其他可能性。这真的取决于你要完成的工作。

There are other possibilities. It really depends on what you're trying to accomplish.

这篇关于无法使用Perl Encode模块对某些字符编码为iso-8859-1编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆