无法使用Perl Encode模块对某些字符编码为iso-8859-1编码 [英] Unable to encode to iso-8859-1 encoding for some chars using Perl Encode module
问题描述
我有一个ISO-8859-1编码的HTML字符串。我需要将此字符串传递给HTML:Entities :: decode_entities(),用于将一些HTML ASCII代码转换为相应的字符。所以我正在使用一个模块HTML :: Parser :: Entities 3.65,但在decode_entities()操作之后,我的整个字符串更改为utf-8字符串。 HTML :: Parse的文档似乎很好。因为我需要这个字符串回到ISO-8859-1格式进一步处理,所以我已经使用Encode :: encode(iso-8859-1,$ str)将字符串更改为ISO-8859-1编码。
我的结果很好,除了一些字符,一个问号即将到来。一个例子是单引号'ASCII码(’)
I have a HTML string in ISO-8859-1 encoding. I need to pass this string to HTML:Entities::decode_entities() for converting some of the HTML ASCII codes to respective chars. To so i am using a module HTML::Parser::Entities 3.65 but after decode_entities() operation my whole string changes to utf-8 string. This behavior seems fine as the documentation of the HTML::Parse. As i need this string back in ISO-8859-1 format for further processing so i have used Encode::encode("iso-8859-1",$str) to change the string back to ISO-8859-1 encoding. My results are fine excepts for some chars, a question mark is coming instead. One example is single quote ' ASCII code (’)
如果Encode模块有任何限制,有人可以帮我吗?任何其他指针也将有助于解决问题。
我正在粘贴具有引发问题的字符的示例文本:
Can anybody help me if there any limitation of Encode module? Any other pointer will also be helpful to solve the problem. I am pasting the sample text having the char causing the issue:
my $str = "This is a test string to test the encoding of some chars like ’ “ ” etc these are failing to encode; some of them which encode correctly are é « etc.";
谢谢
推荐答案
根本的问题是由& rsquo;
,& ldquo;
和
不存在于 ISO-8859-1 。你必须决定你想要做什么。
The fundamental problem is that the characters represented by ’
, “
, and ”
do not exist in ISO-8859-1. You'll have to decide what it is that you want to do with them.
有些可能性:
使用Microsoft的扩展版本的ISO-8859-1的 cp1252 ,而不是真实的东西它包含这些字符。
Use cp1252, Microsoft's "extended" version of ISO-8859-1, instead of the real thing. It does include those characters.
重新编码ISO-8859-1范围之外的实体(加&
),在从utf-8转换为ISO-8859-1之前:
Re-encode the entities outside the ISO-8859-1 range (plus &
), before converting from utf-8 to ISO-8859-1:
my $toEncode = do { no warnings 'utf8'; "&\x{0100}-\x{10FFFF}" };
$string = HTML::Entities::encode_entities($string, $toEncode);
(无警告
位是需要的,因为U + 10FFFF尚未实际分配。)
(The no warnings
bit is needed because U+10FFFF hasn't actually been assigned yet.)
还有其他可能性。这真的取决于你要完成的工作。
There are other possibilities. It really depends on what you're trying to accomplish.
这篇关于无法使用Perl Encode模块对某些字符编码为iso-8859-1编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!