Perl - 用编码方法读取文件? [英] Perl - read file with encoding method?

查看:170
本文介绍了Perl - 用编码方法读取文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

一个包含'»'之类的字符的文件,并在插入到数据库的时候变成了Â& raquo。



decode_entities()什么也不做,encode_entities再次编码字符。所以我创建了自己的子文件来解决这个问题,但是当从文件中获取数据时,它并没有以正确的格式进行检索。

  my $ file =c:/ perlscripts /。 md5_hex($ md5Con)。 -code.php; 
{
local($ /); #取消定义记录分隔符
打开FILE,<,$ file或者无法打开:$!\\\
;
my $ fileContents = unicodeConvert(< FILE>);
...
..

是否有编码选项;

  my $ file =c:/ perlscripts /。 md5_hex($ md5Con)。 -code.php; 
{
local($ /); #取消定义记录分隔符
打开文件,<,$ file或者无法打开:$!\\\
,UTF-8;
my $ fileContents = unicodeConvert(< FILE>);
...
..

和我的子是

  sub unicodeConvert($){
my $ str = shift;
我%的EntityRef =( &安培;=> 中&放大器;放大器;, '¢'=> 中&安培;分;, '¤'=> 中&安培; CURREN;, =& brvbar;,=& uml;,=& ordf;,←=&; '®'=> 中&安培; REG;, '°'=> 中&安培;度;, '²'=> 中&安培; SUP2;, '''=>中&安培;急性; '¶'=> 中&安培;对位;, '¸'=> 中&安培; cedil;, 'º'=> 中&安培; ORDM;, '¼'=> 中&安培; frac14;''''>'& frac34;','''& Agrave;','''& Acirc;','''=> & Auml;,Æ'=& AElig;,È'=& Egrave;,Ê'=& Ecirc;,Ì'= & Igrair;""& Ograve;」,'Ô'& Icirc;','Ð'=& '=& Ocirc;','Ö'=& Ouml;','Ø'=>& Oslash;','Ú'=> Uacute; '&';'& Uuml;','Þ'=>& THORN;','à'=>& agra ve;,â& acirc;,'ä'=>& auml;,'æ'=> aelig;,'è'=> & egrave;,'ê'=>& ecirc;,''=>& igrave;,''=> & icirc;,'ð'=> & eth;,'ò'=> & ograve;,ô=> & ocirc;,ö'=> & ouml;,ø=> & oslash;,'ú'=> & uacute;,ü=> & uuml;,'þ'=> & thorn;,¡=> & iexcl;,£=> & pound;,¥=> & yen;,§=> & sect;,©=> & copy;,«'=> & laquo;,'¯'=> & macr;,±=> & plusmn;,³=> & sup3;,μ'=> & micro;,·=> & middot;,¹=> & sup1;,»=> & raquo;,½'=> & frac12;,¿=> & iquest;,Á'=> & Aacute;,'Ã'=> & Atilde;,Å'=> & Aring;,Ç'=> & Ccedil;,É=> & Eacute;,Ë'=> & Euml;,Í'=> & Iacute;,我=> & Iuml;,Ñ=> & Ntilde;,'Ó'=> & Oacute;,Õ'=> & Otilde;,×=> & times;,Ù=> & Ugrave;,'Û'=> & Ucirc;,≧'=> & Yacute;,ß=> & szlig;,'á'=> & aacute;,'ã'=> & atilde;,'å'=> & aring;,ç'=> & ccedil;,é=> & eacute;,'ë'=> & euml;,'í'=> & iacute;,'ï'=> & iuml;,'ñ'=> & ntilde;,'ó'=> & oacute;,'õ'=> & otilde;,÷=> & divide;,'ù'=> & ugrave;,û'=> & ucirc;,'ý'=> & yacute;,ÿ=> &安培; yuml;); ($ my $ key,my $ obj)= each(%entityRef)){
if($ key ne'&'){
$ str =〜s / $ key / $ obj / gis
} else {
$ str =〜s#&((?!(quot;)|(amp;)|(cent;)|(curren;)|(brvbar ;)|(UML;)|(ordf;)|(未;)|(REG;)|(度)|(SUP2;)|(急性;)|(对;)|(¸)|(ORDM ;)|(frac14;)|(frac34;)|(Agrave;)|(ACIRC)|(AUML;)|(AElig;)|(Egrave;)|(Ecirc;)|(Igrave;)|(Icirc ;)|(ETH;)|(Ograve;)|(Ocirc;)|(Ouml;)|(Oslash;)|(Uacute;)|(Uuml;)|(刺;)|(agrave;)|(ACIRC ;)|(AUML;)|(aelig;)|(egrave;)|(ecirc;)|(igrave;)|(icirc;)|(ETH;)|(ograve;)|(ocirc;)|(ouml ;)|(oslash;)|(uacute;)|(uuml;)|(刺)|(iexcl;)|(磅)|(日元)|(教派)|(复印件;)|(LAQUO ;)|(MACR;)|(plusmn;)|(SUP3;)|(微)|(middot;)|(SUP1;)|(RAQUO;)|(frac12;)|(iquest;)|(Aacute ;)|(Atilde;)|(Aring;)|(Ccedil;)|(Eacute;)|(Euml;)|(Iacute;)|(IUML;)|(Ntilde;)|(Oacute;)|(Otilde ;)|(倍;)|(Ugrave;)|(Ucirc;)|(Yacute;)|(大街)|(aacute;)|(atilde;)|(aring;)|(ccedil;)|(eacute ;)|(euml;)|(iacute;)|(IUML;)| (ntilde;)|(oacute;)|(otilde;)|(鸿沟;)|(ugrave;)|(ucirc;)|(yacute;)|(yuml;)|(NBSP;)))#$ OBJ#地理信息系统;
}
}
return $ str;


解决方案

你的问题,我不确定你在问什么。

所以我假设你想把Unicode字符转换成HTML实体。在这种情况下,使用其中一个预制模块应该更好。如果由于编码问题(这在Perl中相当棘手)而无法正常工作,那么你的问题的答案:


是否没有一个编码选项,如

$ p $ 打开文件<,$ file or dieCan not open:$!\\\
,UTF-8;


...可能会解决这个问题,你自己的尝试工作,但更好的使用现成的;-)(顺便说一下,你写在那里的方式是作为一个UTF-8选项这让你有点难以理解你在问什么;-)

是的,有一个UTF-8选项,假设你有一个最近的 perl (> = v5.8):

  open(my $ fh, '<:encoding(UTF-8)',$ file)或者打开$ file:$!时出错。 

(例子改编自 perluniintro



您也可以使用 binmode 来改变一个已经打开的文件句柄(例如STDIN / OUT)

  binmode(STDOUT, :编码(UTF-8));。 

您也可以使用 open pragma。



但是对于这个我建议尝试 binmode 或改变你的开放线路,看看是否解决这个问题。如果你的 perl 小于v5.8,那么事情就会比较棘手,但是如果你告诉我们




  • 我注意到一些其他的东西:


    • 不是必需的,但是我们认为最好使用词法范围的文件句柄(我的$跳频,而不是文件)。

    • 当您在 die 字符串上放置换行符时,它将禁止通常添加的行号信息以帮助您查找问题。

    • 如果您将无法打开的文件的名称(或者失败的SQL或任何其他文件)放在死信息中,那么调试起来会更容易。
    • 不要在Perl(5)中使用子原型:( sub unicodeConvert($))。不要把 $ / @ / 等。 在那里。它不仅仅是检查事情,它可能会以混乱的方式改变意思。只需要创建新的内置风格运营商。

    im not too good when it comes to encoding and I am wanting to figure out how to return data as the same encoding it started with...

    I have a file with some characters in such as '»' by the time I have edited and and inserted into database they have turned into Â&raquo.

    decode_entities() does nothing and encode_entities encodes the chars again. So i created my own sub to fix that but it apears that when getting the data from the file it isn't retrieving in the right format.

    my $file = "c:/perlscripts/" . md5_hex($md5Con) . "-code.php";
    {
        local( $/ ); # undefine the record seperator
        open FILE, "<", $file or die "Cannot open:$!\n";
        my $fileContents = unicodeConvert(<FILE>);
        ...
        .. 
    

    is there not a encoding option like;

    my $file = "c:/perlscripts/" . md5_hex($md5Con) . "-code.php";
    {
        local( $/ ); # undefine the record seperator
        open FILE, "<", $file or die "Cannot open:$!\n", "UTF-8";
        my $fileContents = unicodeConvert(<FILE>);
        ...
        .. 
    

    and my sub is;

    sub unicodeConvert($) {
       my $str = shift;
        my %entityRef = ("&" => "&amp;", '¢' => "&cent;", '¤' => "&curren;", '¦' => "&brvbar;", '¨' => "&uml;", 'ª' => "&ordf;", '¬' => "&not;", '®' => "&reg;", '°' => "&deg;", '²' => "&sup2;", '´' => "&acute;", '¶' => "&para;", '¸' => "&cedil;", 'º' => "&ordm;", '¼' => "&frac14;", '¾' => "&frac34;", 'À' => "&Agrave;", 'Â' => "&Acirc;", 'Ä' => "&Auml;", 'Æ' => "&AElig;", 'È' => "&Egrave;", 'Ê' => "&Ecirc;", 'Ì' => "&Igrave;", 'Î' => "&Icirc;", 'Ð' => "&ETH;", 'Ò' => "&Ograve;", 'Ô' => "&Ocirc;", 'Ö' => "&Ouml;", 'Ø' => "&Oslash;", 'Ú' => "&Uacute;", 'Ü' => "&Uuml;", 'Þ' => "&THORN;", 'à' => "&agrave;", 'â' => "&acirc;", 'ä' => "&auml;", 'æ' => "&aelig;", 'è' => "&egrave;", 'ê' => "&ecirc;", 'ì' => "&igrave;", 'î' => "&icirc;", 'ð' => "&eth;", 'ò' => "&ograve;", 'ô' => "&ocirc;", 'ö' => "&ouml;", 'ø' => "&oslash;", 'ú' => "&uacute;", 'ü' => "&uuml;", 'þ' => "&thorn;", '¡' => "&iexcl;", '£' => "&pound;", '¥' => "&yen;", '§' => "&sect;", '©' => "&copy;", '«' => "&laquo;", '¯' => "&macr;", '±' => "&plusmn;", '³' => "&sup3;", 'µ' => "&micro;", '·' => "&middot;", '¹' => "&sup1;", '»' => "&raquo;", '½' => "&frac12;", '¿' => "&iquest;", 'Á' => "&Aacute;", 'Ã' => "&Atilde;", 'Å' => "&Aring;", 'Ç' => "&Ccedil;", 'É' => "&Eacute;", 'Ë' => "&Euml;", 'Í' => "&Iacute;", 'Ï' => "&Iuml;", 'Ñ' => "&Ntilde;", 'Ó' => "&Oacute;", 'Õ' => "&Otilde;", '×' => "&times;", 'Ù' => "&Ugrave;", 'Û' => "&Ucirc;", 'Ý' => "&Yacute;", 'ß' => "&szlig;", 'á' => "&aacute;", 'ã' => "&atilde;", 'å' => "&aring;", 'ç' => "&ccedil;", 'é' => "&eacute;", 'ë' => "&euml;", 'í' => "&iacute;", 'ï' => "&iuml;", 'ñ' => "&ntilde;", 'ó' => "&oacute;", 'õ' => "&otilde;", '÷' => "&divide;", 'ù' => "&ugrave;", 'û' => "&ucirc;", 'ý' => "&yacute;", 'ÿ' => "&yuml;");
        while( ( my $key, my $obj ) = each( %entityRef ) ) {
            if( $key ne '&' ) {
                    $str =~ s/$key/$obj/gis
            } else {
                    $str =~ s#&((?!(quot;)|(amp;)|(cent;)|(curren;)|(brvbar;)|(uml;)|(ordf;)|(not;)|(reg;)|(deg;)|(sup2;)|(acute;)|(para;)|(cedil;)|(ordm;)|(frac14;)|(frac34;)|(Agrave;)|(Acirc;)|(Auml;)|(AElig;)|(Egrave;)|(Ecirc;)|(Igrave;)|(Icirc;)|(ETH;)|(Ograve;)|(Ocirc;)|(Ouml;)|(Oslash;)|(Uacute;)|(Uuml;)|(THORN;)|(agrave;)|(acirc;)|(auml;)|(aelig;)|(egrave;)|(ecirc;)|(igrave;)|(icirc;)|(eth;)|(ograve;)|(ocirc;)|(ouml;)|(oslash;)|(uacute;)|(uuml;)|(thorn;)|(iexcl;)|(pound;)|(yen;)|(sect;)|(copy;)|(laquo;)|(macr;)|(plusmn;)|(sup3;)|(micro;)|(middot;)|(sup1;)|(raquo;)|(frac12;)|(iquest;)|(Aacute;)|(Atilde;)|(Aring;)|(Ccedil;)|(Eacute;)|(Euml;)|(Iacute;)|(Iuml;)|(Ntilde;)|(Oacute;)|(Otilde;)|(times;)|(Ugrave;)|(Ucirc;)|(Yacute;)|(szlig;)|(aacute;)|(atilde;)|(aring;)|(ccedil;)|(eacute;)|(euml;)|(iacute;)|(iuml;)|(ntilde;)|(oacute;)|(otilde;)|(divide;)|(ugrave;)|(ucirc;)|(yacute;)|(yuml;)|(nbsp;)))#$obj#gis;   
            }
        }
        return $str;
    }
    

    解决方案

    As noted in the comment on your question, I'm unsure what exactly you're asking.

    So I'm assuming you're trying to convert Unicode characters into HTML entities. In which case, using one of the pre-made modules should be better. If that is not working due to encoding problems (which are quite tricky in Perl), then the answer to your question:

    Is there not a encoding option like

    open FILE, "<", $file or die "Cannot open:$!\n", "UTF-8";
    

    ... will probably solve it, and it would probably make your own attempt work as well, but better to use a ready-made one ;-) (by the way, the way you wrote it there was as a "UTF-8" option to die which made it a little hard to understand what you were asking ;-)

    Yes there is a UTF-8 option, assuming you have a recent perl (>= v5.8):

    open(my $fh,'<:encoding(UTF-8)', $file) or die "Error opening $file: $!";
    

    (example adapted from perluniintro)

    You can also use binmode to change an already open filehandle (e.g. STDIN/OUT).

    binmode(STDOUT, ":encoding(UTF-8)");
    

    You can also set the default encoding with the open pragma.

    But for this I suggest trying binmode or changing your open line to see if that solves it.

    If you have a perl less than v5.8, things are trickier, but maybe resolvable if you tell us the version.

    A couple of other things I noticed by the way:

    • Not essential, but it's considered better to use a lexically scoped filehandle (my $fh instead of FILE).
    • When you put a newline on the die string, it suppresses the line number information that is normally added to help you find the problem.
    • If you put the name of the file that couldn't be opened (or the SQL that failed, or whatever) in the die message it will be easier to debug.
    • Don't use sub prototypes in Perl (5) : (sub unicodeConvert($)). Don't put the $/@/% etc. in there. It doesn't just check things, it may change the meaning in confusing ways. It is only needed to create new "built-in style" operators.

    这篇关于Perl - 用编码方法读取文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆