为什么在带有XML :: Parser的UTF-8字符中间出现额外的换行符? [英] Why do I get an extra newline in the middle of a UTF-8 character with XML::Parser?

查看:73
本文介绍了为什么在带有XML :: Parser的UTF-8字符中间出现额外的换行符?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在处理UTF-8,XML和Perl时遇到了问题.以下是最小的 一段代码和数据以重现该问题.

I encountered a problem dealing with UTF-8, XML and Perl. The following is the smallest piece of code and data in order to reproduce the problem.

这是一个需要解析的XML文件:

Here's an XML file that needs to be parsed:

<?xml version="1.0" encoding="utf-8"?>
<test>
  <words>בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת</words>
  <words>בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת</words>
  <words>בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת</words>

  [<words> .... </words> 148 times repeated]

  <words>בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת</words>
  <words>בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת</words>
</test>

使用以下perl脚本进行解析:

The parsing is done with this perl script:

use warnings;
use strict;

use XML::Parser;
use Data::Dump;

my $in_words = 0;

my $xml_parser=new XML::Parser(Style=>'Stream');

$xml_parser->setHandlers (
   Start   => \&start_element,
   End     => \&end_element,
   Char    => \&character_data,
   Default => \&default);

open OUT, '>out.txt'; binmode (OUT, ":utf8");
open XML, 'xml_test.xml' or die;
$xml_parser->parse(*XML);
close XML;
close OUT;


sub start_element {
  my($parseinst, $element, %attributes) = @_;

  if ($element eq 'words') {
    $in_words = 1;
  }
  else {
    $in_words = 0;
  }
}

sub end_element {
  my($parseinst, $element, %attributes) = @_;

  if ($element eq 'words') {
    $in_words = 0;
  }
}

sub default {
  # nothing to see here;
}

sub character_data {
  my($parseinst, $data) = @_;

  if ($in_words) {
    if ($in_words) {
      print OUT "$data\n";
    }
  }
}

运行脚本时,它将生成out.txt文件.问题在于 文件在第147行.第22个字符(在utf-8中由\ xd6 \ xb8组成)被分割 在d6和b8之间用换行符.这不应该发生.

When the script is run, it produces the out.txt file. The problem is in this file on line 147. The 22th character (which in utf-8 consists of \xd6 \xb8) is split between the d6 and b8 with a new line. This should not happen.

现在,我很感兴趣其他人是否有此问题或可以重现此问题. 以及为什么我遇到这个问题. 我正在Windows上运行此脚本:

Now, I am interested if someone else has this problem or can reproduce it. And why I am getting this problem. I am running this script on Windows:

C:\temp>perl -v

This is perl, v5.10.0 built for MSWin32-x86-multi-thread
(with 5 registered patches, see perl -V for more detail)

Copyright 1987-2007, Larry Wall

Binary build 1003 [285500] provided by ActiveState http://www.ActiveState.com
Built May 13 2008 16:52:49

推荐答案

我没有观察到

C:\Temp> perl -v

This is perl, v5.10.1 built for MSWin32-x86-multi-thread
(with 2 registered patches, see perl -V for more detail)

Copyright 1987-2009, Larry Wall

Binary build 1006 [291086] provided by ActiveState http://www.ActiveState.com
Built Aug 24 2009 13:48:26

C:\Temp> perl -MXML::Parser -e "print $XML::Parser::VERSION"
2.36

这篇关于为什么在带有XML :: Parser的UTF-8字符中间出现额外的换行符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆