读取双字节文件 [英] Reading double byte files

查看:315
本文介绍了读取双字节文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想知道在Tcl中是否有一个简单的方法来读取一个双字节文件(或者我认为它被调用)。我的问题是,当在记事本(我在Win7上)打开时,我得到的文件看起来很好,但是当我在Tcl中读取它们时,每个字符之间都有空格(或者相反,空字符)。 b
$ b

我目前的解决方法是首先运行字符串映射来移除所有的空元素

  string map {\ 0 {}} $ file 



然后正常处理这些信息,但有没有一种更简单的方法来做到这一点,通过 fconfigure encoding 或者其他方式?

我不熟悉编码所以我不知道我应该使用什么参数。

pre $ f $ c $ f $输入$编码双
$ / c $ c>

当然会失败,因为 double 不是有效的编码。实际上,我正在处理大文本文件(大于2 GB),并逐行处理我的解决方法,所以我编辑:正如@mhawke指出的那样,这个文件是UTF-16格式的, LE编码,这显然是不支持的编码。有没有一个优雅的方法来规避这个缺点,也许是通过 proc ?或者这会使事情比使用字符串映射

解决方案

我决定写一个小程序来转换文件。我在使用一个循环,因为将一个3 GB的文件读入一个变量,完全锁定了这个过程...这个注释看起来很长,但是时间并不长。

  proc itrans {infile outfile} {
set f [打开$ infile r]

#注意:我得到的文件有CRLF,所以我在CR上分割以保持LF和
#使用-nonewline放入
fconfigure $ f -translation cr -eof

#简单切换只是删除BOM,因为结果将是UTF-8
set bom 0
set o [open $ outfile w]
while {[gets $ fl ]!= -1} {
#转换为二进制,其中特定字符可以很容易识别
二进制扫描$ l H * l

#忽略空行
如果{$ l ==|| || $ l ==00} {continue}

#如果是第一行,那么BOM
如果{!$ bom} {
set bom 1

#识别并删除BOM,并设置哪个字节应该被删除,并保存
如果{[regexp -nocase - {^(?: FFFE | FEFF)} $ lm]} {
regsub - ^ $ m$ ll

if {[string toupper $ m] eqFFFE} {
set re(..)..
} elseif {[string toupper $ m] eqFEFF} {
set re..(..)
}
}
regsub - 所有 - $ re $ l {\ 1}新的
}其他{
#无论utf-16-le或utf-16-be,这应该工作,因为我们分裂CR
regsub -all - {..(..)| 00 $} $ l {\1} new
}
puts -nonewline $ o [binary format H * $ new]
}
关闭$ o
关闭$ f
}

itrans infile.txt outfile.txt
04 30 会失去 04 ,而变成 30 变成 D0 B0 应如表3-4所示,但 00 4D 将正确映射到 4D )在一个字符默默,所以请确保你不介意,或者你的文件不包含这样的字符之前尝试上述。


I was wondering if there was a simple way in Tcl to read a double byte file (or so I think it is called). My problem is that I get files that look fine when opened in notepad (I'm on Win7) but when I read them in Tcl, there are spaces (or rather, null characters) between each and every character.

My current workaround has been to first run a string map to remove all the null

string map {\0 {}} $file

and then process the information normally, but is there a simpler way to do this, through fconfigure, encoding or another way?

I'm not familiar with encodings so I'm not sure what arguments I should use.

fconfigure $input -encoding double

of course fails because double is not a valid encoding. Same with 'doublebyte'.

I'm actually working on big text files (above 2 GB) and doing my 'workaround' on a line by line basis, so I believe that this slows the process down.


EDIT: As pointed out by @mhawke, the file is UTF-16-LE encoded and this apparently is not a supported encoding. Is there an elegant way to circumvent this shortcoming, maybe through a proc? Or would this make things more complex than using string map?

解决方案

I decided to write a little proc to convert the file. I am using a while loop since reading a 3 GB file into a single variable locked the process completely... The comments make it seem pretty long, but it's not that long.

proc itrans {infile outfile} {
  set f [open $infile r]

  # Note: files I have been getting have CRLF, so I split on CR to keep the LF and
  # used -nonewline in puts
  fconfigure $f -translation cr -eof ""

  # Simple switch just to remove the BOM, since the result will be UTF-8
  set bom 0                              
  set o [open $outfile w]
  while {[gets $f l] != -1} {
    # Convert to binary where the specific characters can be easily identified
    binary scan $l H* l

    # Ignore empty lines
    if {$l == "" || $l == "00"} {continue}

    # If it is the first line, there's the BOM
    if {!$bom} {
      set bom 1

      # Identify and remove the BOM and set what byte should be removed and kept
      if {[regexp -nocase -- {^(?:FFFE|FEFF)} $l m]} {
        regsub -- "^$m" $l "" l

        if {[string toupper $m] eq "FFFE"} {
          set re "(..).."
        } elseif {[string toupper $m] eq "FEFF"} {
          set re "..(..)"
        }
      }
      regsub -all -- $re $l {\1} new
    } else {
      # Regardless of utf-16-le or utf-16-be, that should work since we split on CR
      regsub -all -- {..(..)|00$} $l {\1} new
    }
    puts -nonewline $o [binary format H* $new]
  }
  close $o
  close $f
}

itrans infile.txt outfile.txt

Final warning, this will mess up characters actually using all 16 bits (e.g. code unit sequence 04 30 will lose the 04 and become 30 instead of becoming D0 B0 as it should be in Table 3-4, but 00 4D will correctly be mapped to 4D) in a character silently, so be sure you don't mind that or your file doesn't contain such characters before trying out the above.

这篇关于读取双字节文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆