如何阅读中文文件? [英] How to read Chinese files?

查看:131
本文介绍了如何阅读中文文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我被所有这些令人困惑的编码问题所困扰.我有一个包含中文字幕的文件.我实际上认为它是UTF-8,因为在Notepad ++中使用它会给我带来很好的效果.如果我设置gb2312,中文部分仍然可以,但是我会看到一些UTF8代码没有被转换.

I'm stuck with all this confusing encoding stuff. I have a file containing Chinese subs. I actually believe it is UTF-8 because using this in Notepad++ gives me a very good result. If I set gb2312 the Chinese part is still fine, but I will see some UTF8 code not being converted.

目标是遍历文件中的文本并计算出现不同字符的次数.

The goal is to loop through the text in the file and count how many times the different chars come up.

import os
import re
import io

character_dict = {}
for dirname, dirnames, filenames in os.walk('.'):
    for filename in filenames:
        if "srt" in filename:
            import codecs
            f = codecs.open(filename, 'r', 'gb2312', errors='ignore')
            s = f.read()

            # deleting {}
            s = re.sub('{[^}]+}', '', s)
            # deleting every line that does not start with a chinese char
            s = re.sub(r'(?m)^[A-Z0-9a-z].*\n?', '', s)
            # delete non chinese chars
            s = re.sub(r'[\s\.A-Za-z0-9\?\!\\/\-\"\,\*]', '', s)
            #print s
            s = s.encode('gb2312')
            print s
            for c in s:
                #print c
                pass

这实际上会给我完整的中文文本.但是,当我在底部打印出循环时,我只会得到问号,而不是单个字符.

This will actually give me the complete Chinese text. But when I print out the loop on the bottom I just get questionmarks instead of the single chars.

还请注意,我说的是UTF8,但我必须使用gb2312进行编码,并将其用作gnome终端中的设置.如果我在代码中将其设置为UTF8,则无论我将终端设置为UTF8还是gb2312都只会收到垃圾.所以也许这个文件毕竟不是UTF8!?

Also note I said it is UTF8, but I have to use gb2312 for encoding and as the setting in my gnome-terminal. If I set it to UTF8 in the code i just get trash no matter if I set my terminal to UTF8 or gb2312. So maybe this file is not UTF8 after all!?

无论如何s都包含完整的中文文本.为什么我不能循环播放?

In any case s contains the full Chinese text. Why can't I loop it?

请帮助我理解这一点.这对我来说非常混乱,而文档却使我无所适从.谷歌只是将我引向有人解决的类似问题,但到目前为止,还没有任何解释可以帮助我理解这一点.

Please help me to understand this. It is very confusing for me and the docs are getting me nowhere. And google just leads me to similar problems that somebody solves, but there is no explanation so far that helped me understand this.

推荐答案

gb2312是多字节编码.如果您迭代使用其编码的字节串,则将迭代字节,而不是要计数(或打印)的字符.您可能想在对unicode字符串进行迭代之前对其进行编码.如有必要,您可以将各个代码点(字符)编码为它们自己的字节串以进行输出:

gb2312 is a multi-byte encoding. If you iterate over a bytestring encoded with it, you will be iterating over the bytes, not over the characters you want to be counting (or printing). You probably want to do your iteration on the unicode string before encoding it. If necessary, you can encode the individual codepoints (characters) to their own bytestrings for output:

# don't do s = s.encode('gb2312')
for c in s:      # iterate over the unicode codepoints
    print c.encode('gb2312')  # encode them individually for output, if necessary

这篇关于如何阅读中文文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆