使用Python处理中文 [英] Work with Chinese in Python

查看:57
本文介绍了使用Python处理中文的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用Python处理中文文本和大数据.工作的一部分是来自一些不需要的数据的纯净文本.为了这个目标,我正在使用正则表达式.但是我遇到了与PyCharm应用程序一样的Python正则表达式中的一些问题:

I`m trying to work with Chinese text and big data in Python. Part of work is clean text from some unneeded data. For this goal I am using regexes. However I met some problems as in Python regex as in PyCharm application:

1)数据存储在postgresql中,并在各列中显示良好,但是,在选择并将其拉到var后,它显示为正方形:

1) The data is stored in postgresql and viewed well in the columns, however, after select and pull it to the var it is displayed as a square:

当输出到控制台的值如下所示:

When the value printed to the console is looks like:

曼秀雷敦曼秀雷敦男士深层活炭洁面乳100g(新包装)

Mentholatum 曼秀雷敦 男士 深层活炭洁面乳100g(新包装)

因此,我认为应用程序编码没有问题,但编码的调试部分没有问题,但是,我没有找到解决此类问题的任何方法.

So I presume there is no problem with application encoding but with debug part of encoding, however, I did not find any solutions for such behaviour.

2)我需要注意的正则表达式示例是删除中文括号之间的值,包括它们.我使用的代码是:

2) The example of regex that I need to care is to remove the values between Chinese brackets include them. The code I used is:

#!/usr/bin/env python
# -*- coding: utf-8 -*

import re
from pprint import pprint 
import sys, locale, os

    columnString = row[columnName]
    startFrom = valuestoremove["startsTo"]
    endWith = valuestoremove["endsAt"]
    isInclude = valuestoremove["include"]
    escapeCharsRegex = re.compile('([\.\^\$\*\+\?\(\)\[\{\|])')
    nonASCIIregex = re.compile('([^\x00-\x7F])')
    if escapeCharsRegex.match(startFrom):
        startFrom = re.escape(startFrom)
    if escapeCharsRegex.match(endWith):
        endWith = re.escape(endWith)

    if isInclude:
        regex = startFrom + '(.*)' + endWith
    else:
        regex = '(?<=' + startFrom + ').*?(?=' + endWith + ')'
    if nonASCIIregex.match(regex):
        p = re.compile(ur'' + regex)
    else:
        p = re.compile(regex)
    row[columnName] = p.sub("", columnString).strip()

但是正则表达式不会影响给定的字符串.我用下一个代码进行了测试:

But the regex does not influence on the given string. I`ve made a test with next code:

#!/usr/bin/env python
# -*- coding: utf-8 -*
import re

reg = re.compile(ur'((.*))')
string = u"巴黎欧莱雅 男士 劲能冰爽洁面啫哩(原男士劲能净爽洁面啫哩)100ml"
print string
string = reg.sub("", string)
print string

对我来说很好.这两个代码示例之间的唯一区别是,第一个正则表达式值来自带有json的txt文件,编码为u​​tf-8:

And it is work fine for me. The only difference between those two code examples is that n the first the regex values are come from the txt file with json, encoded as utf-8:

{
                "between": {
                    "startsTo": "(",
                    "endsAt": ")",
                    "include": true,
                    "sequenceID": "1"
                }
            }, {
                "between": {
                    "startsTo": "(",
                    "endsAt": ")",
                    "include": true,
                    "sequenceID": "2"
                }
            },{
                "between": {
                    "startsTo": "(",
                    "endsAt": ")",
                    "include": true,
                    "sequenceID": "2"
                }
            },{
                "between": {
                    "startsTo": "(",
                    "endsAt": ")",
                    "include": true,
                    "sequenceID": "2"
                }
            }

文件中的中括号也像正方形一样显示:

The Chinese brackets from the file are also viewed like the squares:

我找不到这种行为的解释或任何解决方案,因此社区需要提供帮助

I cant find explanation or any solution for such behavior, thus the community help need

感谢您的帮助.

推荐答案

问题是,您正在阅读的文本无法正确理解为Unicode(这是促使您对Python进行大范围更改的重要陷阱之一)3k).代替:

The problem is that the text you're reading in isn't getting understood as Unicode correctly (this is one of the big gotchas that prompted sweeping changes for Python 3k). Instead of:

data_file = myfile.read()

您需要告诉它对文件进行解码:

You need to tell it to decode the file:

data_file = myfile.read().decode("utf8")

然后继续 json.loads 等,它应该可以正常工作.或者,

Then continue with json.loads, etc, and it should work out fine. Alternatively,

data = json.load(myfile, "utf8")

这篇关于使用Python处理中文的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆