使用Python处理中文 [英] Work with Chinese in Python

查看：57 发布时间：2021/5/4 19:19:42 python regex python-2.7 encoding

本文介绍了使用Python处理中文的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试使用Python处理中文文本和大数据.工作的一部分是来自一些不需要的数据的纯净文本.为了这个目标，我正在使用正则表达式.但是我遇到了与PyCharm应用程序一样的Python正则表达式中的一些问题:

I`m trying to work with Chinese text and big data in Python. Part of work is clean text from some unneeded data. For this goal I am using regexes. However I met some problems as in Python regex as in PyCharm application:

1)数据存储在postgresql中，并在各列中显示良好，但是，在选择并将其拉到var后，它显示为正方形:

1) The data is stored in postgresql and viewed well in the columns, however, after select and pull it to the var it is displayed as a square:

当输出到控制台的值如下所示:

When the value printed to the console is looks like:

曼秀雷敦曼秀雷敦男士深层活炭洁面乳100g(新包装)

Mentholatum 曼秀雷敦男士深层活炭洁面乳100g（新包装）

因此，我认为应用程序编码没有问题，但编码的调试部分没有问题，但是，我没有找到解决此类问题的任何方法.

So I presume there is no problem with application encoding but with debug part of encoding, however, I did not find any solutions for such behaviour.

2)我需要注意的正则表达式示例是删除中文括号之间的值，包括它们.我使用的代码是:

2) The example of regex that I need to care is to remove the values between Chinese brackets include them. The code I used is:

#!/usr/bin/env python
# -*- coding: utf-8 -*

import re
from pprint import pprint 
import sys, locale, os

    columnString = row[columnName]
    startFrom = valuestoremove["startsTo"]
    endWith = valuestoremove["endsAt"]
    isInclude = valuestoremove["include"]
    escapeCharsRegex = re.compile('([\.\^\$\*\+\?\(\)\[\{\|])')
    nonASCIIregex = re.compile('([^\x00-\x7F])')
    if escapeCharsRegex.match(startFrom):
        startFrom = re.escape(startFrom)
    if escapeCharsRegex.match(endWith):
        endWith = re.escape(endWith)

    if isInclude:
        regex = startFrom + '(.*)' + endWith
    else:
        regex = '(?<=' + startFrom + ').*?(?=' + endWith + ')'
    if nonASCIIregex.match(regex):
        p = re.compile(ur'' + regex)
    else:
        p = re.compile(regex)
    row[columnName] = p.sub("", columnString).strip()

但是正则表达式不会影响给定的字符串.我用下一个代码进行了测试:

But the regex does not influence on the given string. I`ve made a test with next code:

#!/usr/bin/env python
# -*- coding: utf-8 -*
import re

reg = re.compile(ur'（(.*)）')
string = u"巴黎欧莱雅 男士 劲能冰爽洁面啫哩（原男士劲能净爽洁面啫哩）100ml"
print string
string = reg.sub("", string)
print string

对我来说很好.这两个代码示例之间的唯一区别是，第一个正则表达式值来自带有json的txt文件，编码为utf-8:

And it is work fine for me. The only difference between those two code examples is that n the first the regex values are come from the txt file with json, encoded as utf-8:

{
                "between": {
                    "startsTo": "(",
                    "endsAt": "）",
                    "include": true,
                    "sequenceID": "1"
                }
            }, {
                "between": {
                    "startsTo": "（",
                    "endsAt": ")",
                    "include": true,
                    "sequenceID": "2"
                }
            },{
                "between": {
                    "startsTo": "(",
                    "endsAt": ")",
                    "include": true,
                    "sequenceID": "2"
                }
            },{
                "between": {
                    "startsTo": "（",
                    "endsAt": "）",
                    "include": true,
                    "sequenceID": "2"
                }
            }

文件中的中括号也像正方形一样显示:

The Chinese brackets from the file are also viewed like the squares:

我找不到这种行为的解释或任何解决方案，因此社区需要提供帮助

I cant find explanation or any solution for such behavior, thus the community help need

感谢您的帮助.

使用Python处理中文 [英] Work with Chinese in Python

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

使用Python处理中文 [英] Work with Chinese in Python

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭