python重新编译并用ÆØÅ字符拆分 [英] python re.compile and split with ÆØÅ charcters

查看:62
本文介绍了python重新编译并用ÆØÅ字符拆分的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对 Python 非常陌生.我确实有一个包含单词列表的文件.它们包含丹麦字母 (ÆØÅ) 但 re.compile 不理解这些字符.该函数按每个 ÆØÅ 拆分单词.文本是从 Twitter 和 Facebook 下载的,并不总是只包含字母.

I am very new in Python. I do have a file with a list of words. They contain Danish letters (ÆØÅ) but the re.compile do not understand theses characters. The function split the words by each ÆØÅ. The text are downloade from Twitter and Facebook and do not always contain only letters.

text = "Rød grød med fløde.... !! :)"
pattern_split = re.compile(r"\W+")
words = pattern_split.split(text.lower())
words = ['r', 'd', 'gr', 'd', 'med', 'fl', 'de']

正确的结果应该是

    words = ['rød', 'grød', 'med', 'fløde']

如何获得正确的结果?

完整代码

#!/usr/bin/python 
# -*- coding: utf-8 -*-

import math, re, sys, os
reload(sys)
sys.setdefaultencoding('utf-8')

# AFINN-111 is as of June 2011 the most recent version of AFINN
#filenameAFINN = 'AFINN/AFINN-111.txt'

# Get location of file
__location__ = os.path.realpath(
    os.path.join(os.getcwd(), os.path.dirname(__file__)))


filenameAFINN = __location__ + '/AFINN/AFINN-111DK.txt'
afinn = dict(map(lambda (w, s): (w, int(s)), [ 
            ws.strip().split('\t') for ws in open(filenameAFINN) ]))

# Word splitter pattern
pattern_split = re.compile(r"\W+")
#pattern_split = re.compile('[ .,:();!?]+')

def sentiment(text):
    print(text)
    words = pattern_split.split(text.lower().strip())
    print(words)
    sentiments = map(lambda word: afinn.get(word, 0), words)
    if sentiments:
        sentiment = float(sum(sentiments))/math.sqrt(len(sentiments))

    else:
        sentiment = 0
    return sentiment


# Print result
text = "ånd ånd med fløde... :)asd "
id = 999
split = "###"
print("%6.2f%s%s%s%s" % (sentiment(text), split, id, split, text))

推荐答案

重新编写脚本以使用最佳实践:

Reworking your script to use best practices:

import csv
import math
import os
import re

LOCATION = os.path.dirname(os.path.abspath(__file__))
afinn_filename = os.path.join(LOCATION, '/AFINN/AFINN-111DK.txt')

pattern_split = re.compile(r"\W+")

with open(afinn_filename, encoding='utf8', newline='') as infile:
    reader = csv.reader(infile, delimiter='\t')
    afinn = {key: int(score) for key, score in reader}


def sentiment(text):
    words = pattern_split.split(text.lower().strip())
    if not words:
        return 0
    sentiments = [afinn.get(word, 0) for word in words]
    return sum(sentiments) / math.sqrt(len(sentiments))


# Print result
text = "ånd ånd med fløde... :)asd "
id = 999
split = "###"
print('{sentiment:6.2f}{split}{id}{split}{text}'.format(
    sentiment=sentiment(text), id=id, split=split, text=text))

在 Python 3 中运行这意味着 text 是一个 Unicode 对象,并且正则表达式是用 re.UNICODE 集解释的.

Running this with Python 3 means that text is a Unicode object and that the regular expression is interpreted with the re.UNICODE set.

在 Python 2 中,您将使用:

In Python 2, you'd use:

text = u"ånd ånd med fløde... :)asd "

(注意字符串中的前导 u 前缀)和

(note the leading u prefix on the string) and

pattern_split = re.compile(ur"\W+", re.UNICODE)

您的 AFINN 文件仍将被读取为 CSV,但事后从 UTF8 解码 key,使用:

Your AFINN file would be read as CSV still, but decoding the key from UTF8 after the fact, with:

with open(afinn_filename, 'rb') as infile:
    reader = csv.reader(infile, delimiter='\t')
    afinn = {key.decode('utf8'): int(score) for key, score in reader}

这篇关于python重新编译并用ÆØÅ字符拆分的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆