在字典密钥中将Unicode编码为日语 [英] Encoding Unicode in the Dictionary Key to Japanese

查看:117
本文介绍了在字典密钥中将Unicode编码为日语的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我刚刚开始通过Python2使用日语进行文本聚类.但是,当我根据这些日语单词/术语创建字典时,字典键变为unicode而不是日语.代码如下:

I just started working on text clustering in Japanese through Python2. However, when I created the dictionary based on these Japanese words/terms, the dictionary keys become unicode instead of Japanese. The codes are as follows:

# load data
allWrdMat10 = pd.read_csv("../../data/allWrdMat10.csv.gz", 
encoding='CP932') 


## Set X as CSR Sparse Matrix
X = np.array(allWrdMat10)
X = sp.csr_matrix(X)

## create dictionary
dict_index = {t:i for i,t in enumerate(allWrdMat10.columns)}

freqrank = np.array(dict_index.values()).argsort()
X_transform = X[:, freqrank < 1000].transpose().toarray()

allWrdMat10.columns的结果仍然是日语,如下所示:

The results of allWrdMat10.columns are still Japanese as follows:

Index([u'?', u'.', u'・', u'%', u'0', u'1', u'10月', u'11月', u'12
月', u'1つ',
...
u'瀋陽', u'疆', u'盧', u'籠', u'絆', u'胚', u'諫早', u'趙', u'鉉', u'鎔
基'],dtype='object', length=8655)

但是,dict_index.keys()的结果如下:

[u'\u77ed\u9283',
 u'\u5efa\u3066',
 u'\u4f0a',
 u'\u5e73\u5b89',
 u'\u6025\u9a30',
 u'\u897f\u65e5\u672c',
 u'\u5e03\u9663',
 ...]

有什么办法可以将日语单词/术语保留在字典键中?还是有什么办法可以将unicode转换回日语单词/词条?谢谢.

Is there any way I can keep the Japanese words/terms in the dictionary keys? Or is there any way I can convert the unicodes back to Japanese words/terms? Thanks.

推荐答案

您没有在字符串前面加上u,这在Python 2中是必需的. unicode_literals import unicode_literals

You did not prefix the string with u, which is needed in Python 2. Even better, unicode_literals import unicode_literals

这篇关于在字典密钥中将Unicode编码为日语的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆