过滤使用utf-8编码的文本以仅包含拉丁字母字符 [英] Filtering text encoded with utf-8 to only contain latin alphabet characters

查看:154
本文介绍了过滤使用utf-8编码的文本以仅包含拉丁字母字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将文本数据过滤为仅包含拉丁字符,以进行进一步的文本分析.原始文本源很可能包含韩文字母.这在文本文件中显示如下:

I'm trying to filter textdata to only contain latin characters, for further text analyzing. The original textsource most likely contained Korean Alphabet. This shows up like this in the text file:

\xe7\xac\xac8\xe4\xbd\x8d ONE PIECE FILM GOLD Blu-ray GOLDEN LIMITED EDITION

什么是最快/最简单/最完整的删除方式?我尝试制作一个脚本来删除所有\ xXX组合,但是事实证明,这样做很可靠.

What would be the fastest/easiest/most complete way to get remove these? I tried making a script that would remove all \xXX combinations, but it turns out that there are to many exceptions for this to be reliable.

是否可以从utf-8编码的文本中删除所有无拉丁字符?

Is there a way to remove all none latin characters from utf-8 encoded text?

谢谢.

解决方案:

import string

textin = b'\xe7\xac\xac8\xe4\xbd\x8d ONE PIECE FILM GOLD Blu-ray GOLDEN LIMITED EDITION'.decode('UTF-8')
outtext = ''

for char in textin:
    if char in string.printable:
        outtext += char

print(outtext)

由于某种原因,我的数据被解码为位,不要问我为什么. :D

my data was decoded to bits for some reason, don't ask me why. :D

推荐答案

这是怎么回事:

import string

intext = b'<your funny characters>'
outtext = ''

for char in intext.decode('utf-8'):
    if char in string.ascii_letters:
        outtext += char

但是我不确定这不是您想要的.对于给定的intext,outtext为空.如果将string.digits附加到string.ascii_letters,outtext为'11'.

I'm not sure this is what you want however. For the given intext, outtext is empty. If you append string.digits to string.ascii_letters, outtext is '11'.

(由OP指出,已修改以纠正代码中的错误)

(edited to fix a mistake in the code, pointed out by OP)

这篇关于过滤使用utf-8编码的文本以仅包含拉丁字母字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆