过滤使用utf-8编码的文本以仅包含拉丁字母字符 [英] Filtering text encoded with utf-8 to only contain latin alphabet characters

查看：154 发布时间：2020/7/13 5:54:52 python encoding utf-8

本文介绍了过滤使用utf-8编码的文本以仅包含拉丁字母字符的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试将文本数据过滤为仅包含拉丁字符，以进行进一步的文本分析.原始文本源很可能包含韩文字母.这在文本文件中显示如下:

I'm trying to filter textdata to only contain latin characters, for further text analyzing. The original textsource most likely contained Korean Alphabet. This shows up like this in the text file:

\xe7\xac\xac8\xe4\xbd\x8d ONE PIECE FILM GOLD Blu-ray GOLDEN LIMITED EDITION

什么是最快/最简单/最完整的删除方式?我尝试制作一个脚本来删除所有\ xXX组合，但是事实证明，这样做很可靠.

What would be the fastest/easiest/most complete way to get remove these? I tried making a script that would remove all \xXX combinations, but it turns out that there are to many exceptions for this to be reliable.

是否可以从utf-8编码的文本中删除所有无拉丁字符?

Is there a way to remove all none latin characters from utf-8 encoded text?

谢谢.

解决方案:

import string

textin = b'\xe7\xac\xac8\xe4\xbd\x8d ONE PIECE FILM GOLD Blu-ray GOLDEN LIMITED EDITION'.decode('UTF-8')
outtext = ''

for char in textin:
    if char in string.printable:
        outtext += char

print(outtext)

由于某种原因，我的数据被解码为位，不要问我为什么. :D

my data was decoded to bits for some reason, don't ask me why. :D

推荐答案

这是怎么回事:

import string

intext = b'<your funny characters>'
outtext = ''

for char in intext.decode('utf-8'):
    if char in string.ascii_letters:
        outtext += char

但是我不确定这不是您想要的.对于给定的intext，outtext为空.如果将string.digits附加到string.ascii_letters，outtext为'11'.

I'm not sure this is what you want however. For the given intext, outtext is empty. If you append string.digits to string.ascii_letters, outtext is '11'.

(由OP指出，已修改以纠正代码中的错误)

(edited to fix a mistake in the code, pointed out by OP)

这篇关于过滤使用utf-8编码的文本以仅包含拉丁字母字符的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

过滤使用utf-8编码的文本以仅包含拉丁字母字符 [英] Filtering text encoded with utf-8 to only contain latin alphabet characters

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

过滤使用utf-8编码的文本以仅包含拉丁字母字符 [英] Filtering text encoded with utf-8 to only contain latin alphabet characters

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭