撇号变成 \x92 [英] apostrophe turning into \x92

查看:47
本文介绍了撇号变成 \x92的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

mycorpus.txt

Human where's machine interface for lab abc computer applications   
A where's survey of user opinion of computer system response time

stopwords.txt

let's
ain't
there's

以下代码

corpus = set()
for line in open("path\\to\\mycorpus.txt"):
    corpus.update(set(line.lower().split()))
print corpus

stoplist = set()
for line in open("C:\\Users\\Pankaj\\Desktop\\BTP\\stopwords_new.txt"):
    stoplist.add(line.lower().strip())
print stoplist

给出以下输出

set(['a', "where's", 'abc', 'for', 'of', 'system', 'lab', 'machine', 'applications', 'computer', 'survey', 'user', 'human', 'time', 'interface', 'opinion', 'response'])
set(['let\x92s', 'ain\x92t', 'there\x92s'])

为什么在第二组中撇号变成了\x92?

Why is the apostrophe turning into \x92 in the 2nd set??

推荐答案

window-1252 编码中的代码点 92(hex) 是 Unicode 代码点 2019(hex),即'RIGHT SINGLE QUOTATION MARK'.这看起来很像一个撇号,很可能是你在 stopwords.txt 中的实际字符,我从 python 的解释方式中猜到了,它被编码为 windows-1252 或一种共享 ASCII 和 ' 代码点值的编码.

Code point 92(hex) in window-1252 encoding is Unicode code point 2019(hex) which is 'RIGHT SINGLE QUOTATION MARK'. This looks very like an apostrophe and is likely to be the actual character that you have in stopwords.txt, which I've guessed from the way python has interpreted in, has be encoded in windows-1252 or an encoding that shares ASCII and codepoint values.

' vs '

这篇关于撇号变成 \x92的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆