Doctest由于Unicode导致失败 [英] Doctest fails due to unicode leading u

查看:99
本文介绍了Doctest由于Unicode导致失败的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在为输出标记词列表的函数编写doctest.

I am writing a doctest for a function that outputs a list of tokenized words.

r'''

>>> s = "This is a tokenized sentence s\u00f3"
>>> tokenizer.tokenize(s0)
['This', 'is', 'a', 'tokenized', 'sentence', 'só']

'''

使用 Python3.4 ,我的测试顺利通过.

Using Python3.4 my test passes with no problems.

使用 Python2.7 我得到:

Expected:
  ['This', 'is', 'a', 'tokenized', 'sentence', 'só']
Got:
  [u'This', u'is', u'a', u'tokenized', u'sentence', u's\xf3']

我的代码必须在Python3.4和Python2.7上都能工作.我该如何解决这个问题?

My code has to work on both Python3.4 and Python2.7. How can I solve this problem?

推荐答案

Python 3对Unicode对象使用不同的字符串文字.没有u前缀(在规范表示中),并且一些非ASCII字符按字面显示,例如,'só'是Python 3中的Unicode字符串(如果在输出中看到它,则为Python 2上的字节字符串).

Python 3 uses different string literals for Unicode objects. There is no u prefix (in the canonical representation) and some non-ascii characters are shown literally e.g., 'só' is a Unicode string in Python 3 (it is a bytestring on Python 2 if you see it in the output).

如果您只想知道函数如何将输入文本拆分为标记,请执行以下操作:您可以将每个令牌打印在单独的行上,以使结果与Python 2/3兼容:

If all you interested is how the function splits an input text into tokens; you could print each token on a separate line, to make the result Python 2/3 compatible:

print("\n".join(tokenizer.tokenize(s0)))
This
is
a
tokenized
sentence
só

作为替代方案,您可以自定义doctest.OutputChecker ,例如:

As an alternative, you could customize doctest.OutputChecker, example:

#!/usr/bin/env python
r"""
>>> u"This is a tokenized sentence s\u00f3".split()
[u'This', u'is', u'a', u'tokenized', u'sentence', u's\xf3']
"""
import doctest
import re
import sys

class Py23DocChecker(doctest.OutputChecker):
    def check_output(self, want, got, optionflags):
        if sys.version_info[0] > 2:
            want = re.sub("u'(.*?)'", "'\\1'", want)
            want = re.sub('u"(.*?)"', '"\\1"', want)
        return doctest.OutputChecker.check_output(self, want, got, optionflags)

if __name__ == "__main__":
    import unittest

    suite = doctest.DocTestSuite(sys.modules['__main__'], checker=Py23DocChecker())
    sys.exit(len(unittest.TextTestRunner().run(suite).failures))

这篇关于Doctest由于Unicode导致失败的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆