Scrapy:为什么提取的字符串是这种格式? [英] Scrapy: Why extracted strings are in this format?

查看:49
本文介绍了Scrapy:为什么提取的字符串是这种格式?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在做

item['desc'] = site.select('a/text()').extract()

但这会像这样打印

[u'\n A mano libera\n ']

我必须怎么做才能删除奇怪的字符,如 [u'\n 、traling 空格和 '] ?

我不能修剪(剥离)

exceptions.AttributeError: 'list' 对象没有属性 'strip'

如果转换为字符串然后剥离,结果是上面的字符串,我认为它是 UTF-8

解决方案

html 页面很可能包含这些空格字符.

你检索的是一个list unicode 字符串,这就是为什么你不能简单地调用 strip 的原因.如果要从此列表中的 每个 字符串中去除这些空格字符,可以运行以下命令:

<预><代码>>>>[s.strip() for s in [u'\n A mano libera\n ']][u'A mano libera']

如果只有第一个元素对您很重要,那么简单地做:

<预><代码>>>>[u'\n 一个 mano libera\n '][0].strip()u'A mano libera'

I'm doing

item['desc'] = site.select('a/text()').extract()

but this will be printed like this

[u'\n                    A mano libera\n                  ']

What must I do to tim and remove strange chars like [u'\n , the traling space and '] ?

I cannot trim (strip)

exceptions.AttributeError: 'list' object has no attribute 'strip'

and if converting to string and then stripping, the result was the string above, which I suppose to be in UTF-8

解决方案

The html page may very well contains these whitespaces characters.

What you retrieve a list of unicode strings, which is why you can't simply call strip on it. If you want to strip these whitespaces characters from each string in this list, you can run the following:

>>> [s.strip() for s in [u'\n                    A mano libera\n                  ']]
[u'A mano libera']

If only the first element matters to you, than simply do:

>>> [u'\n                    A mano libera\n                  '][0].strip()
u'A mano libera'

这篇关于Scrapy:为什么提取的字符串是这种格式?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆