有没有办法使用 CSS 在 Scrapy 中提取文本和文本链接? [英] Is there a way to extract text along with text-links in Scrapy using CSS?

查看:38
本文介绍了有没有办法使用 CSS 在 Scrapy 中提取文本和文本链接?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是 Scrapy 的新手.我已经学会了如何使用 response.css() 从网页中读取特定方面,并且避免学习 xpath 系统.它似乎做完全相同的事情,但格式不同(如果我错了,请纠正我)

我正在抓取的网站有很长的文字段落,中间偶尔会有链接文字.这句话带有链接到狗的图片 就是一个例子.我不确定是否有办法让蜘蛛读取文本,链接到位(我只使用了 response.css("p::text").extract())

有没有办法使用 CSS(最好)或 xpath 来抓取段落中的所有文本,包括嵌入链接的文本,而无需将链接或链接文本移出句子?这方面的措辞很难,如果我需要重新解释或举例,请见谅.

需要一些澄清,这最初解释得很差.此网页中的声明可能如下所示:<p>我的句子有一个<a href="https://www.google.com">到google的链接</a>

但是当您使用 response.css("p::text").extract() 时,该句子将显示为列表 ["My sentence has a ","in it."],完全否定链接中的文字.我的目标是:[我的句子中有一个指向谷歌的链接."]

解决方案

您可以尝试使用此表达式提取文本:

<预><代码>>>>txt = """<p>我的句子中有一个<a href="https://www.google.com">指向谷歌的链接</a>.</p>""">>>从scrapy导入选择器>>>sel = 选择器(文本=txt)>>>sel.css('p ::text').extract()[你'我的句子有一个',你'链接到谷歌',你'在里面.']>>>' '.join(sel.css('p ::text').extract())'我的句子中有一个指向谷歌的链接.

或者,例如,使用 w3lib.html 库来清除响应中的 html 标签.这样:

from w3lib.html import remove_tagswith_tags = response.css("p").get()clean_text = remove_tags(with_tags)

但第一个变体看起来更短且更具可读性.

I'm brand new to Scrapy. I have learned how to use response.css() for reading specific aspects from a web page, and am avoiding learning the xpath system. It seems to do the exact same thing, but in a different format (correct me if I'm wrong)

The site I'm scraping has long paragraphs of text, with an occasional linked text right in the middle. This sentence with a link to a picture of a dog is an example. I'm not sure if there is a way to have a spider read the text, with links in place (I've only been using response.css("p::text").extract())

Is there a way, using CSS (preferably) or xpath that I can grab all text in the paragraphs including the link-embedded text, without moving the links or link-text out of the sentence? The wording is difficult on this so apologies if I need to re-explain or give an example.

edit: some clarification is needed, this was poorly explained initially. A statement in this webpage can look like: <p>My sentence has a <a href="https://www.google.com">link to google</a> in it.</p> But when you use response.css("p::text").extract(), that sentence would show up as the list ["My sentence has a ","in it."], completely negating the text in the link. My goal is to get: ["My sentence has a link to google in it."]

解决方案

You can try to extract text with this expression:

>>> txt = """<p>My sentence has a <a href="https://www.google.com">link to google</a> in it.</p>"""
>>> from scrapy import Selector
>>> sel = Selector(text=txt)
>>> sel.css('p ::text').extract()
[u'My sentence has a ', u'link to google', u' in it.']
>>> ' '.join(sel.css('p ::text').extract())
u'My sentence has a  link to google  in it.'

Or, for example, use w3lib.html library to clean html tags from your response. In this way:

from w3lib.html import remove_tags
with_tags = response.css("p").get()
clean_text = remove_tags(with_tags)

But first variant looks shorter and more readable.

这篇关于有没有办法使用 CSS 在 Scrapy 中提取文本和文本链接?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆