(Scrapy) 如何获取 HTML 元素的 CSS 规则? [英] (Scrapy) How to get the CSS rule for a HTML element?

查看:29
本文介绍了(Scrapy) 如何获取 HTML 元素的 CSS 规则?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 Scrapy 构建一个爬虫.我需要获取分配给特定 HTML 元素的字体系列.

I am building a crawler using Scrapy. I need to get the font-family assigned to a particular HTML element.

假设有一个 css 文件,styles.css,其中包含以下内容:

Let's say there is a css file, styles.css, which contains the following:

p {
    font-family: "Times New Roman", Georgia, Serif;
}

并且在 HTML 页面中有如下文字:

And in the HTML page there is text as follows:

<p>Hello how are you?</p>

使用 Scrapy 提取文本很容易,但我也想知道应用于 Hello 好吗?

Its easy for me to extract the text using Scrapy, however I would also like to know the font-family applied to Hello how are you?

我希望这只是(假想的 XPATH)/p[font-family] 或类似的情况.

I am hoping it is simply a case of (imaginary XPATH) /p[font-family] or something like that.

你知道我该怎么做吗?

感谢您的意见.

推荐答案

需要单独下载并解析css.对于 css 解析,您可以使用 tinycss 甚至正则表达式:

You need to download and parse css seperately. For css parsing you can use tinycss or even regex:

import tinycss
class MySpider(Spider):
    name='myspider'
    start_urls = [
        'http://some.url.com'
    ]
    css_rules = {}

def parse(self, response):
    # find css url and parse it
    css_url = response.xpath("").extract_first()
    yield Request(css_url, self.parse_css)

def parse_css(self, response):
    parser = tinycss.make_parser()
    stylesheet = parser.parse_stylesheet(response.body)
    for rule in stylesheet.rules:
        if not getattr(rule, 'selector'):
            continue 
        path = rule.selector.as_css()
        css =  [d.value.as_css() for d in rule.declarations]
        self.css_rules[path] = css

现在你有一个包含 css 路径及其属性的字典,你可以稍后在你的蜘蛛请求链中使用它来分配一些值:

Now you have a dictionary with css paths and their attributes that you can use later in your spider request chain to assign some values:

def parse_item(self, response):
    item = {}
    item['name'] = response.css('div.name').extract_first()
    name_css = []
    for k,v in css_rules.items():
        if 'div' in k and '.name' in k:
            name_css.append(v)
    item['name_css'] = name_css

这篇关于(Scrapy) 如何获取 HTML 元素的 CSS 规则?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆