在python脚本中使用选择器抓取项目 [英] Grabbing items using selector within python script

查看:87
本文介绍了在python脚本中使用选择器抓取项目的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经用python编写了一些代码来从网页上获取公司详细信息和名称.我在脚本中使用了CSS选择器来收集这些项目.但是,当我运行它时,我只能从完整字符串中获得公司详细信息"和联系"仅由"br"标记分隔的第一部分.除了我所拥有的以外,我怎么能得到全部?

I've written some code in python to get company details and names from a webpage. I used css selector in my script to collect those items. However, when I run it I get "company details" and "contact" only the first portion separated by "br" tag out of a full string. How can i get the full portion other than what I've got?

我正在尝试的脚本:

import requests ; from lxml import html

tree = html.fromstring(requests.get("https://www.austrade.gov.au/SupplierDetails.aspx?ORGID=ORG8000000314&folderid=1736").text)
for title in tree.cssselect("div.contact-details"):
    cDetails = title.cssselect("h3:contains('Contact Details')+p")[0].text
    cContact = title.cssselect("h4:contains('Contact')+p")[0].text
    print(cDetails, cContact)

搜索结果所在的元素:

<div class="contact-details block dark">
                <h3>Contact Details</h3><p>Company Name: Distance Learning Australia Pty Ltd<br>Phone: +61 2 6262 2964<br>Fax: +61 2 6169 3168<br>Email: <a href="mailto:rto@dla.com.au">rto@dla.com.au</a><br>Web: <a target="_blank" href="http://dla.edu.au">http://dla.edu.au</a></p><h4>Address</h4><p>Suite 108A, 49 Phillip Avenue<br>Watson<br>ACT<br>2602</p><h4>Contact</h4><p>Name: Christine Jarrett<br>Phone: +61 2 6262 2964<br>Fax: +61 2 6169 3168<br>Email: <a href="mailto:chris.jarrett@dla.com.au">chris.jarrett@dla.com.au</a></p>
            </div>

我得到的结果:

Company Name: Distance Learning Australia Pty Ltd Name: Christine Jarrett

我追求的结果:

Company Name: Distance Learning Australia Pty Ltd
Phone: +61 2 6262 2964
Fax: +61 2 6169 3168
Email: rto@dla.com.au

Name: Christine Jarrett
Phone: +61 2 6262 2964
Fax: +61 2 6169 3168
Email: chris.jarrett@dla.com.au

顺便说一句,我的意图是仅使用选择器而不是xpath来完成上述操作.预先感谢.

Btw, my intention is to do the aforesaid thing using selectors only, not xpath. Thanks in advance.

推荐答案

只需将text属性替换为text_content()方法,如下所示即可获得所需的输出:

Simply replace text property with text_content() method as below to get required output:

cDetails = title.cssselect("h3:contains('Contact Details')+p")[0].text_content()
cContact = title.cssselect("h4:contains('Contact')+p")[0].text_content()

这篇关于在python脚本中使用选择器抓取项目的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆