python 抓取路透社网站...糟糕的 xpath? [英] python scraping reuters site...bad xpath?

查看:27
本文介绍了python 抓取路透社网站...糟糕的 xpath?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试做一些看起来很简单的事情...我正在尝试从此链接中抓取路透社列表的公司名称:

I am trying to do something which appeared to be simple...I am trying to scrape company names of reuters list from this link:

http:///www.reuters.com/finance/markets/index?symbol=us!spx&sortBy=&sortDir=&pn=

但是,我无法访问公司名称!真的,在玩了很多 xpath 查询之后,我在访问表时遇到了问题.我正在尝试获取诸如3M 公司"和雅培实验室"之类的名称

however, I just can't access the company names! Really, after playing around with a lot of xpath queries, I have problems accessing the table. I am trying to grab the names such as "3M company" and "Abbott Laboratories"

以下是我使用过的代码片段:

Here are snippets of code I have used:

scrape = []
companies =[]
import lxml
import lxml.html
import lxml.etree

urlbase = 'http://reuters.com/finance/markets/index?symbol=us!spx&sortBy=&sortDir=&pn='
for i in range(1:18):
    url = urlbase+str(i)
    content = lxml.html.parse(url)
    item = content.xpath('XPATH HERE')
    ticker = [thing.text for thing in item]

以下是我一直在玩的 xpath:

Here are the xpaths i have been playing with:

'//*[@id="topContent"]/div/div[2]/div[1]/table/tr[2]/td[1]/a'
'//*[@id="topContent"]/div/div[2]/div[1]/table/tbody/tr[2]/td[1]/a
'/html/body/div[3]/div[3]/div/div[2]/div/table/tbody/tr[3]/td/a'
'/html/body/div[3]/div[3]/div/div[2]/div/table/tr[3]/td/a'

我尝试通过以下方式访问该特定表:'//table[@class="dataTable sortable"]',但是没有运气

I have tried accessing that one particular table through: '//table[@class="dataTable sortable"]', but have not had any luck

有人可以帮忙吗?我觉得这是一个知道自己在做什么的人能够很快解决的问题谢谢!

can anyone help? I feel like this is something that someone who knows what they are doing will be able to fix rather quickly THANKS!

推荐答案

您尝试抓取的页面在表格内有一个表单.正确的 xpath 应该是 '//table[@class="dataTable sortable"]/form/tr/td[1]/a'

The page you're trying to scrape has a form inside the table. The correct xpath should be '//table[@class="dataTable sortable"]/form/tr/td[1]/a'

另外,你的代码可能有错别字,应该是 range(1,18) 而不是 range(1:18).这是对我有用的最终代码:

Also, you probably have a typo in your code, it should be range(1,18) instead of range(1:18). Here's the final code that works on my side:

scrape = []
companies =[]
import lxml
import lxml.html
import lxml.etree

urlbase = 'http://reuters.com/finance/markets/index?symbol=us!spx&sortBy=&sortDir=&pn='
for i in range(1,18):
    url = urlbase+str(i)
    content = lxml.html.parse(url)
    item = content.xpath('//table[@class="dataTable sortable"]/form/tr/td[1]/a')
    ticker = [thing.text for thing in item]
    print ticker

这篇关于python 抓取路透社网站...糟糕的 xpath?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆