如何使用 Python 抓取带有动态生成的 URL 的页面? [英] How do I scrape pages with dynamically generated URLs using Python?

查看:26
本文介绍了如何使用 Python 抓取带有动态生成的 URL 的页面?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试抓取 http://www.dailyfinance.com/quote/NYSE/international-business-machines/IBM/financial-ratios,但传统的 url 字符串构建技术不起作用,因为完整公司名称已插入-the-path"字符串.确切的公司全名"事先并不知道.只有公司符号IBM"是已知的.

I am trying to scrape http://www.dailyfinance.com/quote/NYSE/international-business-machines/IBM/financial-ratios, but the traditional url string building technique doesn't work because the "full-company-name-is-inserted-in-the-path" string. And the exact "full-company-name" isn't known in advance. Only the company symbol, "IBM" is known.

本质上,我抓取的方式是遍历公司符号数组并在将其发送到 urllib2.urlopen(url) 之前构建 url 字符串.但在这种情况下,这是不可能的.

Essentially, the way I scrape is by looping through an array of company symbol and build the url string before sending it to urllib2.urlopen(url). But in this case, that can't be done.

例如CSCO字符串是

http://www.dailyfinance.com/quote/NASDAQ/cisco-systems-inc/CSCO/financial-ratios

另一个示例 url 字符串是 AAPL:

and another example url string is AAPL:

http://www.dailyfinance.com/quote/NASDAQ/apple/AAPL/financial-ratios

所以为了得到url,我不得不在主页面上的输入框中搜索符号:

So in order to get the url, I had to search the symbol in the input box on the main page:

http://www.dailyfinance.com/

我注意到,当我输入CSCO"并在 (http://www.dailyfinance.com/quote/NASDAQ/apple/AAPL/financial-ratios 在 Firefox Web 开发者网络选项卡中,我注意到 get 请求正在发送到

I've noticed that when I type "CSCO" and inspect the search input at (http://www.dailyfinance.com/quote/NASDAQ/apple/AAPL/financial-ratios in Firefox web developer network tab, I noticed that the get request is sending to

http://j.foolcdn.com/tmf/predictivesearch?callback=_predictiveSearch_csco&term=csco&domain=dailyfinance.com

并且引用者实际上给出了我想要捕获的路径

and that the referer actually gives the path that I want to capture

Host: j.foolcdn.com
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:28.0) Gecko/20100101 Firefox/28.0
Accept: */*
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Referer: http://www.dailyfinance.com/quote/NASDAQ/cisco-systems-inc/CSCO/financial-ratios?source=itxwebtxt0000007
Connection: keep-alive

抱歉解释太长了.所以问题是如何提取Referer中的url?如果这是不可能的,我应该如何解决这个问题?还有别的办法吗?

Sorry for the long explanation. So the question is how do I extract the url in the Referer? If that is not possible, how should I approach this problem? Is there another way?

非常感谢您的帮助.

推荐答案

我喜欢这个问题.正因为如此,我会给出一个非常彻底的答案.为此,我将使用我最喜欢的 Requests 库和 BeautifulSoup4.如果您真的想使用它,则移植到 Mechanize 取决于您.不过,请求会为您省去很多麻烦.

I like this question. And because of that, I'll give a very thorough answer. For this, I'll use my favorite Requests library along with BeautifulSoup4. Porting over to Mechanize if you really want to use that is up to you. Requests will save you tons of headaches though.

首先,您可能正在寻找 POST 请求.但是,如果搜索功能您立即带到您要查找的页面,则通常不需要 POST 请求.所以让我们检查一下,好吗?

First off, you're probably looking for a POST request. However, POST requests are often not needed if a search function brings you right away to the page you're looking for. So let's inspect it, shall we?

当我登陆基本 URL,http://www.dailyfinance.com/,我可以通过 Firebug 或 Chrome 的检查工具做一个简单的检查,当我把 CSCO 或 AAPL 放在搜索栏并启用跳转",有一个 301 Moved Permanently 状态代码.这是什么意思?

When I land on the base URL, http://www.dailyfinance.com/, I can do a simple check via Firebug or Chrome's inspect tool that when I put in CSCO or AAPL on the search bar and enable the "jump", there's a 301 Moved Permanently status code. What does this mean?

简单来说,我被转移到某个地方.此 GET 请求的 URL 如下:

In simple terms, I was transferred somewhere. The URL for this GET request is the following:

http://www.dailyfinance.com/quote/jump?exchange-input=&ticker-input=CSCO

现在,我们通过简单的 URL 操作来测试它是否适用于 AAPL.

Now, we test if it works with AAPL by using a simple URL manipulation.

import requests as rq

apl_tick = "AAPL"
url = "http://www.dailyfinance.com/quote/jump?exchange-input=&ticker-input="
r = rq.get(url + apl_tick)
print r.url

上面给出了以下结果:

http://www.dailyfinance.com/quote/nasdaq/apple/aapl
[Finished in 2.3s]

看看响应的 URL 是如何变化的?让我们通过将以下内容附加到上面的代码中来查找 /financial-ratios 页面,使 URL 操作更进一步:

See how the URL of the response changed? Let's take the URL manipulation one step further by looking for the /financial-ratios page by appending the below to the above code:

new_url = r.url + "/financial-ratios"
p = rq.get(new_url)
print p.url

运行时,结果如下:

http://www.dailyfinance.com/quote/nasdaq/apple/aapl
http://www.dailyfinance.com/quote/nasdaq/apple/aapl/financial-ratios
[Finished in 6.0s]

现在我们走在正确的轨道上.我现在将尝试使用 BeautifulSoup 解析数据.我的完整代码如下:

Now we're on the right track. I will now try to parse the data using BeautifulSoup. My complete code is as follows:

from bs4 import BeautifulSoup as bsoup
import requests as rq

apl_tick = "AAPL"
url = "http://www.dailyfinance.com/quote/jump?exchange-input=&ticker-input="
r = rq.get(url + apl_tick)
new_url = r.url + "/financial-ratios"
p = rq.get(new_url)

soup = bsoup(p.content)
div = soup.find("div", id="clear").table
rows = table.find_all("tr")
for row in rows:
    print row

然后我尝试运行此代码,但遇到以下回溯错误:

I then try running this code, only to encounter an error with the following traceback:

  File "C:Users
anashiDesktop	est.py", line 13, in <module>
    div = soup.find("div", id="clear").table
AttributeError: 'NoneType' object has no attribute 'table'

值得注意的是行 'NoneType' object....这意味着我们的目标 div 不存在!Egads,但为什么我会看到以下内容?!

Of note is the line 'NoneType' object.... This means our target div does not exist! Egads, but why am I seeing the following?!

只能有一种解释:表是动态加载的!老鼠.让我们看看我们是否可以找到该表的另一个来源.我研究了页面,发现底部有滚动条.这可能意味着表格是在框架内加载的,或者是直接从另一个源完全加载并放置到页面中的 div 中.

There can only be one explanation: the table is loaded dynamically! Rats. Let's see if we can find another source for the table. I study the page and see that there are scrollbars at the bottom. This might mean that the table was loaded inside a frame or was loaded straight from another source entirely and placed into a div in the page.

我刷新页面并再次查看 GET 请求.宾果游戏,我发现了一些看起来很有希望的东西:

I refresh the page and watch the GET requests again. Bingo, I found something that seems a bit promising:

第三方源 URL,您看,它很容易使用股票代码进行操作!让我们尝试将其加载到新选项卡中.这是我们得到的:

A third-party source URL, and look, it's easily manipulable using the ticker symbol! Let's try loading it into a new tab. Here's what we get:

哇!我们现在拥有非常准确的数据来源.最后一个障碍是当我们尝试使用此字符串提取 CSCO 数据时它会起作用吗(请记住我们去了 CSCO -> AAPL 现在又回到 CSCO,所以你不会感到困惑).让我们清理一下字符串,彻底摆脱这里 www.dailyfinance.com 的角色.我们的新网址如下:

WOW! We now have the very exact source of our data. The last hurdle though is will it work when we try to pull the CSCO data using this string (remember we went CSCO -> AAPL and now back to CSCO again, so you're not confused). Let's clean up the string and ditch the role of www.dailyfinance.com here completely. Our new url is as follows:

http://www.motleyfool.idmanagedsolutions.com/stocks/financial_ratios.idms?SYMBOL_US=AAPL

让我们尝试在我们的最终刮刀中使用它!

Let's try using that in our final scraper!

from bs4 import BeautifulSoup as bsoup
import requests as rq

csco_tick = "CSCO"
url = "http://www.motleyfool.idmanagedsolutions.com/stocks/financial_ratios.idms?SYMBOL_US="
new_url = url + csco_tick

r = rq.get(new_url)
soup = bsoup(r.content)

table = soup.find("div", id="clear").table
rows = table.find_all("tr")
for row in rows:
    print row.get_text()

我们对 CSCO 财务比率数据的原始结果如下:

And our raw results for CSCO's financial ratios data is as follows:

Company
Industry


Valuation Ratios


P/E Ratio (TTM)
15.40
14.80


P/E High - Last 5 Yrs 
24.00
28.90


P/E Low - Last 5 Yrs
8.40
12.10


Beta
1.37
1.50


Price to Sales (TTM)
2.51
2.59


Price to Book (MRQ)
2.14
2.17


Price to Tangible Book (MRQ)
4.25
3.83


Price to Cash Flow (TTM)
11.40
11.60


Price to Free Cash Flow (TTM)
28.20
60.20


Dividends


Dividend Yield (%)
3.30
2.50


Dividend Yield - 5 Yr Avg (%)
N.A.
1.20


Dividend 5 Yr Growth Rate (%)
N.A.
144.07


Payout Ratio (TTM)
45.00
32.00


Sales (MRQ) vs Qtr 1 Yr Ago (%)
-7.80
-3.70


Sales (TTM) vs TTM 1 Yr Ago (%)
5.50
5.60


Growth Rates (%)


Sales - 5 Yr Growth Rate (%)
5.51
5.12


EPS (MRQ) vs Qtr 1 Yr Ago (%)
-54.50
-51.90


EPS (TTM) vs TTM 1 Yr Ago (%)
-54.50
-51.90


EPS - 5 Yr Growth Rate (%)
8.91
9.04


Capital Spending - 5 Yr Growth Rate (%)
20.30
20.94


Financial Strength


Quick Ratio (MRQ)
2.40
2.70


Current Ratio (MRQ)
2.60
2.90


LT Debt to Equity (MRQ)
0.22
0.20


Total Debt to Equity (MRQ)
0.31
0.25


Interest Coverage (TTM)
18.90
19.10


Profitability Ratios (%)


Gross Margin (TTM)
63.20
62.50


Gross Margin - 5 Yr Avg
66.30
64.00


EBITD Margin (TTM)
26.20
25.00


EBITD - 5 Yr Avg
28.82
0.00


Pre-Tax Margin (TTM)
21.10
20.00


Pre-Tax Margin - 5 Yr Avg
21.60
18.80


Management Effectiveness (%)


Net Profit Margin (TTM)
17.10
17.65


Net Profit Margin - 5 Yr Avg
17.90
15.40


Return on Assets (TTM)
8.30
8.90


Return on Assets - 5 Yr Avg
8.90
8.00


Return on Investment (TTM)
11.90
12.30


Return on Investment - 5 Yr Avg
12.50
10.90


Efficiency


Revenue/Employee (TTM)
637,890.00
556,027.00


Net Income/Employee (TTM)
108,902.00
98,118.00


Receivable Turnover (TTM)
5.70
5.80


Inventory Turnover (TTM)
11.30
9.70


Asset Turnover (TTM)
0.50
0.50

[Finished in 2.0s]

清理数据取决于您.

从这次抓取中学到的一个很好的教训是,并非所有数据都单独包含在一页中.很高兴看到它来自另一个静态站点.如果它是通过 JavaScript 或 AJAX 调用等生成的,我们的方法可能会遇到一些困难.

One good lesson to learn from this scrape is not all data are contained in one page alone. It's pretty nice to see it coming from another static site. If it was produced via JavaScript or AJAX calls or the like, we would likely have some difficulties with our approach.

希望你能从中学到一些东西.如果这有帮助,请告诉我们,祝您好运.

Hopefully you learned something from this. Let us know if this helps and good luck.

这篇关于如何使用 Python 抓取带有动态生成的 URL 的页面?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆