用BeautifulSoup和Python刮擦多个页面 [英] Scrape multiple pages with BeautifulSoup and Python

查看:127
本文介绍了用BeautifulSoup和Python刮擦多个页面的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的代码成功地从[ http://my.gwu.edu/mod/pws/courses.cfm?campId=1&termId=201501&subjId=ACCY ]并将td元素写入文本文件。

然而,上面的网站中有多个网页可用,我希望能够抓取这些网页。



例如,通过上面的网址,当我点击指向第2页的链接时,整体网址不会发生变化。我查看了页面源代码并看到了JavaScript代码以进入下一页。



我的代码如何更改为从所有可用列表页面中抓取数据?



我的代码仅适用于第1页:

  import bs4 
导入请求

response = requests.get('http://my.gwu.edu/mod/pws/courses.cfm?campId=1&termId=201501&subjId=ACCY ')

soup = bs4.BeautifulSoup(response.text)
soup.prettify()

acct = open(/ Users / it / Desktop / accounting .txt,w)

for tr in soup.find_all('tr',align ='center'):
stack = []
for tr in tr .findAll('td'):
stack.append(td.text.replace('\ n','').replace('\t','').strip())

acct.write(,.join(stack)+'\\\
')


F12 )或在Firefox中安装Firebug扩展。我将在这个答案中使用Chrome的检测工具。请参阅下面的设置。





现在,我们想看到的是对另一个页面的 GET 请求或 POST 请求改变页面。工具打开时,点击页码。对于一个非常短暂的时刻,只会出现一个请求,它是一个 POST 方法。所有其他元素将快速跟随并填充页面。请参阅下文,了解我们正在寻找的内容。





点击上面的 POST 方法。它应该弹出一个包含制表符的排序子窗口。点击标题标签。这个页面列出了请求标题,几乎是另一方(例如网站)需要的识别资料,可以连接(其他人可以比我更好地解释这个问题)。



只要网址包含页码,位置标记或类别等变量,通常不会使用查询字符串。长话短说,它类似于一个SQL查询(实际上,它有时是一个SQL查询),它允许该站点提取所需的信息。如果是这种情况,您可以检查查询字符串参数的请求标头。向下滚动一下,你应该找到它。





正如您所看到的,查询字符串参数与我们网址中的变量匹配。稍微低一点,你可以在它下面看到 Form Data ,其中 pageNum:2 。这是关键。

POST 请求通常被称为表单请求,因为这些是请求的类型当你提交表格,登录到网站等。基本上,几乎任何你必须提交信息。大多数人看不到的是 POST 请求有一个他们关注的URL。一个很好的例子就是当你登录到一个网站,并且非常简单地看到你的地址栏变成某种乱码网址之后才决定 /index.html 或者somesuch。

上面的段落基本上意味着你可以(但不是总是)将表单数据附加到你的URL中,它将执行 POST 请求执行。要知道您必须追加的确切字符串,请点击查看源代码





通过将它添加到URL来测试它是否有效。 / p>



Et瞧,它的工作。现在,真正的挑战是:自动获取最后一页,并抓取所有页面。你的代码非常多。剩下要做的事情就是获取页面数量,构建一个需要抓取的URL列表,并迭代它们。



修改后的代码如下:

 从bs4导入BeautifulSoup作为bsoup 
导入请求作为rq
导入重新

base_url ='http://my.gwu.edu/mod/pws/courses.cfm?campId=1&termId=201501&subjId=ACCY'
r = rq.get(base_url)

soup = bsoup(r.text)
#使用正则表达式来隔离页面链接的链接,点击链接。
page_count_links = soup.find_all(a,href = re.compile(r。* javascript:goToPage。*))
尝试:#确保有多个页面,否则,设置为1.
num_pages = int(page_count_links [-1] .get_text())
除了IndexError:
num_pages = 1

#添加1是因为Python范围。
url_list = [{}& pageNum = {}。格式(base_url,str(page))范围内的页面(1,num_pages + 1)]

#打开文本文件。用来从悲伤中拯救自己。
with open(results.txt,wb)as acct:
for url_ in url_list:
printProcessing {} ...。format(url_)
$ r $ = rq.get(url_)
soup_new = bsoup(r_new.text)
for tr in soup_new.find_all('tr',align ='center'):
stack = [ ]
在tr.findAll('td')中为td:
stack.append(td.text.replace('\ n','').replace('\t',' ').strip())
acct.write(,.join(stack)+'\\\
')

我们使用正则表达式来获取正确的链接。然后使用列表理解,我们建立了一个URL字符串列表。最后,我们对它们进行迭代。



结果:

 处理http://my.gwu.edu/mod/pws/courses.cfm?campId = 1& termId = 201501& subjId = ACCY& pageNum = 1 ... 
处理http:/ /my.gwu.edu/mod/pws/courses.cfm?campId=1&termId=201501&subjId=ACCY&pageNum=2 ...
处理http://my.gwu.edu/mod/ pws / courses.cfm?campId = 1& termId = 201501& subjId = ACCY& pageNum = 3 ...
[以6.8s完成]



希望有所帮助。



编辑: >出于纯粹的无聊,我想我只是为整个班级目录创建了一个刮板。此外,我更新了上面和下面的代码,当只有一个可用的页面时不会出错。

  from bs4 import BeautifulSoup作为bsoup 
导入请求为rq
导入re

spring_2015 =http://my.gwu.edu/mod/pws/subjects.cfm?campId=1& soup.find_all(a)对于c中的termId = 201501
r = rq.get(spring_2015)
soup = bsoup(r.text)
classes_url_list = [c [href ,href = re.compile(r。* courses.cfm \?campId = 1& termId = 201501& subjId =。*))]
打印classes_url_list

打开results.txt,wb)as acct:
for class_url in classes_url_list:
base_url =http://my.gwu.edu/mod/pws/{}\".format(class_url )
r = rq.get(base_url)

soup = bsoup(r.text)
#使用正则表达式仅隔离页码的链接,点击。
page_count_links = soup.find_all(a,href = re.compile(r。* javascript:goToPage。*))
try:
num_pages = int(page_count_links [-1 ] .get_text())
,除了IndexError:
num_pages = 1

#添加1是因为Python范围。
url_list = [{}& pageNum = {}。格式(base_url,str(page))范围内的页面(1,num_pages + 1)]

#打开文本文件。用来从悲伤中拯救自己。
for url_在url_list中:
printProcessing {} ...。格式(url_)
r_new = rq.get(url_)
soup_new = bsoup(r_new.text )
for tr intr.findAll('td'):$ b $ t中的soup_new.find_all('tr',align ='center'):
stack = []
b stack.append('td.text.replace('\\\
','').replace('\ t','').strip())
acct.write(,.join (stack)+'\\\
')


My code successfully scrapes the tr align=center tags from [ http://my.gwu.edu/mod/pws/courses.cfm?campId=1&termId=201501&subjId=ACCY ] and writes the td elements to a text file.

However, there are multiple pages available at the site above in which I would like to be able to scrape.

For example, with the url above, when I click the link to "page 2" the overall url does NOT change. I looked at the page source and saw a javascript code to advance to the next page.

How can my code be changed to scrape data from all the available listed pages?

My code that works for page 1 only:

import bs4
import requests 

response = requests.get('http://my.gwu.edu/mod/pws/courses.cfm?campId=1&termId=201501&subjId=ACCY')

soup = bs4.BeautifulSoup(response.text)
soup.prettify()

acct = open("/Users/it/Desktop/accounting.txt", "w")

for tr in soup.find_all('tr', align='center'):
    stack = []
    for td in tr.findAll('td'):
        stack.append(td.text.replace('\n', '').replace('\t', '').strip())

    acct.write(", ".join(stack) + '\n')

解决方案

The trick here is to check the requests that are coming in and out of the page-change action when you click on the link to view the other pages. The way to check this is to use Chrome's inspection tool (via pressing F12) or installing the Firebug extension in Firefox. I will be using Chrome's inspection tool in this answer. See below for my settings.

Now, what we want to see is either a GET request to another page or a POST request that changes the page. While the tool is open, click on a page number. For a really brief moment, there will only be one request that will appear, and it's a POST method. All the other elements will quickly follow and fill the page. See below for what we're looking for.

Click on the above POST method. It should bring up a sub-window of sorts that has tabs. Click on the Headers tab. This page lists the request headers, pretty much the identification stuff that the other side (the site, for example) needs from you to be able to connect (someone else can explain this muuuch better than I do).

Whenever the URL has variables like page numbers, location markers, or categories, more often that not, the site uses query-strings. Long story made short, it's similar to an SQL query (actually, it is an SQL query, sometimes) that allows the site to pull the information you need. If this is the case, you can check the request headers for query string parameters. Scroll down a bit and you should find it.

As you can see, the query string parameters match the variables in our URL. A little bit below, you can see Form Data with pageNum: 2 beneath it. This is the key.

POST requests are more commonly known as form requests because these are the kind of requests made when you submit forms, log in to websites, etc. Basically, pretty much anything where you have to submit information. What most people don't see is that POST requests have a URL that they follow. A good example of this is when you log-in to a website and, very briefly, see your address bar morph into some sort of gibberish URL before settling on /index.html or somesuch.

What the above paragraph basically means is that you can (but not always) append the form data to your URL and it will carry out the POST request for you on execution. To know the exact string you have to append, click on view source.

Test if it works by adding it to the URL.

Et voila, it works. Now, the real challenge: getting the last page automatically and scraping all of the pages. Your code is pretty much there. The only things remaining to be done are getting the number of pages, constructing a list of URLs to scrape, and iterating over them.

Modified code is below:

from bs4 import BeautifulSoup as bsoup
import requests as rq
import re

base_url = 'http://my.gwu.edu/mod/pws/courses.cfm?campId=1&termId=201501&subjId=ACCY'
r = rq.get(base_url)

soup = bsoup(r.text)
# Use regex to isolate only the links of the page numbers, the one you click on.
page_count_links = soup.find_all("a",href=re.compile(r".*javascript:goToPage.*"))
try: # Make sure there are more than one page, otherwise, set to 1.
    num_pages = int(page_count_links[-1].get_text())
except IndexError:
    num_pages = 1

# Add 1 because Python range.
url_list = ["{}&pageNum={}".format(base_url, str(page)) for page in range(1, num_pages + 1)]

# Open the text file. Use with to save self from grief.
with open("results.txt","wb") as acct:
    for url_ in url_list:
        print "Processing {}...".format(url_)
        r_new = rq.get(url_)
        soup_new = bsoup(r_new.text)
        for tr in soup_new.find_all('tr', align='center'):
            stack = []
            for td in tr.findAll('td'):
                stack.append(td.text.replace('\n', '').replace('\t', '').strip())
            acct.write(", ".join(stack) + '\n')

We use regular expressions to get the proper links. Then using list comprehension, we built a list of URL strings. Finally, we iterate over them.

Results:

Processing http://my.gwu.edu/mod/pws/courses.cfm?campId=1&termId=201501&subjId=ACCY&pageNum=1...
Processing http://my.gwu.edu/mod/pws/courses.cfm?campId=1&termId=201501&subjId=ACCY&pageNum=2...
Processing http://my.gwu.edu/mod/pws/courses.cfm?campId=1&termId=201501&subjId=ACCY&pageNum=3...
[Finished in 6.8s]

Hope that helps.

EDIT:

Out of sheer boredom, I think I just created a scraper for the entire class directory. Also, I update both the above and below codes to not error out when there is only a single page available.

from bs4 import BeautifulSoup as bsoup
import requests as rq
import re

spring_2015 = "http://my.gwu.edu/mod/pws/subjects.cfm?campId=1&termId=201501"
r = rq.get(spring_2015)
soup = bsoup(r.text)
classes_url_list = [c["href"] for c in soup.find_all("a", href=re.compile(r".*courses.cfm\?campId=1&termId=201501&subjId=.*"))]
print classes_url_list

with open("results.txt","wb") as acct:
    for class_url in classes_url_list:
        base_url = "http://my.gwu.edu/mod/pws/{}".format(class_url)
        r = rq.get(base_url)

        soup = bsoup(r.text)
        # Use regex to isolate only the links of the page numbers, the one you click on.
        page_count_links = soup.find_all("a",href=re.compile(r".*javascript:goToPage.*"))
        try:
            num_pages = int(page_count_links[-1].get_text())
        except IndexError:
            num_pages = 1

        # Add 1 because Python range.
        url_list = ["{}&pageNum={}".format(base_url, str(page)) for page in range(1, num_pages + 1)]

        # Open the text file. Use with to save self from grief.
        for url_ in url_list:
            print "Processing {}...".format(url_)
            r_new = rq.get(url_)
            soup_new = bsoup(r_new.text)
            for tr in soup_new.find_all('tr', align='center'):
                stack = []
                for td in tr.findAll('td'):
                    stack.append(td.text.replace('\n', '').replace('\t', '').strip())
                acct.write(", ".join(stack) + '\n')

这篇关于用BeautifulSoup和Python刮擦多个页面的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆