使用 BeautifulSoup 和 Python 抓取多个页面 [英] Scrape multiple pages with BeautifulSoup and Python

查看:29
本文介绍了使用 BeautifulSoup 和 Python 抓取多个页面的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的代码成功地从 [ http://my.gwu.edu/mod/pws/courses.cfm?campId=1&termId=201501&subjId=ACCY ] 并将 td 元素写入文本文件.

My code successfully scrapes the tr align=center tags from [ http://my.gwu.edu/mod/pws/courses.cfm?campId=1&termId=201501&subjId=ACCY ] and writes the td elements to a text file.

但是,上面的站点中有多个页面可供我抓取.

However, there are multiple pages available at the site above in which I would like to be able to scrape.

例如,对于上面的 url,当我单击第 2 页"的链接时,整个 url 不会改变.我查看了页面源代码,看到了一个 javascript 代码以前进到下一页.

For example, with the url above, when I click the link to "page 2" the overall url does NOT change. I looked at the page source and saw a javascript code to advance to the next page.

如何更改我的代码以从所有可用的列出页面中抓取数据?

How can my code be changed to scrape data from all the available listed pages?

我的代码仅适用于第 1 页:

My code that works for page 1 only:

import bs4
import requests 

response = requests.get('http://my.gwu.edu/mod/pws/courses.cfm?campId=1&termId=201501&subjId=ACCY')

soup = bs4.BeautifulSoup(response.text)
soup.prettify()

acct = open("/Users/it/Desktop/accounting.txt", "w")

for tr in soup.find_all('tr', align='center'):
    stack = []
    for td in tr.findAll('td'):
        stack.append(td.text.replace('
', '').replace('	', '').strip())

    acct.write(", ".join(stack) + '
')

推荐答案

这里的技巧是在您单击链接查看其他页面时检查进出页面更改操作的请求.检查这一点的方法是使用 Chrome 的检查工具(通过按 F12)或在 Firefox 中安装 Firebug 扩展.我将在此答案中使用 Chrome 的检查工具.请参阅下文了解我的设置.

The trick here is to check the requests that are coming in and out of the page-change action when you click on the link to view the other pages. The way to check this is to use Chrome's inspection tool (via pressing F12) or installing the Firebug extension in Firefox. I will be using Chrome's inspection tool in this answer. See below for my settings.

现在,我们想要看到的是对另一个页面的 GET 请求或更改页面的 POST 请求.在工具打开时,单击页码.在很短的时间内,只会出现一个请求,它是一个 POST 方法.所有其他元素将快速跟随并填满页面.请参阅下文了解我们正在寻找的内容.

Now, what we want to see is either a GET request to another page or a POST request that changes the page. While the tool is open, click on a page number. For a really brief moment, there will only be one request that will appear, and it's a POST method. All the other elements will quickly follow and fill the page. See below for what we're looking for.

点击上面的POST方法.它应该会弹出一个带有选项卡的子窗口.单击 Headers 选项卡.此页面列出了请求标头,几乎是另一方(例如站点)需要您提供的身份信息才能进行连接(其他人可以比我更好地解释这一点).

Click on the above POST method. It should bring up a sub-window of sorts that has tabs. Click on the Headers tab. This page lists the request headers, pretty much the identification stuff that the other side (the site, for example) needs from you to be able to connect (someone else can explain this muuuch better than I do).

只要 URL 有页码、位置标记或类别等变量,通常情况下,网站就会使用查询字符串.长话短说,它类似于 SQL 查询(实际上,有时是 SQL 查询),允许站点提取您需要的信息.如果是这种情况,您可以检查请求标头中的查询字符串参数.向下滚动一点,您应该会找到它.

Whenever the URL has variables like page numbers, location markers, or categories, more often that not, the site uses query-strings. Long story made short, it's similar to an SQL query (actually, it is an SQL query, sometimes) that allows the site to pull the information you need. If this is the case, you can check the request headers for query string parameters. Scroll down a bit and you should find it.

如您所见,查询字符串参数与我们 URL 中的变量相匹配.在下方,您可以看到 Form DatapageNum: 2 在它下面.这是关键.

As you can see, the query string parameters match the variables in our URL. A little bit below, you can see Form Data with pageNum: 2 beneath it. This is the key.

POST 请求通常称为表单请求,因为这些请求是在您提交表单、登录网站等时发出的请求.基本上,几乎所有您必须提交信息的地方.大多数人没有看到的是 POST 请求有一个他们遵循的 URL.一个很好的例子是,当您登录到一个网站时,非常简短地看到您的地址栏在确定 /index.html 或类似内容之前变成了某种乱码 URL.

POST requests are more commonly known as form requests because these are the kind of requests made when you submit forms, log in to websites, etc. Basically, pretty much anything where you have to submit information. What most people don't see is that POST requests have a URL that they follow. A good example of this is when you log-in to a website and, very briefly, see your address bar morph into some sort of gibberish URL before settling on /index.html or somesuch.

以上段落的基本意思是您可以(但并非总是)将表单数据附加到您的 URL 中,它会在执行时为您执行 POST 请求.要知道您必须附加的确切字符串,请单击view source.

What the above paragraph basically means is that you can (but not always) append the form data to your URL and it will carry out the POST request for you on execution. To know the exact string you have to append, click on view source.

通过将其添加到 URL 来测试它是否有效.

Test if it works by adding it to the URL.

等等,它有效.现在,真正的挑战是:自动获取最后一页并抓取所有页面.你的代码就在那里.剩下要做的唯一事情就是获取页面数量、构建要抓取的 URL 列表并对其进行迭代.

Et voila, it works. Now, the real challenge: getting the last page automatically and scraping all of the pages. Your code is pretty much there. The only things remaining to be done are getting the number of pages, constructing a list of URLs to scrape, and iterating over them.

修改后的代码如下:

from bs4 import BeautifulSoup as bsoup
import requests as rq
import re

base_url = 'http://my.gwu.edu/mod/pws/courses.cfm?campId=1&termId=201501&subjId=ACCY'
r = rq.get(base_url)

soup = bsoup(r.text)
# Use regex to isolate only the links of the page numbers, the one you click on.
page_count_links = soup.find_all("a",href=re.compile(r".*javascript:goToPage.*"))
try: # Make sure there are more than one page, otherwise, set to 1.
    num_pages = int(page_count_links[-1].get_text())
except IndexError:
    num_pages = 1

# Add 1 because Python range.
url_list = ["{}&pageNum={}".format(base_url, str(page)) for page in range(1, num_pages + 1)]

# Open the text file. Use with to save self from grief.
with open("results.txt","wb") as acct:
    for url_ in url_list:
        print "Processing {}...".format(url_)
        r_new = rq.get(url_)
        soup_new = bsoup(r_new.text)
        for tr in soup_new.find_all('tr', align='center'):
            stack = []
            for td in tr.findAll('td'):
                stack.append(td.text.replace('
', '').replace('	', '').strip())
            acct.write(", ".join(stack) + '
')

我们使用正则表达式来获取正确的链接.然后使用列表理解,我们构建了一个 URL 字符串列表.最后,我们对它们进行迭代.

We use regular expressions to get the proper links. Then using list comprehension, we built a list of URL strings. Finally, we iterate over them.

结果:

Processing http://my.gwu.edu/mod/pws/courses.cfm?campId=1&termId=201501&subjId=ACCY&pageNum=1...
Processing http://my.gwu.edu/mod/pws/courses.cfm?campId=1&termId=201501&subjId=ACCY&pageNum=2...
Processing http://my.gwu.edu/mod/pws/courses.cfm?campId=1&termId=201501&subjId=ACCY&pageNum=3...
[Finished in 6.8s]

希望有所帮助.

出于无聊,我想我只是为整个类目录创建了一个爬虫.此外,我更新了上面和下面的代码,以便在只有一个页面可用时不会出错.

Out of sheer boredom, I think I just created a scraper for the entire class directory. Also, I update both the above and below codes to not error out when there is only a single page available.

from bs4 import BeautifulSoup as bsoup
import requests as rq
import re

spring_2015 = "http://my.gwu.edu/mod/pws/subjects.cfm?campId=1&termId=201501"
r = rq.get(spring_2015)
soup = bsoup(r.text)
classes_url_list = [c["href"] for c in soup.find_all("a", href=re.compile(r".*courses.cfm?campId=1&termId=201501&subjId=.*"))]
print classes_url_list

with open("results.txt","wb") as acct:
    for class_url in classes_url_list:
        base_url = "http://my.gwu.edu/mod/pws/{}".format(class_url)
        r = rq.get(base_url)

        soup = bsoup(r.text)
        # Use regex to isolate only the links of the page numbers, the one you click on.
        page_count_links = soup.find_all("a",href=re.compile(r".*javascript:goToPage.*"))
        try:
            num_pages = int(page_count_links[-1].get_text())
        except IndexError:
            num_pages = 1

        # Add 1 because Python range.
        url_list = ["{}&pageNum={}".format(base_url, str(page)) for page in range(1, num_pages + 1)]

        # Open the text file. Use with to save self from grief.
        for url_ in url_list:
            print "Processing {}...".format(url_)
            r_new = rq.get(url_)
            soup_new = bsoup(r_new.text)
            for tr in soup_new.find_all('tr', align='center'):
                stack = []
                for td in tr.findAll('td'):
                    stack.append(td.text.replace('
', '').replace('	', '').strip())
                acct.write(", ".join(stack) + '
')

这篇关于使用 BeautifulSoup 和 Python 抓取多个页面的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆