刮'N'与Beautifulsoup和请求的页面（如何获得真实页码） [英] Scraping 'N' pages with Beautifulsoup and Requests (How to obtain the true page number)

查看：224 发布时间：2016/8/5 19:11:31 python selenium beautifulsoup python-requests scrape

本文介绍了刮'N'与Beautifulsoup和请求的页面（如何获得真实页码）的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我要得到所有的标题（）在网站上。

I want to get all the titles() in the website.

http://www.shyan.gov.cn/zwhd/web/webindex.action

现在，我的code成功擦伤只有一个页面。不过，也有在现场可用的多个页面上面，我想凑。

Now, my code successfully scrapes only one page. However, there are multiple pages available at the site above in which I would like to to scrape.

例如，上面的网址，当我点击链接第2页，整体的网址不会改变。我看了看页面的源代码，看到的javascript code前进到下一个页面是这样的：JavaScript的：GotoPage记述（2）或JavaScript：无效（0）。
我的code是这里（获取页面1）

For example, with the url above, when I click the link to "page 2", the overall url does NOT change. I looked at the page source and saw javascript code to advance to the next page like this: javascript:gotopage(2) or javascript:void(0). My code is here (get page 1)

from bs4 import Beautifulsoup
import requests
url = 'http://www.shyan.gov.cn/zwhd/web/webindex.action'
r =  requests.get(url)
soup = Beautifulsoup(r.content,'lxml')
titles = soup.select('td.tit3 > a')
for title in titles:
    print(title.get_text())

如何将我的code改为刮去所有可用的列出的网页游戏？
非常感谢你！

How can my code be changed to scrape titles from all the available listed pages? Thank you very much!

推荐答案

尝试使用以下URL格式：

Try to use the following URL format:

<一个href=\"http://www.shiyan.gov.cn/zwhd/web/webindex.action?keyWord=&searchType=3&page.currentpage=2&page.pagesize=15&page.pagecount=2357&docStatus=&sendOrg=\" rel=\"nofollow\">http://www.shiyan.gov.cn/zwhd/web/webindex.action?keyWord=&searchType=3&page.currentpage=2&page.pagesize=15&page.pagecount=2357&docStatus=&sendOrg=

该网站使用JavaScript来隐藏页面信息传递给服务器请求下一个页面。当您查看源你会发现：

The site is using javascript to pass hidden page information to the server to request the next page. When you view the source you will find:

<form action="/zwhd/web/webindex.action" id="searchForm" name="searchForm" method="post">
 <div class="item">
     <div class="titlel">
      <span>留言查询</span>
     <label class="dow"></label>
     </div>
     <input type="text" name="keyWord" id="keyword" value="" class="text"/>
     <div class="key">
        <ul>
            <li><span><input type="radio" checked="checked" value="3" name="searchType"/></span><p>编号</p></li>
            <li><span><input type="radio" value="2" name="searchType"/></span><p>关键字</p></li>
        </ul>    
     </div>
     <input type="button" class="btn1" onclick="search();" value="查询"/>
  </div>
  <input type="hidden" id="pageIndex" name="page.currentpage" value="2"/>
  <input type="hidden" id="pageSize" name="page.pagesize" value="15"/>
  <input type="hidden" id="pageCount" name="page.pagecount" value="2357"/>
  <input type="hidden" id="docStatus" name="docStatus" value=""/>
  <input type="hidden" id="sendorg" name="sendOrg" value=""/>
  </form>

这篇关于刮'N'与Beautifulsoup和请求的页面（如何获得真实页码）的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

刮'N'与Beautifulsoup和请求的页面（如何获得真实页码） [英] Scraping 'N' pages with Beautifulsoup and Requests (How to obtain the true page number)

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

刮'N'与Beautifulsoup和请求的页面（如何获得真实页码） [英] Scraping &#39;N&#39; pages with Beautifulsoup and Requests (How to obtain the true page number)

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

刮'N'与Beautifulsoup和请求的页面（如何获得真实页码） [英] Scraping 'N' pages with Beautifulsoup and Requests (How to obtain the true page number)

登录关闭