beautifulsoup find_all无法获取div数据 [英] beautifulsoup find_all can't get div data

查看:340
本文介绍了beautifulsoup find_all无法获取div数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尝试从网站获取html数据,但data_table返回null 并尝试跟踪代码,当我尝试获取标头数据时,它将返回html上下文

I try to get html data from website,but data_table is return null and try to trace the code,when I try to get header data it will return html context

    import requests
    from bs4 import BeautifulSoup
    import html.parser
    from html.parser import HTMLParser
    import time
    from random import randint
    import sys
    from IPython.display import clear_output
    import pymysql

links = ['https://www.ptt.cc/bbs/Gossiping/index'+str(i+1)+'.html' for i in range(10)]
    data_links=[]

for link in links:
    res = requests.get(link)
    soup = BeautifulSoup(res.text.encode("utf-8"),"html.parser")
    data_table = soup.findAll("div",{"id":"r-ent"})
    print(data_table)

推荐答案

当您在浏览器中访问该页面时,必须先确认您已年满18岁,然后才能访问实际内容,即您正在访问的页面,您需要将包含数据yes=yesfrom = "/bbs/Gossiping/index{the_number}.html"的信息发布到https://www.ptt.cc/ask/over18,如果您打印返回的源代码,则可以看到表格.

When you visit the page in your browser you have to acknowledge that you are over 18 before you get to the actual content so that is the page you are getting, you need to so a post to https://www.ptt.cc/ask/over18 with the data yes=yes and from = "/bbs/Gossiping/index{the_number}.html", you can see the form if you print the source returned.

<form action="/ask/over18" method="post">
    <input type="hidden" name="from" value="/bbs/Gossiping/index1.html">
    <div class="over18-button-container">
        <button class="btn-big" type="submit" name="yes" value="yes">我同意,我已年滿十八歲<br><small>進入</small></button>
    </div>
    <div class="over18-button-container">
        <button class="btn-big" type="submit" name="no" value="no">未滿十八歲或不同意本條款<br><small>離開</small></button>
    </div>
</form>

页面上也没有 r-ent ,只有div:

Also there is no is r-ent on the page, there are only divs:

import requests
from bs4 import BeautifulSoup

links = ['https://www.ptt.cc/bbs/Gossiping/index{}.html' for i in range(1,11)]
data_links = []
data = {"yes":"yes"}
head = {"User-Agent":"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:47.0) Gecko/20100101 Firefox/47.0"}

for ind, link in enumerate(links, 1):
    with requests.Session() as s:
        data["from"] = "/bbs/Gossiping/index{}.html".format(ind)
        s.post("https://www.ptt.cc/ask/over18", data=data, headers=head)
        res = s.get(link, headers=head)
        soup = BeautifulSoup(res.text,"html.parser")
        data_divs= soup.select("div.r-ent")
        print(data_divs)

上面的代码为您提供了类r-ent的所有div.

The code above gets you all the divs with the class r-ent.

使用Session一次发布可能会很好,因为将存储cookie,因此以下代码应该可以正常工作.

It is probably fine to just post once using a Session as the cookies will be stored so the following code should work fine.

links = ['https://www.ptt.cc/bbs/Gossiping/index{}.html' for i in range(1,11)]
data_links=[]
data = {"yes":"yes"}
head = {"User-Agent":"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:47.0) Gecko/20100101 Firefox/47.0"}
with requests.Session() as s:
    data["from"] = "/bbs/Gossiping/index1.html"
    s.post("https://www.ptt.cc/ask/over18", data=data, headers=head)
    for link in links:
        res = s.get(link, headers=head)
        BeautifulSoup(res.text,"html.parser")
        data_divs= soup.select("div.r-ent")
        print(data_divs)

这篇关于beautifulsoup find_all无法获取div数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆