登录网站后漂亮的汤刮表 [英] beautiful soup scraping table after logging in a website

查看:56
本文介绍了登录网站后漂亮的汤刮表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一段python代码,可将我登录到一个网站.我正在尝试提取特定表的数据,但出现错误,并且不确定在在线搜索后如何解决该问题.

I have a piece of python code which logs me into a website. I am trying to extract the data of a particular table, I'm getting errors and I'm not sure how to resolve it after searching online.

这是我的代码写在我的 f.py 文件中:

Here is my code written in my f.py file:

import mechanize
from bs4 import BeautifulSoup
import cookielib
import requests

cj = cookielib.CookieJar()
br = mechanize.Browser()
br.set_cookiejar(cj)
br.open("http://kingmedia.tv/home")

br.select_form(nr=0)
br.form['vb_login_username'] = 'abcde'
br.form['vb_login_password'] = '12345'
br.submit()

a = br.response().read()


url = br.open("http://kingmedia.tv/home/forumdisplay.php?f=2").read()

print (url)

soup = BeautifulSoup(requests.get(url).text, 'lxml')
for table in soup.select('table#tborder tr')[1:]:
    cell = table.select_one('td').get_text(strip=True)
    print(cell)

print(url)给了我下面显示的URL的HTML数据,我希望从中提取表数据.我感兴趣的表数据是 table class ="tborder"

print (url) gives me the HTML data of the url which I have shown below from which I want to extract the table data. The table data that I am interested in is table class="tborder"

更新:2021年7月5日

按照@ Code-Apprentice的建议,使用 soup = BeautifulSoup(content,'lxml'),我能够获取所需的数据.但是,我正在努力完全获得它.

Using soup = BeautifulSoup(content, 'lxml') as suggested by @Code-Apprentice, I am able to get the desired data. However I am struggling to obtain it fully.

我需要此表,链接的源代码如下:

I need this table and the source code from the link is the following:

<td class="alt1" width="100%"><div><font size="2"><div>

    <a href="calendar.php?do=getinfo&amp;day=2021-5-7&amp;e=18116&amp;c=1">Live: EPL - Leicester v Newcastle (CH3)</a>: 05/07/21 to 05/07/21

</div><div>

    <a href="calendar.php?do=getinfo&amp;day=2021-5-8&amp;e=18121&amp;c=1">Live: EPL - Liverpool v Southampton (CH3)</a>: 05/08/21 to 05/08/21

</div><div>

    <a href="calendar.php?do=getinfo&amp;day=2021-5-8&amp;e=18123&amp;c=1">Live: UFC PreLims (CH2)</a>: 05/08/21 to 05/08/21

</div><div>

    <a href="calendar.php?do=getinfo&amp;day=2021-5-8&amp;e=18124&amp;c=1">Live: UFC - Sandhagen v Dillashaw (CH2)</a>: 05/08/21 to 05/09/21

</div><div>

    <a href="calendar.php?do=getinfo&amp;day=2021-5-8&amp;e=18120&amp;c=1">Live: EPL - Man City v Chelsea (CH3)</a>: 05/08/21 to 05/08/21

</div><div>

    <a href="calendar.php?do=getinfo&amp;day=2021-5-8&amp;e=18122&amp;c=1">Live: La Liga - Barcelona v Atletico Madrid (CH6)(beIn)</a>: 05/08/21 to 05/08/21

</div><div>

    <a href="calendar.php?do=getinfo&amp;day=2021-5-8&amp;e=18118&amp;c=1">Live: EPL - Leeds v Tottenham (CH3)</a>: 05/08/21 to 05/08/21

</div><div>

    <a href="calendar.php?do=getinfo&amp;day=2021-5-8&amp;e=18125&amp;c=1">Live: F1 Qualifying (CH2)</a>: 05/08/21 to 05/08/21

</div><div>

    <a href="calendar.php?do=getinfo&amp;day=2021-5-8&amp;e=18119&amp;c=1">Live: EPL - Sheff Utd v Crystal Palace (CH3)</a>: 05/08/21 to 05/08/21

</div></font><br>View More Detailed Calendar <a href="/home/calendar.php">HERE</a></div></td>

推荐答案

url = br.open("http://kingmedia.tv/home/forumdisplay.php?f=2").read()

print (url)

soup = BeautifulSoup(requests.get(url).text, 'lxml')

这看起来非常可疑.您正在读取一个HTTP响应的内容,然后将其用作另一个请求的URL.相反,只需用漂亮的汤解析第一个请求的内容:

This looks very suspicious. You are reading the content of one HTTP response then using it as the URL for another request. Instead, just parse the content of the first request with beautiful soup:

content = br.open("http://kingmedia.tv/home/forumdisplay.php?f=2").read()
soup = BeautifulSoup(content, 'lxml')

首先,我将 url 重命名为 content ,以反映变量实际代表的含义.其次,在创建 BeautifulSoup 对象时,我直接使用 content .

First, I renamed url to content to reflect what the variable actual represents. Second, I use content directly in the creation of the BeautifulSoup object.

免责声明:这可能仍然不完全正确,但是它应该使您朝着正确的方向前进.

Disclaimer: this still might not be exactly correct, but it should get you headed in the right direction.

这篇关于登录网站后漂亮的汤刮表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆