无法使用漂亮的汤解析网页内容 [英] Not able to parse webpage contents using beautiful soup

查看:71
本文介绍了无法使用漂亮的汤解析网页内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在使用Beautiful Soup来解析网页以进行一些数据提取.到目前为止,对于我来说,对于其他网页来说,它的运行情况都非常好.但是,我试图计算< 页面中的标签>

I have been using Beautiful Soup for parsing webpages for some data extraction. It has worked perfectly well for me so far, for other webpages. But however I'm trying to count the number of < a> tags in this page,

from bs4 import BeautifulSoup
import requests
catsection = "cricket"
url_base = "http://www.dnaindia.com/"
i = 89

url = url_base + catsection + "?page=" + str(i)
print(url)

#This is the page I'm trying to parse and also the one in the hyperlink
#I get the correct url i'm looking for at this stage

r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, 'html.parser')

j=0
for num in soup.find_all('a'):
    j=j+1
print(j)

我将输出设为0.这使我认为r = requests.get(url)之后的2行可能无法正常工作(页面中显然没有机会出现零个< a>标记) ,我不确定在这里可以使用哪种替代解决方案.有人之前有任何解决方案或遇到过类似问题吗? 提前致谢.

I'm getting the output as 0. This makes me think that the 2 lines after r=requests.get(url) is probably not working(there's obviously no chance that there's zero < a> tags in the page), and i'm not sure about what alternative solution I can use here. Does anybody have any solution or faced a similar kind of problem before? Thanks, in advance.

推荐答案

您需要将一些信息和请求一起传递给服务器.
以下代码应该可以工作...您也可以与其他参数一起玩

You need to pass some of the information along with the request to the server.
Following code should work...You can play along with other parameter as well

from bs4 import BeautifulSoup
import requests
catsection = "cricket"
url_base = "http://www.dnaindia.com/"
i = 89

url = url_base + catsection + "?page=" + str(i)
print(url)

headers = {
    'User-agent': 'Mozilla/5.0'
}

#This is the page I'm trying to parse and also the one in the hyperlink
#I get the correct url i'm looking for at this stage

r = requests.get(url, headers=headers)
data = r.text
soup = BeautifulSoup(data, 'html.parser')

j=0
for num in soup.find_all('a'):
    j=j+1
print(j)

这篇关于无法使用漂亮的汤解析网页内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆