网络爬虫在嵌套 div 中不起作用 [英] Web Crawler not working in nested divs

查看:40
本文介绍了网络爬虫在嵌套 div 中不起作用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试制作一个能够吸引人们兴趣的网络爬虫.代码如下:

I am trying to make a web crawler that picks the interest of the people. Here is the code:

import requests
from bs4 import BeautifulSoup

def facebook_spider():
    url = 'https://www.facebook.com/abhas.mittal7'
    source_code = requests.get(url)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text , "html.parser")
    for div in soup.findAll('div', attrs={'class':'mediaRowWrapper'}):
        print div.text

facebook_spider()

它没有显示任何结果.但是,如果我输入不同类别的 div(位于页面顶部的 div),则它会显示内容.我认为嵌套 div 存在一些问题,但我在示例 html 页面中尝试了此代码,其中包含太多嵌套 div,它奏效了.请帮忙.

It is not showing any results. However if I type in a different class of div (the divs that are at the top of the page) then it shows the content. I think there is some problem with the nested divs but I have tried this code in sample html page with too many nested divs, it worked. Kindly help.

推荐答案

看看这是否有效:

import urlparse,urllib,codecs
from bs4 import BeautifulSoup

url = 'https://www.facebook.com/abhas.mittal7'
html=urllib.urlopen(url)
htmltext=html.read() 

def gettext(htmltext):
soup=BeautifulSoup(htmltext)
for script in soup(["script", "style"]):
    script.extract()#removing styles and scripts

text = soup.get_text()
lines = (line.strip() for line in text.splitlines())
chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
text = '\n'.join(chunk for chunk in chunks if chunk)

# return text.encode('utf-8') #or print it or whatever you see fit


gettext(htmltext)

这篇关于网络爬虫在嵌套 div 中不起作用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆