使用BeautifulSoup来分析Facebook [英] Using BeautifulSoup to parse facebook

查看:175
本文介绍了使用BeautifulSoup来分析Facebook的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以我试图用BeautifulSoup解析公开的Facebook页面。我设法成功地抓了LinkedIn,但是我花了几个小时试图让它在Facebook上工作,没有运气。我试图使用的代码如下所示:

so I'm trying to parse public facebook pages using BeautifulSoup. I've managed to successfully scrape LinkedIn, but I've spent hours trying to get it to work on facebook with no luck. The code I'm trying to use looks like this:

for urls in my_urls:
try:
    page = urllib2.urlopen(urls)
    soup = BeautifulSoup(page)
    info = soup.find_all("div", class_="fsl fwb fcb")
    info2 = info.findall('a')

令我沮丧的部分是我可以得到标题元素,我甚至可以在文件的很远的地方,但是我无法得到我需要的部分。

The part that's frustrating me is that I can get the title element out, and I can even get pretty far down the document, but I can't get to the part where I need to get.

这行成功地抓住了pageTitle:

This line successfuly grabs the pageTitle:

info = soup.find_all("title", attrs={"id": "pageTitle"})

这一行可以在元素列表中得到很多,但不能再走得更远。

This line can get pretty far down the list of elements, but can't go any farther.

info = soup.find_all(id="pagelet_timeline_main_column")

这是一个我正在尝试解析的示例页面,我想要现在的城市:

Here's a sample page that I'm trying to parse, I want current city from it:

https://www.facebook.com/100004210542493

,并回想一下我想要看的部分的快速截图如:

and heres a quick screenshot of what the part I want looks like:

http://prntscr.com/1t8xx6

我觉得我真的很亲密,但是我根本无法想像出来。感谢您的任何帮助!

I feel like I'm really close, but I just can't figure it out. Thanks in advance for any help!

编辑2:我还应该提到,我可以成功打印整个汤和视觉上找到我需要的部分,但无论什么原因

EDIT 2: I should also mention that I can successfully print the whole soup and visually find the part I need, but for whatever reason the parsing just won't work the way it should.

推荐答案

尝试查看使用curl或wget返回的内容。您在浏览器中看到的是在执行javascript之后呈现的内容。

Try looking at content returned by using curl or wget. What you are seeing in the browser is what has been rendered after javascripts has been executed.

wget https://www.facebook.com/100004210542493

您可能想使用机械化或硒,因为你想模拟客户端浏览器(而不是处理原始内容)。

You might want to use memchanize or selenium, since you want to simulate a client browser (instead of handling raw content).

与此相关的另一个问题可能是如果对象还有其他类,美丽的汤也找不到CSS类,

Another issue related to it might be Beautiful Soup cannot find a CSS class if the object has other classes, too

这篇关于使用BeautifulSoup来分析Facebook的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆