从Facebook帖子中刮掉原始链接和标题 [英] Scrape original links and headlines from Facebook posts

查看:66
本文介绍了从Facebook帖子中刮掉原始链接和标题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要收集一些Facebook Analytics未提供的信息.例如,文章的原始URL和标题在Facebook上被宣传为链接文章.此信息被埋在Facebook帖子的html代码中,但我很难将其挖掘出来.感谢您的帮助.

I need to gather some information which is not provided by Facebook Analytics. For example, the original url and headline of an article promoted on Facebook as a link post. This info is buried in the html code of a Facebook post but I struggle to dig it out. Will appreciate your help.

让我们举个例子: https://www.facebook.com/bbcnews/posts/10156428513547217

我确定了链接的类(bbc.in ...):"_ 6ks" 和标题:"mbs _6m6 _2cnj _5s6c"

I identified classes for a link (bbc.in...): "_6ks" and headline: 'mbs _6m6 _2cnj _5s6c'

下面的代码不返回任何内容:

The code below doesn't return anything:

from bs4 import BeautifulSoup
import requests
link = 'https://www.facebook.com/bbcnews/posts/10156428513547217'
r = requests.get(link)
soup = BeautifulSoup(r.content, "lxml")
for paragraph in soup.find_all("div", class_="_6ks"):
    for a in paragraph("a"):
       print(a.get('href'))
for paragraph in soup.find_all("div", class_='mbs _6m6 _2cnj _5s6c'):
    for a in paragraph("a"):
       print(a.get('hover'))

推荐答案

另一种实现此目的的方法如下:

Another way to achieve the same would be something like below:

from bs4 import BeautifulSoup
import requests

link = 'https://www.facebook.com/bbcnews/posts/10156428513547217'

res = requests.get(link,headers={'User-Agent':'Mozilla/5.0'})
comment = res.text.replace("-->", "").replace("<!--", "")
soup = BeautifulSoup(comment, "lxml")
items = soup.select_one('.mbs a')
print(items.get("href")+"\n",items.text)

这篇关于从Facebook帖子中刮掉原始链接和标题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆