从Facebook页面中提取公开帖子,而无需API / APP密钥/令牌/秘密 [英] Extract public posts from Facebook page without API/APP key/token/secret

查看:191
本文介绍了从Facebook页面中提取公开帖子,而无需API / APP密钥/令牌/秘密的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

为了提前澄清,我没有Facebook帐户,也无意创建一个帐户。另外,我要实现的目标在我的国家和美国是完全合法的。

Just to clarify in advance, I don't have a Facebook account and I have no intent to create one. Also, what I'm trying to achieve is perfectly legal in my country and the USA.

而不是使用Facebook API来获取Facebook页面的最新时间线帖子,我想直接将获取请求发送到页面URL(例如此页面 ),然后从HTML源代码中提取帖子。

(我想获取帖子的文字和创建时间。)

Instead of using the Facebook API to get the latest timeline posts of a Facebook page, I want to send a get request directly to the page URL (e.g. this page) and extract the posts from the HTML source code.
(I'd like to get the text and the creation time of the post.)

在Web控制台中运行此命令时:

When I run this in the web console:

document.getElementsByClassName('userContent')

我得到了包含最新帖子文本的元素列表。

I get a list of elements containing the text of the latest posts.

但是我想从nodejs脚本中提取这些信息。使用 puppeteer 之类的无头浏览器,我可能很容易做到这一点,但这会产生大量不必要的开销。我真的很想一种简单的方法,例如下载HTML代码,将其传递给cheerio并使用cheeriio的类似jQuery的API提取帖子。

But I'd like to extract that information from a nodejs script. I could probably do it quite easily using a headless browser like puppeteer or the like, but that would create a ton of unnecessary overhead. I'd really like to a simple approach like downloading the HTML code, passing it to cheerio and use cheeriio's jQuery-like API to extract the posts.

这是我尝试尝试的尝试:

Here is my attempt of trying exactly that:

// npm i request cheerio request-promise-native
const rp = require('request-promise-native'); // requires installation of `request`
const cheerio = require('cheerio');

rp.get('https://www.facebook.com/pg/officialstackoverflow/posts/').then( postsHtml => {
    const $ = cheerio.load(postsHtml);

    const timeLinePostEls = $('.userContent');
    console.log(timeLinePostEls.html()); // should NOT be null
    const newestPostEl = timeLinePostEls.get(0);
    console.log(newestPostEl.html()); // should NOT be null
    const newestPostText = newestPostEl.text();
    console.log(newestPostText);
    //const newestPostTime = newestPostEl.parent(??).child('.livetimestamp').title;
    //console.log(newestPostTime);
}).catch(console.error);

不幸的是 $('。userContent')不起作用。但是,我能够验证我要查找的数据是否已嵌入该HTML代码中。

unfortunately $('.userContent') does not work. However, I was able to verify that the data I'm looking for is embedded somewhere in that HTML code.

但是我无法真正提出带有a的

But I couldn't really come up with a with a good regex approach or the like to extract that data.

根据帖子内容,帖子中HTML标记的数量差异很大。

Depending on the post content the number of HTML tags within the post varies heavily.

下面是一个包含一个链接的帖子的简单示例:

Here is a simple example of a post containing one link:

<div class="_5pbx userContent _3576" data-ft="&#123;&quot;tn&quot;:&quot;K&quot;&#125;"><p>We&#039;re proud to be named one of Built In NYC&#039;s Best Places to Work in 2019, ranking in the top 10 for Best Midsize Places to Work and top 3 (!) for Best Perks and Benefits. See what it took to make the list and check out our profile to see some of our job openings. <a href="https://l.facebook.com/l.php?u=https%3A%2F%2Fbit.ly%2F2H3Kbr2&amp;h=AT29h2HyDsEk0rHRWqJA-Fa4M1qi3nJT1NBi95othaR3qeFuFAMNiVS2Dgtv5KR5m0xqjw6kfwZdhZt0_D3UQT1Oel2UhxRql-KwkA1xqWvrql4u1jDhzrkGVT_XxoUd8_w8_fzLZzzhz23a8yPCK6IPfWKB76_CEFjG3b78y4dFJvY9Z08AYlR01dmi5_FvWVEVytkN-123u6alYE8pqL6Jb6dtIQUTWGXYJPaNMrtxkCUZniEVXEcILkwHGSuHqCTAarboyMP55F1vhYO3OAiVMkvjbN274fVq92YvbK3bi90bU9T-5ADWHDUJ-CwcofSBTW47chstQeY0n_UluD_rBIPLsfXVSnCtpRkR2kXi9zzHLnNeIYeNssv3i7UKS_f5Z2pnVT6xe3zJbNpB68doH1Z__I9nsTCNIyFyKx2VxabecoL03DIawbRrzBoxLAwzNPLACBjTkpEQhdVn4_wdAIjXRL4cLQDcZkLEoG_sspBgRePH23TFbNufQOBly-FNtLHnkUDO2Ca-FYvAGXpcu6J4B1aH3XFPB803lsz-GRdACyOFOgXDXJfwr4WtWzUHxfiOPULWiI43yI5L4aU6wYRhPjxua3RuRZ8oj9fXa1w4Jrht94Ue2wfKtz8" target="_blank" data-ft="&#123;&quot;tn&quot;:&quot;-U&quot;&#125;" rel="noopener nofollow" data-lynx-mode="async">http://*******/2H3Kbr2</a></p></div>

采用更易读的格式,看起来有点像这样:

Formatted in a more readable form it looks somewhat like this:

<div class="_5pbx userContent _3576" data-ft="&#123;&quot;tn&quot;:&quot;K&quot;&#125;">
    <p>
        We&#039;re proud to be named one of Built In NYC&#039;s Best Places to Work in 
        2019, ranking in the top 10 for Best Midsize Places to Work and top 3 (!) for 
        Best Perks and Benefits. See what it took to make the list and check out our 
        profile to see some of our job openings.
        <a href="VERY_LONG_URL.........." target="_blank" data-ft="&#123;&quot;tn&quot;:&quot;-U&quot;&#125;" rel="noopener nofollow" data-lynx-mode="async">SHORT_LINK.....</a>
    </p>
</div>

此正则表达式似乎可以正常工作,但是我认为它不是非常可靠:

This regex seems to work okay, but I don't think it is very reliable:

/<div class="[^"]+ userContent [^"]+" data-ft="[^"]+">(.+?)<\/div>/g

例如,如果该帖子包含另一个div元素,那么它将无法正常工作。没有办法知道使用这种方法创建帖子的时间/日期吗?

If for example the post contained another div-element then it wouldn't work properly. In addition to that I have no way of knowing the time/date the post was created using this approach?

任何想法我都可以相对可靠地提取最近的2-3个帖子(包括创建内容)日期/时间?

Any ideas how I could relatively reliably extract the most recent 2-3 posts including the creation date/time?

推荐答案

好吧,我终于弄清楚了,希望对其他人有用。此函数将提取最新的20条帖子,包括创建时间:

Okay, I finally figured it out. I hope this will be useful to others. This function will extract the 20 latest posts, including the creation time:

// npm i request cheerio request-promise-native
const rp = require('request-promise-native'); // requires installation of `request`
const cheerio = require('cheerio');

function GetFbPosts(pageUrl) {
    const requestOptions = {
        url: pageUrl,
        headers: {
            'User-Agent': 'Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:64.0) Gecko/20100101 Firefox/64.0'
        }
    };
    return rp.get(requestOptions).then( postsHtml => {
        const $ = cheerio.load(postsHtml);
        const timeLinePostEls = $('.userContent').map((i,el)=>$(el)).get();
        const posts = timeLinePostEls.map(post=>{
            return {
                message: post.html(),
                created_at: post.parents('.userContentWrapper').find('.timestampContent').html()
            }
        });
        return posts;
    });
}
GetFbPosts('https://www.facebook.com/pg/officialstackoverflow/posts/').then(posts=>{
    // Log all posts
    for (const post of posts) {
        console.log(post.created_at, post.message);
    }
});

由于Facebook消息的格式可能很复杂,因此消息不是纯文本,而是HTML。但是您可以删除格式,而只需将 message:post.html()替换为 message:post.text()

Since Facebook messages can have complicated formatting the message is not plain text, but HTML. But you could remove the formatting and just get the text by replacing message: post.html() with message: post.text().

编辑:
如果您想获得超过20个最新帖子,则更为复杂。前20个帖子静态地显示在初始html页面上。通过ajax以8个帖子为块检索以下所有帖子。
可以这样实现:

If you want to get more than the latest 20 posts, it is more complicated. The first 20 posts are served statically on the initial html page. All following posts are retrieved via ajax in chunks of 8 posts. It can be achieved like that:

// make sure your node.js version supports async/await (v10 and above should be fine)
// npm i request cheerio request-promise-native
const rp = require('request-promise-native'); // requires installation of `request`
const cheerio = require('cheerio');

class FbScrape {
    constructor(options={}) {
        this.headers = options.headers || {
            'User-Agent': 'Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:64.0) Gecko/20100101 Firefox/64.0' // you may have to update this at some point
        };
    }

    async getPosts(pageUrl, limit=20) {
        const staticPostsHtml = await rp.get({ url: pageUrl, headers: this.headers });
        if (limit <= 20) {
            return this._parsePostsHtml(staticPostsHtml);
        } else {
            let staticPosts = this._parsePostsHtml(staticPostsHtml);
            const nextResultsUrl = this._getNextPageAjaxUrl(staticPostsHtml);
            const ajaxPosts = await this._getAjaxPosts(nextResultsUrl, limit-20);
            return staticPosts.concat(ajaxPosts);
        }
    }

    _parsePostsHtml(postsHtml) {
        const $ = cheerio.load(postsHtml);
        const timeLinePostEls = $('.userContent').map((i,el)=>$(el)).get();
        const posts = timeLinePostEls.map(post => {
            return {
                message: post.html(),
                created_at: post.parents('.userContentWrapper').find('.timestampContent').html()
            }
        });
        return posts;
    }

    async _getAjaxPosts(resultsUrl, limit=8, posts=[]) {
        const responseBody = await rp.get({ url: resultsUrl, headers: this.headers });
        const extractedJson = JSON.parse(responseBody.substr(9));
        const postsHtml = extractedJson.domops[0][3].__html;
        const newPosts = this._parsePostsHtml(postsHtml);
        const allPosts = posts.concat(newPosts);
        const nextResultsUrl = this._getNextPageAjaxUrl(postsHtml);
        if (allPosts.length+1 >= limit)
            return allPosts;
        else
            return await this._getAjaxPosts(nextResultsUrl, limit, allPosts);
    }

    _getNextPageAjaxUrl(html) {
        return 'https://www.facebook.com' + /"(\/pages_reaction_units\/more[^"]+)"/g.exec(html)[1].replace(/&amp;/g, '&') + '&__a=1';
    }
}

const fbScrape = new FbScrape();
const minimum = 28; // minimum number of posts to request (gets rounded up to 20, 28, 36, 44, 52, 60, 68 etc... because of page sizes (page1=20; all_following_pages=8)
fbScrape.getPosts('https://www.facebook.com/pg/officialstackoverflow/posts/', minimum).then(posts => { // get at least the 28 latest posts
    // Log all posts
    for (const post of posts) {
        console.log(post.created_at, post.message);
    }
});

这篇关于从Facebook页面中提取公开帖子,而无需API / APP密钥/令牌/秘密的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆