如何通过从多个 div 中选择来获取 Facebook 群组帖子 div 中的帖子链接? [英] How can I fetch the post link in a Facebook Group post div by selecting it out of multiple divs?

查看:50
本文介绍了如何通过从多个 div 中选择来获取 Facebook 群组帖子 div 中的帖子链接?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以我正在开发这个 puppeteer 应用程序,它要求我获取 Facebook 群组帖子中的帖子链接,尤其是帖子的链接,即帖子下方帖子的时间和日期作者姓名.而且我只想为第一篇文章获取该帖子链接.

所以要做到这一点,我必须首先选择作为父级的帖子的最外面的 div.很明显,提要中的每个帖子都包含与这张照片中显示的相同的类:

上图显示了 Facebook 提要的典型 html 结构.第一个子 div 是 New Activity 标题 div.其他孩子是post div.我只对上图中的 Post 1 的第一个 post div 感兴趣.

我感兴趣的锚链接嵌套在内部深处.大概有 10-15 层深.并且大概有一百万个锚链接.所以为了缩小范围,我可以只在帖子的标题中获取链接.

下图显示了父div和头部div的结构:

这张图片显示了我试图获取的链接.

我知道上面的图片太过分了,但这是我可以解释我一直在尝试做的事情的最简单的方法.问题是我无法使用实际的选择器.我是 Puppeteer 的新手,它的语法对我来说有点复杂.因此,简单来说,我希望您从多个相同类的 div 中选择第一个 Post 1 div.这是最重要的部分.然后选择带有类的内部 div 到实际的锚链接.

除了我尝试过的所有代码之外,这是其中之一:

const postDivs = await page.$$( 'div[role=feed"] .du4w35lb' );const hrefs = await page.$$eval( `${postDivs[ 0 ]} .pybr56ya .buofh1pr a`, links => links.map( a => a.href ) );console.log('锚链接:', hrefs);

上面的代码返回一个错误说:

错误:评估失败:DOMException:无法执行querySelectorAll"'文档':'JSHandle@node .pybr56ya .buofh1pr a' 不是有效的选择器.

希望得到您的积极答复.

更新***这是我用来抓取锚链接的代码:

( async() => {尝试 {const 浏览器 = 等待 puppeteer.launch( {无头:假,args: ['--no-sandbox', '--allow-third-party-modules', '--start-maximized'],慢动作:10});const context = await browser.createIncognitoBrowserContext();const page = await context.newPage();//进入网页等待 page.goto( 'https://www.facebook.com', { waitUntil: 'networkidle2' } );//填写登录信息并提交等待 page.waitForSelector( "#email");等待 page.focus( "#email");await page.type( "#email", "myEmailId", { delay: 50 } );等待 page.waitForSelector( "#pass");等待页面焦点(#pass");等待 page.type( "#pass", "myPassword", { delay: 50 } );await page.click(`[type="submit"]`);等待 page.waitForNavigation();await page.goto( "https://www.facebook.com/groups/groupName", { waitUntil: 'networkidle2' } );等待 page.waitForTimeout( 5000 );//获取链接的代码const 链接 = 等待 page.evaluate( 函数 () {返回 [ ...document.querySelectorAll( 'div[role=feed] .du4w35lb .buofh1pr .tojvnm2t .oajrlxb2[role=link]' ) ].map( ( link ) => link.href );});console.log('链接:', 链接);等待 page.waitForTimeout( 5000 );//关闭浏览器等待 browser.close()}抓住(错误){控制台日志(错误);}})();

解决方案

你必须找到一个足够独特的选择器,只挑选出你需要的东西.要做到这一点,您可以尝试将多个级别的类串在一起,以通向带有日期的链接(但不会通向其他链接,如个人资料链接,因此足够独特").

我在一个随机的 FB 讨论组上做了一个快速实验,其中的选择器目前与你的非常相似,并想出了这个选择器来查找帖子的链接:

const links = await page.evaluate(function(){return [...document.querySelectorAll('div[role=feed] .du4w35lb .buofh1pr .tojvnm2t .oajrlxb2[role=link]')].map((link) => link.href);});

它应该产生一个这样的数组:

<预><代码>[https://www.facebook.com/groups/somegroup/permalink/1244304367068568/?__cft__[0]=AZXxG8lKJxPS9bC&__tn__=%2CO%2CP-R",https://www.facebook.com/groups/somegroup/permalink/1243163367516017/?__cft__[0]=AZXcER8tI9lU1EL&__tn__=%2CO%2CP-R",https://www.facebook.com/groups/somegroup/permalink/1245602367605409/?__cft__[0]=AZW9cets_p3QIyB&__tn__=%2CO%2CP-R",https://www.facebook.com/groups/somegroup/permalink/1248223367343307/?__cft__[0]=AZV-htDstk_4Gsn&__tn__=%2CO%2CP-R",https://www.facebook.com/groups/somegroup/permalink/1247711367061195/?__cft__[0]=AZW2depBCCmRtXC&__tn__=%2CO%2CP-R",https://www.facebook.com/groups/somegroup#",https://www.facebook.com/groups/somegroup#"]

但请注意最后两个元素:显然您需要将光标悬停在 FB 的日期链接上才能动态计算 href,因此请记住这一点.

So I'm working on this puppeteer app which requires me to fetch the post link in a Facebook Group post, especially the link of the post which is the time and date of the post below the author name. And I want to fetch that post link only for the first post only.

So to do that I have to start by selecting the outermost div of the post which is the parent. And apparently each post in the feed contains the same class as shown in this photo:

The photo above shows a typical html structure of Facebook feed. The first child div is the New Activity title div. And the other children are post divs. I'm only interested in the first post div which is Post 1 in the picture above.

The anchor link I'm interested in is nested deep inside. Probably 10-15 levels deep. And there are probably a million anchor links. So to narrow it down I can target to fetch links only in the header of the post.

The image below shows the structure of the parent div and the header div:

This image shows which link I'm trying to fetch.

I know the pictures above are just too overwhelming but this is the simplest way I could explain what I have been trying to do. The problem is I'm unable to use the actual selectors. I'm new to Puppeteer and its syntax is a little bit complicated for me. So in simplest terms what I want from you is select the first Post 1 div out of multiple same class divs. This is the most important part. Then choosing the inner divs with classes to the actual anchor link.

Beside all the codes I have tried this is one of them:

const postDivs = await page.$$( 'div[role="feed"] .du4w35lb' );

const hrefs = await page.$$eval( `${postDivs[ 0 ]} .pybr56ya .buofh1pr a`, links => links.map( a => a.href ) );

console.log( 'anchor link: ', hrefs );

The above code returns an error that says:

Error: Evaluation failed: DOMException: Failed to execute 'querySelectorAll' on
'Document': 'JSHandle@node .pybr56ya .buofh1pr a' is not a valid selector.

Hoping to get a positive reply from you.

UPDATE*** This is the code I'm using to scrape the anchor links:

( async () => {
    try {

        const browser = await puppeteer.launch( {
            headless: false,
            args: [ '--no-sandbox', '--allow-third-party-modules', '--start-maximized' ],
            slowMo: 10
        } );

        const context = await browser.createIncognitoBrowserContext();
        const page = await context.newPage();

        // go to webpage
        await page.goto( 'https://www.facebook.com', { waitUntil: 'networkidle2' } );

        // fill login details and submit
        await page.waitForSelector( "#email" );
        await page.focus( "#email" );
        await page.type( "#email", "myEmailId", { delay: 50 } );
        await page.waitForSelector( "#pass" );
        await page.focus( "#pass" );
        await page.type( "#pass", "myPassword", { delay: 50 } );
        await page.click( `[type="submit"]` );

        await page.waitForNavigation();
        await page.goto( "https://www.facebook.com/groups/groupName", { waitUntil: 'networkidle2' } );


        await page.waitForTimeout( 5000 );

        // code to fetch the links
        const links = await page.evaluate( function () {
            return [ ...document.querySelectorAll( 'div[role=feed] .du4w35lb .buofh1pr .tojvnm2t .oajrlxb2[role=link]' ) ].map( ( link ) => link.href );
        } );

        console.log( 'links: ', links );

        await page.waitForTimeout( 5000 );


        // close browser
        await browser.close()


    } catch ( err ) {
        console.log( err );
    }
} )();

解决方案

You have to find a selector that is unique enough to only pick out what you need. To do that you may try to string together classes from several levels leading to the link with the date (but not leading to other links, like profile links, hence "uniqe enough").

I did a quick experiment on a random FB discussion group where selectors are currently very similar to yours and came up with this selector to find the links to posts:

const links = await page.evaluate(function(){
  return [...document.querySelectorAll('div[role=feed] .du4w35lb .buofh1pr .tojvnm2t .oajrlxb2[role=link]')].map((link) => link.href);
});

It should produce an array like this:

[
  "https://www.facebook.com/groups/somegroup/permalink/1244304367068568/?__cft__[0]=AZXxG8lKJxPS9bC&__tn__=%2CO%2CP-R",
  "https://www.facebook.com/groups/somegroup/permalink/1243163367516017/?__cft__[0]=AZXcER8tI9lU1EL&__tn__=%2CO%2CP-R",
  "https://www.facebook.com/groups/somegroup/permalink/1245602367605409/?__cft__[0]=AZW9cets_p3QIyB&__tn__=%2CO%2CP-R",
  "https://www.facebook.com/groups/somegroup/permalink/1248223367343307/?__cft__[0]=AZV-htDstk_4Gsn&__tn__=%2CO%2CP-R",
  "https://www.facebook.com/groups/somegroup/permalink/1247711367061195/?__cft__[0]=AZW2depBCCmRtXC&__tn__=%2CO%2CP-R",
  "https://www.facebook.com/groups/somegroup#",
  "https://www.facebook.com/groups/somegroup#"
]

Notice the two last elements though: obviously you need to hover a cursor over the date link for FB to dynamically calculate the href, so keep that in mind.

这篇关于如何通过从多个 div 中选择来获取 Facebook 群组帖子 div 中的帖子链接?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆