网络抓取每个论坛帖子(Python,Beautifulsoup) [英] Web scraping every forum post (Python, Beautifulsoup)

查看:277
本文介绍了网络抓取每个论坛帖子(Python,Beautifulsoup)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

你好,我再次成为堆叠者.简短说明.我正在网上使用Python从一个汽车论坛中抓取一些数据,并将所有数据保存到CSV文件中.在其他stackoverflow的一些帮助下,成员设法深入研究了某些主题的所有页面,收集了每个帖子的日期,标题和链接.

我还有一个单独的脚本,我现在正在执行该脚本(对于找到的每个链接,python都会为其创建一个新的汤,刮擦所有文章,然后返回上一个链接).

在我第一次使用python的过程中,我将非常感谢任何其他技巧或建议,以使它变得更好,我认为这可能是我的嵌套循环逻辑弄乱了,但是多次检查对我来说似乎是正确的. /p>

此处是代码段:

        link += (div.get('href'))
        savedData += "\n" + title + ", " + link
        tempSoup = make_soup('http://www.automotiveforums.com/vbulletin/' + link)
        while tempNumber < 3:
            for tempRow in tempSoup.find_all(id=re.compile("^td_post_")):
                for tempNext in tempSoup.find_all(title=re.compile("^Next Page -")):
                    tempNextPage = ""
                    tempNextPage += (tempNext.get('href'))
                post = ""
                post += tempRow.get_text(strip=True)
                postData += post + "\n"
            tempNumber += 1
            tempNewUrl = "http://www.automotiveforums.com/vbulletin/" + tempNextPage
            tempSoup = make_soup(tempNewUrl)
            print(tempNewUrl)
    tempNumber = 1
    number += 1
    print(number)
    newUrl = "http://www.automotiveforums.com/vbulletin/" + nextPage
    soup = make_soup(newUrl)

到目前为止,我的主要问题是tempSoup = make_soup('http://www.automotiveforums.com/vbulletin/' + link) 在将所有帖子贴到论坛主题之后,似乎并没有创建新的汤.

这是我得到的输出:

 http://www.automotiveforums.com/vbulletin/showthread.php?s=6a2caa2b46531be10e8b1c4acb848776&t=1139532&page=2
    http://www.automotiveforums.com/vbulletin/showthread.php?s=6a2caa2b46531be10e8b1c4acb848776&t=1139532&page=3
    1

因此,它似乎确实找到了新页面的正确链接并将其抓取,但是对于下一个提示,它会打印新日期和相同的页面.在最后一个链接被打印之后,还有10-12秒的怪异延迟,然后才跳到打印数字1,然后清除所有新日期..

但是在转到下一个论坛主题链接之后,每次都会刮取相同的准确数据.

很抱歉,如果它看起来确实很凌乱,那是一个附带项目,这是我第一次尝试做一些有用的事情,所以我对此很陌生,任何建议或技巧都将不胜感激.我并不是要您为我解决代码,即使对于一些可能导致错误逻辑的指针,也将不胜感激!

亲切的问候,并感谢您阅读如此冗长的文章!

由于我相信人们会不知所措,所以我删除了大部分的帖子/代码段.刚刚离开了我正在尝试使用的基本内容.任何帮助将不胜感激!

解决方案

因此,在花了更多时间之后,我设法ALMOST破解了它.现在到了python找到每个线程及其在论坛上的链接的位置,然后转到每个链接,读取所有页面,并继续下一个链接.

这是任何人使用它的固定代码.

    link += (div.get('href'))
    savedData += "\n" + title + ", " + link
    soup3 = make_soup('http://www.automotiveforums.com/vbulletin/' + link)
    while tempNumber < 4:
        for postScrape in soup3.find_all(id=re.compile("^td_post_")):
            post = ""
            post += postScrape.get_text(strip=True)
            postData += post + "\n"
            print(post)
        for tempNext in soup3.find_all(title=re.compile("^Next Page -")):
            tempNextPage = ""
            tempNextPage += (tempNext.get('href'))
            print(tempNextPage)
        soup3 = ""
        soup3 = make_soup('http://www.automotiveforums.com/vbulletin/' + tempNextPage)
        tempNumber += 1
    tempNumber = 1
number += 1
print(number)
newUrl = "http://www.automotiveforums.com/vbulletin/" + nextPage
soup = make_soup(newUrl)

我要做的就是将两个嵌套在一起的for循环分离为自己的循环.仍然不是一个完美的解决方案,但嘿,它几乎可以正常工作.

无效位:提供的链接的前2个线程有多页帖子.以下10个以上的线程不要.我无法找出一种检查for tempNext in soup3.find_all(title=re.compile("^Next Page -")):的方法 循环外的值以查看其是否为空.因为如果找不到下一个页面元素/href,它只会使用最后一个.但是,如果我在每次运行后重置该值,它将不再挖掘每个页面= l一个仅产生了另一个问题的解决方案:D.

Hello once again fellow stack'ers. Short description.. I am web scraping some data from an automotive forum using Python and saving all data into CSV files. With some help from other stackoverflow members managed to get as far as mining through all pages for certain topic, gathering the dates, title and link for each post.

I also have a seperate script I am now sturggling with implementing (For every link found, python creates a new soup for it, scrapes through all the posts and then goes back to previous link).

Would really appreciate any other tips or advice on how to make this better as it's my first time working with python, I think it might be my nested loop logic that's messed up, but checking through multiple times seems right to me.

Heres the code snippet :

        link += (div.get('href'))
        savedData += "\n" + title + ", " + link
        tempSoup = make_soup('http://www.automotiveforums.com/vbulletin/' + link)
        while tempNumber < 3:
            for tempRow in tempSoup.find_all(id=re.compile("^td_post_")):
                for tempNext in tempSoup.find_all(title=re.compile("^Next Page -")):
                    tempNextPage = ""
                    tempNextPage += (tempNext.get('href'))
                post = ""
                post += tempRow.get_text(strip=True)
                postData += post + "\n"
            tempNumber += 1
            tempNewUrl = "http://www.automotiveforums.com/vbulletin/" + tempNextPage
            tempSoup = make_soup(tempNewUrl)
            print(tempNewUrl)
    tempNumber = 1
    number += 1
    print(number)
    newUrl = "http://www.automotiveforums.com/vbulletin/" + nextPage
    soup = make_soup(newUrl)

My main issue with it so far is that tempSoup = make_soup('http://www.automotiveforums.com/vbulletin/' + link) Does not seem to create a new soup after it has done scraping all the posts for forum thread.

This is the output I'm getting :

 http://www.automotiveforums.com/vbulletin/showthread.php?s=6a2caa2b46531be10e8b1c4acb848776&t=1139532&page=2
    http://www.automotiveforums.com/vbulletin/showthread.php?s=6a2caa2b46531be10e8b1c4acb848776&t=1139532&page=3
    1

So it does seem to find the correct links for new pages and scrape them, however for next itteration it prints the new dates AND the same exact pages. There's also a reaaly weird 10-12 seconds delays after the last link is printed and only then it hops down to print number 1 and then bash out all the new dates..

But after going for the next forum threads link, it scrapes same exact data every time.

Sorry if it looks really messy, it is sort of a side project, and my first attempt at doing some useful things, so I am very new at this, any advice or tips would be much appreciated. I'm not asking you to solve the code for me, even some pointers for my possibly wrong logic would be greatly appreciated!

Kind regards, and thanks for reading such annoyingly long post!

EDIT: I've cut out majority of the post / code snippet as I believe people were getting overwhelmed. Just left the essential bit I am trying to work with. Any help would be much appreciated!

解决方案

So after spending a little bit more time, I have managed to ALMOST crack it. It's now at the point where python finds every thread and it's link on the forum, then goes onto each link, reads all pages and continues on with next link.

This is the fixed code for it if anyone will make any use of it.

    link += (div.get('href'))
    savedData += "\n" + title + ", " + link
    soup3 = make_soup('http://www.automotiveforums.com/vbulletin/' + link)
    while tempNumber < 4:
        for postScrape in soup3.find_all(id=re.compile("^td_post_")):
            post = ""
            post += postScrape.get_text(strip=True)
            postData += post + "\n"
            print(post)
        for tempNext in soup3.find_all(title=re.compile("^Next Page -")):
            tempNextPage = ""
            tempNextPage += (tempNext.get('href'))
            print(tempNextPage)
        soup3 = ""
        soup3 = make_soup('http://www.automotiveforums.com/vbulletin/' + tempNextPage)
        tempNumber += 1
    tempNumber = 1
number += 1
print(number)
newUrl = "http://www.automotiveforums.com/vbulletin/" + nextPage
soup = make_soup(newUrl)

All I've had to do was to seperate the 2 for loops that were nested within each other, into own loops. Still not a perfect solution, but hey, it ALMOST works.

The non working bit: First 2 threads of provided link have multiple pages of posts. The following 10+ more threads Do not. I cannot figure out a way to check the for tempNext in soup3.find_all(title=re.compile("^Next Page -")): value outside of loop to see if it's empty or not. Because if it does not find a next page element / href, it just uses the last one. But if I reset the value after each run, it no longer mines each page =l A solution that just created another one problem :D.

这篇关于网络抓取每个论坛帖子(Python,Beautifulsoup)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆