无法从pantip.com提取数据 [英] Cannot pull data from pantip.com

查看:90
本文介绍了无法从pantip.com提取数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在尝试从pantip.com提取数据,包括标题,帖子风格和所有使用beautifulsoup的评论. 但是,我只能拉标题并发表文章.我无法获得评论. 这是标题和帖子风格的代码

I have been trying to pull data from pantip.com including title, post stoy and all comments using beautifulsoup. However, I could pull only title and post stoy. I could not get comments. Here is code for title and post stoy

import requests
import re
from bs4 import BeautifulSoup


# specify the url
url = 'https://pantip.com/topic/38372443'

# Split Topic number
topic_number = re.split('https://pantip.com/topic/', url)
topic_number = topic_number[1]


page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')

# Capture title
elementTag_title = soup.find(id = 'topic-'+ topic_number)
title = str(elementTag_title.find_all(class_ = 'display-post-title')[0].string)

# Capture post story
resultSet_post = elementTag_title.find_all(class_ = 'display-post-story')[0]
post = resultSet_post.contents[1].text.strip()

我试图通过ID查找

elementTag_comment = soup.find(id = "comments-jsrender")

根据

我得到下面的结果.

elementTag_comment =

elementTag_comment =

<div id="comments-jsrender">
<div class="loadmore-bar loadmore-bar-paging"> <a href="javascript:void(0)"> 
<span class="icon-expand-left"><small>▼</small></span> <span class="focus- 
txt"><span class="loading-txt">กำลังโหลดข้อมูล...</span></span> <span 
class="icon-expand-right"><small>▼</small></span> </a> </div>
</div>

问题是我如何获得所有评论.请建议我如何解决它.

The question is how can I get all comments. Please, suggest me how to fix it.

推荐答案

查找这些帖子的其余部分之所以困难,是因为该网站填充了动态javascript.要解决此问题,您可以使用Selenium实现解决方案,请参见此处如何获取正确的驱动程序并将其添加到系统变量中

The reason your having trouble locating the rest of these posts is because the site is populated with dynamic javascript. To get around this you can implement a solution with selenium, see here how to get the correct driver and add to your system variables https://github.com/mozilla/geckodriver/releases . Selenium will load the page and you will have full access to all the attributes you see in your screenshot, with just beautiful soup that data is not being parsed.

完成此操作后,您可以使用以下命令返回每个帖子数据:

Once you do that you can use the following to return each of the posts data:

from bs4 import BeautifulSoup
from selenium import webdriver

url='https://pantip.com/topic/38372443'
driver = webdriver.Firefox()
driver.get(url)
content=driver.page_source
soup=BeautifulSoup(content,'lxml')

for div in soup.find_all("div", id=lambda value: value and value.startswith("comment-")):
    if len(str(div.text).strip()) > 1:
        print(str(div.text).strip())

driver.quit()

这篇关于无法从pantip.com提取数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆