使用Python,Selenium和BeautifulSoup抓取标签的内容? [英] Using Python, Selenium, and BeautifulSoup to scrape for content of a tag?

查看:148
本文介绍了使用Python,Selenium和BeautifulSoup抓取标签的内容?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

相对来说是初学者.有与此类似的主题,但是我可以看到我的解决方案的工作原理,我只需要帮助连接这最后几个点即可.我想在没有使用API​​的情况下从Instagram中抓取追随者人数.这是我到目前为止的内容:

Relatively beginner. There are similar topics to this but I can see how my solution works, I just need help connecting these last few dots. I'd like to scrape follower counts from Instagram without the use of the API. Here's what I have so far:

Python 3.7.0
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Chrome()

> DevTools listening on ws://.......

driver.get("https://www.instagram.com/cocacola")
soup = BeautifulSoup(driver.page_source)
elements = soup.find_all(attrs={"class":"g47SY "}) 
# Note the full class is 'g47SY lOXF2' but I can't get this to work
for element in elements:
    print(element)

>[<span class="g47SY ">667</span>,
  <span class="g47SY " title="2,598,456">2.5m</span>, # Need what's in title, 2,598,456
  <span class="g47SY ">582</span>]

for element in elements:
    t = element.get('title')
    if t:
        count = t
        count = count.replace(",","")
    else:
        pass

print(int(count))

>2598456 # Success

有没有更简单或更快捷的方法获得2,598,456的电话号码?我最初的希望是,我可以只使用'g47SY lOXF2'类,但是据我所知,该类名称中的空格在BS4中不起作用.只是要确保此代码简洁实用.

Is there any easier, or quicker way to get to the 2,598,456 number? My original hope was that I could just use the class of 'g47SY lOXF2' but spaces in the class name aren't functional in BS4 as far as I'm aware. Just want to make sure this code is succinct and functional.

推荐答案

我不得不使用无头选项并添加了execute_path进行测试.您可以删除它.

I had to use headless option and added executable_path for testing. You can remove that.

from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.chrome.options import Options

options = Options()
options.add_argument("--headless")
driver = webdriver.Chrome(executable_path="chromedriver.exe",chrome_options=options)

driver.get('https://www.instagram.com/cocacola')

soup = BeautifulSoup(driver.page_source,'lxml')

#This will give you span that has title attribute. But it gives us multiple results
#Follower count is in the inner of a tag.
followers = soup.select_one('a > span[title]')['title'].replace(',','')

print(followers)
#Output 2598552

这篇关于使用Python,Selenium和BeautifulSoup抓取标签的内容?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆