如何使用漂亮的汤从Shopee抓取数据 [英] how to scrape data from shopee using beautiful soup

查看:70
本文介绍了如何使用漂亮的汤从Shopee抓取数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前是一名学生,目前我正在学习beautifulsoup,所以我的讲师会像我一样从商店提取数据,但是我无法提取产品的详细信息.目前,我正在尝试从 https://抓取数据shopee.com.my/shop/13377506/search?page=0&sortBy=sales .我只想刮擦产品的名称和价格.有人可以告诉我为什么我不能使用beautifulsoup抓取数据吗?

I'm currently a student where currently I studied beautifulsoup so my lecturer as me to scrape data from shopee however I cannot scrape the details of the products. Currently, I'm trying to scrape data from https://shopee.com.my/shop/13377506/search?page=0&sortBy=sales. I only want to scrape the name and price of the products. can someone tell me why I cannot scrape the data using beautifulsoup ?

这是我的代码:

from requests import get
from bs4 import BeautifulSoup

url = "https://shopee.com.my/shop/13377506/search?page=0&sortBy=sales"
response= get (url)
soup=BeautifulSoup(response.text,'html.parser')
print (soup)

推荐答案

这个问题有点棘手(对于python初学者来说是 ),因为它涉及硒(用于无头浏览)和beautifulsoup(用于html数据提取).此外,由于文档对象模型(DOM)被包含在javascript中,因此该问题变得很困难.我们知道javascript是存在的,因为当仅使用beautifulsoup进行访问时,我们会从网站上获得空响应,例如,汤中的item_n的 .find_all('div',class _ ='_ 1NoI8_ _16BAGk'):打印(item_n.get_text())

This question is a bit tricky (for python beginners) because it involves a combination of selenium (for headless browsing) and beautifulsoup (for html data extraction). Moreover, the problem becomes difficult because the Document Object Model (DOM) is encased within javascripting. We know javascript is there because we get an empty response from the website when accessed only using beautifulsoup, like, for item_n in soup.find_all('div', class_='_1NoI8_ _16BAGk'): print(item_n.get_text())

因此,要从具有控制其DOM的脚本语言的网页中提取数据,我们必须使用硒进行无头浏览(这表明网站浏览器正在访问它).我们还必须使用某种延迟参数(告诉网站它是人为访问的).为此,硒库中的功能 WebdriverWait()将有所帮助.

Therefore, to extract data from such a webpage which has a scripting language controlling its DOM, we have to use selenium for headless browsing (this tells the website that a browser is accessing it). We also have to use some sort of delay parameter, (which tells the website that it's accessed by a human). For this, the function WebdriverWait() from the selenium library will help.

我现在提供解释该过程的代码片段.

I now present snippets of code that explain the process.

首先,导入必需的库

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import TimeoutException
from time import sleep

接下来,初始化无头浏览器的设置.我正在使用Chrome.

Next, initialize the settings for the headless browser. I'm using chrome.

# create object for chrome options
chrome_options = Options()
base_url = 'https://shopee.com.my/shop/13377506/search?page=0&sortBy=sales'

# set chrome driver options to disable any popup's from the website
# to find local path for chrome profile, open chrome browser
# and in the address bar type, "chrome://version"
chrome_options.add_argument('disable-notifications')
chrome_options.add_argument('--disable-infobars')
chrome_options.add_argument('start-maximized')
chrome_options.add_argument('user-data-dir=C:\\Users\\username\\AppData\\Local\\Google\\Chrome\\User Data\\Default')
# To disable the message, "Chrome is being controlled by automated test software"
chrome_options.add_argument("disable-infobars")
# Pass the argument 1 to allow and 2 to block
chrome_options.add_experimental_option("prefs", { 
    "profile.default_content_setting_values.notifications": 2
    })
# invoke the webdriver
browser = webdriver.Chrome(executable_path = r'C:/Users/username/Documents/playground_python/chromedriver.exe',
                          options = chrome_options)
browser.get(base_url)
delay = 5 #secods

接下来,我声明一个空列表变量来保存数据.

Next, I declare empty list variables to hold the data.

# declare empty lists
item_cost, item_init_cost, item_loc = [],[],[]
item_name, items_sold, discount_percent = [], [], []
while True:
    try:
        WebDriverWait(browser, delay)
        print ("Page is ready")
        sleep(5)
        html = browser.execute_script("return document.getElementsByTagName('html')[0].innerHTML")
        #print(html)
        soup = BeautifulSoup(html, "html.parser")

        # find_all() returns an array of elements. 
        # We have to go through all of them and select that one you are need. And than call get_text()
        for item_n in soup.find_all('div', class_='_1NoI8_ _16BAGk'):
            print(item_n.get_text())
            item_name.append(item_n.text)

        # find the price of items
        for item_c in soup.find_all('span', class_='_341bF0'):
            print(item_c.get_text())
            item_cost.append(item_c.text)

        # find initial item cost
        for item_ic in soup.find_all('div', class_ = '_1w9jLI QbH7Ig U90Nhh'):
            print(item_ic.get_text())
            item_init_cost.append(item_ic.text)
        # find total number of items sold/month
        for items_s in soup.find_all('div',class_ = '_18SLBt'):
            print(items_s.get_text())
            items_sold.append(item_ic.text)

        # find item discount percent
        for dp in soup.find_all('span', class_ = 'percent'):
            print(dp.get_text())
            discount_percent.append(dp.text)
        # find item location
        for il in soup.find_all('div', class_ = '_3amru2'):
            print(il.get_text())
            item_loc.append(il.text)

        break # it will break from the loop once the specific element will be present. 
    except TimeoutException:
        print ("Loading took too much time!-Try again")

此后,我使用 zip 函数组合不同的列表项.

Thereafter, I use the zip function to combine the different list items.

rows = zip(item_name, item_init_cost,discount_percent,item_cost,items_sold,item_loc)

最后,我将此数据写入光盘,

Finally, I write this data to disc,

import csv
newFilePath = 'shopee_item_list.csv'
with open(newFilePath, "w") as f:
    writer = csv.writer(f)
    for row in rows:
        writer.writerow(row)

作为一种好的做法,明智的做法是在任务完成后关闭无头浏览器.所以我将其编码为

As a good practice, its wise to close the headless browser once the task is complete. And so i code it as,

# close the automated browser
browser.close()

结果

Nestle MILO Activ-Go Chocolate Malt Powder (2kg)
NESCAFE GOLD Refill (170g)
Nestle MILO Activ-Go Chocolate Malt Powder (1kg)
MAGGI Hot Cup - Asam Asam Laksa (60g)
MAGGI 2-Minit Curry (79g x 5 Packs x 2)
MAGGI PAZZTA Cheese Macaroni 70g
.......
29.90
21.90
16.48
1.69
8.50
3.15
5.90
.......
RM40.70
RM26.76
RM21.40
RM1.80
RM9.62
........
9k sold/month
2.3k sold/month
1.8k sold/month
1.7k sold/month
.................
27%
18%
23%
6%
.............
Selangor
Selangor
Selangor
Selangor

给读者的提示

OP引起我注意,xpath无法按照我的回答给出.两天后,我再次检查了网站,发现一个奇怪的现象. div 类的 class _ 属性确实已更改.我发现了相似的Q .但这并没有太大帮助.因此,目前为止,我认为shoppee网站中的div属性可以再次更改.我将其作为未解决的问题留待以后解决.

The OP brought to my attention that the xpath was not working as given in my answer. I checked the website again after 2 days and noticed a strange phenomenon. The class_ attribute of the div class had indeed changed. I found a similar Q. But it did not help much. So for now, I'm concluding the div attributes in the shoppee website can change again. I leave this as an open problem to solve later.

操作说明

安娜,以上代码仅适用于一页,即,仅适用于网页 https://shopee.com.my/shop/13377506/search?page=0&sortBy=sales.我邀请您解决如何在sales标签下抓取多个网页的数据,从而进一步提高您的技能.您的提示是在此页面右上方看到的 1/9 和/或在页面底部的 1 2 3 4 5 链接.给您的另一个提示是查看urlparse库中的urljoin.希望这可以帮助您入门.

Ana, the above code will work for just one page i.e., it will work only for the webpage, https://shopee.com.my/shop/13377506/search?page=0&sortBy=sales. I invite you to further enhance your skills by solving how to scrape data for multiple webpages under the sales tag. Your hint is the 1/9 seen on the top right of the this page and/or the 1 2 3 4 5 links at the bottom of the page. Another hint for you is to look at the urljoin in the urlparse library. Hope this should get you started.

有用的资源

这篇关于如何使用漂亮的汤从Shopee抓取数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆