使用 Pandas 将 HTML 脚本中的 @Type 抓取到 csv 文件中 [英] Scraping @Type from HTML Script into a csv file using pandas

查看：21 发布时间：2021/9/24 19:04:07 python csv web-scraping beautifulsoup data-science

本文介绍了使用 Pandas 将 HTML 脚本中的 @Type 抓取到 csv 文件中的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我是第一次尝试网络抓取，但遇到了很多麻烦，尤其是因为我应该使用的网站尽力阻止抓取库.我下载了 HTML 代码，但我想收集以制作 csv 文件的数据不在标签中(如 div、li、...).就像@type 显示字典一样.我需要制作一个数据集，其中的列显示字典中的列(评级值、作者、URL 和描述).我下载的 HTML 源代码附在下面.感谢您的帮助！

I am trying web scraping for the first time and I am having a lot of trouble especially because the website I am supposed to use tries its best to block scraping libraries. I downloaded the HTML code, but the data I want to collect to make a csv file is not in tags (like div, li,...). It is as @type showing a dictionary. I need to make a dataset with columns showing those in the dictionary (Rating Value, Author, URL, and description). The HTML source code I downloaded is attached below. Would appreciate your help!

这是我用来抓取它的代码:

and here is the code I used to scrape it:

from bs4 import BeautifulSoup
import requests
import re
import pandas as pd
from selenium import webdriver
import codecs
import os
import numpy as np
import pandas as pd
#import nltk
#import matplotlib.pyplot as plt
#from tensorflow import keras
os.system('cls')


PATH = "C:\\Users\\HCES\\Downloads\\chromedriver.exe"
driver = webdriver.Chrome(PATH)
i=1
driver.get("https://www.zomato.com/beirut/divvy-ashrafieh/reviews?page= {}&sort=dd&filter=reviews-dd".format(i))
page_source = driver.page_source
soup = BeautifulSoup(page_source,"lxml")

推荐答案

import json
import re

import requests

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:90.0) Gecko/20100101 Firefox/90.0'
}


def main(url, page):
    params = {
        'page': page,
        'sort': 'dd',
        'filter': 'reviews-dd'
    }
    r = requests.get(url, params=params, headers=headers)
    match = re.search(r'\.parse\((.*)\)', r.text).group(1)
    goal = json.loads(json.loads(match))

    print(goal.keys())


main('https://www.zomato.com/beirut/divvy-ashrafieh/reviews', 1)

输出:

dict_keys(['pages', 'blogData', 'pageUrlMappings', 'careers', 'allJobs', 'department', 'aboutus', 'sneakpeek', 'apiState', 'entities', 'user', 'uiLogic', 'location', 'gAds', 'footer', 'langKeys', 'deviceSpecificInfo', 'pageBlockerInfo', 'fullPageAds', 'networkState', 'fetchConfigs', 'hrefLangInfo', 'pageConfig', 'partnershipLoginModal', 'partnershipLoginOptionModal', 'doesNotDeliverModal', 'backButton'])

这篇关于使用 Pandas 将 HTML 脚本中的 @Type 抓取到 csv 文件中的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用 Pandas 将 HTML 脚本中的 @Type 抓取到 csv 文件中 [英] Scraping @Type from HTML Script into a csv file using pandas

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

使用 Pandas 将 HTML 脚本中的 @Type 抓取到 csv 文件中 [英] Scraping @Type from HTML Script into a csv file using pandas

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭