我怎样才能将报废的数据水平导出到Excel? [英] How Can I Export Scraped Data to Excel Horizontally?

查看:27
本文介绍了我怎样才能将报废的数据水平导出到Excel?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对Python比较陌生.以这个网站为例,我试图抓取餐厅的信息,但是我不确定在垂直读取数据时如何水平旋转数据.我希望Excel工作表具有六列,如下所示:名称,街道,城市,州,邮编,电话.这是我正在使用的代码:

从硒导入Webdriver的

 从bs4导入BeautifulSoup从urllib.request导入urlopen导入时间驱动程序= webdriver.Chrome(executable_path = r"C:\ Downloads \ chromedriver_win32 \ chromedriver.exe")driver.get('https://www.restaurant.com/listing?&&st=KS&p=KS&p=PA&page=1&&searchradius=50&loc=10021')时间.睡眠(10)使用urlopen(driver.current_url)作为响应:汤= BeautifulSoup(响应,'html.parser')pageList = soup.findAll("div",attrs = {"class":{"details"}})list_of_inner_text = [x.pageList中x的文本]文字=','.join(list_of_inner_text)打印(文字) 

谢谢

根据反馈,这是我对本页前五家餐厅的期望:

I'm relatively new to Python. Using this site as an example, I'm trying to scrape the restaurants' information but I'm not sure how to pivot this data horizontally when it's being read vertically. I'd like the Excel sheet to have six columns as follows: Name, Street, City, State, Zip, Phone. This is the code I'm using:

from selenium import webdriver
from bs4 import BeautifulSoup
from urllib.request import urlopen
import time

driver = webdriver.Chrome(executable_path=r"C:\Downloads\chromedriver_win32\chromedriver.exe")


driver.get('https://www.restaurant.com/listing?&&st=KS&p=KS&p=PA&page=1&&searchradius=50&loc=10021')
time.sleep(10)
with urlopen(driver.current_url) as response:
    soup = BeautifulSoup(response, 'html.parser')
    pageList = soup.findAll("div", attrs={"class": {"details"}})
    list_of_inner_text = [x.text for x in pageList]
    text = ', '.join(list_of_inner_text)
    print(text)

Thanks

EDIT: Based on feedback, here's what I would expect from the first five restaurants on this page: FirstFiveRestaurants

解决方案

Here is one way. You mileage may vary on other pages.

This line

details = [re.sub(r'\s{2,}|[,]', '',i) for i in restuarant.select_one('h3 + p').text.strip().split('\n') if i!=''

basically handles the generation of the output columns (bar name) by splitting the p tag on '\n' and doing a little string cleaning.

import requests, re
from bs4 import BeautifulSoup 
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd

driver = webdriver.Chrome(executable_path=r"C:\Users\User\Documents\chromedriver.exe")
driver.get('https://www.restaurant.com/listing?&&st=KS&p=KS&p=PA&page=1&&searchradius=50&loc=10021')
WebDriverWait(driver,10).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".restaurants")))
soup = BeautifulSoup(driver.page_source, 'html.parser')
restuarants = soup.select('.restaurants')
results = []

for restuarant in restuarants:
    details = [re.sub(r'\s{2,}|[,]', '',i) for i in restuarant.select_one('h3 + p').text.strip().split('\n') if i!='']
    details.insert(0, restuarant.select_one('h3 a').text)
    results.append(details)

df = pd.DataFrame(results, columns= ['Name','Address', 'City', 'State', 'Zip', 'Phone'])
df.to_csv(r'C:\Users\User\Desktop\Data.csv', sep=',', encoding='utf-8-sig',index = False )


这篇关于我怎样才能将报废的数据水平导出到Excel?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆