直接在 pandas 中从网站打开CSV文件而不下载到文件夹 [英] Open csv file from website directly in pandas without downloading to folder

查看:87
本文介绍了直接在 pandas 中从网站打开CSV文件而不下载到文件夹的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

This website contains an 'Export Data' link, which downloads the contents of the page into a csv file. The button does not contain a link to the csv file, but instead runs a javascript procedure. I want to open the csv file directly with pandas, rather than downloading it, figuring out the download folder, then opening it from there. Is this possible?

我现有的代码使用selenium来单击按钮,尽管如果有更好的方法可以做到这一点,我很想听听.

My existing code uses selenium to click the button, although if there is a better way to do that, I'd love to hear it.

# assign chrome driver path to variable
chrome_path = chromepath

# create browser object
    driver=webdriver.Chrome(chrome_path)

# assign url variable    
url = 'http://www.fangraphs.com/projections.aspx?pos=all&stats=bat&type=fangraphsdc&team=0&lg=all&players=0&sort=24%2cd'

# navigate to web page    
driver.get(url)

# click export data button    
driver.find_element_by_link_text("Export Data").click()

#close driver
driver.quit()

推荐答案

碰巧遇到了这个问题,并且有一个脚本可以在您更改URL的情况下正常工作.

Just happened to come across this and have a script that should work if you change the URL. Instead of using selenium to download the CSV, soup is used to scrape the tables within the page and pandas is used to create the table(s) for CSV export.

只需确保其末尾具有"page = 1_100000"即可获取所有行.如果您有任何问题,请告诉我.

Just make sure it has the "page=1_100000" at the end to get all rows. Let me know if you have any questions.

import requests
from random import choice
from bs4 import BeautifulSoup
import pandas as pd
from urllib.parse import urlparse, parse_qs
from functools import reduce

desktop_agents = ['Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36',
                 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36',
                 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36',
                 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/602.2.14 (KHTML, like Gecko) Version/10.0.1 Safari/602.2.14',
                 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36',
                 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36',
                 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36',
                 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36',
                 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36',
                 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0']

url = "https://www.fangraphs.com/leaders.aspx?pos=np&stats=bat&lg=all&qual=0&type=c,4,6,5,23,9,10,11,13,12,21,22,60,18,35,34,50,40,206,207,208,44,43,46,45,24,26,25,47,41,28,110,191,192,193,194,195,196,197,200&season=2018&month=0&season1=2018&ind=0&team=0&rost=0&age=0&filter=&players=0&page=1_100000"

def random_headers():
    return {'User-Agent': choice(desktop_agents),'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'}

df3 = pd.DataFrame()
# get the url

page_request = requests.get(url,headers=random_headers())
soup = BeautifulSoup(page_request.text,"lxml")

table = soup.find_all('table')[11]
data = []
# pulls headings from the fangraphs table
column_headers = []
headingrows = table.find_all('th')
for row in headingrows[0:]:
    column_headers.append(row.text.strip())

data.append(column_headers)
table_body = table.find('tbody')
rows = table_body.find_all('tr')

for row in rows:
    cols = row.find_all('td')
    cols = [ele.text.strip() for ele in cols]
    data.append([ele for ele in cols[1:]])

ID = []

for tag in soup.select('a[href^=statss.aspx?playerid=]'):
    link = tag['href']
    query = parse_qs(link)
    ID.append(query)

df1 = pd.DataFrame(data)
df1 = df1.rename(columns=df1.iloc[0])
df1 = df1.loc[1:].reset_index(drop=True)

df2 = pd.DataFrame(ID)
df2.drop(['position'], axis = 1, inplace = True, errors = 'ignore')
df2['statss.aspx?playerid'] = df2['statss.aspx?playerid'].str[0]

df3 = pd.concat([df1, df2], axis=1)

df3.to_csv("HittingGA2018.csv")

这篇关于直接在 pandas 中从网站打开CSV文件而不下载到文件夹的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆