直接在 pandas 中从网站打开CSV文件而不下载到文件夹 [英] Open csv file from website directly in pandas without downloading to folder
问题描述
This website contains an 'Export Data' link, which downloads the contents of the page into a csv file. The button does not contain a link to the csv file, but instead runs a javascript procedure. I want to open the csv file directly with pandas, rather than downloading it, figuring out the download folder, then opening it from there. Is this possible?
我现有的代码使用selenium来单击按钮,尽管如果有更好的方法可以做到这一点,我很想听听.
My existing code uses selenium to click the button, although if there is a better way to do that, I'd love to hear it.
# assign chrome driver path to variable
chrome_path = chromepath
# create browser object
driver=webdriver.Chrome(chrome_path)
# assign url variable
url = 'http://www.fangraphs.com/projections.aspx?pos=all&stats=bat&type=fangraphsdc&team=0&lg=all&players=0&sort=24%2cd'
# navigate to web page
driver.get(url)
# click export data button
driver.find_element_by_link_text("Export Data").click()
#close driver
driver.quit()
推荐答案
碰巧遇到了这个问题,并且有一个脚本可以在您更改URL的情况下正常工作.
Just happened to come across this and have a script that should work if you change the URL. Instead of using selenium to download the CSV, soup is used to scrape the tables within the page and pandas is used to create the table(s) for CSV export.
只需确保其末尾具有"page = 1_100000"即可获取所有行.如果您有任何问题,请告诉我.
Just make sure it has the "page=1_100000" at the end to get all rows. Let me know if you have any questions.
import requests
from random import choice
from bs4 import BeautifulSoup
import pandas as pd
from urllib.parse import urlparse, parse_qs
from functools import reduce
desktop_agents = ['Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/602.2.14 (KHTML, like Gecko) Version/10.0.1 Safari/602.2.14',
'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0']
url = "https://www.fangraphs.com/leaders.aspx?pos=np&stats=bat&lg=all&qual=0&type=c,4,6,5,23,9,10,11,13,12,21,22,60,18,35,34,50,40,206,207,208,44,43,46,45,24,26,25,47,41,28,110,191,192,193,194,195,196,197,200&season=2018&month=0&season1=2018&ind=0&team=0&rost=0&age=0&filter=&players=0&page=1_100000"
def random_headers():
return {'User-Agent': choice(desktop_agents),'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'}
df3 = pd.DataFrame()
# get the url
page_request = requests.get(url,headers=random_headers())
soup = BeautifulSoup(page_request.text,"lxml")
table = soup.find_all('table')[11]
data = []
# pulls headings from the fangraphs table
column_headers = []
headingrows = table.find_all('th')
for row in headingrows[0:]:
column_headers.append(row.text.strip())
data.append(column_headers)
table_body = table.find('tbody')
rows = table_body.find_all('tr')
for row in rows:
cols = row.find_all('td')
cols = [ele.text.strip() for ele in cols]
data.append([ele for ele in cols[1:]])
ID = []
for tag in soup.select('a[href^=statss.aspx?playerid=]'):
link = tag['href']
query = parse_qs(link)
ID.append(query)
df1 = pd.DataFrame(data)
df1 = df1.rename(columns=df1.iloc[0])
df1 = df1.loc[1:].reset_index(drop=True)
df2 = pd.DataFrame(ID)
df2.drop(['position'], axis = 1, inplace = True, errors = 'ignore')
df2['statss.aspx?playerid'] = df2['statss.aspx?playerid'].str[0]
df3 = pd.concat([df1, df2], axis=1)
df3.to_csv("HittingGA2018.csv")
这篇关于直接在 pandas 中从网站打开CSV文件而不下载到文件夹的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!