Python在线抓取数据,但csv文件未显示正确的数据格式 [英] Python scraping data online, but the csv file doesn't show correct format of data
问题描述
我正在尝试做一些小数据抓取工作,因为我想做一些数据分析.对于我从foxsports获取的数据,代码中也包含了url链接.步骤在评论部分解释.如果可能,您可以直接粘贴并运行.
I am trying work on a small data scraping stuff because I want to do some data analysis. For the data, I obtained from foxsports, the url link is also included in the codes. The steps are explained in the comment part. If possible, you could just paste and run.
对于数据,我想跳过2013-2018赛季的网页,在网页上抓取表格中的所有数据.所以我的代码在这里:
For the data, I want to jump over 2013-2018 seasons' web pages, and scrape all the data in the table on the web pages. So my codes are here:
import requests
from lxml import html
import csv
# Set up the urls for Bayern Muenchen's Team Stats starting from 2013-14
Season
# up to 2017-18 Season
# The data stores in the foxsports websites
urls = ["https://www.foxsports.com/soccer/bayern-munich-team-stats?competition=4&season=2013&category=STANDARD",
"https://www.foxsports.com/soccer/bayern-munich-team-stats? competition=4&season=2014&category=STANDARD",
"https://www.foxsports.com/soccer/bayern-munich-team-stats? competition=4&season=2015&category=STANDARD",
"https://www.foxsports.com/soccer/bayern-munich-team-stats? competition=4&season=2016&category=STANDARD",
"https://www.foxsports.com/soccer/bayern-munich-team-stats? competition=4&season=2017&category=STANDARD"
]
seasons = ["2013/2014","2014/2015", "2015/2016", "2016/2017", "2017/2018"]
data = ["Season", "Team", "Name", "Games_Played", "Games_Started", "Minutes_Played", "Goals", "Assists", "Shots_On_Goal", "Shots", "Yellow_Cards", "Red_Cards"]
csvFile = "bayern_munich_team_stats_2013_18.csv"
# Having set up the dataframe and urls for various season standard stats, we
# are going to examine the xpath of the same player Lewandowski's same data feature
# for various pages (namely the different season pages)
# See if we can find some pattern
# 2017-18 Season Name xpath:
# //*[@id="wisfoxbox"]/section[2]/div[1]/table/tbody/tr[1]/td[1]/div/a/span[1]
# 2016-17 Season Name xpath:
# //*[@id="wisfoxbox"]/section[2]/div[1]/table/tbody/tr[1]/td[1]/div/a/span[1]
# 2015-16 Season Name xpath:
# //*[@id="wisfoxbox"]/section[2]/div[1]/table/tbody/tr[1]/td[1]/div/a/span[1]
# tr xpath 17-18:
# //*[@id="wisfoxbox"]/section[2]/div[1]/table/tbody/tr[1]
# tr xpath 16=17:
# //*[@id="wisfoxbox"]/section[2]/div[1]/table/tbody/tr[1]
# tr xpath 15-16:
# //*[@id="wisfoxbox"]/section[2]/div[1]/table/tbody/tr[1]
# For a single season's team stats, the tbody and tr relationship is like:
# //*[@id="wisfoxbox"]/section[2]/div[1]/table/tbody
# //*[@id="wisfoxbox"]/section[2]/div[1]/table/tbody/tr[1]
# //*[@id="wisfoxbox"]/section[2]/div[1]/table/tbody/tr[2]
# lewandowski
# //*[@id="wisfoxbox"]/section[2]/div[1]/table/tbody/tr[1]/td[1]/div/a/span[1]
# Wagner
# //*[@id="wisfoxbox"]/section[2]/div[1]/table/tbody/tr[2]/td[1]/div/a/span[1]
# ********
# for each row with player names, the name proceeds with tr[num], num += 1 gives
# new name in a new row.
# ********
i = 0
for url in urls:
print(url)
response = requests.get(url)
result = html.fromstring(response.content)
j = 1
for tr in result.xpath('//*[@id="wisfoxbox"]/section[2]/div[1]/table/tbody/tr'):
# Except for season and team, we open foxsports webpage for the given team, here
# Bayern Munich, and the given season, here starting from 13-14, and use F12 to
# view page elements, look for tbody of the figure table, then copy the corresponding
# xpath to here. Adjust the xpath as described above.
season = seasons[i] # seasons[i] changes with i, but stays the same for each season
data.append(season)
team = ["FC BAYERN MUNICH"] # this doesn't change since we are extracting solely Bayern
data.append(team)
name = tr.xpath('//*[@id="wisfoxbox"]/section[2]/div[1]/table/tbody/tr[%d]/td[1]/div/a/span[1]' %j )
data.append(name)
gamep = tr.xpath('//*[@id="wisfoxbox"]/section[2]/div[1]/table/tbody/tr[%d]/td[2]' %j )
data.append(gamep)
games = tr.xpath('//*[@id="wisfoxbox"]/section[2]/div[1]/table/tbody/tr[%d]/td[3]' %j )
data.append(games)
mp = tr.xpath('//*[@id="wisfoxbox"]/section[2]/div[1]/table/tbody/tr[%d]/td[4]' %j )
data.append(mp)
goals = tr.xpath('//*[@id="wisfoxbox"]/section[2]/div[1]/table/tbody/tr[%d]/td[5]' %j )
data.append(goals)
assists = tr.xpath('//*[@id="wisfoxbox"]/section[2]/div[1]/table/tbody/tr[%d]/td[6]' %j )
data.append(assists)
shots_on_goal = tr.xpath('//*[@id="wisfoxbox"]/section[2]/div[1]/table/tbody/tr[%d]/td[7]' %j )
data.append(shots_on_goal)
shots = tr.xpath('//*[@id="wisfoxbox"]/section[2]/div[1]/table/tbody/tr[%d]/td[8]' %j )
data.append(shots)
yellow = tr.xpath('//*[@id="wisfoxbox"]/section[2]/div[1]/table/tbody/tr[%d]/td[9]' %j )
data.append(yellow)
red= tr.xpath('//*[@id="wisfoxbox"]/section[2]/div[1]/table/tbody/tr[%d]/td[10]' %j )
data.append(red)
# update j for next row of player
j += 1
# update i
i += 1
with open(csvFile, "w") as file:
writer = csv.writer(file)
writer.writerow(data)
print("Done")
我尝试使用 data.extend([season, name, team, ...]) 但结果还是一样,所以我只是在这里附加了所有内容.csv 文件内容不是我所期望的,正如您在图片中看到的:
I tried to use data.extend([season, name, team, ...]) but the result is still the same, so I just appended everything here. The csv file content is not what I expected, and as you can see here in the picture:
我不太确定哪里出了问题,它显示结果XXXXXX#####处的元素跨度",我仍然是编程的新鱼.如果有人能帮助我解决这个问题,我真的很感激,所以我可以继续这个小项目,它仅用于教育目的.非常感谢您的时间和帮助!
I am not quite sure where went wrong, it shows the result "Element span at XXXXXX#####", and I am still a new fish to programming. I'd really appreciate it if anyone could help me with this issue, so I can keep going on for this little project, which is only for educational purpose. Thank you very much for your time and help!
推荐答案
这是你能做的
我以前也这样做过
import csv
with open(output_file, 'w', newline='') as csvfile:
field_names = ['f6s_profile', 'linkedin_profile', 'Name', 'job_type', 'Status']
writer = csv.DictWriter(csvfile, fieldnames=field_names)
writer.writerow(
{'profile': 'profile', 'profile1': 'profile1',
'Name': 'Name', 'job_type': 'Job Type', 'Status': 'Status'})
for raw in data2:
.data = []
.# get you data using selenium
.# data.append()
.
writer.writerow(
{'profile': data[0], 'profile1': data[1],
'Name': name_person, 'job_type': data[2], 'Status': status})
第一个 writer.writerow
将是您的标题,而 field_names
仅用作将数据填充到特定列的键
where first writer.writerow
will be you header and field_names
are just used as key to fill you data to perticular column
要获取 [
的值,您可以使用data.append(name.text)
to get the value of [<Element td at 0x151ca980638>]
you can use
data.append(name.text)
你也可以这样做在 xpath 之后添加 .text
you can also do this
add .text
after your xpath
name = tr.xpath('//*[@id="wisfoxbox"]/section[2]/div[1]/table/tbody/tr[%d]/td[1]/div/a/span[1]' %j ).text
data.append(name)
这篇关于Python在线抓取数据,但csv文件未显示正确的数据格式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!