将BeautifulSoup刮取结果导出为CSV; scrape +在列中包含图像值 [英] Export BeautifulSoup scraping results to CSV; scrape + include image values in column
问题描述
对于此项目,我从数据库中抓取数据,并尝试将此数据导出到电子表格以供进一步分析。
For this project, I am scraping data from a database and attempting to export this data to a spreadsheet for further analysis.
虽然我的代码似乎工作得很好,当它涉及到最后一位 - 导出到CSV - 我没有运气。这个问题已经被问了几次,然而看来答案是针对不同的方法,我没有运气适应他们的答案。
While my code seems mostly to work well, when it comes to the last bit--exporting to CSV--I am having no luck. This question has been asked a few times, however it seems the answers were geared towards different approaches, and I didn't have any luck adapting their answers.
我的代码如下:
from bs4 import BeautifulSoup
import requests
import re
url1 = "http://www.elections.ca/WPAPPS/WPR/EN/NC?province=-1&distyear=2013&district=-1&party=-1&pageno="
url2 = "&totalpages=55&totalcount=1368&secondaryaction=prev25"
date1 = []
date2 = []
date3 = []
party=[]
riding=[]
candidate=[]
winning=[]
number=[]
for i in range(1, 56):
r = requests.get(url1 + str(i) + url2)
data = r.text
cat = BeautifulSoup(data)
links = []
for link in cat.find_all('a', href=re.compile('selectedid=')):
links.append("http://www.elections.ca" + link.get('href'))
for link in links:
r = requests.get(link)
data = r.text
cat = BeautifulSoup(data)
date1.append(cat.find_all('span')[2].contents)
date2.append(cat.find_all('span')[3].contents)
date3.append(cat.find_all('span')[5].contents)
party.append(re.sub("[\n\r/]", "", cat.find("legend").contents[2]).strip())
riding.append(re.sub("[\n\r/]", "", cat.find_all('div', class_="group")[2].contents[2]).strip())
cs= cat.find_all("table")[0].find_all("td", headers="name/1")
elected=[]
for c in cs:
elected.append(c.contents[0].strip())
number.append(len(elected))
candidate.append(elected)
winning.append(cs[0].contents[0].strip())
import csv
file = ""
for i in range(0,len(date1)):
file = [file,date1[i],date2[i],date3[i],party[i],riding[i],"\n"]
with open ('filename.csv','rb') as file:
writer=csv.writer(file)
for row in file:
writer.writerow(row)
b $ b
真的 - 任何提示将是非常感激。非常感谢。
Really--any tips would be GREATLY appreciated. Thanks a lot.
*第二部分:另一个问题:我以前认为在表格中找到获胜候选人可以简化,只要总是选择出现在表,因为我以为优胜者总是首先出现。然而,这种情况并非如此。
候选人是否被选举以图片的形式存储在第一列中。我将如何刮除并将其存储在电子表格中?
它位于< td headers> as:
*PART 2: Another question: I previously thought that finding the winning candidate in the table could be simplified by just always selecting the first name that appears in the table, as I thought the "winners" always appeared first. However, this is not the case. Whether or not a candidate was elected is stored in the form of a picture in the first column. How would I scrape this and store it in a spreadsheet? It's located under < td headers > as:
< img src="/WPAPPS/WPR/Content/Images/selected_box.gif" alt="contestant won this nomination contest" >
我有一个想法尝试某种布尔排序措施,但我不知道如何实现。非常感谢。*
更新:此问题现在是一个单独的帖子此处。
I had an idea for attempting some sort of Boolean sorting measure, but I am unsure of how to implement. Thanks a lot.* UPDATE: This question is now a separate post here.
推荐答案
您的数据到CSV文件:
The following should correctly export your data to a CSV file:
from bs4 import BeautifulSoup
import requests
import re
import csv
url = "http://www.elections.ca/WPAPPS/WPR/EN/NC?province=-1&distyear=2013&district=-1&party=-1&pageno={}&totalpages=55&totalcount=1368&secondaryaction=prev25"
rows = []
for i in range(1, 56):
print(i)
r = requests.get(url.format(i))
data = r.text
cat = BeautifulSoup(data, "html.parser")
links = []
for link in cat.find_all('a', href=re.compile('selectedid=')):
links.append("http://www.elections.ca" + link.get('href'))
for link in links:
r = requests.get(link)
data = r.text
cat = BeautifulSoup(data, "html.parser")
lspans = cat.find_all('span')
cs = cat.find_all("table")[0].find_all("td", headers="name/1")
elected = []
for c in cs:
elected.append(c.contents[0].strip())
rows.append([
lspans[2].contents[0],
lspans[3].contents[0],
lspans[5].contents[0],
re.sub("[\n\r/]", "", cat.find("legend").contents[2]).strip(),
re.sub("[\n\r/]", "", cat.find_all('div', class_="group")[2].contents[2]).strip().encode('latin-1'),
len(elected),
cs[0].contents[0].strip().encode('latin-1')
])
with open('filename.csv', 'w', newline='') as f_output:
csv_output = csv.writer(f_output)
csv_output.writerows(rows)
在CSV文件中给出以下类型的输出:
Giving you the following kind of output in your CSV file:
"September 17, 2016","September 13, 2016","September 17, 2016",Liberal,Medicine Hat--Cardston--Warner,1,Stanley Sakamoto
"June 25, 2016","May 12, 2016","June 25, 2016",Conservative,Medicine Hat--Cardston--Warner,6,Brian Benoit
"September 28, 2015","September 28, 2015","September 28, 2015",Liberal,Cowichan--Malahat--Langford,1,Luke Krayenhoff
不需要为数据的每一列建立大量单独的列表,只需构建行直接。然后,您可以轻松地一次性将其写入CSV(或者在收集数据时一次写入一行)。
There is no need to build up lots of separate lists for each column of your data, it is easier just to build a list of rows
directly. This can then easily be written to a CSV in one go (or written a row at a time as your are gathering the data).
这篇关于将BeautifulSoup刮取结果导出为CSV; scrape +在列中包含图像值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!