抓取:将存储为图片的数据添加到python 3.5中的CSV文件 [英] Scraping: add data stored as a picture to CSV file in python 3.5

查看:167
本文介绍了抓取:将存储为图片的数据添加到python 3.5中的CSV文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于此项目,我从数据库中刮取数据,并尝试将此数据导出到电子表格以进行进一步分析。 (以前发布的此处 -

For this project, I am scraping data from a database and attempting to export this data to a spreadsheet for further analysis. (Previously posted here--thanks for the help over there reworking my code!)

我以前认为在表格中找到获胜候选人可以简化为只需总是选择名字出现在表中,因为我认为赢家总是首先出现。然而,这种情况并非如此。

I previously thought that finding the winning candidate in the table could be simplified by just always selecting the first name that appears in the table, as I thought the "winners" always appeared first. However, this is not the case.

候选人是否被选举以图片的形式存储在第一列。

Whether or not a candidate was elected is stored in the form of a picture in the first column. How would I scrape this and store it in a spreadsheet?

它位于< td headers> as:

It's located under < td headers > as:

<img src="/WPAPPS/WPR/Content/Images/selected_box.gif" alt="contestant won this nomination contest">

我的问题是:我如何使用BeautifulSoup解析HTML表格并从第一个列,它作为一个图像而不是文本存储在表中。

My question is: how would I use BeautifulSoup to parse the HTML table and extract a value from the first column, which is stored in the table as an image rather than text.

我有一个想法尝试某种布尔排序措施,但我不知道如何实行。

I had an idea for attempting some sort of Boolean sorting measure, but I am unsure of how to implement.

我的代码如下:

from bs4 import BeautifulSoup
import requests
import re
import csv


url = "http://www.elections.ca/WPAPPS/WPR/EN/NC?province=-1&distyear=2013&district=-1&party=-1&pageno={}&totalpages=55&totalcount=1368&secondaryaction=prev25"
rows = []

for i in range(1, 56):
    print(i)
    r  = requests.get(url.format(i))
    data = r.text
    cat = BeautifulSoup(data, "html.parser")
    links = []

    for link in cat.find_all('a', href=re.compile('selectedid=')):
        links.append("http://www.elections.ca" + link.get('href'))  

    for link in links:
        r  = requests.get(link)
        data = r.text
        cat = BeautifulSoup(data, "html.parser")
        lspans = cat.find_all('span')
        cs = cat.find_all("table")[0].find_all("td", headers="name/1")        
        elected = []

        for c in cs:
            elected.append(c.contents[0].strip())

        rows.append([
            lspans[2].contents[0], 
            lspans[3].contents[0], 
            lspans[5].contents[0],
            re.sub("[\n\r/]", "", cat.find("legend").contents[2]).strip(),
            re.sub("[\n\r/]", "",  cat.find_all('div', class_="group")[2].contents[2]).strip().encode('latin-1'),
            len(elected),
            cs[0].contents[0].strip().encode('latin-1')
            ])

with open('filename.csv', 'w', newline='') as f_output:
   csv_output = csv.writer(f_output)
   csv_output.writerows(rows)

不胜感激。非常感谢。

推荐答案

此片段将列出当选人的姓名:

This snippet will print the name of the elected person:

from bs4 import BeautifulSoup
import requests
req  = requests.get("http://www.elections.ca/WPAPPS/WPR/EN/NC/Details?province=-1&distyear=2013&district=-1&party=-1&selectedid=8548")
page_source = BeautifulSoup(req.text, "html.parser")
table = page_source.find("table",{"id":"gvContestants/1"})
for row in table.find_all("tr"):
    if not row.find("img"):
        continue
    if "selected_box.gif" in row.find("img").get("src"):
        print(''.join(row.find("td",{"headers":"name/1"}).text.split()))


$ b b

另外,请避免使用无意义的名称声明变量。它伤害任何人试图帮助你的眼睛,它会伤害你在未来再次看代码

As a side note please refrain yourself from declaring variables with meaningless names. It hurts the eyes of anyone trying to help you and it will hurt you in the future when looking at the code again

这篇关于抓取:将存储为图片的数据添加到python 3.5中的CSV文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆