Python - Web Scraping - BeautifulSoup& CSV [英] Python - Web Scraping - BeautifulSoup & CSV

查看:274
本文介绍了Python - Web Scraping - BeautifulSoup& CSV的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我希望从一个城市提取生活成本的变化对许多城市。我打算列出我想在CSV文件中比较的城市,并使用这个列表创建的web链接,将带我到我正在寻找的信息到网站。



这是指向示例的链接: http://www.expatistan.com/cost-of-living/comparison/phoenix/new-york-city



不幸的是,我遇到了几个挑战。非常感谢您对以下挑战的任何帮助!


  1. 输出只显示百分比,但没有指示是更昂贵还是更便宜。对于上面列出的示例,基于当前代码的输出显示为48%,129%,63%,43%,42%和42%。我试图通过添加一个 if-statement 来添加+号(如果它更贵),或者一个 - 号,如果它更便宜。但是,此if语句无法正常工作。

  2. 将数据写入CSV文件时,每个百分比都写入一个新行。

  3. 与第2项相关)当我将数据写入CSV文件为上面列出的示例,数据以下面列出的格式写入。如何更正格式并将数据以下面列出的首选格式写入(也没有百分比符号)?

CURRENT CSV FORMAT(注意:' if-statement '无法正常工作):

 城市,食品,住房,衣服,运输,个人护理,娱乐
n,e,w, - ,y,o,r,k, - ,c,i,t,y, ,%
n,e,w, - ,y,o,r,k, - ,c,i,t,y, - ,1,2,9,%
n,e,w, ,y,o,r,k, - ,c,i,t,y, - ,6,3,%
n,e,w, ,t,y, - 4,3,%
n,e,w, - ,y,o,r,k, - ,c,i,t,y, $ bn,e,w, - ,y,o,r,k, - ,c,i,t,y, - 4,2,%

首选CSV FORMAT:

食品,住房,衣服,运输,个人护理,娱乐
纽约市,48,129,63,43,42,42

这是我目前的代码

  import requests 
import csv
from bs4 import BeautifulSoup

#读取文本文件
Textfile = open(City.txt)
Textfilelist = Textfile.read
Textfilelistsplit = Textfilelist.split(\\\

HomeCity ='Phoenix'

i = 0
当i url =http://www.expatistan.com/cost-of-living/comparison/+ HomeCity +/+ Textfilelistsplit [i]
page = requests.get(url).text
soup_expatistan = BeautifulSoup(page)

#准备CSV作者。
WriteResultsFile = csv.writer(open(Expatistan.csv,w))
WriteResultsFile.writerow([City,Food,Housing,Clothes ,Personal Care,Entertainment])

expatistan_table = soup_expatistan.find(table,class _ =comparison)
expatistan_titles = expatistan_table.find_all(tr class _ =expanded)

for expatistan_title in expatistan_titles:
percent_difference = expatistan_title.find(th,class _ =percent)
percent_difference_title = percent_difference.span [ class']
如果percent_difference_title ==expensiver:
WriteResultsFile.writerow(Textfilelistsplit [i] +'+'+ percent_difference.span.string)
else:
WriteResultsFile。
i + = 1


解决方案

答案:




  • 问题1: $ c> span 是一个列表,您需要检查 expensiver 是否在此列表中。换句话说,请替换:

     如果percent_difference_title ==expensiver
    pre>

    与:

     如果expensiverin percent_difference.span ['class'] 


  • 问题2和3:您需要将列值列表传递到 writerow(),而不是字符串。而且,由于每个城市只需要一个记录,因此在循环外调用 writerow() tr )。



其他问题:





以下是修改后的代码:

  import requests 
import csv
from bs4 import BeautifulSoup

BASE_URL ='http://www.expatistan.com/cost-of-living/ open('City.txt')as input_file:
with open(Expatistan.csv) ,W)作为output_file:
writer = csv.writer(output_file)
writer.writerow([City,Food,Housing,Clothes personal care,Entertainment])
for input_file:
city = line.strip()
url = BASE_URL.format(home_city = home_city,city = city)
soup = BeautifulSoup(requests.get(url).text)

table = soup.find(table,class _ =comparison)
differences = []
for title in table.find_all(tr,class _ =expanded):
percent_difference = title.find(th,class _ =percent)
如果expensiver span ['class']:
differences.append('+'+ percent_difference.span.string)
else:
differences.append(' - '+ percent_difference.span.string)
writer.writerow([city] + differences)

对于 City.txt 只包含一个纽约市行,它会产生 Expatistan.csv 具有以下内容:

 城市,食品,住房,衣服,运输,个人护理,娱乐
new-约克市,+ 48%,+ 129%,+ 63%,+ 43%,+ 42%,+ 42%


b $ b

确保您了解我所做的更改。如果您需要进一步的帮助,请与我们联系。


I am hoping to extract the change in cost of living from one city against many cities. I plan to list the cities I would like to compare in a CSV file and using this list to create the web link that would take me to the website with the information I am looking for.

Here is the link to an example: http://www.expatistan.com/cost-of-living/comparison/phoenix/new-york-city

Unfortunately I am running into several challenges. Any assistance to the following challenges is greatly appreciated!

  1. The output only shows the percentage, but no indication whether it is more expensive or cheaper. For the example listed above, my output based on the current code shows 48%, 129%, 63%, 43%, 42%, and 42%. I tried to correct for this by adding an 'if-statement' to add '+' sign if it is more expensive, or a '-' sign if it is cheaper. However, this 'if-statement' is not functioning correctly.
  2. When I write the data to a CSV file, each of the percentages is written to a new row. I can't seem to figure out how to write it as a list on one line.
  3. (related to item 2) When I write the data to a CSV file for the example listed above, the data is written in the format listed below. How can I correct the format and have the data written in the preferred format listed below (also without the percentage sign)?

CURRENT CSV FORMAT (Note: 'if-statement' not functioning correctly):

City,Food,Housing,Clothes,Transportation,Personal Care,Entertainment
n,e,w,-,y,o,r,k,-,c,i,t,y,-,4,8,%
n,e,w,-,y,o,r,k,-,c,i,t,y,-,1,2,9,%
n,e,w,-,y,o,r,k,-,c,i,t,y,-,6,3,%
n,e,w,-,y,o,r,k,-,c,i,t,y,-,4,3,%
n,e,w,-,y,o,r,k,-,c,i,t,y,-,4,2,%
n,e,w,-,y,o,r,k,-,c,i,t,y,-,4,2,%

PREFERRED CSV FORMAT:

City,Food,Housing,Clothes,Transportation,Personal Care,Entertainment
new-york-city, 48,129,63,43,42,42

Here is my current code:

import requests
import csv
from bs4 import BeautifulSoup

#Read text file
Textfile = open("City.txt")
Textfilelist = Textfile.read()
Textfilelistsplit = Textfilelist.split("\n")
HomeCity = 'Phoenix'

i=0
while i<len(Textfilelistsplit):
    url = "http://www.expatistan.com/cost-of-living/comparison/" + HomeCity + "/" + Textfilelistsplit[i]
    page  = requests.get(url).text
    soup_expatistan = BeautifulSoup(page)

    #Prepare CSV writer.
    WriteResultsFile = csv.writer(open("Expatistan.csv","w"))
    WriteResultsFile.writerow(["City","Food","Housing","Clothes","Transportation","Personal Care", "Entertainment"])

    expatistan_table = soup_expatistan.find("table",class_="comparison")
    expatistan_titles = expatistan_table.find_all("tr",class_="expandable")

    for expatistan_title in expatistan_titles:
            percent_difference = expatistan_title.find("th",class_="percent")
            percent_difference_title = percent_difference.span['class']
            if percent_difference_title == "expensiver":
                WriteResultsFile.writerow(Textfilelistsplit[i] + '+' + percent_difference.span.string)
            else:
                WriteResultsFile.writerow(Textfilelistsplit[i] + '-' + percent_difference.span.string)
    i+=1

解决方案

Answers:

  • Question 1: the class of the span is a list, you need to check if expensiver is inside this list. In other words, replace:

    if percent_difference_title == "expensiver" 
    

    with:

    if "expensiver" in percent_difference.span['class']
    

  • Questions 2 and 3: you need to pass a list of column values to writerow(), not string. And, since you want only one record per city, call writerow() outside of the loop (over the trs).

Other issues:

  • open csv file for writing before the loop
  • use with context managers while working with files
  • try to follow PEP8 style guide

Here's the code with modifications:

import requests
import csv
from bs4 import BeautifulSoup

BASE_URL = 'http://www.expatistan.com/cost-of-living/comparison/{home_city}/{city}'
home_city = 'Phoenix'

with open('City.txt') as input_file:
    with open("Expatistan.csv", "w") as output_file:
        writer = csv.writer(output_file)
        writer.writerow(["City", "Food", "Housing", "Clothes", "Transportation", "Personal Care", "Entertainment"])
        for line in input_file:
            city = line.strip()
            url = BASE_URL.format(home_city=home_city, city=city)
            soup = BeautifulSoup(requests.get(url).text)

            table = soup.find("table", class_="comparison")
            differences = []
            for title in table.find_all("tr", class_="expandable"):
                percent_difference = title.find("th", class_="percent")
                if "expensiver" in percent_difference.span['class']:
                    differences.append('+' + percent_difference.span.string)
                else:
                    differences.append('-' + percent_difference.span.string)
            writer.writerow([city] + differences)

For the City.txt containing just one new-york-city line, it produces Expatistan.csv with the following content:

City,Food,Housing,Clothes,Transportation,Personal Care,Entertainment
new-york-city,+48%,+129%,+63%,+43%,+42%,+42%

Make sure you understand what changes have I made. Let me know if you need further help.

这篇关于Python - Web Scraping - BeautifulSoup&amp; CSV的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆