Python - Web Scraping - BeautifulSoup& CSV [英] Python - Web Scraping - BeautifulSoup & CSV
问题描述
我希望从一个城市提取生活成本的变化对许多城市。我打算列出我想在CSV文件中比较的城市,并使用这个列表创建的web链接,将带我到我正在寻找的信息到网站。
这是指向示例的链接: http://www.expatistan.com/cost-of-living/comparison/phoenix/new-york-city
不幸的是,我遇到了几个挑战。非常感谢您对以下挑战的任何帮助!
- 输出只显示百分比,但没有指示是更昂贵还是更便宜。对于上面列出的示例,基于当前代码的输出显示为48%,129%,63%,43%,42%和42%。我试图通过添加一个 if-statement 来添加+号(如果它更贵),或者一个 - 号,如果它更便宜。但是,此if语句无法正常工作。
- 将数据写入CSV文件时,每个百分比都写入一个新行。
- (与第2项相关)当我将数据写入CSV文件为上面列出的示例,数据以下面列出的格式写入。如何更正格式并将数据以下面列出的首选格式写入(也没有百分比符号)?
CURRENT CSV FORMAT(注意:' if-statement '无法正常工作):
城市,食品,住房,衣服,运输,个人护理,娱乐
n,e,w, - ,y,o,r,k, - ,c,i,t,y, ,%
n,e,w, - ,y,o,r,k, - ,c,i,t,y, - ,1,2,9,%
n,e,w, ,y,o,r,k, - ,c,i,t,y, - ,6,3,%
n,e,w, ,t,y, - 4,3,%
n,e,w, - ,y,o,r,k, - ,c,i,t,y, $ bn,e,w, - ,y,o,r,k, - ,c,i,t,y, - 4,2,%
首选CSV FORMAT:
食品,住房,衣服,运输,个人护理,娱乐
纽约市,48,129,63,43,42,42
这是我目前的代码:
import requests
import csv
from bs4 import BeautifulSoup
#读取文本文件
Textfile = open(City.txt)
Textfilelist = Textfile.read
Textfilelistsplit = Textfilelist.split(\\\
)
HomeCity ='Phoenix'
i = 0
当i url =http://www.expatistan.com/cost-of-living/comparison/+ HomeCity +/+ Textfilelistsplit [i]
page = requests.get(url).text
soup_expatistan = BeautifulSoup(page)
#准备CSV作者。
WriteResultsFile = csv.writer(open(Expatistan.csv,w))
WriteResultsFile.writerow([City,Food,Housing,Clothes ,Personal Care,Entertainment])
expatistan_table = soup_expatistan.find(table,class _ =comparison)
expatistan_titles = expatistan_table.find_all(tr class _ =expanded)
for expatistan_title in expatistan_titles:
percent_difference = expatistan_title.find(th,class _ =percent)
percent_difference_title = percent_difference.span [ class']
如果percent_difference_title ==expensiver:
WriteResultsFile.writerow(Textfilelistsplit [i] +'+'+ percent_difference.span.string)
else:
WriteResultsFile。
i + = 1
答案:
-
问题1: $ c> span 是一个列表,您需要检查
expensiver
是否在此列表中。换句话说,请替换:如果percent_difference_title ==expensiver
pre>
与:
如果expensiverin percent_difference.span ['class']
- 问题2和3:您需要将列值列表传递到
writerow()
,而不是字符串。而且,由于每个城市只需要一个记录,因此在循环外调用writerow()
(tr
)。
其他问题:
以下是修改后的代码:
import requests
import csv
from bs4 import BeautifulSoup
BASE_URL ='http://www.expatistan.com/cost-of-living/ open('City.txt')as input_file:
with open(Expatistan.csv) ,W)作为output_file:
writer = csv.writer(output_file)
writer.writerow([City,Food,Housing,Clothes personal care,Entertainment])
for input_file:
city = line.strip()
url = BASE_URL.format(home_city = home_city,city = city)
soup = BeautifulSoup(requests.get(url).text)
table = soup.find(table,class _ =comparison)
differences = []
for title in table.find_all(tr,class _ =expanded):
percent_difference = title.find(th,class _ =percent)
如果expensiver span ['class']:
differences.append('+'+ percent_difference.span.string)
else:
differences.append(' - '+ percent_difference.span.string)
writer.writerow([city] + differences)
对于 City.txt
只包含一个纽约市
行,它会产生 Expatistan.csv
具有以下内容:
城市,食品,住房,衣服,运输,个人护理,娱乐
new-约克市,+ 48%,+ 129%,+ 63%,+ 43%,+ 42%,+ 42%
b $ b
确保您了解我所做的更改。如果您需要进一步的帮助,请与我们联系。
I am hoping to extract the change in cost of living from one city against many cities. I plan to list the cities I would like to compare in a CSV file and using this list to create the web link that would take me to the website with the information I am looking for.
Here is the link to an example: http://www.expatistan.com/cost-of-living/comparison/phoenix/new-york-city
Unfortunately I am running into several challenges. Any assistance to the following challenges is greatly appreciated!
- The output only shows the percentage, but no indication whether it is more expensive or cheaper. For the example listed above, my output based on the current code shows 48%, 129%, 63%, 43%, 42%, and 42%. I tried to correct for this by adding an 'if-statement' to add '+' sign if it is more expensive, or a '-' sign if it is cheaper. However, this 'if-statement' is not functioning correctly.
- When I write the data to a CSV file, each of the percentages is written to a new row. I can't seem to figure out how to write it as a list on one line.
- (related to item 2) When I write the data to a CSV file for the example listed above, the data is written in the format listed below. How can I correct the format and have the data written in the preferred format listed below (also without the percentage sign)?
CURRENT CSV FORMAT (Note: 'if-statement' not functioning correctly):
City,Food,Housing,Clothes,Transportation,Personal Care,Entertainment
n,e,w,-,y,o,r,k,-,c,i,t,y,-,4,8,%
n,e,w,-,y,o,r,k,-,c,i,t,y,-,1,2,9,%
n,e,w,-,y,o,r,k,-,c,i,t,y,-,6,3,%
n,e,w,-,y,o,r,k,-,c,i,t,y,-,4,3,%
n,e,w,-,y,o,r,k,-,c,i,t,y,-,4,2,%
n,e,w,-,y,o,r,k,-,c,i,t,y,-,4,2,%
PREFERRED CSV FORMAT:
City,Food,Housing,Clothes,Transportation,Personal Care,Entertainment
new-york-city, 48,129,63,43,42,42
Here is my current code:
import requests
import csv
from bs4 import BeautifulSoup
#Read text file
Textfile = open("City.txt")
Textfilelist = Textfile.read()
Textfilelistsplit = Textfilelist.split("\n")
HomeCity = 'Phoenix'
i=0
while i<len(Textfilelistsplit):
url = "http://www.expatistan.com/cost-of-living/comparison/" + HomeCity + "/" + Textfilelistsplit[i]
page = requests.get(url).text
soup_expatistan = BeautifulSoup(page)
#Prepare CSV writer.
WriteResultsFile = csv.writer(open("Expatistan.csv","w"))
WriteResultsFile.writerow(["City","Food","Housing","Clothes","Transportation","Personal Care", "Entertainment"])
expatistan_table = soup_expatistan.find("table",class_="comparison")
expatistan_titles = expatistan_table.find_all("tr",class_="expandable")
for expatistan_title in expatistan_titles:
percent_difference = expatistan_title.find("th",class_="percent")
percent_difference_title = percent_difference.span['class']
if percent_difference_title == "expensiver":
WriteResultsFile.writerow(Textfilelistsplit[i] + '+' + percent_difference.span.string)
else:
WriteResultsFile.writerow(Textfilelistsplit[i] + '-' + percent_difference.span.string)
i+=1
Answers:
Question 1: the class of the
span
is a list, you need to check ifexpensiver
is inside this list. In other words, replace:if percent_difference_title == "expensiver"
with:
if "expensiver" in percent_difference.span['class']
- Questions 2 and 3: you need to pass a list of column values to
writerow()
, not string. And, since you want only one record per city, callwriterow()
outside of the loop (over thetr
s).
Other issues:
- open
csv
file for writing before the loop - use
with
context managers while working with files - try to follow
PEP8
style guide
Here's the code with modifications:
import requests
import csv
from bs4 import BeautifulSoup
BASE_URL = 'http://www.expatistan.com/cost-of-living/comparison/{home_city}/{city}'
home_city = 'Phoenix'
with open('City.txt') as input_file:
with open("Expatistan.csv", "w") as output_file:
writer = csv.writer(output_file)
writer.writerow(["City", "Food", "Housing", "Clothes", "Transportation", "Personal Care", "Entertainment"])
for line in input_file:
city = line.strip()
url = BASE_URL.format(home_city=home_city, city=city)
soup = BeautifulSoup(requests.get(url).text)
table = soup.find("table", class_="comparison")
differences = []
for title in table.find_all("tr", class_="expandable"):
percent_difference = title.find("th", class_="percent")
if "expensiver" in percent_difference.span['class']:
differences.append('+' + percent_difference.span.string)
else:
differences.append('-' + percent_difference.span.string)
writer.writerow([city] + differences)
For the City.txt
containing just one new-york-city
line, it produces Expatistan.csv
with the following content:
City,Food,Housing,Clothes,Transportation,Personal Care,Entertainment
new-york-city,+48%,+129%,+63%,+43%,+42%,+42%
Make sure you understand what changes have I made. Let me know if you need further help.
这篇关于Python - Web Scraping - BeautifulSoup& CSV的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!