使用beautifulsoup爬取XML元素属性 [英] Scraping XML element attributes with beautifulsoup

查看:99
本文介绍了使用beautifulsoup爬取XML元素属性的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下代码:

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("https://api.stlouisfed.org/fred/...")
bsObj = BeautifulSoup(html.read(), "lxml");

print(bsObj)

它返回如下内容:

<?xml version="1.0" encoding="utf-8" ?><html><body><observations count="276" file_type="xml" limit="100000" observation_end="9999-12-31" observation_start="1776-07-04" offset="0" order_by="observation_date" output_type="1" realtime_end="2016-06-22" realtime_start="2016-06-22" sort_order="asc" units="lin">
<observation date="1947-04-01" realtime_end="2016-06-22" realtime_start="2016-06-22" value="-0.4"></observation>
<observation date="1947-07-01" realtime_end="2016-06-22" realtime_start="2016-06-22" value="-0.4"></observation>
<observation date="1947-10-01" realtime_end="2016-06-22" realtime_start="2016-06-22" value="6.4"></observation>
<observation date="1948-01-01" realtime_end="2016-06-22" realtime_start="2016-06-22" value="6"></observation>
<observation date="1948-04-01" realtime_end="2016-06-22" realtime_start="2016-06-22" value="6.7"></observation>
<observation date="1948-07-01" realtime_end="2016-06-22" realtime_start="2016-06-22" value="2.3"></observation>
<observation date="1948-10-01" realtime_end="2016-06-22" realtime_start="2016-06-22" value="0.4"></observation>
<observation date="1949-01-01" realtime_end="2016-06-22" realtime_start="2016-06-22" value="-5.4"></observation>
<observation date="1949-04-01" realtime_end="2016-06-22" realtime_start="2016-06-22" value="-1.3"></observation>
<observation date="1949-07-01" realtime_end="2016-06-22" realtime_start="2016-06-22" value="4.5"></observation>
<observation date="1949-10-01" realtime_end="2016-06-22" realtime_start="2016-06-22" value="-3.5"></observation>
<observation date="1950-01-01" realtime_end="2016-06-22" realtime_start="2016-06-22" value="16.9"></observation>
<observation date="1950-04-01" realtime_end="2016-06-22" realtime_start="2016-06-22" value="12.7"></observation>
<observation date="1950-07-01" realtime_end="2016-06-22" realtime_start="2016-06-22" value="16.3"></observation>
</observations>
</body></html>

我只想提取日期"和值",所以最终我会得到这样的东西:

I want to extract only the "date" and the "value" so finaly I have something like this:

1947-04-01 -0.4
1947-07-01 -0.4
1947-10-01 6.4
1948-01-01 6
and so on...

到目前为止,我正在使用replace抓取文本,并使用import csv抓取csv文件:

so far I'm using replace to scrape the text and import csv for the csv file:

string = str(bsObj)

string = string.replace("realtime_start=","")
string = string.replace("realtime_end=","")
string = string.replace("observation","")
string = string.replace("date=","")
string = string.replace('"2016-06-22"',"")
string = string.replace("value=","")
string = string.replace("<","")
string = string.replace(">","")
string = string.replace("/","")
string = string.replace('"',"")
print(string)

import csv
with open('test.csv', 'w', newline='') as fp:
    a = csv.writer(fp, delimiter=',')
    data = string
    a.writerows(data)

尽管这几乎是灾难.它会将文本推入csv,但每个simbol都将移至新行.

This one though is almost disaster. It push the text in to the csv but every simbol is going on to new row.

我想知道是否还有其他更优雅的方式可以提取需要的东西.例如:

I want to know if there is any more elegant way I can extract what I need. For example:

for line in f:
   extract "date" and "value"

或类似.并将其插入.csv文件的最合适方法是什么?每次调用此脚本时,我都会重写.csv文件. 字段必须用,"分隔,行用"/n"分隔.

or similar. And what is the most apropriate way to insert it in to .csv file? I'll be rewriting the .csv file every time I call this script. The fields have to be separated by "," and the lines by "/n".

推荐答案

找到所有属性标签,然后提取所需的属性:

Find all the attribute tags and just extract the attributes you want:

x = """<?xml version="1.0" encoding="utf-8" ?><html><body><observations count="276" file_type="xml" limit="100000" observation_end="9999-12-31" observation_start="1776-07-04" offset="0" order_by="observation_date" output_type="1" realtime_end="2016-06-22" realtime_start="2016-06-22" sort_order="asc" units="lin">
<observation date="1947-04-01" realtime_end="2016-06-22" realtime_start="2016-06-22" value="-0.4"></observation>
<observation date="1947-07-01" realtime_end="2016-06-22" realtime_start="2016-06-22" value="-0.4"></observation>
<observation date="1947-10-01" realtime_end="2016-06-22" realtime_start="2016-06-22" value="6.4"></observation>
<observation date="1948-01-01" realtime_end="2016-06-22" realtime_start="2016-06-22" value="6"></observation>
<observation date="1948-04-01" realtime_end="2016-06-22" realtime_start="2016-06-22" value="6.7"></observation>
<observation date="1948-07-01" realtime_end="2016-06-22" realtime_start="2016-06-22" value="2.3"></observation>
<observation date="1948-10-01" realtime_end="2016-06-22" realtime_start="2016-06-22" value="0.4"></observation>
<observation date="1949-01-01" realtime_end="2016-06-22" realtime_start="2016-06-22" value="-5.4"></observation>
<observation date="1949-04-01" realtime_end="2016-06-22" realtime_start="2016-06-22" value="-1.3"></observation>
<observation date="1949-07-01" realtime_end="2016-06-22" realtime_start="2016-06-22" value="4.5"></observation>
<observation date="1949-10-01" realtime_end="2016-06-22" realtime_start="2016-06-22" value="-3.5"></observation>
<observation date="1950-01-01" realtime_end="2016-06-22" realtime_start="2016-06-22" value="16.9"></observation>
<observation date="1950-04-01" realtime_end="2016-06-22" realtime_start="2016-06-22" value="12.7"></observation>
<observation date="1950-07-01" realtime_end="2016-06-22" realtime_start="2016-06-22" value="16.3"></observation>
</observations>
</body></html>"""

from bs4 import BeautifulSoup

soup = BeautifulSoup(x,"lxml")

for ob in soup.find_all("observation"):
    print(ob["date"])
    print(ob["value"])

哪个会给你:

1947-04-01
-0.4
1947-07-01
-0.4
1947-10-01
6.4
1948-01-01
6
1948-04-01
6.7
1948-07-01
2.3
1948-10-01
0.4
1949-01-01
-5.4
1949-04-01
-1.3
1949-07-01
4.5
1949-10-01
-3.5
1950-01-01
16.9
1950-04-01
12.7
1950-07-01
16.3

要写入csv:

from bs4 import BeautifulSoup
import csv

soup = BeautifulSoup(x, "lxml")
with open("out.csv", "w") as f:
    csv.writer(f).writerows((ob["date"], ob["value"])
                            for ob in soup.find_all("observation"))

哪个为您提供了一个csv文件:

Which gives you a csv file with:

1947-04-01,-0.4
1947-07-01,-0.4
1947-10-01,6.4
1948-01-01,6
1948-04-01,6.7
1948-07-01,2.3
1948-10-01,0.4
1949-01-01,-5.4
1949-04-01,-1.3
1949-07-01,4.5
1949-10-01,-3.5
1950-01-01,16.9
1950-04-01,12.7
1950-07-01,16.3

这篇关于使用beautifulsoup爬取XML元素属性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆