用美丽的汤解析XML [英] Parsing XML with Beautiful Soup
问题描述
已解决.以为我会在底部添加答案...
注意:所需的输出是一堆线,例如
US D0591026
我有一些XML格式的数据:
<?xml version ="1.0" encoding ="UTF-8"?><!DOCTYPE美国专利授予系统"us-patent-grant-v42-2006-08-23.dtd" []>< us-patent-grant lang ="EN" dtd-version ="v4.2 2006-08-23" file ="USD0591026-20090428.XML" status ="PRODUCTION" id ="us-patent-grant"国家/地区="US" date-produced ="20090414" date-publ ="20090428"><我们书目数据授予>< publication-reference>< document-id>< country>美国</country>< doc-number> D0591026</doc-number><种类S1</种类>< date> 20090428</date></document-id></publication-reference>< application-reference appl-type ="design">< document-id>< country>美国</country>< doc-number> 29303426</doc-number>< date> 20080208</date></document-id></application-reference>< us-application-series-code> 29</us-application-series-code><优先权><优先权要求序列="01" kind ="national">< country> CA</country>< doc-number> 122078</doc-number>< date> 20070830</date></priority-claim></priority-claims><我们的授予条款><授予长度> 14<授予长度></我们授予的条款><分类-locarno>< edition> 9</edition><主要分类> 0101</主要分类></classification-locarno><国家分类>< country>美国</country><主要分类> D 1106</主要分类></classification-national>< invention-title id ="d0e71">火箭飞艇形状的可食用水果产品//invention-title><引用引用>
我正在尝试提取国家/地区和文件编号.我已经到了这一点:
import os导入io从bs4导入BeautifulSoup导入csv汇入要求directory_in_str ='C:/用户/某目录'目录= os.fsencode(directory_in_str)对于os.listdir(目录)中的文件:文件名= os.fsdecode(文件)full_name = directory_in_str +文件名处理程序= open(full_name).read()汤= BeautifulSoup(处理程序,'lxml')专利= soup.find_all('美国专利授权')pub_ref = soup.find_all('publication-reference')country = soup.find_all('country')doc_num = soup.find_all('doc-number')pub_ref中的专利:专利中的doc_num:打印(doc_num)继续
我可以在其中打印出包含这些元素的漂亮代码块(上面的代码做了什么),但是我试图获取这两个特定元素(然后将它们串联)的所有操作都失败了.我已经能够通过字符串操作来做到这一点,但是数据集的格式设置不够好(稍后我将抽取没有标准长度的文本字段),以至于我可以基于拼接字符串执行整个分析./p>
有什么想法可以深入到这些进一步的标签中并仅返回这两个元素吗?
好,所以我进行了一些更改,并获得了代码:
import os导入io从bs4导入BeautifulSoup导入csv汇入要求directory_in_str ='C:/somedir'目录= os.fsencode(directory_in_str)对于os.listdir(目录)中的文件:文件名= os.fsdecode(文件)full_name = directory_in_str +文件名处理程序= open(full_name).read()汤= BeautifulSoup(处理程序,'lxml')专利= soup.find_all('美国专利授权')pub_ref = soup.find_all('publication-reference')pub_ref中的专利:国家/地区= patent.find_all('country')doc_num = patent.find_all('doc-number')打印(国家+ doc_num)继续
哪个给了我最我想要的东西.我得到这个:
[< country"美国</country>,<文档编号> D0591026</文档编号>]
但是我想要的只是:
US D0591026
我知道对象的类型是bs4结果集,但是我对如何仅返回标记中的内容还不够熟悉.最终,这将转至csv,所以我不想在其中放置这些标签.
我将汤对象转换为字符串,并使用正则表达式获得所需的输出
...汇入......国家/地区= patent.find_all('country')doc_num = patent.find_all('doc-number')country_str = str(国家/地区)doc_num_str = str(doc_num)country_str2 = re.search('>(.*)<',country_str)doc_num_str2 = re.search('>(.*)<',doc_num_str)打印(country_str2.group(1)+ doc_num_str2.group(1))
使用列表理解和来获取具有
,一个简单的单行代码就是: doc-number
及其相关的 country
的列表邮编
>>>[(国家的(country.text,number.text),邮政编码中的数字(soup.findAll("country"),soup.findAll("doc-number")))][('US','D0591026'),('US','29303426'),('CA','122078')]
或者,如果您不习惯于列出理解,则可能是一种更具可读性的方法:
>>>lst = []>>>对于国家/地区,邮政编码(soup.findAll("country"),soup.findAll("doc-number"))中的数字:打印(country.text,number.text)lst.append((country.text,number.text))美国D0591026US 29303426CA 122078>>>第一次[('US','D0591026'),('US','29303426'),('CA','122078')]
Edit: resolved. Thought I'd add my answer at the bottom...
Note: the desired output is a bunch of lines like
US D0591026
I have data that looks like the following in XML:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-grant SYSTEM "us-patent-grant-v42-2006-08-23.dtd" [ ]>
<us-patent-grant lang="EN" dtd-version="v4.2 2006-08-23" file="USD0591026-20090428.XML" status="PRODUCTION" id="us-patent-grant" country="US" date-produced="20090414" date-publ="20090428">
<us-bibliographic-data-grant>
<publication-reference>
<document-id>
<country>US</country>
<doc-number>D0591026</doc-number>
<kind>S1</kind>
<date>20090428</date>
</document-id>
</publication-reference>
<application-reference appl-type="design">
<document-id>
<country>US</country>
<doc-number>29303426</doc-number>
<date>20080208</date>
</document-id>
</application-reference>
<us-application-series-code>29</us-application-series-code>
<priority-claims>
<priority-claim sequence="01" kind="national">
<country>CA</country>
<doc-number>122078</doc-number>
<date>20070830</date>
</priority-claim>
</priority-claims>
<us-term-of-grant>
<length-of-grant>14</length-of-grant>
</us-term-of-grant>
<classification-locarno>
<edition>9</edition>
<main-classification>0101</main-classification>
</classification-locarno>
<classification-national>
<country>US</country>
<main-classification>D 1106</main-classification>
</classification-national>
<invention-title id="d0e71">Edible fruit product in the shape of a rocketship</invention-title>
<references-cited>
I am trying to pull out the country, and the document number. I've gotten to this point:
import os
import io
from bs4 import BeautifulSoup
import csv
import requests
directory_in_str = 'C:/Users/somedirectory'
directory = os.fsencode(directory_in_str)
for file in os.listdir(directory):
filename = os.fsdecode(file)
full_name = directory_in_str + filename
handler = open(full_name).read()
soup = BeautifulSoup(handler, 'lxml')
patents=soup.find_all('us-patent-grant')
pub_ref=soup.find_all('publication-reference')
country=soup.find_all('country')
doc_num=soup.find_all('doc-number')
for patent in pub_ref:
for doc_num in patent:
print(doc_num)
continue
Where I can print out a nice block that includes those elements (what the code above does), but everything I have tried to get at those two specific elements (and then concatenate them) has failed. I've been able to do it with string operations, but the dataset isn't well formatted enough (I will be pulling out textfields without a standard length later) to feel confident that I can perform the whole analysis based on splicing strings.
Any ideas how I can drill down into those further tags and return just those two elements?
Ok, so I have made some changes, and gotten my code to:
import os
import io
from bs4 import BeautifulSoup
import csv
import requests
directory_in_str = 'C:/somedir'
directory = os.fsencode(directory_in_str)
for file in os.listdir(directory):
filename = os.fsdecode(file)
full_name = directory_in_str + filename
handler = open(full_name).read()
soup = BeautifulSoup(handler, 'lxml')
patents=soup.find_all('us-patent-grant')
pub_ref=soup.find_all('publication-reference')
for patent in pub_ref:
country = patent.find_all('country')
doc_num = patent.find_all('doc-number')
print(country + doc_num)
continue
Which gives me most of what I want. I am getting this:
[<country>US</country>, <doc-number>D0591026</doc-number>]
but what I want is just:
US D0591026
I understand the type of the object is a bs4 result set, but I am not familiar enough with how I only return the things in the tag. Eventually, this is going to a csv, so I don't want to have those tags in there.
I converted the soup objects to strings and used regular expressions to get the desired output
...
import re
...
...
country = patent.find_all('country')
doc_num = patent.find_all('doc-number')
country_str = str(country)
doc_num_str = str(doc_num)
country_str2 = re.search('>(.*)<', country_str)
doc_num_str2 = re.search('>(.*)<', doc_num_str)
print(country_str2.group(1) + doc_num_str2.group(1))
To get a list with doc-number
and it's related country
using list comprehension and zip
, a simple one-liner would be:
>>> [(country.text,number.text) for country, number in zip(soup.findAll("country"), soup.findAll("doc-number"))]
[('US', 'D0591026'), ('US', '29303426'), ('CA', '122078')]
Or perhaps a more readable way if you are not used to list comprehensions:
>>> lst = []
>>> for country, number in zip(soup.findAll("country"), soup.findAll("doc-number")):
print(country.text, number.text)
lst.append((country.text, number.text))
US D0591026
US 29303426
CA 122078
>>> lst
[('US', 'D0591026'), ('US', '29303426'), ('CA', '122078')]
这篇关于用美丽的汤解析XML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!