用美丽的汤解析XML [英] Parsing XML with Beautiful Soup

查看:50
本文介绍了用美丽的汤解析XML的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

已解决.以为我会在底部添加答案...

注意:所需的输出是一堆线,例如

  US D0591026 

我有一些XML格式的数据:

 <?xml version ="1.0" encoding ="UTF-8"?><!DOCTYPE美国专利授予系统"us-patent-grant-v42-2006-08-23.dtd" []>< us-patent-grant lang ="EN" dtd-version ="v4.2 2006-08-23" file ="USD0591026-20090428.XML" status ="PRODUCTION" id ="us-patent-grant"国家/地区="US" date-produced ="20090414" date-publ ="20090428"><我们书目数据授予>< publication-reference>< document-id>< country>美国</country>< doc-number> D0591026</doc-number><种类S1</种类>< date> 20090428</date></document-id></publication-reference>< application-reference appl-type ="design">< document-id>< country>美国</country>< doc-number> 29303426</doc-number>< date> 20080208</date></document-id></application-reference>< us-application-series-code> 29</us-application-series-code><优先权><优先权要求序列="01" kind ="national">< country> CA</country>< doc-number> 122078</doc-number>< date> 20070830</date></priority-claim></priority-claims><我们的授予条款><授予长度> 14<授予长度></我们授予的条款><分类-locarno>< edition> 9</edition><主要分类> 0101</主要分类></classification-locarno><国家分类>< country>美国</country><主要分类> D 1106</主要分类></classification-national>< invention-title id ="d0e71">火箭飞艇形​​状的可食用水果产品//invention-title><引用引用> 

我正在尝试提取国家/地区和文件编号.我已经到了这一点:

  import os导入io从bs4导入BeautifulSoup导入csv汇入要求directory_in_str ='C:/用户/某目录'目录= os.fsencode(directory_in_str)对于os.listdir(目录)中的文件:文件名= os.fsdecode(文件)full_name = directory_in_str +文件名处理程序= open(full_name).read()汤= BeautifulSoup(处理程序,'lxml')专利= soup.find_all('美国专利授权')pub_ref = soup.find_all('publication-reference')country = soup.find_all('country')doc_num = soup.find_all('doc-number')pub_ref中的专利:专利中的doc_num:打印(doc_num)继续 

我可以在其中打印出包含这些元素的漂亮代码块(上面的代码做了什么),但是我试图获取这两个特定元素(然后将它们串联)的所有操作都失败了.我已经能够通过字符串操作来做到这一点,但是数据集的格式设置不够好(稍后我将抽取没有标准长度的文本字段),以至于我可以基于拼接字符串执行整个分析./p>

有什么想法可以深入到这些进一步的标签中并仅返回这两个元素吗?

好,所以我进行了一些更改,并获得了代码:

  import os导入io从bs4导入BeautifulSoup导入csv汇入要求directory_in_str ='C:/somedir'目录= os.fsencode(directory_in_str)对于os.listdir(目录)中的文件:文件名= os.fsdecode(文件)full_name = directory_in_str +文件名处理程序= open(full_name).read()汤= BeautifulSoup(处理程序,'lxml')专利= soup.find_all('美国专利授权')pub_ref = soup.find_all('publication-reference')pub_ref中的专利:国家/地区= patent.find_all('country')doc_num = patent.find_all('doc-number')打印(国家+ doc_num)继续 

哪个给了我我想要的东西.我得到这个:

  [< country"美国</country>,<文档编号> D0591026</文档编号>] 

但是我想要的只是:

  US D0591026 

我知道对象的类型是bs4结果集,但是我对如何仅返回标记中的内容还不够熟悉.最终,这将转至csv,所以我不想在其中放置这些标签.

我将汤对象转换为字符串,并使用正则表达式获得所需的输出

  ...汇入......国家/地区= patent.find_all('country')doc_num = patent.find_all('doc-number')country_str = str(国家/地区)doc_num_str = str(doc_num)country_str2 = re.search('>(.*)<',country_str)doc_num_str2 = re.search('>(.*)<',doc_num_str)打印(country_str2.group(1)+ doc_num_str2.group(1)) 

解决方案

使用列表理解和来获取具有 doc-number 及其相关的 country 的列表邮编,一个简单的单行代码就是:

 >>>[(国家的(country.text,number.text),邮政编码中的数字(soup.findAll("country"),soup.findAll("doc-number")))][('US','D0591026'),('US','29303426'),('CA','122078')] 

或者,如果您不习惯于列出理解,则可能是一种更具可读性的方法:

 >>>lst = []>>>对于国家/地区,邮政编码(soup.findAll("country"),soup.findAll("doc-number"))中的数字:打印(country.text,number.text)lst.append((country.text,number.text))美国D0591026US 29303426CA 122078>>>第一次[('US','D0591026'),('US','29303426'),('CA','122078')] 

Edit: resolved. Thought I'd add my answer at the bottom...

Note: the desired output is a bunch of lines like

US D0591026

I have data that looks like the following in XML:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-grant SYSTEM "us-patent-grant-v42-2006-08-23.dtd" [ ]>
<us-patent-grant lang="EN" dtd-version="v4.2 2006-08-23" file="USD0591026-20090428.XML" status="PRODUCTION" id="us-patent-grant" country="US" date-produced="20090414" date-publ="20090428">
<us-bibliographic-data-grant>
<publication-reference>
<document-id>
<country>US</country>
<doc-number>D0591026</doc-number>
<kind>S1</kind>
<date>20090428</date>
</document-id>
</publication-reference>
<application-reference appl-type="design">
<document-id>
<country>US</country>
<doc-number>29303426</doc-number>
<date>20080208</date>
</document-id>
</application-reference>
<us-application-series-code>29</us-application-series-code>
<priority-claims>
<priority-claim sequence="01" kind="national">
<country>CA</country>
<doc-number>122078</doc-number>
<date>20070830</date>
</priority-claim>
</priority-claims>
<us-term-of-grant>
<length-of-grant>14</length-of-grant>
</us-term-of-grant>
<classification-locarno>
<edition>9</edition>
<main-classification>0101</main-classification>
</classification-locarno>
<classification-national>
<country>US</country>
<main-classification>D 1106</main-classification>
</classification-national>
<invention-title id="d0e71">Edible fruit product in the shape of a rocketship</invention-title>
<references-cited>

I am trying to pull out the country, and the document number. I've gotten to this point:

import os
import io
from bs4 import BeautifulSoup
import csv
import requests

directory_in_str = 'C:/Users/somedirectory'
directory = os.fsencode(directory_in_str)

for file in os.listdir(directory):
    filename = os.fsdecode(file)
    full_name = directory_in_str + filename
    handler = open(full_name).read()
    soup = BeautifulSoup(handler, 'lxml')
    patents=soup.find_all('us-patent-grant')
    pub_ref=soup.find_all('publication-reference')
    country=soup.find_all('country')
    doc_num=soup.find_all('doc-number')
    for patent in pub_ref:
        for doc_num in patent:
            print(doc_num)

    continue

Where I can print out a nice block that includes those elements (what the code above does), but everything I have tried to get at those two specific elements (and then concatenate them) has failed. I've been able to do it with string operations, but the dataset isn't well formatted enough (I will be pulling out textfields without a standard length later) to feel confident that I can perform the whole analysis based on splicing strings.

Any ideas how I can drill down into those further tags and return just those two elements?

Ok, so I have made some changes, and gotten my code to:

import os
import io
from bs4 import BeautifulSoup
import csv
import requests

directory_in_str = 'C:/somedir'

directory = os.fsencode(directory_in_str)

for file in os.listdir(directory):
    filename = os.fsdecode(file)
    full_name = directory_in_str + filename
    handler = open(full_name).read()
    soup = BeautifulSoup(handler, 'lxml')
    patents=soup.find_all('us-patent-grant')
    pub_ref=soup.find_all('publication-reference')
    for patent in pub_ref:
     country = patent.find_all('country')
     doc_num = patent.find_all('doc-number')
     print(country + doc_num)

    continue

Which gives me most of what I want. I am getting this:

[<country>US</country>, <doc-number>D0591026</doc-number>]

but what I want is just:

US D0591026

I understand the type of the object is a bs4 result set, but I am not familiar enough with how I only return the things in the tag. Eventually, this is going to a csv, so I don't want to have those tags in there.

I converted the soup objects to strings and used regular expressions to get the desired output

...
import re
...
...
     country = patent.find_all('country')
     doc_num = patent.find_all('doc-number')
     country_str = str(country)
     doc_num_str = str(doc_num)
     country_str2 = re.search('>(.*)<', country_str)
     doc_num_str2 = re.search('>(.*)<', doc_num_str)
     print(country_str2.group(1) + doc_num_str2.group(1))

解决方案

To get a list with doc-number and it's related country using list comprehension and zip, a simple one-liner would be:

>>> [(country.text,number.text) for country, number in zip(soup.findAll("country"), soup.findAll("doc-number"))]
[('US', 'D0591026'), ('US', '29303426'), ('CA', '122078')]

Or perhaps a more readable way if you are not used to list comprehensions:

>>> lst = []
>>> for country, number in zip(soup.findAll("country"), soup.findAll("doc-number")):
    print(country.text, number.text)
    lst.append((country.text, number.text))


US D0591026
US 29303426
CA 122078
>>> lst
[('US', 'D0591026'), ('US', '29303426'), ('CA', '122078')]

这篇关于用美丽的汤解析XML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆