使用BeautifulSoup遍历XML以提取特定标签并存储在变量中 [英] Use BeautifulSoup to Iterate over XML to pull specific tags and store in variable
问题描述
我对编程还很陌生,一直在努力寻找解决方案,但是我所能找到的只是点点滴滴,没有运气将它们放在一起.
I'm fairly new to programming and have been trying to find a solution for this but all I can find are bits and pieces with no real luck putting it all together.
我试图在python
中使用BeautifulSoup4
刮一些xml
并将文本值存储在变量中的特定标记之间.数据来自医学生培训计划,现在必须手动找到所需的一切.因此,我正在尝试通过抓取程序来提高效率.
I'm trying to use BeautifulSoup4
in python
to scrape some xml
and store the text value in between specific tags in variables. The data is from a med student training program and right now everything needed has to be found manually. So I'm trying to increase efficiency a bit with a scraping program.
例如,假设我正在查看这种类型的测试数据以进行试验:
Let's say for example that I was looking at this type of test data to experiment with:
<AllergyList>
<Allergy>
<Deleted>n</Deleted>
<Status>
<Active/>
</Status>
<ExternalID/>
<Patient>
<ExternalID/>
<FirstName>Testcase</FirstName>
<LastName>casetest</LastName>
</Patient>
<Allergen>
<Name>Flagyl (metronidazole)</Name>
<Drug>
<NDCID>00025182151,00025182131,00025182150</NDCID>
</Drug>
</Allergen>
<Reaction>difficulty breathing</Reaction>
<OnsetDate>02/02/2013</OnsetDate>
</Allergy>
<Allergy>
<Deleted>n</Deleted>
<Status>
<Active/>
</Status>
<ExternalID/>
<Patient>
<ExternalID/>
<FirstName>Testcase</FirstName>
<LastName>casetest</LastName>
</Patient>
<Allergen>
<Name>Bactrim (sulfamethoxazole-trimethoprim)</Name>
<Drug>
<NDCID>13310014501,49999023220</NDCID>
</Drug>
</Allergen>
<Reaction>swelling</Reaction>
<OnsetDate>05/03/2002</OnsetDate>
</Allergy>
<Number>2</Number>
</AllergyList>
我一直试图从多个<Allergen>
标签之间提取<Name>
标签以及从<Onsetdate>
和<Reaction>
标签之间提取相应数据,同时将提取结果存储到相应的位置变量.
I've been trying to pull the <Name>
tag from in between multiple <Allergen>
tags as well as the respective data from in between the <Onsetdate>
and <Reaction>
tags while storing the results of the pull into respective variables.
例如,我想先拉Flagyl (metronidazole)
,difficulty breathing
,02/02/2013
,然后拉Bactrim (sulfamethoxazole-trimethoprim)
,swelling
,05/03/2002
等,然后将它们放在单独的变量中,以便以后使用
So for example I would want to pull Flagyl (metronidazole)
, difficulty breathing
, 02/02/2013
, then Bactrim (sulfamethoxazole-trimethoprim)
, swelling
, 05/03/2002
, and so on while placing them in separate variables that I can use later.
从<Allergen>
标记中拉出第一个集合很容易,但是我很难弄清楚如何在xml
上进行迭代并将提取的数据存储到变量中.我一直在尝试使用for循环,同时将数据存储到数组或列表中,但是我一直在写它的方式总是一遍又一遍地提取相同的数据,具体取决于我根据
Pulling the first set from the <Allergen>
tag is easy but I'm having trouble figuring out how to iterate over the xml
and storing the pulled data into variables. I've been trying to use a for loop while storing the data into an array or list but the way I've been writing it I always pull the same data over and over again depending on the number of iterations I calculate from the len()
function and have since failed to store any of it into an array.
我已经为此花了很长时间的思考,我想我可能还不那么聪明,所以任何帮助甚至指向正确方向的帮助都将不胜感激.
I've been racking my brain about this for a while now and I think I may just not be that smart so any help or even pointing me in the right direction would be immensely appreciated.
推荐答案
这似乎很简单,因为嵌套标签不多:
It seems a simple task because there isn't many nesting tags:
from bs4 import BeautifulSoup
import sys
soup = BeautifulSoup(open(sys.argv[1], 'r'), 'xml')
allergies = []
for allergy in soup.find_all('Allergy'):
d = {
'name': allergy.Allergen.Name.string,
'reaction': allergy.Reaction.string,
'on_set_date': allergy.OnsetDate.string,
}
allergies.append(d)
## Use 'allergies' array of dictionaries as you want.
## Example:
print(allergies[1]['reaction'])
使用xml
文件作为参数运行它:
Run it with the xml
file as argument:
python3 script.py xmlfile
此测试得出:
swelling
这篇关于使用BeautifulSoup遍历XML以提取特定标签并存储在变量中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!