检索使用美丽的汤一关闭和开启HTML标记之间的所有内容 [英] Retrieve all content between a closing and opening html tag using Beautiful Soup
问题描述
我解析使用Python的内容和美味的汤,然后将其写入到CSV文件,并已运行到获得一组特定的数据问题的开溜。数据通过TidyHTML的实现,我制作的,然后其他不需要的数据剥离出来跑了。
问题是,我需要一组之间获取所有数据< H3方式>
标签
样本数据:
< H3>< A HREF =卷-1页-001.pdf>页1-18 LT; / A>< / H3>
< UL><立GT; 9月13日1880年教师首先例会;
9月14日1880年讨论课程事宜。学生们
采取代数,直到他们完成了两个精神被取消资格
和小数运算; 10月4日 - 1880年LT; /李><李方式>所有成员present< /李>< / UL>
< UL><立GT;移动今后教师在举行定期会议weekkly
大学建筑president的房间; 10月11日1880年所有
会员present; 10月18日1880年例会2.移动的
president等待12日的街道,并要求物业持有者
他们以减轻他们的财产的滋扰; 1880 10月25日。
感动的是高级和初级班rhetoricals是...< /李>< / UL>
&所述; H3>&下; A HREF =卷-1-页-019.pdf>页数19-33&下; / A>&下; / H3>`
我需要检索所有的第一次结束&LT的内容; / H3>
标记和下一个开放< H3>
标记。这不应该是困难的,但我的厚头没有做出必要的连接。我可以抓住所有的< UL>
标签,但不起作用,因为没有之间的一对一的关系< H3>
标记和< UL>
标记。
我期待达到的输出是:
1-18页|卷-1页-001.pdf |和标记之间的内容。
前两部分都没有一组标记之间的问题,但内容对我来说很难。
我目前的code是如下:
进口水珠,重,OS,CSV
从BeautifulSoup进口BeautifulSoup
从tidylib进口tidy_document
从收藏导入双端队列html_path ='Z:\\\\ \\\\应用\\\\ MAMP的htdocs \\\\ \\\\ uoassembly AssemblyRecordsVol1
csv_path ='Z:\\\\ \\\\应用\\\\ MAMP的htdocs \\\\ \\\\ uoassembly \\\\ AssemblyRecordsVol1 archiveVol1.csvhtml_cleanup = {'\\ r \\ r \\ n'的'','\\ n \\ n':'','\\ n':'','\\ r':'','\\ r \\ r':'' ,'&所述; IMG SRC =UOSymbol1.jpgALT =/>':''}在glob.glob INFILE(os.path.join(html_path,*。html的')):
打印当前文件是:+ INFILE HTML =开(INFILE).read() 对于I,J在html_cleanup.iteritems():
的HTML = html.replace(I,J) #parse与美丽的汤清理HTML
汤= BeautifulSoup(HTML) #PRINT汤
html_to_csv = csv.writer(开(csv_path,'A'),分隔符='|',
报价= csv.QUOTE_NONE,escapechar ='')
#retrieve具有页范围和文件名的串
体积=双端队列()
文件名=双端队列()
总结=双端队列()
I = 0
在soup.findAll('A')标题:
如果标题[HREF'] startswith('V'):
#PRINT title.string
volume.append(title.string)
I + = 1
#PRINT汤('a')的[I] ['的href']
fileName.append(汤('a')的[I] ['的href'])
#PRINT html_to_csv
#html_to_csv.writerow([体积,文件名]) #retrieve每个存档和存储的摘要
#for身体soup.findAll(UL)或soup.findAll('醇'):
#summary.append(体)
对于体soup.findAll('H3'):
body.findNextSibling(文= TRUE)
summary.append(体) 每个字段#PRINT出到csv文件
在范围(一)C:
页= volume.popleft()
PATH = fileName.popleft()
注释=汇总
如果不总结:
笔记=帮助
如果总结:
注释= summary.popleft()
html_to_csv.writerow([页,路径,注意事项])
之间提取物含量< / H3>
和< H3>
标签:
从进口和itertools takewhile3H公司=汤(H3)#找到所有< H3>分子
为H3,h3next拉链(3H公司,3H公司[1:]):
#得到的元素
between_it = takewhile(拉姆达EL:EL不h3next,h3.nextSiblingGenerator())
#提取文本
打印(''。加入(GETATTR(EL,'文字',EI)为EL在between_it))
在code假定所有的< H3>
元素是同级的。如果不是的话,那么你可以使用 h3.nextGenerator()
而不是 h3.nextSiblingGenerator()
。
I am parsing content using Python and Beautiful Soup then writing it to a CSV file, and have run into a bugger of a problem getting a certain set of data. The data is ran through an implementation of TidyHTML that I have crafted and then other not needed data is stripped out.
The issue is that I need to retrieve all data between a set of <h3>
tags.
Sample Data:
<h3><a href="Vol-1-pages-001.pdf">Pages 1-18</a></h3>
<ul><li>September 13 1880. First regular meeting of the faculty;
September 14 1880. Discussion of curricular matters. Students are
debarred from taking algebra until they have completed both mental
and fractional arithmetic; October 4 1880.</li><li>All members present.</li></ul>
<ul><li>Moved the faculty henceforth hold regular weekkly meetings in the
President's room of the University building; 11 October 1880. All
members present; 18 October 1880. Regular meeting 2. Moved that the
President wait on the property holders on 12th street and request
them to abate the nuisance on their property; 25 October 1880.
Moved that the senior and junior classes for rhetoricals be...</li></ul>
<h3><a href="Vol-1-pages-019.pdf">Pages 19-33</a></h3>`
I need to retrieve all of the content between the first closing </h3>
tag and the next opening <h3>
tag. This shouldn't be hard, but my thick head isn't making the necessary connections. I can grab all of the <ul>
tags but that doesn't work because there is not a one to one relationship between <h3>
tags and <ul>
tags.
The output I am looking to achieve is:
Pages 1-18|Vol-1-pages-001.pdf|content between and tags.
The first two parts have not been a problem but content between a set of tags is difficult for me.
My current code is as follows:
import glob, re, os, csv
from BeautifulSoup import BeautifulSoup
from tidylib import tidy_document
from collections import deque
html_path = 'Z:\\Applications\\MAMP\\htdocs\\uoassembly\\AssemblyRecordsVol1'
csv_path = 'Z:\\Applications\\MAMP\\htdocs\\uoassembly\\AssemblyRecordsVol1\\archiveVol1.csv'
html_cleanup = {'\r\r\n':'', '\n\n':'', '\n':'', '\r':'', '\r\r': '', '<img src="UOSymbol1.jpg" alt="" />':''}
for infile in glob.glob( os.path.join(html_path, '*.html') ):
print "current file is: " + infile
html = open(infile).read()
for i, j in html_cleanup.iteritems():
html = html.replace(i, j)
#parse cleaned up html with Beautiful Soup
soup = BeautifulSoup(html)
#print soup
html_to_csv = csv.writer(open(csv_path, 'a'), delimiter='|',
quoting=csv.QUOTE_NONE, escapechar=' ')
#retrieve the string that has the page range and file name
volume = deque()
fileName = deque()
summary = deque()
i = 0
for title in soup.findAll('a'):
if title['href'].startswith('V'):
#print title.string
volume.append(title.string)
i+=1
#print soup('a')[i]['href']
fileName.append(soup('a')[i]['href'])
#print html_to_csv
#html_to_csv.writerow([volume, fileName])
#retrieve the summary of each archive and store
#for body in soup.findAll('ul') or soup.findAll('ol'):
# summary.append(body)
for body in soup.findAll('h3'):
body.findNextSibling(text=True)
summary.append(body)
#print out each field into the csv file
for c in range(i):
pages = volume.popleft()
path = fileName.popleft()
notes = summary
if not summary:
notes = "help"
if summary:
notes = summary.popleft()
html_to_csv.writerow([pages, path, notes])
Extract content between </h3>
and <h3>
tags:
from itertools import takewhile
h3s = soup('h3') # find all <h3> elements
for h3, h3next in zip(h3s, h3s[1:]):
# get elements in between
between_it = takewhile(lambda el: el is not h3next, h3.nextSiblingGenerator())
# extract text
print(''.join(getattr(el, 'text', el) for el in between_it))
The code assumes that all <h3>
elements are siblings. If it is not the case then you could use h3.nextGenerator()
instead of h3.nextSiblingGenerator()
.
这篇关于检索使用美丽的汤一关闭和开启HTML标记之间的所有内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!