检索使用美丽的汤一关闭和开启HTML标记之间的所有内容 [英] Retrieve all content between a closing and opening html tag using Beautiful Soup

查看：100 发布时间：2016/8/5 19:14:31 python beautifulsoup

本文介绍了检索使用美丽的汤一关闭和开启HTML标记之间的所有内容的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我解析使用Python的内容和美味的汤，然后将其写入到CSV文件，并已运行到获得一组特定的数据问题的开溜。数据通过TidyHTML的实现，我制作的，然后其他不需要的数据剥离出来跑了。

问题是，我需要一组之间获取所有数据＆LT; H3方式＆gt; 标签

样本数据：

 ＆LT; H3＆GT;＆LT; A HREF =卷-1页-001.pdf＆GT;页1-18 LT; / A＆GT;＆LT; / H3＆GT;
＆LT; UL＆GT;＆LT;立GT; 9月13日1880年教师首先例会;
 9月14日1880年讨论课程事宜。学生们
 采取代数，直到他们完成了两个精神被取消资格
 和小数运算; 10月4日 -  1880年LT; /李＆GT;＆LT;李方式＆gt;所有成员present＆LT; /李＆GT;＆LT; / UL＆GT;
 ＆LT; UL＆GT;＆LT;立GT;移动今后教师在举行定期会议weekkly
 大学建筑president的房间; 10月11日1880年所有
 会员present; 10月18日1880年例会2.移动的
 president等待12日的街道，并要求物业持有者
 他们以减轻他们的财产的滋扰; 1880 10月25日。
 感动的是高级和初级班rhetoricals是...＆LT; /李＆GT;＆LT; / UL＆GT;
 ＆所述; H3＆GT;＆下; A HREF =卷-1-页-019.pdf＆GT;页数19-33＆下; / A＆GT;＆下; / H3＆GT;`

我需要检索所有的第一次结束＆LT的内容; / H3＆GT; 标记和下一个开放＆LT; H3＆GT; 标记。这不应该是困难的，但我的厚头没有做出必要的连接。我可以抓住所有的＆LT; UL＆GT; 标签，但不起作用，因为没有之间的一对一的关系＆LT; H3＆GT; 标记和＆LT; UL＆GT; 标记。

我期待达到的输出是：

1-18页|卷-1页-001.pdf |和标记之间的内容。

前两部分都没有一组标记之间的问题，但内容对我来说很难。

我目前的code是如下：

 进口水珠，重，OS，CSV
从BeautifulSoup进口BeautifulSoup
从tidylib进口tidy_document
从收藏导入双端队列html_path ='Z：\\\\ \\\\应用\\\\ MAMP的htdocs \\\\ \\\\ uoassembly AssemblyRecordsVol1
csv_path ='Z：\\\\ \\\\应用\\\\ MAMP的htdocs \\\\ \\\\ uoassembly \\\\ AssemblyRecordsVol1 archiveVol1.csvhtml_cleanup = {'\\ r \\ r \\ n'的''，'\\ n \\ n'：''，'\\ n'：''，'\\ r'：''，'\\ r \\ r'：'' ，'＆所述; IMG SRC =UOSymbol1.jpgALT =/＆GT;'：''}在glob.glob INFILE（os.path.join（html_path，*。html的'））：
    打印当前文件是：+ INFILE    HTML =开（INFILE）.read（）    对于I，J在html_cleanup.iteritems（）：
            的HTML = html.replace（I，J）    #parse与美丽的汤清理HTML
    汤= BeautifulSoup（HTML）    #PRINT汤
    html_to_csv = csv.writer（开（csv_path，'A'），分隔符='|'，
                      报价= csv.QUOTE_NONE，escapechar =''）
    #retrieve具有页范围和文件名的串
    体积=双端队列（）
    文件名=双端队列（）
    总结=双端队列（）
    I = 0
    在soup.findAll（'A'）标题：
            如果标题[HREF'] startswith（'V'）：
             #PRINT title.string
             volume.append（title.string）
             I + = 1
             #PRINT汤（'a'）的[I] ['的href']
             fileName.append（汤（'a'）的[I] ['的href']）
             #PRINT html_to_csv
             ＃html_to_csv.writerow（[体积，文件名]）    #retrieve每个存档和存储的摘要
    #for身体soup.findAll（UL）或soup.findAll（'醇'）：
    ＃summary.append（体）
    对于体soup.findAll（'H3'）：
            body.findNextSibling（文= TRUE）
            summary.append（体）    每个字段#PRINT出到csv文件
    在范围（一）C：
            页= volume.popleft（）
            PATH = fileName.popleft（）
            注释=汇总
            如果不总结：
                    笔记=帮助
            如果总结：
                    注释= summary.popleft（）
            html_to_csv.writerow（[页，路径，注意事项]）

解决方案

之间提取物含量＆LT; / H3＆GT; 和＆LT; H3＆GT; 标签：

 从进口和itertools takewhile3H公司=汤（H3）＃找到所有＆LT; H3＆GT;分子
为H3，h3next拉链（3H公司，3H公司[1：]）：
  ＃得到的元素
  between_it = takewhile（拉姆达EL：EL不h3next，h3.nextSiblingGenerator（））
  ＃提取文本
  打印（''。加入（GETATTR（EL，'文字'，EI）为EL在between_it））

在code假定所有的＆LT; H3＆GT; 元素是同级的。如果不是的话，那么你可以使用 h3.nextGenerator（）而不是 h3.nextSiblingGenerator（）。

I am parsing content using Python and Beautiful Soup then writing it to a CSV file, and have run into a bugger of a problem getting a certain set of data. The data is ran through an implementation of TidyHTML that I have crafted and then other not needed data is stripped out.

The issue is that I need to retrieve all data between a set of <h3> tags.

Sample Data:

<h3><a href="Vol-1-pages-001.pdf">Pages 1-18</a></h3>
<ul><li>September 13 1880. First regular meeting of the faculty;
 September 14 1880. Discussion of curricular matters. Students are
 debarred from taking algebra until they have completed both mental
 and fractional arithmetic; October 4 1880.</li><li>All members present.</li></ul>
 <ul><li>Moved the faculty henceforth hold regular weekkly meetings in the
 President's room of the University building; 11 October 1880. All
 members present; 18 October 1880. Regular meeting 2. Moved that the
 President wait on the property holders on 12th street and request
 them to abate the nuisance on their property; 25 October 1880.
 Moved that the senior and junior classes for rhetoricals be...</li></ul>
 <h3><a href="Vol-1-pages-019.pdf">Pages 19-33</a></h3>`

I need to retrieve all of the content between the first closing </h3> tag and the next opening <h3> tag. This shouldn't be hard, but my thick head isn't making the necessary connections. I can grab all of the <ul> tags but that doesn't work because there is not a one to one relationship between <h3> tags and <ul> tags.

The output I am looking to achieve is:

Pages 1-18|Vol-1-pages-001.pdf|content between and tags.

The first two parts have not been a problem but content between a set of tags is difficult for me.

My current code is as follows:

import glob, re, os, csv
from BeautifulSoup import BeautifulSoup
from tidylib import tidy_document
from collections import deque

html_path = 'Z:\\Applications\\MAMP\\htdocs\\uoassembly\\AssemblyRecordsVol1'
csv_path = 'Z:\\Applications\\MAMP\\htdocs\\uoassembly\\AssemblyRecordsVol1\\archiveVol1.csv'

html_cleanup = {'\r\r\n':'', '\n\n':'', '\n':'', '\r':'', '\r\r': '', '<img src="UOSymbol1.jpg"    alt="" />':''}

for infile in glob.glob( os.path.join(html_path, '*.html') ):
    print "current file is: " + infile

    html = open(infile).read()

    for i, j in html_cleanup.iteritems():
            html = html.replace(i, j)

    #parse cleaned up html with Beautiful Soup
    soup = BeautifulSoup(html)

    #print soup
    html_to_csv = csv.writer(open(csv_path, 'a'), delimiter='|',
                      quoting=csv.QUOTE_NONE, escapechar=' ')  
    #retrieve the string that has the page range and file name
    volume = deque()
    fileName = deque()
    summary = deque()
    i = 0
    for title in soup.findAll('a'):
            if title['href'].startswith('V'):
             #print title.string
             volume.append(title.string)
             i+=1
             #print soup('a')[i]['href']
             fileName.append(soup('a')[i]['href'])
             #print html_to_csv
             #html_to_csv.writerow([volume, fileName])

    #retrieve the summary of each archive and store
    #for body in soup.findAll('ul') or soup.findAll('ol'):
    #        summary.append(body)
    for body in soup.findAll('h3'):
            body.findNextSibling(text=True)
            summary.append(body)

    #print out each field into the csv file
    for c in range(i):
            pages = volume.popleft()
            path = fileName.popleft()
            notes = summary
            if not summary: 
                    notes = "help"
            if summary:
                    notes = summary.popleft()
            html_to_csv.writerow([pages, path, notes])

解决方案

Extract content between </h3> and <h3> tags:

from itertools import takewhile

h3s = soup('h3') # find all <h3> elements
for h3, h3next in zip(h3s, h3s[1:]):
  # get elements in between
  between_it = takewhile(lambda el: el is not h3next, h3.nextSiblingGenerator())
  # extract text
  print(''.join(getattr(el, 'text', el) for el in between_it))

The code assumes that all <h3> elements are siblings. If it is not the case then you could use h3.nextGenerator() instead of h3.nextSiblingGenerator().

这篇关于检索使用美丽的汤一关闭和开启HTML标记之间的所有内容的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

检索使用美丽的汤一关闭和开启HTML标记之间的所有内容 [英] Retrieve all content between a closing and opening html tag using Beautiful Soup

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

检索使用美丽的汤一关闭和开启HTML标记之间的所有内容 [英] Retrieve all content between a closing and opening html tag using Beautiful Soup

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭