从一个HTML文档的特定章节 [英] get a specific section from a html doc

查看：106 发布时间：2016/8/5 19:09:29 python beautifulsoup

本文介绍了从一个HTML文档的特定章节的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

你好，我想获得一个HTML文档的特定部分，这部分是关系到一个div，并封装在一个span标签，段通常在文档的biginning。

Hello i would like get a specific section from an html doc, this section is related to a div and is encapsulated in a span tag, the section is normally at the biginning of the document.

self.contents = BeautifulSoup(convert_pdf_to_html(self.path), 'html.parser')
self.keywords = self.contents.find('span',text=re.compile("(.*keywords.*|.*key-words.*)",re.IGNORECASE)).parent

问题是我总是有一个换行符这避免我检索相关的股利如下：

the problem is i always have a newline character which avoid me to retrieve the related div like:

<span style="font-family: EICMDB+AdvTrebu-B; font-size:8px">keywords
<br/></span>

即使使用普通的前pression它不工作，也没有选项来删除文本

even with using a regular expression it doesn't work and there is no option to strip the text

推荐答案

首先，让我告诉你，你的正则表达式是有点不对，你要逃跑 - 为 \\ -

First let me tell you that your regex is somewhat wrong, you have to escape - as \-

反正类似的东西为我工作，但最近，我不能组合正则表达式找到，太

Anyways something similar worked for me but lately I can't combine regexes with find, too

contents = bs(open(path), 'html.parser')
keywords = contents.find(text = re.compile(ur"key\-?words",re.I|re.U)).parent

这篇关于从一个HTML文档的特定章节的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

从一个HTML文档的特定章节 [英] get a specific section from a html doc

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

从一个HTML文档的特定章节 [英] get a specific section from a html doc

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭