Python3.5 BeautifulSoup4从&#39&p;#39;获取文本在div中 [英] Python3.5 BeautifulSoup4 get text from &#39;p&#39; in div

查看：49 发布时间：2021/5/14 20:35:40 html python-3.x beautifulsoup python-requests

本文介绍了Python3.5 BeautifulSoup4从&#39&p;#39;获取文本在div中的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试从div类'caselawcontent searchable-content'中提取所有文本.此代码仅打印HTML，而不会显示来自网页的文本.我想得到什么短信?

I am trying to pull all the text from the div class 'caselawcontent searchable-content'. This code just prints the HTML without the text from the web page. What am I missing to get the text?

以下链接位于'finteredcasesdoc.text'文件中:
http://caselaw.findlaw.com/mo-court-of-appeals/1021163.html

The following link is in the 'finteredcasesdoc.text' file:
http://caselaw.findlaw.com/mo-court-of-appeals/1021163.html

import requests
from bs4 import BeautifulSoup

with open('filteredcasesdoc.txt', 'r') as openfile1:

    for line in openfile1:
                rulingpage = requests.get(line).text
                soup = BeautifulSoup(rulingpage, 'html.parser')
                doctext = soup.find('div', class_='caselawcontent searchable-content')
                print (doctext)

推荐答案

from bs4 import BeautifulSoup
import requests

url = 'http://caselaw.findlaw.com/mo-court-of-appeals/1021163.html'
soup = BeautifulSoup(requests.get(url).text, 'html.parser')

我添加了更多可靠 .find方法(键: value )

I've added a much more reliable .find method ( key : value)

whole_section = soup.find('div',{'class':'caselawcontent searchable-content'})


the_title = whole_section.center.h2
#e.g. Missouri Court of Appeals,Southern District,Division Two.
second_title = whole_section.center.h3.p
#e.g. STATE of Missouri, Plaintiff-Appellant v....
number_text = whole_section.center.h3.next_sibling.next_sibling
#e.g.
the_date = number_text.next_sibling.next_sibling
#authors
authors = whole_section.center.next_sibling
para = whole_section.findAll('p')[1:]
#Because we don't want the paragraph h3.p.
# we could aslso do findAll('p',recursive=False) doesnt pickup children

基本上，我解剖了整个树至于段落(例如，主文本， var para )，则必须循环打印(作者)

Basically, I've dissected this whole tree as for the Paragraphs (e.g. Main text, the var para), you'll have to loop print(authors)

# and you can add .text (e.g. print(authors.text) to get the text without the tag. 
# or a simple function that returns only the text 
def rettext(something):
    return something.text
#Usage: print(rettext(authorts))

这篇关于Python3.5 BeautifulSoup4从&#39&p;#39;获取文本在div中的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Python3.5 BeautifulSoup4从&#39&p;#39;获取文本在div中 [英] Python3.5 BeautifulSoup4 get text from &#39;p&#39; in div

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

Python3.5 BeautifulSoup4从&amp;#39&p;#39;获取文本在div中 [英] Python3.5 BeautifulSoup4 get text from &amp;#39;p&amp;#39; in div

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

Python3.5 BeautifulSoup4从&#39&p;#39;获取文本在div中 [英] Python3.5 BeautifulSoup4 get text from 'p' in div

登录关闭