Python3.5 BeautifulSoup4从div中的'p'获取文本 [英] Python3.5 BeautifulSoup4 get text from 'p' in div
问题描述
我正在尝试从div类'caselawcontent searchable-content'中提取所有文本.此代码仅打印HTML,而不会显示来自网页的文本.我想得到什么短信?
I am trying to pull all the text from the div class 'caselawcontent searchable-content'. This code just prints the HTML without the text from the web page. What am I missing to get the text?
以下链接位于'finteredcasesdoc.text'文件中:
http://caselaw.findlaw.com/mo-court-of-appeals/1021163.html
The following link is in the 'finteredcasesdoc.text' file:
http://caselaw.findlaw.com/mo-court-of-appeals/1021163.html
import requests
from bs4 import BeautifulSoup
with open('filteredcasesdoc.txt', 'r') as openfile1:
for line in openfile1:
rulingpage = requests.get(line).text
soup = BeautifulSoup(rulingpage, 'html.parser')
doctext = soup.find('div', class_='caselawcontent searchable-content')
print (doctext)
推荐答案
from bs4 import BeautifulSoup
import requests
url = 'http://caselaw.findlaw.com/mo-court-of-appeals/1021163.html'
soup = BeautifulSoup(requests.get(url).text, 'html.parser')
我添加了更多可靠 .find方法(键: value )
I've added a much more reliable .find method ( key : value)
whole_section = soup.find('div',{'class':'caselawcontent searchable-content'})
the_title = whole_section.center.h2
#e.g. Missouri Court of Appeals,Southern District,Division Two.
second_title = whole_section.center.h3.p
#e.g. STATE of Missouri, Plaintiff-Appellant v....
number_text = whole_section.center.h3.next_sibling.next_sibling
#e.g.
the_date = number_text.next_sibling.next_sibling
#authors
authors = whole_section.center.next_sibling
para = whole_section.findAll('p')[1:]
#Because we don't want the paragraph h3.p.
# we could aslso do findAll('p',recursive=False) doesnt pickup children
基本上,我解剖了整个树
至于段落(例如主文本, var para
),则必须循环
print(authors)
Basically, I've dissected this whole tree
as for the Paragraphs (e.g. Main text, the var para
), you'll have to loop
print(authors)
# and you can add .text (e.g. print(authors.text) to get the text without the tag.
# or a simple function that returns only the text
def rettext(something):
return something.text
#Usage: print(rettext(authorts))
这篇关于Python3.5 BeautifulSoup4从div中的'p'获取文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!