Python 抓取 htm 标签之间的文本继续主题 [英] Python scrape text between htm tags continuation topic

查看:26
本文介绍了Python 抓取 htm 标签之间的文本继续主题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所有,这是我以前的帖子,但适用于不同的场景.

All, This is continuation from my previous post, but for different scenario.

现在有特定的场景,我需要提取标签之间的文本.

Now there is specific scenario, where i need to extract text in between the tags.

    data='''<BR><DIV CLASS="c0"><P CLASS="c1"><SPAN CLASS="c2">1 of 2 DOCUMENTS</SPAN></P>
</DIV>
<BR><DIV CLASS="c0"><BR><P CLASS="c1"><SPAN CLASS="c2">The </SPAN><SPAN CLASS="c3">New York Times</SPAN></P>
</DIV>
<BR><DIV CLASS="c4"><P CLASS="c1"><SPAN CLASS="c3">March</SPAN><SPAN CLASS="c2"> 17, 2016 Thursday</SPAN><SPAN CLASS="c2">&nbsp;</SPAN><SPAN CLASS="c2">&nbsp;<BR>Late Edition - Final</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c7">Paid Notice: Deaths THORNTON, ROBERT</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">SECTION: </SPAN><SPAN CLASS="c2">Section A; Column 0; Classified; Pg. 19</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">LENGTH: </SPAN><SPAN CLASS="c2">176 words</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c9"><SPAN CLASS="c2">THORNTON--Robert. Robert &quot;Bob&quot; Richard Thornton, 89, of Peoria, IL, died peacefully and surrounded by family on Friday, March 11, 2016. Bob was born October 16, 1926, in Jersey City, New Jersey. He graduated from Regis High School in New York City on June 15, 1945, and immediately thereafter served in the U.S. Navy. He received a B.A. from Georgetown University in 1950 and a J.D. from Columbia University Law School in 1953. He practiced law in New York City for 17 years with the law firms of Dorr Hand and Nixon, Mudge, Rose, Guthrie &amp; Alexander. He joined the legal department of Caterpillar Tractor Co. in 1970 and served as the company's General Counsel and Corporate Secretary from 1983 to 1991. He is survived by his wife, Dorothy (McGuire) of Peoria; and his children, Matthew, Nicholas, Jennifer, and Julia. In lieu of flowers, donations may be made in the name of Robert and Dorothy Thornton to St. Philomena's School in Peoria, IL, Regis High School in New York City, or the National Association for Rare Disorders (www.rare diseases.org). 1/3</SPAN><SPAN CLASS="c2">&nbsp;</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">URL: </SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">LANGUAGE: </SPAN><SPAN CLASS="c2">ENGLISH</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">DOCUMENT-TYPE: </SPAN><SPAN CLASS="c2">Paid Death Notice</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">PUBLICATION-TYPE: </SPAN><SPAN CLASS="c2">Newspaper</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">SUBJECT: </SPAN><SPAN CLASS="c2">DEATHS &amp; OBITUARIES (92%); HIGH SCHOOLS (90%); LAWYERS (87%); LAW SCHOOLS (77%); CORPORATE COUNSEL (75%); LEGAL SERVICES (70%); GRADUATE &amp; PROFESSIONAL SCHOOLS (70%); ASSOCIATIONS &amp; ORGANIZATIONS (65%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">COMPANY: </SPAN><SPAN CLASS="c2">CATERPILLAR INC (70%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">ORGANIZATION: </SPAN><SPAN CLASS="c2">COLUMBIA UNIVERSITY (57%); GEORGETOWN UNIVERSITY (57%); US NAVY (57%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">TICKER: </SPAN><SPAN CLASS="c2">CATR (PAR) (70%); </SPAN><SPAN CLASS="c3">CAT</SPAN><SPAN CLASS="c2"> (SWX) (70%); </SPAN><SPAN CLASS="c3">CAT</SPAN><SPAN CLASS="c2"> (NYSE) (70%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">INDUSTRY: </SPAN><SPAN CLASS="c2">NAICS333131 MINING MACHINERY &amp; EQUIPMENT MANUFACTURING (70%); NAICS333120 CONSTRUCTION MACHINERY MANUFACTURING (70%); NAICS333111 FARM MACHINERY &amp; EQUIPMENT MANUFACTURING (70%); SIC3531 CONSTRUCTION MACHINERY &amp; EQUIPMENT (70%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">PERSON: </SPAN><SPAN CLASS="c2">RICHARD NIXON (78%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">CITY: </SPAN><SPAN CLASS="c2">NEW YORK, NY, USA (94%); PEORIA, IL, USA (94%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">STATE: </SPAN><SPAN CLASS="c2">NEW YORK, USA (94%); ILLINOIS, USA (94%); NEW JERSEY, USA (94%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">COUNTRY: </SPAN><SPAN CLASS="c2">UNITED STATES (94%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">LOAD-DATE: </SPAN><SPAN CLASS="c2">March 17, 2016</SPAN></P>
</DIV>
<BR><DIV CLASS="c0"><BR><P CLASS="c1"><SPAN CLASS="c2">Copyright 2016 The </SPAN><SPAN CLASS="c3">New York Times</SPAN><SPAN CLASS="c2"> Company</SPAN></P>
</DIV>
<!-- Hide XML section from browser
</DOCFULL>
</DOC> -->
<DIV CLASS="c10">&nbsp;</DIV>
<A NAME="DOC_ID_0_1"></A><!-- Hide XML section from browser
<DOC NUMBER=2>
<DOCFULL> -->
<BR><DIV CLASS="c0"><P CLASS="c1"><SPAN CLASS="c2">2 of 2 DOCUMENTS</SPAN></P>
</DIV>
<BR><DIV CLASS="c0"><BR><P CLASS="c1"><SPAN CLASS="c2">The </SPAN><SPAN CLASS="c3">New York Times Company</SPAN></P>
</DIV>
<BR><DIV CLASS="c4"><P CLASS="c1"><SPAN CLASS="c3">March</SPAN><SPAN CLASS="c2"> 16, 2016 Wednesday</SPAN><SPAN CLASS="c2">&nbsp;</SPAN><SPAN CLASS="c2">&nbsp;<BR>Late Edition - Final</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c7">Paid Notice: Deaths THORNTON, ROBERT</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">SECTION: </SPAN><SPAN CLASS="c2">Section B; Column 0; Classified; Pg. 16</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">LENGTH: </SPAN><SPAN CLASS="c2">176 words</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c9"><SPAN CLASS="c2">THORNTON--Robert. Robert &quot;Bob&quot; Richard Thornton, 89, of Peoria, IL, died peacefully and surrounded by family on Friday, March 11, 2016. Bob was born October 16, 1926, in Jersey City, New Jersey. He graduated from Regis High School in New York City on June 15, 1945, and immediately thereafter served in the U.S. Navy. He received a B.A. from Georgetown University in 1950 and a J.D. from Columbia University Law School in 1953. He practiced law in New York City for 17 years with the law firms of Dorr Hand and Nixon, Mudge, Rose, Guthrie &amp; Alexander. He joined the legal department of Caterpillar Tractor Co. in 1970 and served as the company's General Counsel and Corporate Secretary from 1983 to 1991. He is survived by his wife, Dorothy (McGuire) of Peoria; and his children, Matthew, Nicholas, Jennifer, and Julia. In lieu of flowers, donations may be made in the name of Robert and Dorothy Thornton to St. Philomena's School in Peoria, IL, Regis High School in New York City, or the National Association for Rare Disorders (www.rare diseases.org). 1/3 </SPAN><SPAN CLASS="c2">&nbsp;</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">URL: </SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">LANGUAGE: </SPAN><SPAN CLASS="c2">ENGLISH</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">DOCUMENT-TYPE: </SPAN><SPAN CLASS="c2">Paid Death Notice</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">PUBLICATION-TYPE: </SPAN><SPAN CLASS="c2">Newspaper</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">SUBJECT: </SPAN><SPAN CLASS="c2">DEATHS &amp; OBITUARIES (92%); HIGH SCHOOLS (90%); LAWYERS (87%); LAW SCHOOLS (77%); CORPORATE COUNSEL (75%); LEGAL SERVICES (70%); GRADUATE &amp; PROFESSIONAL SCHOOLS (70%); ASSOCIATIONS &amp; ORGANIZATIONS (65%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">COMPANY: </SPAN><SPAN CLASS="c2">CATERPILLAR INC (70%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">ORGANIZATION: </SPAN><SPAN CLASS="c2">COLUMBIA UNIVERSITY (57%); GEORGETOWN UNIVERSITY (57%); US NAVY (57%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">TICKER: </SPAN><SPAN CLASS="c2">CATR (PAR) (70%); </SPAN><SPAN CLASS="c3">CAT</SPAN><SPAN CLASS="c2"> (SWX) (70%); </SPAN><SPAN CLASS="c3">CAT</SPAN><SPAN CLASS="c2"> (NYSE) (70%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">INDUSTRY: </SPAN><SPAN CLASS="c2">NAICS333131 MINING MACHINERY &amp; EQUIPMENT MANUFACTURING (70%); NAICS333120 CONSTRUCTION MACHINERY MANUFACTURING (70%); NAICS333111 FARM MACHINERY &amp; EQUIPMENT MANUFACTURING (70%); SIC3531 CONSTRUCTION MACHINERY &amp; EQUIPMENT (70%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">PERSON: </SPAN><SPAN CLASS="c2">RICHARD NIXON (78%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">CITY: </SPAN><SPAN CLASS="c2">NEW YORK, NY, USA (94%); PEORIA, IL, USA (94%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">STATE: </SPAN><SPAN CLASS="c2">NEW YORK, USA (94%); ILLINOIS, USA (94%); NEW JERSEY, USA (94%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">COUNTRY: </SPAN><SPAN CLASS="c2">UNITED STATES (94%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">LOAD-DATE: </SPAN><SPAN CLASS="c2">March 16, 2016</SPAN></P>
</DIV>
<BR><DIV CLASS="c0"><BR><P CLASS="c1"><SPAN CLASS="c2">Copyright 2015 The </SPAN><SPAN CLASS="c3">New York Times</SPAN><SPAN CLASS="c2"> Company</SPAN></P>
</DIV>

'''

我尝试过的解决方案:

publicationnamepattern="\<DIV CLASS=\"c0\"\>\<BR>\<P CLASS=\"c1\"\><SPAN CLASS=\"c2\"\>(.*)\</SPAN>\</P>"

copyrightpattern = "\<DIV CLASS=\"c0\"\>\<BR>\<P CLASS=\"c1\"\><SPAN CLASS=\"c2\"\>([^<]*)\</SPAN>"

publicationnamepatternvalues = [a.strip("*") for a in re.findall(publicationnamepattern, data)]

copyrightpatternvalues = [a.strip("*") for a in re.findall(copyrightpattern, data)]

print(str(publicationnamepatternvalues))

print(str(copyrightpatternvalues))

结果:

['The </SPAN><SPAN CLASS="c3">New York Times', 'Copyright 2016 The </SPAN><SPAN CLASS="c3">New York Times</SPAN><SPAN CLASS="c2"> Company', 'The </SPAN><SPAN CLASS="c3">New York Times', 'Copyright 2016 The </SPAN><SPAN CLASS="c3">New York Times</SPAN><SPAN CLASS="c2"> Company']

我只需要纽约时报"作为publicationnamepatternvalues 和Copyright 2016 The New York Times Company"作为Copyrightpatternvalues

where i need only "The New York Times" for publicationnamepatternvalues and "Copyright 2016 The New York Times Company" for Copyrightpatternvalues

我无法提供更多静态值,因为只有这些字段在数据中很常见.即纽约时报

I am not able to give more static values as only these fields are common in data.i.e New York Times

任何人都可以帮助我,如何解决这种情况.

Could anyone pls help me, how to solve this kind of scenario.

推荐答案

from bs4 import BeautifulSoup

a="""
data='''<BR><DIV CLASS="c0"><P CLASS="c1"><SPAN CLASS="c2">1 of 2 DOCUMENTS</SPAN></P>
</DIV>
<BR><DIV CLASS="c0"><BR><P CLASS="c1"><SPAN CLASS="c2">The </SPAN><SPAN CLASS="c3">New York Times</SPAN></P>
</DIV>
<BR><DIV CLASS="c4"><P CLASS="c1"><SPAN CLASS="c3">March</SPAN><SPAN CLASS="c2"> 17, 2016 Thursday</SPAN><SPAN CLASS="c2">&nbsp;</SPAN><SPAN CLASS="c2">&nbsp;<BR>Late Edition - Final</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c7">Paid Notice: Deaths THORNTON, ROBERT</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">SECTION: </SPAN><SPAN CLASS="c2">Section A; Column 0; Classified; Pg. 19</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">LENGTH: </SPAN><SPAN CLASS="c2">176 words</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c9"><SPAN CLASS="c2">THORNTON--Robert. Robert &quot;Bob&quot; 1/3</SPAN><SPAN CLASS="c2">&nbsp;</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">URL: </SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">LANGUAGE: </SPAN><SPAN CLASS="c2">ENGLISH</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">DOCUMENT-TYPE: </SPAN><SPAN CLASS="c2">Paid Death Notice</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">PUBLICATION-TYPE: </SPAN><SPAN CLASS="c2">Newspaper</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">SUBJECT: </SPAN><SPAN CLASS="c2">DEATHS &amp; OBITUARIES (92%); </SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">COMPANY: </SPAN><SPAN CLASS="c2">CATERPILLAR INC (70%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">ORGANIZATION: </SPAN><SPAN CLASS="c2">COLUMBIA UNIVERSITY (57%); GEORGETOWN UNIVERSITY (57%); US NAVY (57%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">TICKER: </SPAN><SPAN CLASS="c2">CATR (PAR) (70%); </SPAN><SPAN CLASS="c3">CAT</SPAN><SPAN CLASS="c2"> (SWX) (70%); </SPAN><SPAN CLASS="c3">CAT</SPAN><SPAN CLASS="c2"> (NYSE) (70%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">INDUSTRY: </SPAN><SPAN CLASS="c2">NAICS333131 MINING MACHINERY &amp; </SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">PERSON: </SPAN><SPAN CLASS="c2">RICHARD NIXON (78%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">CITY: </SPAN><SPAN CLASS="c2">NEW YORK, NY, USA (94%); PEORIA, IL, USA (94%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">STATE: </SPAN><SPAN CLASS="c2">NEW YORK, USA (94%); ILLINOIS, USA (94%); NEW JERSEY, USA (94%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">COUNTRY: </SPAN><SPAN CLASS="c2">UNITED STATES (94%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">LOAD-DATE: </SPAN><SPAN CLASS="c2">March 17, 2016</SPAN></P>
</DIV>
<BR><DIV CLASS="c0"><BR><P CLASS="c1"><SPAN CLASS="c2">Copyright 2016 The </SPAN><SPAN CLASS="c3">New York Times</SPAN><SPAN CLASS="c2"> Company</SPAN></P>
</DIV>'''
"""
soup=BeautifulSoup(a)
soup2 = soup.select('div.c0')
list1 = [b.text.strip().encode('utf-8') for b in soup2]
print list1
var1, var2 = list1[1], list1[2]
print var1
print var2

输出:

['1 of 2 DOCUMENTS', 'The New York Times', 'Copyright 2016 The New York Times Company']
The New York Times
Copyright 2016 The New York Times Company

这篇关于Python 抓取 htm 标签之间的文本继续主题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆