清洁文本字符串使用越来越Beautifulsoup正文之后 [英] Cleaning text string after getting body text using Beautifulsoup

查看:958
本文介绍了清洁文本字符串使用越来越Beautifulsoup正文之后的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图从不同的网页文章获得文字,并将其写入清洁文本文档。我不希望所有看到的文本,因为这通常包括网页上的侧不相干的链接。我使用Beautifulsoup提取网页中的信息。但是,额外的链接不只是页面的一面,但也有时那些在正文的中间,并在文章底部有时会使其成为最终产品。

有谁知道如何处理与被转换成文本实际上不是真正的文章的文字部分的额外的链接问题?

  #Some进口的是这里没有显示code的其他部分。
#我是新来的Python和记住这图书馆有哪些功能是坏的。
进口OS
进口SYS
进口的urllib2
进口web浏览器
从BS4进口BeautifulSoup
从OS导入路径
从cookielib进口CookieJar#我做了开门红处理代理和放***而不是我的信息
#cookielib帮助我获得纽约时报的文章
代理= urllib2.ProxyHandler({'HTTP':'***'%'***'})
AUTH = urllib2.HTTPBasicAuthHandler()
CJ = CookieJar()
首战= urllib2.build_opener(代理,身份验证,urllib2.HTTPHandler,urllib2.HTTPCookieProcessor(CJ))
urllib2.install_opener(揭幕战)#Uses URL输入作为一个字符串UPEN一个网页,并翻出所有信息。
高清鲍迈斯特(URL):
    REQ = urllib2.Request(URL)
    打开= urllib2.urlopen(REQ)
    html_doc = opened.read()
    汤= BeautifulSoup(html_doc)
    回汤#Gets从HTML信息正文。
高清substanz(URL):
    汤=鲍迈斯特(URL)
    体= soup.find_all(P)#这个是我试图解决这个问题,失败
    结果=
    为电子车身:
        I = e.getText()代替(\\ t的,).replace(,).strip()。EN code(误差=忽略)
        结果+ = I +\\ r \\ n \\ r \\ n
    返回结果

这是我使用测试,在我想确切的方式得到清理substanz一个文章是:

的http:/ /blogs.hbr.org/2014/06/do-you-really-want-to-be-yourself-at-work/

我想与来自不同网站的文章进行测试。所以我试图清理substanz的结果(结果是一个很大的字符串)。我的问题是这篇文章:

<一个href=\"http://www.cnbc.com/id/101790001?__source=yahoo%7Cfinance%7Cheadline%7Cheadline%7Cstory&par=yahoo&doc=101790001%7CThink%20college%20is%20expensiv#\" rel=\"nofollow\">http://www.cnbc.com/id/101790001?__source=yahoo%7Cfinance%7Cheadline%7Cheadline%7Cstory&par=yahoo&doc=101790001%7CThink%20college%20is%20expensiv#.

我刚刚使用了打印substanz('URL')来看看有什么结果的样子。与CNBC的文章中,我得到变成多余的文字链接,并不是真正的文章的一部分。而在哈佛商业评论文章一切工作只是罚款为包含链接的实际文本的一部分。

我不打算在这里附上完整的结果为每篇文章进行查看,因为它们每个长文本的一整页。

如果您尝试完全code我已经发布了开门红以上行不通的,所以用你喜欢访问的网站的任何开门红。我在工作访问某些代理所以这是为我的作品的格式。

最后一点,我正在使用python 3.4,而我写在笔记本IPython中的code。


解决方案

 进口要求
从BS4进口BeautifulSoup
R = requests.get(\"http://www.cnbc.com/id/101790001?__source=yahoo%7Cfinance%7Cheadline%7Cheadline%7Cstory&par=yahoo&doc=101790001%7CThink%20college%20is%20expensiv#\")
汤= BeautifulSoup(r.content)
文本= [''在soup.findAll。联接(s.findAll(文= TRUE))为S('P')]
打印(文本)
  ['&GT;&GT;查看所有结果','请输入由逗号','伦敦引用分隔的多个符号,现在可,贷款利率跳,因为联邦学生贷款联系在一起的10年期国债,CNBC的沙龙爱普森报告的借款人将会看到美国国债收益率的上升在过去一年的影响。,恭喜你,毕业生,你的文凭。现在什么关于$ 29,000名学生贷款债务? ,毕业生的70%以上将携带学生债务到现实世界,根据研究所学​​院访问和成功。平均债务是略低于$ 30,000。 ,但是,当学生贷款利率设定为再上升的消息将变得更糟下周。 ','虽然联邦学生贷款利率是固定的贷款期限,这些利率每7月1日感谢重置新的借款人,到领带的利率向金融市场表现的立法。 ','联邦斯塔福德贷款的利率将略低于4%,目前的固定利率去4.66%的被7月1日至2015年6月30日之间进行分配贷款,阅读MoreStudent贷款问题,一个简单的办法:参议员沃伦','研究生,在斯塔福德贷款利率将略高于5%上升到6.21%左右。 ','直接加为毕业生贷款和家长仍然是最昂贵的,随着利率上升至7.21%左右。,哪个大学的专业不负有心人最大?,CNBC的莎朗·爱普森在工程报告专业是最赚钱的。 每月的联邦学生贷款支付的增加可以迅速增加,但不应该成为大多数学生负担过重。对于贷款每$ 10,000,新的借款人将支付约4个月的基础上10年的还款期$。 ,读千年MoreWhy女人不为退休储蓄,不过,专家警告说,这仅仅只是一个开始。 ,美国联邦学生贷款利率将继续在未来几年内增加,可能会击中最高利率上限这是一样高的部分贷款增长10.5%,马克坎特罗威茨,高级副president和出版商说Edvisors.com。 ','对于大二的学生萨曼塔库克,去乔治·华盛顿大学的决定是一个大的经济。她说,她有疑虑。 ,我的父母想向我保证,不管什么我的妈呀,我们\\倒是找到一种方法,使其工作,库克说。像大多数的家庭,库克和她的父母都使得它通过结合他们的家庭储蓄,奖学金和助学金和助学贷款工作。 ','读MoreCramer:胶印高等教育的高成本,尽管学费和借贷成本上升,库克家族决定对萨曼莎转移到一个在州立大学。 ','尽管债务负担,她正在上,她说,GW程度的对我的价值找工作以后当至少会更有价值。 ' - 通过CNBC的莎朗·爱普森,托管旧货出售未必是摆脱旧的垃圾最赚钱的方法。,与借记卡绑定到他们的支票帐户,许多美国人仍然困惑如何将这些程序工作。 ,这是如何避免这些致命的罪过,如果你在考虑离婚或已经,美国国税局提供了很多对学生的帮助问题是,教育税收减免和它们如何协同工作 - 或者唐'T - 令人困惑,让最好的CNBC在收件箱','提示购房者,这将有助于你找到合适的家您的银行帐户,关于动机的投诉下降。如何找到合适的人,并保存。,忘了游泳衣的季节。为什么这真是一次参加健身房。,司机可能看到天然气价格较低,但今年智能购物战术可以帮助他们节省更多。 ,数据是实时的快照*数据延迟最少15 minutesGlobal商业和财经新闻,股市行情,与市场数据与分析,©2014 CNBC LLC。保留所有权利。,NBC环球的事业部]

从你的链接以获得从主文章全文的网站。

 进口要求
从BS4进口BeautifulSoup
R = requests.get(\"http://www.cnbc.com/id/101790001?__source=yahoo%7Cfinance%7Cheadline%7Cheadline%7Cstory&par=yahoo&doc=101790001%7CThink%20college%20is%20expensiv#\")
汤= BeautifulSoup(r.content)
文本= [。在soup.findAll'加入(s.findAll(文= TRUE))为S(格,{级:群})]
打印(文本)
['\\ n恭喜你,毕业生,你的文凭。现在什么关于$ 29,000名学生贷款债务? \\ n毕业生的70%以上将携带学生债务到现实世界,根据研究所学​​院访问和成功。平均债务是略低于$ 30,000。 \\ n但是,当学生贷款利率设定为再上升的消息将变得更糟下周。 \\ n虽然联邦学生贷款利率是固定的贷款期限,这些利率每7月1日重新设置新的借款人,这要归功于绑率对金融市场的表现立法。 \\ n联邦斯塔福德贷款的利率将略低于4%,目前的固定利率去4.66%的被7月1日至2015年6月30日,\\ n读取MoreStudent贷款问题的一个简单的办法进行分配贷款:参议员沃伦\\ n对于研究生,在斯塔福德贷款利率将略高于5%上升到6.21%左右。 \\ n直接加为毕业生和家长的贷款仍然是最昂贵的,随着利率上升至7.21%左右。\\ n','\\ n的月度联邦学生贷款支付的增加可以迅速增加,但不应该\\'吨太繁琐对于大多数的学生。对于每一个贷款$ 10,000,新的借款人将根据10年的还款期每月支付约4个$。 \\ n读取MoreWhy千年的女性不要\\'吨为退休储蓄\\ n不过,专家警告说,这仅仅只是一个开始。 \\ n联邦学生贷款利率将继续在未来几年内增加,可能会击中最高利率上限这是一样高的部分贷款增长10.5%,马克坎特罗威茨,Edvisors高级副president和出版商说.COM。 \\ n对于大二的学生萨曼莎库克,决定去乔治·华盛顿大学是一个大的经济。她说,她有疑虑。 \\ n我的父母想向我保证,不管什么我的妈呀,我们\\倒是找到一种方法,使其工作,库克说。像大多数的家庭,库克和她的父母都使得它通过结合他们的家庭储蓄,奖学金和助学金和助学贷款工作。 \\ n读取MoreCramer:高等教育的\\ n尽管学费和借贷成本的上升抵消成本高,库克家族决定对萨曼莎转移到一个在州立大学。 \\ n尽管债务负担,她正在上,她说,GW学位的价值为我找工作以后当至少会更有价值。 \\ n - 通过CNBC \\的莎朗·爱普森\\ n']

I'm trying to get text from articles on various webpages and write them as clean text documents. I don't want all visible text because that often includes irrelevant links on the side of webpages. I'm using Beautifulsoup to extract the information from pages. But, extra links not just on the side of the page but also those sometimes in the middle of the body text and at the bottom of the articles sometimes make it into the final product.

Does anyone know how to deal with the problem of extra links that are converted into text that are not actually a part of the real article's text?

#Some of the imports are for other portions of the code not shown here.
#I'm new to Python and am bad at remembering which library has which functions.
import os
import sys
import urllib2
import webbrowser
from bs4 import BeautifulSoup
from os import path
from cookielib import CookieJar

#I made an opener to deal with proxies and put *** instead of my information
#cookielib helps me get articles from nytimes
proxy = urllib2.ProxyHandler({'http': '***' % '***'})
auth = urllib2.HTTPBasicAuthHandler()
cj = CookieJar()
opener = urllib2.build_opener(proxy, auth, urllib2.HTTPHandler, urllib2.HTTPCookieProcessor(cj))
urllib2.install_opener(opener)

#Uses url input as a string to upen a webpage and and pulls out all the information.
def baumeister(url):
    req = urllib2.Request(url)
    opened = urllib2.urlopen(req)
    html_doc = opened.read()
    soup = BeautifulSoup(html_doc)
    return soup

#Gets the body text from that html information.
def substanz(url):
    soup = baumeister(url)
    body = soup.find_all("p") #This is where I have tried to fix the problem and failed
    result = ""    
    for e in body:
        i = e.getText().replace("\t", "").replace("  ", " ").strip().encode(errors="ignore")
        result += i + "\r\n\r\n"
    return result

One article that I have used to test substanz that gets cleaned in the exact way I want is:

http://blogs.hbr.org/2014/06/do-you-really-want-to-be-yourself-at-work/

I'm trying to test with more articles from different sites. So I'm trying to clean the result of substanz (the result is a big string). The problem I have is with this article:

http://www.cnbc.com/id/101790001?__source=yahoo%7Cfinance%7Cheadline%7Cheadline%7Cstory&par=yahoo&doc=101790001%7CThink%20college%20is%20expensiv#.

I've just used the print substanz('url') to see what the result looks like. With the cnbc article I get extra links turned into text that are not really a part of the article. Whereas in the Harvard Business Review Article everything works out just fine as included links are part of the actual text.

I'm not going to attach the full result for each article here for viewing because they are each a full page of text long.

If you try exactly the code I have posted above the opener is not going to work, so use whatever opener you like to access websites. I have to access a certain proxy at work so that's the format that works for me.

Final note, I'm using python 3.4, and am writing the code in ipython notebook.

解决方案

import requests
from bs4 import BeautifulSoup
r = requests.get("http://www.cnbc.com/id/101790001?__source=yahoo%7Cfinance%7Cheadline%7Cheadline%7Cstory&par=yahoo&doc=101790001%7CThink%20college%20is%20expensiv#")
soup = BeautifulSoup(r.content)
text =[''.join(s.findAll(text=True))for s in soup.findAll('p')]
print (text)


  ['>> View All Results for ""', 'Enter multiple symbols separated by commas', 'London quotes now available', 'Interest rates on loans to jump', "Because federal student loans are tied to the 10-year Treasury note, CNBC's Sharon Epperson reports borrowers will see the impact of the rise in Treasury yields over the past year.", '  Congratulations, graduates, on your diploma. Now what about that $29,000 student loan debt? ', '  More than 70 percent of graduates will carry student debt into the real world, according to the Institute for College Access and Success. And the average debt is just shy of $30,000.  ', '  But the news will get worse next week when interest rates on student loans are set to rise again.   ', '  Though federal student loan rates are fixed for the life of the loan, these rates reset for new borrowers every July 1, thanks to legislation that ties the rates to the performance of the financial markets.  ', '  The interest rate on federal Stafford loans will go from its current fixed rate of just under 4 percent to 4.66 percent for loans that are distributed between July 1 and June 30, 2015.  ', ' Read MoreStudent loan problem an easy fix: Sen. Warren ', '  For graduate students, the rate on Stafford loans will rise from just over 5 percent  to 6.21 percent.  ', '  Direct PLUS Loans for graduates and parents are still the most expensive, with rates rising to 7.21 percent.', 'Which college major pays off most?', "CNBC's Sharon Epperson reports majoring in engineering is the most lucrative. ", "  The increase in monthly federal student loan payments can add up quickly, but shouldn't be too burdensome for most students. For every $10,000 in loans, new borrowers will pay about $4 more a month based on a 10-year repayment period.   ", " Read MoreWhy millennial women don't save for retirement ", '  Still, experts warn that this is only just the beginning.  ', '  "Federal student loan rates will continue to increase in the next few years and will likely hit the maximum rate caps which are as high as 10.5 percent for some loans," said Mark Kantrowitz, senior vice president and publisher of Edvisors.com.  ', '  For sophomore student Samantha Cook, the decision to go to George Washington University was a big one financially. She says she had doubts about it.  ', '  "My parents wanted to assure me that no matter what I picked, we\'d find a way to make it work," Cook said. Like most families, Cook and her parents are making it work by combining their household savings, scholarships and grants—and student loans.    ', ' Read MoreCramer: Offset high cost of higher education ', '  Despite rising tuition and borrowing costs, the Cook family decided against Samantha transferring to an in-state university.  ', '  Despite the debt load she is taking on, she said, "the value of a GW degree for me at least would be more valuable when looking for jobs later on." ', " —By CNBC's Sharon Epperson ", 'Hosting a yard sale may not be the most profitable way to get rid of your old junk.', 'Many Americans with debit cards tied to their checking accounts are still confused about how these programs work. ', "Here's how to avoid these deadly sins if you're contemplating or already in a divorce.", "The IRS offers a lot of help for students. Problem is, the educational tax breaks and how they work together -- or don't -- are confusing.", 'Get the best of CNBC in your inbox', 'Tips for home buyers that will help you find the right home for your bank account.', 'Complaints about movers are down. How to find the right one—and save.', "Forget bathing suit season. Why it's really time to join the gym. ", 'Drivers might see lower gas prices this year, but smart shopping tactics could help them save even more.', 'Data is a real-time snapshot *Data is delayed at least 15 minutesGlobal Business and Financial News, Stock Quotes, and Market Data and Analysis', '© 2014 CNBC LLC.  All Rights Reserved.', 'A Division of NBCUniversal']

From the website in your link to get text from the main article.

import requests
from bs4 import BeautifulSoup
r = requests.get("http://www.cnbc.com/id/101790001?__source=yahoo%7Cfinance%7Cheadline%7Cheadline%7Cstory&par=yahoo&doc=101790001%7CThink%20college%20is%20expensiv#")
soup = BeautifulSoup(r.content)
text =[''.join(s.findAll(text=True)) for s in soup.findAll("div", {"class":"group"})]
print (text)
['\n  Congratulations, graduates, on your diploma. Now what about that $29,000 student loan debt? \n  More than 70 percent of graduates will carry student debt into the real world, according to the Institute for College Access and Success. And the average debt is just shy of $30,000.  \n  But the news will get worse next week when interest rates on student loans are set to rise again.   \n  Though federal student loan rates are fixed for the life of the loan, these rates reset for new borrowers every July 1, thanks to legislation that ties the rates to the performance of the financial markets.  \n  The interest rate on federal Stafford loans will go from its current fixed rate of just under 4 percent to 4.66 percent for loans that are distributed between July 1 and June 30, 2015.  \n Read MoreStudent loan problem an easy fix: Sen. Warren \n  For graduate students, the rate on Stafford loans will rise from just over 5 percent  to 6.21 percent.  \n  Direct PLUS Loans for graduates and parents are still the most expensive, with rates rising to 7.21 percent.\n', '\n  The increase in monthly federal student loan payments can add up quickly, but shouldn\'t be too burdensome for most students. For every $10,000 in loans, new borrowers will pay about $4 more a month based on a 10-year repayment period.   \n Read MoreWhy millennial women don\'t save for retirement \n  Still, experts warn that this is only just the beginning.  \n  "Federal student loan rates will continue to increase in the next few years and will likely hit the maximum rate caps which are as high as 10.5 percent for some loans," said Mark Kantrowitz, senior vice president and publisher of Edvisors.com.  \n  For sophomore student Samantha Cook, the decision to go to George Washington University was a big one financially. She says she had doubts about it.  \n  "My parents wanted to assure me that no matter what I picked, we\'d find a way to make it work," Cook said. Like most families, Cook and her parents are making it work by combining their household savings, scholarships and grants—and student loans.    \n Read MoreCramer: Offset high cost of higher education \n  Despite rising tuition and borrowing costs, the Cook family decided against Samantha transferring to an in-state university.  \n  Despite the debt load she is taking on, she said, "the value of a GW degree for me at least would be more valuable when looking for jobs later on." \n —By CNBC\'s Sharon Epperson \n']

这篇关于清洁文本字符串使用越来越Beautifulsoup正文之后的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆