从python BeautifulSoup的输出中删除新行'\ n' [英] Removing new line '\n' from the output of python BeautifulSoup

查看:147
本文介绍了从python BeautifulSoup的输出中删除新行'\ n'的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用python Beautiful汤来获取以下内容:

I am using python Beautiful soup to get the contents of:

<div class="path">
    <a href="#"> abc</a>
    <a href="#"> def</a>
    <a href="#"> ghi</a>
</div>

我的代码如下:

html_doc="""<div class="path">
    <a href="#"> abc</a>
    <a href="#"> def</a>
    <a href="#"> ghi</a>
</div>"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)

path = soup.find('div',attrs={'class':'path'})
breadcrum = path.findAll(text=True)

print breadcrum

输出如下,

[u'\n', u'abc', u'\n', u'def', u'\n', u'ghi',u'\n']

如何仅以以下形式获取结果:abc,def,ghi作为单个字符串?

How can I only get the result in this form: abc,def,ghi as a single string?

我还想知道这样获得的输出.

Also I want to know about the output so obtained.

推荐答案

您可以执行以下操作:

breadcrum = [item.strip() for item in breadcrum if str(item)]

if str(item)将在删除换行符后消除空列表项.

The if str(item) will take care of getting rid of the empty list items after stripping the new line characters.

如果要连接字符串,请执行以下操作:

If you want to join the strings, then do:

','.join(breadcrum)

这将为您提供abc,def,ghi

编辑

尽管上面提供了所需的内容,正如线程中的其他人所指出的那样,但是使用BS提取锚文本的方式并不正确.一旦有了您感兴趣的div,就应该使用它来获取它的子项,然后获取锚点文本.为:

Although the above gives you what you want, as pointed out by others in the thread, the way you are using BS to extract anchor texts is not correct. Once you have the div of your interest, you should be using it to get it's children and then get the anchor text. As:

path = soup.find('div',attrs={'class':'path'})
anchors = path.find_all('a')
data = []
for ele in anchors:
    data.append(ele.text)

然后执行','.join(data)

这篇关于从python BeautifulSoup的输出中删除新行'\ n'的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆