从 python BeautifulSoup 的输出中删除新行 ' ' [英] Removing new line ' ' from the output of python BeautifulSoup

查看:11
本文介绍了从 python BeautifulSoup 的输出中删除新行 ' '的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用python Beautiful soup获取内容:

<a href="#">abc</a><a href="#">定义<a href="#">ghi</a>

我的代码如下:

html_doc="""

<a href="#">abc</a><a href="#">定义<a href="#">ghi</a></div>"""从 bs4 导入 BeautifulSoup汤 = BeautifulSoup(html_doc)路径 = 汤.find('div',attrs={'class':'path'})面包屑 = path.findAll(text=True)打印面包屑

输出如下,

[u'
', u'abc', u'
', u'def', u'
', u'ghi',u'
']

我怎样才能得到这种形式的结果:abc,def,ghi 作为单个字符串?

我也想知道这样获得的输出.

解决方案

你可以这样做:

breadcrum = [item.strip() 用于面包屑中的项目 if str(item)]

if str(item) 将在去除换行符后处理去除空列表项.

如果你想加入字符串,那么做:

','.join(面包屑)

这会给你 abc,def,ghi

编辑

虽然上面给了你你想要的东西,正如线程中的其他人所指出的那样,你使用 BS 提取锚文本的方式是不正确的.一旦你有了你感兴趣的 div,你应该使用它来获取它的子元素,然后获取锚文本.如:

path = soup.find('div',attrs={'class':'path'})锚点 = path.find_all('a')数据 = []对于锚点中的 ele:数据附加(电子文本)

然后做一个','.join(data)

I am using python Beautiful soup to get the contents of:

<div class="path">
    <a href="#"> abc</a>
    <a href="#"> def</a>
    <a href="#"> ghi</a>
</div>

My code is as follows:

html_doc="""<div class="path">
    <a href="#"> abc</a>
    <a href="#"> def</a>
    <a href="#"> ghi</a>
</div>"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)

path = soup.find('div',attrs={'class':'path'})
breadcrum = path.findAll(text=True)

print breadcrum

The output is as follow,

[u'
', u'abc', u'
', u'def', u'
', u'ghi',u'
']

How can I only get the result in this form: abc,def,ghi as a single string?

Also I want to know about the output so obtained.

解决方案

You could do this:

breadcrum = [item.strip() for item in breadcrum if str(item)]

The if str(item) will take care of getting rid of the empty list items after stripping the new line characters.

If you want to join the strings, then do:

','.join(breadcrum)

This will give you abc,def,ghi

EDIT

Although the above gives you what you want, as pointed out by others in the thread, the way you are using BS to extract anchor texts is not correct. Once you have the div of your interest, you should be using it to get it's children and then get the anchor text. As:

path = soup.find('div',attrs={'class':'path'})
anchors = path.find_all('a')
data = []
for ele in anchors:
    data.append(ele.text)

And then do a ','.join(data)

这篇关于从 python BeautifulSoup 的输出中删除新行 ' '的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆