从字符串中删除长破折号 [英] Remove long dash from string

查看:138
本文介绍了从字符串中删除长破折号的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从网站到Python读取html内容,以分析其中的文本并确定它们属于哪个类别.当我尝试使用短划线时,当它们进入NoneType时,我遇到了一个长破折号的问题.我已经尝试过在此站点上建议的一些修复程序,但是都没有起作用.

I'm trying to read the html content from website to Python to analyze the texts there and decide in which category they fall into. I have an issue with long dashes as they go into NoneType when i'm trying to work with them. I have tried several fixes suggested on this site, but none of them have worked.

from bs4 import BeautifulSoup
import re
import urllib.request
response = urllib.request.urlopen('website-im-opening')
content = response.read().decode('utf-8')
#this does not work
content = content.translate({0x2014: None})
content = re.sub(u'\u2014','',content)
#This is other part of code
htmlcontent = BeautifulSoup(content,"html.parser")

for cont in htmlcontent.select('p'):
    if cont.has_attr('class') == False:
        print(cont.strip()) #Returns an error as text contains long dash

有什么想法可以过滤字符串中的长破折号以便与其他文本一起使用吗?我可以将其替换为短破折号或完全删除,它们对我而言并不重要.

Any ideas how could I filter out the long dashes from the string in order to work with the other text? I could replace it with short dash or remove completely, they're not important for me.

谢谢!

推荐答案

使用bs4提取数据后,应清除数据:

you should clean the data after the you use bs4 extract it:

  1. BS4将转换一些HTML实体,您不需要自己做.
  2. BS4将为您解码文档

```

response = urllib.request.urlopen('website-im-opening')

content = response.read()

htmlcontent = BeautifulSoup(content,"html.parser")

for cont in htmlcontent.find_all('p', class_=False):

    print(p.text)

```

这篇关于从字符串中删除长破折号的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆