文本提取:使用了所有方法,但卡住了 [英] Text Extracting: Used All Methods, Yet Stuck

查看:57
本文介绍了文本提取:使用了所有方法,但卡住了的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从网页中提取一些文本.我搜索了StackOverFlow(以及其他站点)以找到合适的方法.我使用HTML2TEXT,BEAUTIFULSOUP,NLTK和其他一些手动方法进行提取,例如,我失败了:

I want to extract a few text out of a webpage. I searched StackOverFlow (as well as other sites) to find a proper method. I used HTML2TEXT, BEAUTIFULSOUP, NLTK and some other manual methods to do extraction and I failed for example:

  • HTML2TEXT可在离线(已保存的页面)上使用,我需要在线进行.
  • BS4无法在Unicode上正常工作(我的页面使用UTF8波斯编码),并且无法提取文本.它还返回HTML标记\代码.我只需要渲染的文本.
  • NLTK无法用于我的波斯文字.即使在尝试使用urllib.request.urlopen打开我的页面时,我仍然遇到一些错误.因此,如您所见,尝试了几种方法后,我非常困惑.

这是我的目标URL: http://vynylyn.yolasite.com/page2.php 我只想提取不带标签\代码的波斯语段落.

Here's my target URL: http://vynylyn.yolasite.com/page2.php I want to extract only Persian paragraphs without tags\codes.

(注意:我使用带Python 34的Eclipse Kepler,也想提取文本,然后要对文本进行POS标记,Word \句子标记化等.)

(Note: I use Eclipse Kepler w\ Python 34 also I want to extract text then I want to do POS Tagging, Word\Sentence Tokenizing, etc on the text.)

要使该功能正常工作,我有什么选择?

What are my options to get this working?

推荐答案

我首先会选择第二个选项.BeautifulSoup 4应该(并且确实)绝对支持Unicode (请注意,它是UTF-8,是一种全局字符编码,因此没有波斯语.)

I'd go for your second option at first. BeautifulSoup 4 should (and does) definitely support unicode (note it's UTF-8, a global character encoding, so there's nothing Persian about it).

是的,您将获得标签,因为它是HTML页面.尝试搜索唯一的ID,或查看页面上的HTML结构.对于您的示例,请查找元素 main ,然后在其下方查找内容元素,或者在该特定页面中使用 div#I1_sys_txt .有了元素后,您只需要调用 get_text().

And yes, you will get tags, as it's an HTML page. Try searching for a unique ID, or look at the HTML structure on the page(s). For your example, look for element main and then content elements below that, or maybe use div#I1_sys_txt in that specific page. Once you have your element, you just need to call get_text().

尝试一下(现在在Python 3中):

Try this (now in Python 3):

#!/usr/bin/env python3
import requests
from bs4 import BeautifulSoup

content = requests.get('http://vynylyn.yolasite.com/page2.php')
soup = BeautifulSoup(content.text)

tag = soup.find('div', id='I1_sys_txt')
print(tag.get_text() if tag else "<none found>")

这篇关于文本提取:使用了所有方法,但卡住了的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆