文本提取:使用了所有方法,但卡住了 [英] Text Extracting: Used All Methods, Yet Stuck
问题描述
我想从网页中提取一些文本.我搜索了StackOverFlow(以及其他站点)以找到合适的方法.我使用HTML2TEXT,BEAUTIFULSOUP,NLTK和其他一些手动方法进行提取,例如,我失败了:
I want to extract a few text out of a webpage. I searched StackOverFlow (as well as other sites) to find a proper method. I used HTML2TEXT, BEAUTIFULSOUP, NLTK and some other manual methods to do extraction and I failed for example:
- HTML2TEXT可在离线(已保存的页面)上使用,我需要在线进行.
- BS4无法在Unicode上正常工作(我的页面使用UTF8波斯编码),并且无法提取文本.它还返回HTML标记\代码.我只需要渲染的文本.
- NLTK无法用于我的波斯文字.即使在尝试使用urllib.request.urlopen打开我的页面时,我仍然遇到一些错误.因此,如您所见,尝试了几种方法后,我非常困惑.
这是我的目标URL: http://vynylyn.yolasite.com/page2.php 我只想提取不带标签\代码的波斯语段落.
Here's my target URL: http://vynylyn.yolasite.com/page2.php I want to extract only Persian paragraphs without tags\codes.
(注意:我使用带Python 34的Eclipse Kepler,也想提取文本,然后要对文本进行POS标记,Word \句子标记化等.)
(Note: I use Eclipse Kepler w\ Python 34 also I want to extract text then I want to do POS Tagging, Word\Sentence Tokenizing, etc on the text.)
要使该功能正常工作,我有什么选择?
What are my options to get this working?
推荐答案
我首先会选择第二个选项.BeautifulSoup 4应该(并且确实)绝对支持Unicode (请注意,它是UTF-8,是一种全局字符编码,因此没有波斯语.)
I'd go for your second option at first. BeautifulSoup 4 should (and does) definitely support unicode (note it's UTF-8, a global character encoding, so there's nothing Persian about it).
是的,您将获得标签,因为它是HTML页面.尝试搜索唯一的ID,或查看页面上的HTML结构.对于您的示例,请查找元素 main
,然后在其下方查找内容元素,或者在该特定页面中使用 div#I1_sys_txt
.有了元素后,您只需要调用 get_text().
And yes, you will get tags, as it's an HTML page. Try searching for a unique ID, or look at the HTML structure on the page(s). For your example, look for element main
and then content elements below that, or maybe use div#I1_sys_txt
in that specific page. Once you have your element, you just need to call get_text().
尝试一下(现在在Python 3中):
Try this (now in Python 3):
#!/usr/bin/env python3
import requests
from bs4 import BeautifulSoup
content = requests.get('http://vynylyn.yolasite.com/page2.php')
soup = BeautifulSoup(content.text)
tag = soup.find('div', id='I1_sys_txt')
print(tag.get_text() if tag else "<none found>")
这篇关于文本提取:使用了所有方法,但卡住了的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!