文本提取:使用了所有方法，但卡住了 [英] Text Extracting: Used All Methods, Yet Stuck

查看：57 发布时间：2021/4/15 19:19:32 python beautifulsoup webpage extraction persian

本文介绍了文本提取:使用了所有方法，但卡住了的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想从网页中提取一些文本.我搜索了StackOverFlow(以及其他站点)以找到合适的方法.我使用HTML2TEXT，BEAUTIFULSOUP，NLTK和其他一些手动方法进行提取，例如，我失败了:

I want to extract a few text out of a webpage. I searched StackOverFlow (as well as other sites) to find a proper method. I used HTML2TEXT, BEAUTIFULSOUP, NLTK and some other manual methods to do extraction and I failed for example:

HTML2TEXT可在离线(已保存的页面)上使用，我需要在线进行.
BS4无法在Unicode上正常工作(我的页面使用UTF8波斯编码)，并且无法提取文本.它还返回HTML标记\代码.我只需要渲染的文本.
NLTK无法用于我的波斯文字.即使在尝试使用urllib.request.urlopen打开我的页面时，我仍然遇到一些错误.因此，如您所见，尝试了几种方法后，我非常困惑.

这是我的目标URL: http://vynylyn.yolasite.com/page2.php 我只想提取不带标签\代码的波斯语段落.

Here's my target URL: http://vynylyn.yolasite.com/page2.php I want to extract only Persian paragraphs without tags\codes.

(注意:我使用带Python 34的Eclipse Kepler，也想提取文本，然后要对文本进行POS标记，Word \句子标记化等.)

(Note: I use Eclipse Kepler w\ Python 34 also I want to extract text then I want to do POS Tagging, Word\Sentence Tokenizing, etc on the text.)

要使该功能正常工作，我有什么选择?

What are my options to get this working?

推荐答案

我首先会选择第二个选项.BeautifulSoup 4应该(并且确实)绝对支持Unicode (请注意，它是UTF-8，是一种全局字符编码，因此没有波斯语.)

I'd go for your second option at first. BeautifulSoup 4 should (and does) definitely support unicode (note it's UTF-8, a global character encoding, so there's nothing Persian about it).

是的，您将获得标签，因为它是HTML页面.尝试搜索唯一的ID，或查看页面上的HTML结构.对于您的示例，请查找元素 main ，然后在其下方查找内容元素，或者在该特定页面中使用 div#I1_sys_txt .有了元素后，您只需要调用 get_text().

And yes, you will get tags, as it's an HTML page. Try searching for a unique ID, or look at the HTML structure on the page(s). For your example, look for element main and then content elements below that, or maybe use div#I1_sys_txt in that specific page. Once you have your element, you just need to call get_text().

尝试一下(现在在Python 3中):

Try this (now in Python 3):

#!/usr/bin/env python3
import requests
from bs4 import BeautifulSoup

content = requests.get('http://vynylyn.yolasite.com/page2.php')
soup = BeautifulSoup(content.text)

tag = soup.find('div', id='I1_sys_txt')
print(tag.get_text() if tag else "<none found>")

这篇关于文本提取:使用了所有方法，但卡住了的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

文本提取:使用了所有方法，但卡住了 [英] Text Extracting: Used All Methods, Yet Stuck

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

文本提取:使用了所有方法，但卡住了 [英] Text Extracting: Used All Methods, Yet Stuck

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭