网页文本提取器 [英] web page text extractor

查看:78
本文介绍了网页文本提取器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

你好,


对于一个项目,我需要开发一个在线新闻故事集。我是

寻找一个应用程序,给定网页的网址,复制

网页的呈现文本(不是源HTNL)文本),打开一个

文本编辑器(记事本),并显示用户的复制文本,以便检查并保存到文本文件中。

忽略了图形和侧边栏。我遇到的例子对我来说太复杂了

来定制这个简单的工作。谁能带我到正确的方向

方向?


谢谢,

gk

Hello,

For a project, I need to develop a corpus of online news stories. I''m
looking for an application that, given the url of a web page, "copies"
the rendered text of the web page (not the source HTNL text), opens a
text editor (Notepad), and displays the copied text for the user to
examine and save into a text file. Graphics and sidebars to be
ignored. The examples I have come across are much too complex for me
to customize for this simple job. Can anyone lead me to the right
direction?

Thanks,
gk

推荐答案

Hello jk,
Hello jk,

对于一个项目,我需要开发一个在线新闻故事集。我是

寻找一个应用程序,给定网页的网址,复制

网页的呈现文本(不是源HTNL)文本),打开一个

文本编辑器(记事本),并显示用户的复制文本,以便检查并保存到文本文件中。

忽略了图形和侧边栏。我遇到的例子对我来说太复杂了

来定制这个简单的工作。任何人都可以带我到正确的方向吗?
方向?
For a project, I need to develop a corpus of online news stories. I''m
looking for an application that, given the url of a web page, "copies"
the rendered text of the web page (not the source HTNL text), opens a
text editor (Notepad), and displays the copied text for the user to
examine and save into a text file. Graphics and sidebars to be
ignored. The examples I have come across are much too complex for me
to customize for this simple job. Can anyone lead me to the right
direction?



简单明了:)

来自os导入系统的


来自sys import argv


OUTFILE =" geturl.txt"

system(" lynx -dump%s%s"%(argv [1],OUTFILE))

system(开始记事本%s%OUTFILE)

(你可以在 http://lynx.browser.org/

注意删除侧边栏是一个非常困难的问题。

搜索包装器感应看看有关这个问题的一些工作。


HTH,

-

Miki< mi ****** ***@gmail.com>
http://pythonwise.blogspot.com

Going simple :)

from os import system
from sys import argv

OUTFILE = "geturl.txt"
system("lynx -dump %s %s" % (argv[1], OUTFILE))
system("start notepad %s" % OUTFILE)
(You can find lynx at http://lynx.browser.org/)

Note the removing sidebars is a very difficult problem.
Search for "wrapper induction" to see some work on the subject.

HTH,
--
Miki <mi*********@gmail.com>
http://pythonwise.blogspot.com


2007-07-12 04:42:25 -0500,kublai< re ******* @ gmail.comsaid:
On 2007-07-12 04:42:25 -0500, kublai <re*******@gmail.comsaid:

对于一个项目,我需要开发一个在线新闻故事集。我是

寻找一个应用程序,给定网页的网址,复制

网页的呈现文本(不是源HTNL)文本),打开一个

文本编辑器(记事本),并显示用户的复制文本,以便检查并保存到文本文件中。

忽略了图形和侧边栏。我遇到的例子对我来说太复杂了

来定制这个简单的工作。任何人都可以带我到正确的方向吗?
方向?
For a project, I need to develop a corpus of online news stories. I''m
looking for an application that, given the url of a web page, "copies"
the rendered text of the web page (not the source HTNL text), opens a
text editor (Notepad), and displays the copied text for the user to
examine and save into a text file. Graphics and sidebars to be
ignored. The examples I have come across are much too complex for me
to customize for this simple job. Can anyone lead me to the right
direction?



您可能会发现BeautifulSoup或模板制作者可以提供帮助:

http://www.crummy.com/software/BeautifulSoup/
http://www.holovaty.com/blog/archive/2007/07/06/0128

You may find BeautifulSoup or templatemaker to be of assistance:

http://www.crummy.com/software/BeautifulSoup/
http://www.holovaty.com/blog/archive/2007/07/06/0128


2007/7/12,kublai< re ******* @ gmail.com>:
2007/7/12, kublai <re*******@gmail.com>:

对于一个项目,我需要开发一个在线新闻故事集。我是

寻找一个应用程序,给定网页的网址,复制

网页的呈现文本(不是源HTNL)文本),打开一个

文本编辑器(记事本),并显示用户的复制文本,以便检查并保存到文本文件中。

忽略了图形和侧边栏。我遇到的例子对我来说太复杂了

来定制这个简单的工作。任何人都可以带我到正确的方向吗?
方向?
For a project, I need to develop a corpus of online news stories. I''m
looking for an application that, given the url of a web page, "copies"
the rendered text of the web page (not the source HTNL text), opens a
text editor (Notepad), and displays the copied text for the user to
examine and save into a text file. Graphics and sidebars to be
ignored. The examples I have come across are much too complex for me
to customize for this simple job. Can anyone lead me to the right
direction?



def textonly(url):

#在网址上获取HTML源代码并仅提供主要文本

f = urllib2.urlopen(url)

text = f.read()

r = re.compile(''\< [^ \< \ >] * \>'')

newtext = r.sub('''',text)

而newtext!= text:

text = newtext

newtext = r.sub('''',text)

返回文字


-

Andre Engels, an*********@gmail.com

ICQ:6260644 - Skype:a_engels

def textonly(url):
# Get the HTML source on url and give only the main text
f = urllib2.urlopen(url)
text = f.read()
r = re.compile(''\<[^\<\>]*\>'')
newtext = r.sub('''',text)
while newtext != text:
text = newtext
newtext = r.sub('''',text)
return text

--
Andre Engels, an*********@gmail.com
ICQ: 6260644 -- Skype: a_engels


这篇关于网页文本提取器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆