从文本文件中获取数据以解析html页面 [英] Taking data from a text file to parse html page

查看:69
本文介绍了从文本文件中获取数据以解析html页面的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述




我正试图从html页面中删除html和其他无用的垃圾..

我想创建像一个自动文本编辑器,它在哪里
从txt文件中获取关键字并将其从html页面中删除

(用空格替换html页面中的单词)I 我是python的新手

并且可以在正确的方向上稍微推动一下,关于如何实现这个的任何想法?


谢谢!

解决方案

DH,

你能否更具体地描述你拥有的东西和你想要什么?你正在向人们发表讲话,其中许多人擅长于一旦你告诉他们什么是无用的垃圾就剥离无用的垃圾。

也有一些帮助您需要处理的数据以及处理完毕后应该看到的相同数据样本。


Frederic


-----原始消息-----

来自:" DH" < dy ********* @ gmail.com>

新闻组:comp.lang.python

收件人:< py ***** ****@python.org>

已发送:2006年8月24日星期四上午2:11

主题:从文本文件中获取数据以解析html页面





我试图从html页面中删除html和其他无用的垃圾..

我喜欢创建类似于自动文本编辑器的东西,

从txt文件中获取关键字并将其从html页面中删除

(用空格替换html页面中的单词)我是python的新手

并且可以在正确的方向上使用一点推动,任何关于如何的想法

实现这个吗?


谢谢!


-
http://mail.python.org/mailman/listinfo/python-list


DH写道:





我正试图从html页面中删除html和其他无用的垃圾..

我想创建类似于自动文本编辑器的东西,

从txt文件中获取关键字并从html页面中删除它们

(用空格替换html页面中的单词)我是python的新手

并且可以在正确的方向上使用一点推动,任何有关如何使用的建议

实现这个吗?


谢谢!



参见美丽的汤: http://www.crummy.com/software/BeautifulSoup/

它甚至会解析得很糟糕形成HTML并允许您根据需要提取/更改

信息。


-Larry Bates


< blockquote> Frederic,

好​​点...


我有一个包含html的纯文本文件从html文件中删除我想要的单词

删除(关键字),处理好html文件之后它会将它保存为纯文本文件。


因此程序将导入关键字,从html

文件中删除它们并将html文件保存为something.txt。


我会发布数据,但这是秘密。我可以发一个例子:


index.html(html页面)


"

< div> ;< p>< em>& quot;自从

开始以来,Python一直是Google的重要组成部分,并且随着系统的发展和发展而保持不变。

& quot;< / em>< / p>

< p> - Peter Norvig,< a class =" reference"

" ;

replace.txt(关键字)

"

< div id =" quote" class =" homepage-box">


< div>< p>< em>& quot;


& quot;< / em>< / p>


< p> - Peter Norvig,< a class =" reference"


"


something.txt(编辑后的文件)


"


Python从一开始就是Google的一个重要组成部分,随着系统的发展和发展,它仍然是


"

Larry,


我已经研究过使用BeatifulSoup,但最终得出结论,我的

想法最终会更好。

感谢您的帮助。

Anthra Norell写道:


DH,

你能否更具体地描述你拥有什么和你想要什么?你正在向人们发表讲话,其中许多人擅长于一旦你告诉他们什么是无用的垃圾就剥离无用的垃圾。

也有一些帮助您需要处理的数据以及处理完毕后应该看到的相同数据样本。


Frederic


-----原始消息-----

来自:" DH" < dy ********* @ gmail.com>

新闻组:comp.lang.python

收件人:< py ***** ****@python.org>

已发送:2006年8月24日星期四上午2:11

主题:从文本文件中获取数据以解析html页面





我试图从html页面中删除html和其他无用的垃圾..

我喜欢创建类似于自动文本编辑器的东西,

从txt文件中获取关键字并将其从html页面中删除

(用空格替换html页面中的单词)我是python的新手

并且可以在正确的方向上使用一点推动,任何关于如何的想法

实现这个吗?


谢谢!


-
http://mail.python.org/mailman/listinfo/python-list

Hi,

I''m trying to strip the html and other useless junk from a html page..
Id like to create something like an automated text editor, where it
takes the keywords from a txt file and removes them from the html page
(replace the words in the html page with blank space) I''m new to python
and could use a little push in the right direction, any ideas on how to
implement this?

Thanks!

解决方案

DH,
Could you be more specific describing what you have and what you want? You are addressing people, many of whom are good at
stripping useless junk once you tell them what ''useless junk'' is.
Also it helps to post some of you data that you need to process and a sample of the same data as it should look once it is
processed.

Frederic

----- Original Message -----
From: "DH" <dy*********@gmail.com>
Newsgroups: comp.lang.python
To: <py*********@python.org>
Sent: Thursday, August 24, 2006 2:11 AM
Subject: Taking data from a text file to parse html page

Hi,

I''m trying to strip the html and other useless junk from a html page..
Id like to create something like an automated text editor, where it
takes the keywords from a txt file and removes them from the html page
(replace the words in the html page with blank space) I''m new to python
and could use a little push in the right direction, any ideas on how to
implement this?

Thanks!

--
http://mail.python.org/mailman/listinfo/python-list


DH wrote:

Hi,

I''m trying to strip the html and other useless junk from a html page..
Id like to create something like an automated text editor, where it
takes the keywords from a txt file and removes them from the html page
(replace the words in the html page with blank space) I''m new to python
and could use a little push in the right direction, any ideas on how to
implement this?

Thanks!

See Beautiful Soup: http://www.crummy.com/software/BeautifulSoup/
it will parse even badly formed HTML and allow you to extract/change
information as you wish.

-Larry Bates


Frederic,
Good points...

I have a plain text file containing the html and words that I want
removed(keywords) from the html file, after processing the html file it
would save it as a plain text file.

So the program would import the keywords, remove them from the html
file and save the html file as something.txt.

I would post the data but it''s secret. I can post an example:

index.html (html page)

"
<div><p><em>&quot;Python has been an important part of Google since the
beginning, and remains so as the system grows and evolves.
&quot;</em></p>
<p>-- Peter Norvig, <a class="reference"
"
replace.txt (keywords)
"
<div id="quote" class="homepage-box">

<div><p><em>&quot;

&quot;</em></p>

<p>-- Peter Norvig, <a class="reference"

"

something.txt(file after editing)

"

Python has been an important part of Google since the beginning, and
remains so as the system grows and evolves.
"
Larry,

I''ve looked into using BeatifulSoup but came to the conculsion that my
idea would work better in the end.
Thanks for the help.
Anthra Norell wrote:

DH,
Could you be more specific describing what you have and what you want? You are addressing people, many of whom are good at
stripping useless junk once you tell them what ''useless junk'' is.
Also it helps to post some of you data that you need to process and a sample of the same data as it should look once it is
processed.

Frederic

----- Original Message -----
From: "DH" <dy*********@gmail.com>
Newsgroups: comp.lang.python
To: <py*********@python.org>
Sent: Thursday, August 24, 2006 2:11 AM
Subject: Taking data from a text file to parse html page

Hi,

I''m trying to strip the html and other useless junk from a html page..
Id like to create something like an automated text editor, where it
takes the keywords from a txt file and removes them from the html page
(replace the words in the html page with blank space) I''m new to python
and could use a little push in the right direction, any ideas on how to
implement this?

Thanks!

--
http://mail.python.org/mailman/listinfo/python-list


这篇关于从文本文件中获取数据以解析html页面的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆