RE模块 [英] RE Module

查看:74
本文介绍了RE模块的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试过滤所有html标签列表中的列。


为此,我设置了以下声明。


row [0] = re.sub(r''<。*?>'','''',row [0])


结果我得到的是讽刺的。有时会删除两个标签。

有时会删除1个标签。有时没有标签被删除。可以

有人告诉我这里哪里出错了吗?


提前致谢

解决方案

Roman写道:


我试图过滤所有html标签列表中的列。



什么?


为此,我设置了以下声明。


row [0] = re.sub(r''<。*?>'','''',row [0])


我得到的结果很糟糕。有时会删除两个标签。

有时会删除1个标签。有时没有标签被删除。可以

有人告诉我这里哪里出错了吗?


提前付款



我不是专家,所以我不会尝试为你提供建议,但如果你提供了输入和输出数据的例子,那么可能会帮助那些人.b br />
你得到什么结果输入字符串。


另外,如果你只是试图剥离html标记来获取纯文本来自

a文件,w3m -dump some.html效果很好。 ;-)


HTH,

~Simon


Roman,


你的重新为我工作。我怀疑你有跨越线的标签,这是你经常得到的东西。如果是这样,按行处理

是行不通的。你需要抓住这样的标签:


>> text = re.sub (''<(。| \ n)*?>'','''',文字)



如果你的文字相当小,我会推荐这个解决方案。另外你可能想看看SE这是一个流编辑器

为你做缓冲:

http://cheeseshop.python.org/pypi/SE/2.2%20beta


>> import SE
Tag_Stripper = SE.SE(''"〜<(。 | \ n)*?>〜=""〜<! - (。| \ n)*? - >〜="'')
打印Tag_Stripper(文本)



(...你的文字没有标签......)


Tag_Stripper由两个正则表达式组成。第二个捕获可能嵌套标签的注释。单独的第一个表达式

也会捕获注释,但会将第一个嵌套标记的''>''误认为注释的结尾并提前退出。示例

" re.sub(''<(。| \ n)*?>'','''',文字)"上面会在这方面误导。


你的Tag_Stripper直接从文件中获取输入:


>> Tag_Stripper(''name_of_file.htm'',''name_of_output_file'')



''name_of_output_file''


或者如果你想查看输出:


>> Tag_Stripper(''name_of_file.htm'','''')



(...你的文字没有标签......)


如果你想保留定义供以后使用,请执行以下操作:


>> Tag_Stripper.save(''[your_path /] tag_stripper.se'')
< /块quote>



您的定义现在保存在文件''tag_stripper.se''中。您可以编辑该文件。下次你需要一个Tag_Stripper你可以简单地通过命名文件来实现它:


>> Tag_Stripper = SE.SE(''[your_path /] tag_stripper.se'')



您可以轻松扩展Tag_Stripper的功能。例如,如果您要翻译&符号转义(& nbsp;

等),您只需添加定义&符号替换的文件名称:
< blockquote class =post_quotes>


>> Tag_Stripper = SE.SE(''tag_stripper.se htm2iso.se'')



''htm2iso.se''附带准备使用的SE包,并作为编写自己的替换集的示例。 />
弗雷德里克

-----原始消息-----

来自:西蒙福尔曼 < ro ********* @ yahoo.com>

新闻组:comp.lang.python

收件人:< py ***** ****@python.org>

已发送:2006年8月25日星期五上午7:09

主题:回复:RE模块


Roman写道:


我试图过滤所有html标签列表中的列。



什么?


为此,我设置了以下声明。


row [0] = re.sub(r''<。*?>'','''',row [0])


我得到的结果很糟糕。有时会删除两个标签。

有时会删除1个标签。有时没有标签被删除。可以

有人告诉我这里哪里出错了吗?


提前付款



我不是专家,所以我不会尝试为你提供建议,但如果你提供了输入和输出数据的例子,那么可能会帮助那些人.b br />
你得到什么结果输入字符串。


另外,如果你只是试图剥离html标记来获取纯文本来自

a文件,w3m -dump some.html效果很好。 ;-)


HTH,

~西蒙


-
http://mail.python.org/mailman/listinfo/python-list


感谢您的帮助。


我没有提到的是在语句行[0]之前的事情=

re.sub(r''<。*?>'','''',row [0]),我有行[0] = re.sub(r'' [^

0-9A-Za-z \" \''\。\,\#\ @ \!\(\)\ * \\ \\& \%\%\\\ / \:\; \?\\\\〜\< \>]'','''',行[ 0])

声明。因此,行分隔符将会消失。你提到
字符串的大小可能是一个因素。如果是这样,在我看到问题之前

最大尺寸是多少?


再次感谢

Anthra Norell写道:


Roman,


你的重新为我服务。我怀疑你有跨越线的标签,这是你经常得到的东西。如果是这样,按行处理

是行不通的。您需要捕获这样的标记:


> text = re.sub(''<(。| \\\ n)*?>'','''',文字)



如果您的文字相当小,我会推荐这个解决方案。另外你可能想看看SE这是一个流编辑器

为你做缓冲:

http://cheeseshop.python.org/pypi/SE/2.2%20beta


> import SE
Tag_Stripper = SE.SE(''"〜<(。| \ n)*?>〜 =""〜<! - (。| \ n)*? - >〜="'')
打印Tag_Stripper(文本)



(...你的文字没有标签......)


Tag_Stripper由两个正则表达式组成。第二个捕获可能嵌套标签的注释。单独的第一个表达式

也会捕获注释,但会将第一个嵌套标记的''>''误认为注释的结尾并提前退出。示例

" re.sub(''<(。| \ n)*?>'','''',文字)"以上会在这方面出错。


您的Tag_Stripper直接从文件中获取输入:


> Tag_Stripper(''name_of_file.htm'',''name_of_output_file'')



''name_of_output_file''


或者如果你想查看输出:


> Tag_Stripper(''name_of_file.htm' ','''')



(...你的文字没有标签......)


如果要保留定义供以后使用,请执行以下操作:


> Tag_Stripper.save(''[your_path / ] tag_stripper.se'')



您的定义现在保存在文件''tag_stripper.se''中。您可以编辑该文件。下次你需要一个Tag_Stripper你可以简单地通过命名文件来实现它:


> ; Tag_Stripper = SE.SE(''[your_path /] tag_stripper.se'')



您可以轻松扩展Tag_Stripper的功能。例如,如果您要翻译&符号转义(& nbsp;

等),您只需添加定义&符号替换的文件名称:
< blockquote class =post_quotes>


> Tag_Stripper = SE.SE(''tag_stripper.se htm2iso.se'')



''htm2iso.se''附带准备使用的SE包,并作为编写自己的替换套件的示例。


Frederic

-----原始消息-----

来自:Simon Forman < ro ********* @ yahoo.com>

新闻组:comp.lang.python

收件人:< py ***** ****@python.org>

已发送:2006年8月25日星期五上午7:09

主题:回复:RE模块


Roman写道:


我试图过滤所有html标签列表中的列。



什么?


为此,我设置了以下声明。

>

row [0] = re.sub(r''<。*?>'','''',row [0])

>

我得到的结果很糟糕。有时会删除两个标签。

有时会删除1个标签。有时没有标签被删除。可以

有人告诉我这里哪里出错了吗?

>

先谢谢



我不是专家,所以我不会尝试为你提供建议,但它可能会帮助那些如果你提供输入示例的人输出数据。

你得到什么输入字符串的结果。


另外,如果你只是试图剥离html标记来获得简单的来自

a文件的文字,w3m -dump some.html效果很好。 ;-)


HTH,

~西蒙


-
http://mail.python.org/mailman/listinfo/python-list


I am trying to filter a column in a list of all html tags.

To do that, I have setup the following statement.

row[0] = re.sub(r''<.*?>'', '''', row[0])

The results I get are sporatic. Sometimes two tags are removed.
Sometimes 1 tag is removed. Sometimes no tags are removed. Could
somebody tell me where have I gone wrong here?

Thanks in advance

解决方案

Roman wrote:

I am trying to filter a column in a list of all html tags.

What?

To do that, I have setup the following statement.

row[0] = re.sub(r''<.*?>'', '''', row[0])

The results I get are sporatic. Sometimes two tags are removed.
Sometimes 1 tag is removed. Sometimes no tags are removed. Could
somebody tell me where have I gone wrong here?

Thanks in advance

I''m no re expert, so I won''t try to advise you on your re, but it might
help those who are if you gave examples of your input and output data.
What results are you getting for what input strings.

Also, if you''re just trying to strip html markup to get plain text from
a file, "w3m -dump some.html" works great. ;-)

HTH,
~Simon


Roman,

Your re works for me. I suspect you have tags spanning lines, a thing you get more often than not. If so, processing linewise
doesn''t work. You need to catch the tags like this:

>>text = re.sub (''<(.|\n)*?>'', '''', text)

If your text is reasonably small I would recommend this solution. Else you might want to take a look at SE which is a stream edtor
that does the buffering for you:

http://cheeseshop.python.org/pypi/SE/2.2%20beta

>>import SE
Tag_Stripper = SE.SE ('' "~<(.|\n)*?>~=" "~<!--(.|\n)*?-->~=" '')
print Tag_Stripper (text)

(... your text without tags ...)

The Tag_Stripper is made up of two regexes. The second one catches comments which may nest tags. The first expression alone would
also catch comments, but would mistake the ''>'' of the first nested tag for the end of the comment and quit prematurely. The example
"re.sub (''<(.|\n)*?>'', '''', text)" above would misperform in this respect.

Your Tag_Stripper takes input from files directly:

>>Tag_Stripper (''name_of_file.htm'', ''name_of_output_file'')

''name_of_output_file''

Or if you want to to view the output:

>>Tag_Stripper (''name_of_file.htm'', '''')

(... your text without tags ...)

If you want to keep the definitions for later use, do this:

>>Tag_Stripper.save (''[your_path/]tag_stripper.se'')

Your definitions are now saved in the file ''tag_stripper.se''. You can edit that file. The next time you need a Tag_Stripper you can
make it simply by naming the file:

>>Tag_Stripper = SE.SE (''[your_path/]tag_stripper.se'')

You can easily expand the capabilities of your Tag_Stripper. If, for instance, you want to translate the ampersand escapes (&nbsp;
etc.) you''d simply add the name of the file that defines the ampersand replacements:

>>Tag_Stripper = SE.SE (''tag_stripper.se htm2iso.se'')

''htm2iso.se'' comes with the SE package ready to use and as an example for writing ones own replacement sets.
Frederic
----- Original Message -----
From: "Simon Forman" <ro*********@yahoo.com>
Newsgroups: comp.lang.python
To: <py*********@python.org>
Sent: Friday, August 25, 2006 7:09 AM
Subject: Re: RE Module

Roman wrote:

I am trying to filter a column in a list of all html tags.


What?

To do that, I have setup the following statement.

row[0] = re.sub(r''<.*?>'', '''', row[0])

The results I get are sporatic. Sometimes two tags are removed.
Sometimes 1 tag is removed. Sometimes no tags are removed. Could
somebody tell me where have I gone wrong here?

Thanks in advance


I''m no re expert, so I won''t try to advise you on your re, but it might
help those who are if you gave examples of your input and output data.
What results are you getting for what input strings.

Also, if you''re just trying to strip html markup to get plain text from
a file, "w3m -dump some.html" works great. ;-)

HTH,
~Simon

--
http://mail.python.org/mailman/listinfo/python-list


Thanks for your help.

A thing I didn''t mention is that before the statement row[0] =
re.sub(r''<.*?>'', '''', row[0]), I have row[0]=re.sub(r''[^
0-9A-Za-z\"\''\.\,\#\@\!\(\)\*\&\%\%\\\/\:\;\?\`\~\<\>]'', '''', row[0])
statement. Hence, the line separators are going to be gone. You
mentioned the size of the string could be a factor. If so what is the
max size before I see problems?

Thanks again
Anthra Norell wrote:

Roman,

Your re works for me. I suspect you have tags spanning lines, a thing you get more often than not. If so, processing linewise
doesn''t work. You need to catch the tags like this:

>text = re.sub (''<(.|\n)*?>'', '''', text)


If your text is reasonably small I would recommend this solution. Else you might want to take a look at SE which is a stream edtor
that does the buffering for you:

http://cheeseshop.python.org/pypi/SE/2.2%20beta

>import SE
Tag_Stripper = SE.SE ('' "~<(.|\n)*?>~=" "~<!--(.|\n)*?-->~=" '')
print Tag_Stripper (text)

(... your text without tags ...)

The Tag_Stripper is made up of two regexes. The second one catches comments which may nest tags. The first expression alone would
also catch comments, but would mistake the ''>'' of the first nested tag for the end of the comment and quit prematurely. The example
"re.sub (''<(.|\n)*?>'', '''', text)" above would misperform in this respect.

Your Tag_Stripper takes input from files directly:

>Tag_Stripper (''name_of_file.htm'', ''name_of_output_file'')

''name_of_output_file''

Or if you want to to view the output:

>Tag_Stripper (''name_of_file.htm'', '''')

(... your text without tags ...)

If you want to keep the definitions for later use, do this:

>Tag_Stripper.save (''[your_path/]tag_stripper.se'')


Your definitions are now saved in the file ''tag_stripper.se''. You can edit that file. The next time you need a Tag_Stripper you can
make it simply by naming the file:

>Tag_Stripper = SE.SE (''[your_path/]tag_stripper.se'')


You can easily expand the capabilities of your Tag_Stripper. If, for instance, you want to translate the ampersand escapes (&nbsp;
etc.) you''d simply add the name of the file that defines the ampersand replacements:

>Tag_Stripper = SE.SE (''tag_stripper.se htm2iso.se'')


''htm2iso.se'' comes with the SE package ready to use and as an example for writing ones own replacement sets.
Frederic
----- Original Message -----
From: "Simon Forman" <ro*********@yahoo.com>
Newsgroups: comp.lang.python
To: <py*********@python.org>
Sent: Friday, August 25, 2006 7:09 AM
Subject: Re: RE Module

Roman wrote:

I am trying to filter a column in a list of all html tags.

What?

To do that, I have setup the following statement.
>
row[0] = re.sub(r''<.*?>'', '''', row[0])
>
The results I get are sporatic. Sometimes two tags are removed.
Sometimes 1 tag is removed. Sometimes no tags are removed. Could
somebody tell me where have I gone wrong here?
>
Thanks in advance

I''m no re expert, so I won''t try to advise you on your re, but it might
help those who are if you gave examples of your input and output data.
What results are you getting for what input strings.

Also, if you''re just trying to strip html markup to get plain text from
a file, "w3m -dump some.html" works great. ;-)

HTH,
~Simon

--
http://mail.python.org/mailman/listinfo/python-list


这篇关于RE模块的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆