使用 python regex 从文本中提取某些 URL [英] using python regex to extract certain URLs from text

查看:39
本文介绍了使用 python regex 从文本中提取某些 URL的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以我有 NPR 页面的 HTML,我想使用正则表达式为我提取某些 URL(这些 URL 调用嵌套在页面中的特定故事的 URL).实际链接出现在文本中(手动检索)为:

<a href="http://www.npr.org/blogs/parallels/2014/11/11/363018388/how-the-islamic-state-wages-its-propaganda-war"><a href="http://www.npr.org/blogs/thetwo-way/2014/11/11/363309020/asked-to-stop-praying-alaska-school-won-t-host-state-锦标赛"><a href="http://www.npr.org/2014/11/11/362817642/a-marines-parents-story-their-memories-that-you-should-hear"><a href="http://www.npr.org/blogs/thetwo-way/2014/11/11/363288744/comets-rugged-landscape-makes-landing-a-roll-of-the-dice"><a href="http://www.npr.org/blogs/thetwo-way/2014/11/11/363293514/for-dyslexics-a-font-and-a-dictionary-that-are-meant-帮助">

显然,如果我希望能够在一致的基础上使用它,我不能继续使用手动检索.到目前为止,我有这个代码:

import nltk进口重新f = open("/Users/shannonmcgregor/Desktop/npr.txt")npr_lines = f.readlines()f.close()

我有这个代码来抓取 (

 用于 npr_lines 中的行:re.findall('<a href="?\'?([^"\'>]*)', line)

但这会抓取所有网址.我尝试添加如下内容:

(parallels|thetwo-way|a-marines)

但这没有任何回报.那么我做错了什么?我如何将较大的 URL 剥离器与这些针对给定 URL 的特定词结合起来?

请和谢谢:)

解决方案

您可以使用 re.search 函数匹配行中的正则表达式,如果匹配则打印该行

<预><代码>>>>file = open('/Users/shannonmcgregor/Desktop/npr.txt', 'r')>>>对于文件中的行:... if re.search('<a href=[^>]*(parallels|thetwo-way|a-marines)', line):... 打印线

将给出一个输出

<a href="http://www.npr.org/blogs/parallels/2014/11/11/363018388/how-the-islamic-state-wages-its-propaganda-war"><a href="http://www.npr.org/blogs/thetwo-way/2014/11/11/363309020/asked-to-stop-praying-alaska-school-won-t-host-state-锦标赛"><a href="http://www.npr.org/2014/11/11/362817642/a-marines-parents-story-their-memories-that-you-should-hear"><a href="http://www.npr.org/blogs/thetwo-way/2014/11/11/363288744/comets-rugged-landscape-makes-landing-a-roll-of-the-dice"><a href="http://www.npr.org/blogs/thetwo-way/2014/11/11/363293514/for-dyslexics-a-font-and-a-dictionary-that-are-meant-帮助">

So I have the HTML from an NPR page, and I want to use regex to extract just certain URLs for me (these call the URLs to specific stories nested within the page). The actual links appear in the text (retrieved manually) as:

<a href="http://www.npr.org/blogs/parallels/2014/11/11/363018388/how-the-islamic-state-wages-its-propaganda-war">
<a href="http://www.npr.org/blogs/thetwo-way/2014/11/11/363309020/asked-to-stop-praying-alaska-school-won-t-host-state-tournament">
<a href="http://www.npr.org/2014/11/11/362817642/a-marines-parents-story-their-memories-that-you-should-hear">
<a href="http://www.npr.org/blogs/thetwo-way/2014/11/11/363288744/comets-rugged-landscape-makes-landing-a-roll-of-the-dice">
<a href="http://www.npr.org/blogs/thetwo-way/2014/11/11/363293514/for-dyslexics-a-font-and-a-dictionary-that-are-meant-to-help">

obviously, I cannot to continue to use manual retrieval if I want to be able to use this on a consistent basis. So far, I have this code:

import nltk
import re

f = open("/Users/shannonmcgregor/Desktop/npr.txt")
npr_lines = f.readlines()
f.close()

I have this code to grab everything between (

for line in npr_lines:
re.findall('<a href="?\'?([^"\'>]*)', line)

But that grabs all urls. I tried adding something like:

(parallels|thetwo-way|a-marines)

but that returns nothing. So what am I doing wrong? How I combine the larger URL stripper with these specific words that target the given URLs?

Please and thank you :)

解决方案

You can use re.search function to match the regex in the line and prints the line if it matches as

>>> file  = open('/Users/shannonmcgregor/Desktop/npr.txt', 'r')
>>> for line in file:
...     if re.search('<a href=[^>]*(parallels|thetwo-way|a-marines)', line):
...             print line

will give an output as

<a href="http://www.npr.org/blogs/parallels/2014/11/11/363018388/how-the-islamic-state-wages-its-propaganda-war">
<a href="http://www.npr.org/blogs/thetwo-way/2014/11/11/363309020/asked-to-stop-praying-alaska-school-won-t-host-state-tournament">
<a href="http://www.npr.org/2014/11/11/362817642/a-marines-parents-story-their-memories-that-you-should-hear">
<a href="http://www.npr.org/blogs/thetwo-way/2014/11/11/363288744/comets-rugged-landscape-makes-landing-a-roll-of-the-dice">
<a href="http://www.npr.org/blogs/thetwo-way/2014/11/11/363293514/for-dyslexics-a-font-and-a-dictionary-that-are-meant-to-help">

这篇关于使用 python regex 从文本中提取某些 URL的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆