从python的网页中读取特定的行 [英] Reading a particular line from a webpage in python
问题描述
在我的代码中,我试图从网页中将第一行文本转换为python中的变量。目前,我正在使用urlopen获取我想阅读的每个链接的整个页面。我如何才能阅读网页上的第一行字。
我的代码:
import urllib2
line_number = 10
id =(np.arange(1,5))
for n in id:
link = urllib2.urlopen(http:// www。 cv.edu/id={}\".format(n))
l = link.read()
我想从网页的以下html代码中提取单词old car:
< html> ;
< head>
< link rel =stylesheet>
< style>
.norm {font-family:arial; font-size:8.5pt;颜色:#000000;文字修饰:无; }
.norm:Visited {font-family:arial; font-size:8.5pt;颜色:#000000;文字修饰:无; }
.norm:Hover {font-family:arial; font-size:8.5pt;颜色:#000000;文字修饰:下划线; }
< / style>
< / head>
< body>
< b>旧车< / b>< br>
< sup> 13< / sup> CO< font color =red> v = 0< / font>< br>
ID:02910< br>
< p>
< p>< b> CDS< / b>< / p>
使用 XPath 。这正是我们需要的。
b b XPath , XML路径语言用于从XML文档中选择节点的查询语言。 lxml
python库将帮助我们解决这个问题。这是很多人之一。 Libxml2 ,和 PyXML 是一些选项。有很多很多库可以做这种事情。
使用XPath
类似于以下是基于您现有的代码,将工作:
从lxml导入urllib2
导入html
line_number = 10
id =(np.arange(1,5))
for n in id:
link = urllib2.urlopen(http://www.cv.edu/id= {}。format(n))
l = link.read()
tree = html.fromstring(l)
print tree.xpath(// b / text()) [0]
XPath查询 请求库是用于阅读代码中的网页的最新技术,它可能为您节省时间一些令人头痛的问题。 完整的程序可能如下所示: 这些网址并不适合我,所以您可能不得不修补一下。但是,这个概念是完全正确的。 从网页中读取,您可以使用以下内容来测试XPath: In my code I'm trying to get the first line of text from a webpage into a variable in python. At the moment I'm using urlopen to get the whole page for each link I want to read. How do I only read the first line of words on the webpage. My code: I want to extract the word "old car" from the following html code of the webpage:
Use XPath. It's exactly what we need. XPath, the XML Path Language, is a query language for selecting nodes from an XML document. The Something like the following, based on your existing code, will work: The XPath query The Requests library is the state-of-the-art when it comes to reading webpages in code. It may save you some headaches later. The complete program might look like this:
The urls didn't work for me, so you might have to tinker a bit. The concept is sound, though. Reading from the webpages aside, you can use the following to test the XPath:
这篇关于从python的网页中读取特定的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋! // b / text() code>基本上是说从页面上的
< b>
元素获取文本。 tree.xpath
函数调用返回一个列表,我们选择第一个使用 [0]
。除了请求
from lxml import html
导入请求
在范围(1,6)中为nn:
page = requests.get(http://www.cv.edu/ (b / text())[0]
code>
警告
from lxml ; link rel =stylesheet>
< / head>
< body>
< b>旧车< / b>< br>
< sup> 13< CO>< font color =red> v = 0< / font>< br>
ID:02910< br>
< p>
< b> CDS< / b>< / p>)
print tree.xpath(// b / text())[0]# 旧车
import urllib2
line_number = 10
id = (np.arange(1,5))
for n in id:
link = urllib2.urlopen("http://www.cv.edu/id={}".format(n))
l = link.read()
<html>
<head>
<link rel="stylesheet">
<style>
.norm { font-family: arial; font-size: 8.5pt; color: #000000; text-decoration : none; }
.norm:Visited { font-family: arial; font-size: 8.5pt; color: #000000; text-decoration : none; }
.norm:Hover { font-family: arial; font-size: 8.5pt; color : #000000; text-decoration : underline; }
</style>
</head>
<body>
<b>Old car</b><br>
<sup>13</sup>CO <font color="red">v = 0</font><br>
ID: 02910<br>
<p>
<p><b>CDS</b></p>
lxml
python library will help us with this. It's one of many. Libxml2, Element Tree, and PyXML are some of the options. There are many, many, many libraries to do this type of thing.Using XPath
import urllib2
from lxml import html
line_number = 10
id = (np.arange(1,5))
for n in id:
link = urllib2.urlopen("http://www.cv.edu/id={}".format(n))
l = link.read()
tree = html.fromstring(l)
print tree.xpath("//b/text()")[0]
//b/text()
basically says "get the text from the <b>
elements on a page. The tree.xpath
function call returns a list, and we select the first one using [0]
. Easy.An aside about Requests
from lxml import html
import requests
for nn in range(1, 6):
page = requests.get("http://www.cv.edu/id=%d" % nn)
tree = html.fromstring(page.text)
print tree.xpath("//b/text()")[0]
Caveats
from lxml import html
tree = html.fromstring("""<html>
<head>
<link rel="stylesheet">
</head>
<body>
<b>Old car</b><br>
<sup>13</sup>CO <font color="red">v = 0</font><br>
ID: 02910<br>
<p>
<p><b>CDS</b></p>""")
print tree.xpath("//b/text()")[0] # "Old cars"