蟒解析的HTML数据,并存储到数据库 [英] Python to parse html data and store into the database

查看:347
本文介绍了蟒解析的HTML数据,并存储到数据库的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述


这是麻烦我两天,我是新一个蟒蛇,我要解析HTML数据如下链接: http://movie.walkerplus.com/list/2015/12/

,然后将数据存储到PostgreSQL数据库命名movie_db,并且有其通过以下命令创建的表名为膜

  CREATE TABLE薄膜(
标题为varchar(128)NOT NULL,
说明VARCHAR(256)NOT NULL,
董事VARCHAR(128)[]
角色VARCHAR(128)[]
);

我分析的数据,也有标题,描述,导演,角色三甲之列的数据。如标题= ['一',......,'B'],说明= ['C',...,'F'],导演= ['D',......, 'G'],角色= ['F','G','T'],...... ['H','T','U']。

SQL =INSERT INTO电影(标题,描述,导演,角色)结果
      VALUES结果
      (%S,%S%S%S);
     在拉链OBJ(T,DES,污垢,R):
     cur.execute(cur.mogrify(SQL,OBJ))
     conn.commit()

有错误是:

  psycopg2.DataError:畸形的数组文本:サム·メンデスLINE 1:...ームズ·ボンドの戦いを描く「007」シリーズ第24作','サムメ·...
                                                         ^
详细信息:数组值必须以{或维度的信息。


解决方案

我知道这个错误。这意味着你正在尝试插入字符串值到数组列。可以验证SQL如下图所示。

  SQL2 = cur.mogrify(SQL,OBJ)
打印SQL2

您的董事,并从HTML获取角色是字符串列表。所以压缩功能后的obj包含目录和角色为字符串。

有关你的情况你想只插入1行。因此,有可能是没有必要压缩。

我不熟悉这个API,您使用的,但你可以尝试打印从HTML接收的值插入之前?我可以提供您所需的确切SQL。

修改关于语法新阵列

董事阵列的缩写语法创建与每个元素作为数组的新数组。在一个更可读的语法,这将是相同的,如下

 导演= ['汤姆','杰克','约翰']
董事= []在导演D:
    elem_as_list = []
    elem_as_list.append(D)
    directors.append(elem_as_list)
打印导演
打印董事
打印类型(主任[0])
打印类型(导演[0])

下面是输出

  ['汤姆','杰克','约翰']
['嗵'],['杰克'],['约翰']
<键入'海峡'>
<类型列表'>

a This is trouble me for two days, I am new one to python, I want to Parse the html data as the following link:http://movie.walkerplus.com/list/2015/12/

and then store the data into the postgresql database named movie_db, and there is table named films which is created by the following command:

CREATE TABLE films (
title       varchar(128) NOT NULL,
description varchar(256) NOT NULL,
directors   varchar(128)[],
roles       varchar(128)[]
);

I have parsed data, there are three list data for title, description, director, roles. such as title =['a', .....,'b'], description = ['c',....,'f'], director= ['d',.....,'g'], roles = [['f','g','t'], ......,['h', 't','u']].

sql = "INSERT INTO films (title, description, directors, roles)
VALUES
(%s, %s, %s, %s);" for obj in zip(t, des, dirt, r): cur.execute(cur.mogrify(sql, obj)) conn.commit()

There is error:

 psycopg2.DataError: malformed array literal: "サム・メンデス"

LINE 1: ...ームズ・ボンドの戦いを描く『007』シリーズ第24作', 'サム・メ...
                                                         ^
DETAIL:  Array value must start with "{" or dimension information.     

解决方案

I know this error. It means you are trying to insert string values into array columns. You can verify the SQL as below.

sql2 = cur.mogrify(SQL, obj)
print sql2

Your directors and roles fetched from html are list of strings. So after zip function the obj contains dir and roles as strings.

For your case you are trying to insert only 1 row. So there is probably no need to zip.

I am not familiar with this API you used, but can you try to print the values received from html before inserting? I can provide you the exact SQL required.

Edit About the syntax for the new array

the directors array is a shorthand syntax to create a new array with each element as array. In a more readable syntax, it will be same as below

director = ['tom', 'jack', 'john']
directors = []

for d in director:
    elem_as_list = []
    elem_as_list.append(d)
    directors.append(elem_as_list)
print director
print directors
print type(director[0])
print type(directors[0])

Here is the output

['tom', 'jack', 'john']
[['tom'], ['jack'], ['john']]
<type 'str'>
<type 'list'>                                                           

这篇关于蟒解析的HTML数据,并存储到数据库的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆