Pyspark,错误:输入没有期望的模式所需值的数量和列之后的额外尾随逗号 [英] Pyspark, error:input doesn't have expected number of values required by the schema and extra trailing comma after columns

查看:361
本文介绍了Pyspark,错误:输入没有期望的模式所需值的数量和列之后的额外尾随逗号的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

首先,我做了两个表(RDD)以使用以下命令:

  rdd1 = sc.textFile('checkouts')。 map(lambda行:line.split(','))。map(lambda fields:((fields [0],fields [3],fields [5]),1))
rdd2 = sc.textFile ('inventory2')。map(lambda line:line.split(','))。map(lambda fields:((fields [0],fields [8],fields [10]),1))

第一个RDD中的键是BibNum,ItemCollection和CheckoutDateTime。当我检查第一个RDD的值以使用rdd1.take(2)时,它显示了

pre $ [((u'BibNum' ,u'ItemCollection',u'CheckoutDateTime'),1),((u'1842225',u'namys',u'05 / 23/2005 03:20:00 PM'),1)]

类似地,第二个RDD中的键是BibNum,ItemCollection和Itemlocation。这些值如下所示:

$ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $($'$'$'$'$,$'$' 1),((u'3011076',u'ncrdr',u'qna'),1)]

一旦我创建了两个RDD,我试图加入这两个使用 rdd3 = rdd1.join(rdd2)
之后,当我检查值时rdd3使用rdd3.take(2)。发生以下错误。

  IndexError:列表索引超出范围

我不知道为什么它会发生。如果你知道原因,请赐教。如果您对我的问题或代码有任何疑问,请告诉我,我会尽力澄清它。谢谢

编辑---
我为每个RDD提供样本输入数据

1842225,10035249209,acbk,namys,MYSTERY ELKINS1999,05 / 23/2005 03:20: 00 PM ,,,,,,,
1928264,10037335444,jcbk,ncpic,E TABACK,12/14/2005 05:56:00 PM ,,,,,,,
1982511,10039952527, jcvhs,ncvidnf,VHS J796.2 KNOW_YO 2000,08 / 11/2005 01:52:00 PM ,,,,,,,
2026467,10040985615,accd,nacd,CD 782.421642 Y71T,10/19/2005 07:47:00 PM ,,,,,,,
2174698,10047696215,jcbk,ncpic,E KROSOCZ,12/29/2005 03:42:00 PM ,,,,,,,
1602768,10028318730,jcbk,ncpic,E BLACK,10/08/2005 02:15:00 PM ,,,,,,,
2285195,10053424767,accd,cacd,CD 782.42166 F19R,09/30/2005 10:16:00 AM ,,,,输入,BinNumber,日期,BinNumber +月
2245955,10048392665,jcbk,ncnf,J949.73 Or77S 2004,12 / 05/2005 05:03:00 PM ,, ,,,,,
770918,10044828100,jcbk,ncpic,E HILL,07/22/2005 03:17:00 PM ,,,,,,,

  BibNum,标题,作者,ISBN,PublicationYear,出版商,主题,ItemType,ItemCollection,FloatingItem,ItemLocation,ReportDate,ItemCount ,,,,,,,,,,,,, 
3011076,两个朋友的故事/由Ellie O'Ryan改编;由Tom Caulfield |说明弗雷德里克加德纳| Megan Petasky |和艾伦Tam。,O'Ryan |艾莉,1481425730 | 1481425749 | 9781481425735 | 9781481425742,2014,Simon Spotlight |,音乐家小说|斗牛士小说|最好的朋友小说|友谊小说|冒险和冒险家小说,jcbk,ncrdr,Floating,qna,09/01 / 2017,1 ,,,,,,,,,,,,,
2248846,Naruto。卷。 1 |漩涡鸣人/岸本真史的故事与艺术; [Jo Duffy的英文改编],岸本|志| 1974,1569319006,2003 | c1999。,Viz |,忍者日本漫画书等| |漫画书等。日本翻译成英语|图文小说,acbk,nycomic,NA,lcy,09/01 / 2017,1 ,,,,,,,,,,,,,
3209270,Peace |爱与爱Wi-Fi:ZITS金库/由Jerry Scott和Jim Borgman提供。,Scott |杰里| 1955,144945867X | 9781449458676,2014,Andrews McMeel Publishing |,邓肯杰里米虚构人物漫画书等等|青少年美国漫画书条等|家长和青少年漫画书等家庭漫画书等| |漫画书等| |漫画平面作品|幽默漫画,acbk,nycomic,NA,bea,09/01 / 2017,1 ,,,,,,,,,,,,,
1907265,巴黎朝圣者:小说/ Clancy Carlile。,Carlile |克兰西| 1930-,786706155,c1999,Carroll& Graf |,海明威Ernest 1899 1961年小说|传记小说|历史小说,acbk,cafic,NA,cen,09/01 / 2017,1 ,,,,,,,,,,,,,
1644616,色情本质:生活的庆典|爱的和我们精彩的机构/大卫斯坦伯格编辑。,094020813X,1991 | c1988。,红Al书籍/下载有新闻|,色情文学美国|美国文学20世纪,acbk,canf,NA,cen,09/01 / 2017,1 ,,,,,,,,,,,,,

编辑------------------------------------- ---------
date_count - > DataFrame [BibNum:string,ItemCollection:string,CheckoutDateTime:string,count:BigInt] ..显示喜欢这个
,但是当我使用date_count.take(2)检查它的值时,它显示如下错误:输入没有期望的模式所需值的数量。

df_final模式如下所示: DataFrame [BibNum:string,ItemType:string,ItemCollection:string ,ItemBarcode:字符串,CallNumber:字符串,CheckoutDateTime:字符串,标题:字符串,作者:字符串,ISBN:字符串,PublicationYear:字符串,发布者:字符串,Subjects:字符串,FloatingItem:字符串,ItemLocation:字符串,ReportDate:字符串,ItemLocation :string,:string,:string,:string ....:string,:string ]

解决方案

所以我会尽力回答你的问题。这个解决方案可能会在语法上过时,但我会尽我所能(我现在没有一个测试环境)。让我知道如果这是你正在寻找的,否则我可以帮助你微调解决方案。



以下是加入Pyspark



所以当你阅读文件时:

  rdd1 = sc.textFile('checkouts' ).map(lambda line:line.split(','))
rdd2 = sc.textFile('inventory2')。map(lambda line:line.split(','))
#定义这两个文件的头文件
rdd1_header = rdd1.first()
rdd2_header = rdd2.first()

#定义数据帧
rdd1_df = rdd1.filter (lambda line:line!= rdd1_header).toDF(rdd1_header)
rdd2_df = rdd2.filter(lambda line:line!= rdd2_header).toDF(rdd2_header)

common_cols = [x for x在rdd1_df.columns中如果x在rdd2_df.columns中]

df_final = rdd1_df.join(rdd2_df,on = common_cols)
date_count = df_final.groupBy([BibN um,ItemCollection,CheckoutDateTime])。count()

EDITS:



1)您的错误:pyspark.sql.utils.AnalysisException:u引用'ItemCollection'含糊不清,可能是ItemCollection#3,ItemCollection#21是由于多列在加入之后生成。你需要做的是在你的连接中包含所有常见的列。我会在代码中提到它。


2)另一个问题:一些奇怪的部分被添加到每个RDD的最后部分,例如 - [Row(BibNum = u'1842225',ItemBarcode = u'10035249209',ItemType = u'acbk',ItemCollection = u'namys',CallNumber = u'MYSTERY ELKINS1999',CheckoutDateTime = u'05 / 23/2005 03:20:00 PM ',= u'',= u'',= u'',= u'',= u'',= u'',= u'')

为此,您已经提到了您的CSV文件,如下所示:

  BibNum,ItemBarcode,ItemType,ItemCollection,CallNumber,CheckoutDateTime, ,,,,,, 
1842225,10035249209,acbk,namys,MYSTERY ELKINS1999,05 / 23/2005 03:20:00 PM ,,,,,,,
1928264,10037335444,jcbk,ncpic ,E TABACK,12/14/2005 05:56:00 PM ,,,,,,,

现在,如果您可以看到日期栏后面有很多尾随逗号。即',,,,,,,',这是给这些额外的空列(逗号分割后),你可以放弃。


First I made two tables(RDD) to use following commands

rdd1=sc.textFile('checkouts').map(lambda line:line.split(',')).map(lambda fields:((fields[0],fields[3],fields[5]), 1) )
rdd2=sc.textFile('inventory2').map(lambda line:line.split(',')).map(lambda fields:((fields[0],fields[8],fields[10]), 1) )

The keys in first RDD are BibNum, ItemCollection and CheckoutDateTime. And when I checked the values for first RDD to use rdd1.take(2) it shows

[((u'BibNum', u'ItemCollection', u'CheckoutDateTime'), 1), ((u'1842225', u'namys', u'05/23/2005 03:20:00 PM'), 1)]

and similarly the keys in second RDD are BibNum, ItemCollection and Itemlocation. And the values are as following:

[((u'BibNum', u'ItemCollection', u'ItemLocation'), 1), ((u'3011076', u'ncrdr', u'qna'), 1)]

Once I created two RDDs, I tried to join those two to use rdd3=rdd1.join(rdd2) After it, when I checked the value of rdd3 to use rdd3.take(2). Following error happened.

IndexError: list index out of range

I do not know why it happend. Please enlighten me if you know the reason. If you have any doubts for my question or code, just let me know I will try to clarify it. Thanks

edit--- I put up my sample input data for each RDD

BibNum,ItemBarcode,ItemType,ItemCollection,CallNumber,CheckoutDateTime,,,,,,,
1842225,10035249209,acbk,namys,MYSTERY ELKINS1999,05/23/2005 03:20:00 PM,,,,,,,
1928264,10037335444,jcbk,ncpic,E TABACK,12/14/2005 05:56:00 PM,,,,,,,
1982511,10039952527,jcvhs,ncvidnf,VHS J796.2 KNOW_YO 2000,08/11/2005 01:52:00 PM,,,,,,,
2026467,10040985615,accd,nacd,CD 782.421642 Y71T,10/19/2005 07:47:00 PM,,,,,,,
2174698,10047696215,jcbk,ncpic,E KROSOCZ,12/29/2005 03:42:00 PM,,,,,,,
1602768,10028318730,jcbk,ncpic,E BLACK,10/08/2005 02:15:00 PM,,,,,,,
2285195,10053424767,accd,cacd,CD 782.42166 F19R,09/30/2005 10:16:00 AM,,,,Input,BinNumber,Date,BinNumber+Month
2245955,10048392665,jcbk,ncnf,J949.73 Or77S 2004,12/05/2005 05:03:00 PM,,,,,,,
770918,10044828100,jcbk,ncpic,E HILL,07/22/2005 03:17:00 PM,,,,,,,

.

BibNum,Title,Author,ISBN,PublicationYear,Publisher,Subjects,ItemType,ItemCollection,FloatingItem,ItemLocation,ReportDate,ItemCount,,,,,,,,,,,,,
3011076,A tale of two friends / adapted by Ellie O'Ryan ; illustrated by Tom Caulfield| Frederick Gardner| Megan Petasky| and Allen Tam.,O'Ryan| Ellie,1481425730| 1481425749| 9781481425735| 9781481425742,2014,Simon Spotlight|,Musicians Fiction| Bullfighters Fiction| Best friends Fiction| Friendship Fiction| Adventure and adventurers Fiction,jcbk,ncrdr,Floating,qna,09/01/2017,1,,,,,,,,,,,,,
2248846,Naruto. Vol. 1| Uzumaki Naruto / story and art by Masashi Kishimoto ; [English adaptation by Jo Duffy].,Kishimoto| Masashi| 1974-,1569319006,2003| c1999.,Viz|,Ninja Japan Comic books strips etc| Comic books strips etc Japan Translations into English| Graphic novels,acbk,nycomic,NA,lcy,09/01/2017,1,,,,,,,,,,,,,
3209270,Peace| love & Wi-Fi : a ZITS treasury / by Jerry Scott and Jim Borgman.,Scott| Jerry| 1955-,144945867X| 9781449458676,2014,Andrews McMeel Publishing|,Duncan Jeremy Fictitious character Comic books strips etc| Teenagers United States Comic books strips etc| Parent and teenager Comic books strips etc| Families Comic books strips etc| Comic books strips etc| Comics Graphic works| Humorous comics,acbk,nycomic,NA,bea,09/01/2017,1,,,,,,,,,,,,,
1907265,The Paris pilgrims : a novel / Clancy Carlile.,Carlile| Clancy| 1930-,786706155,c1999.,Carroll & Graf|,Hemingway Ernest 1899 1961 Fiction| Biographical fiction| Historical fiction,acbk,cafic,NA,cen,09/01/2017,1,,,,,,,,,,,,,
1644616,Erotic by nature : a celebration of life| of love| and of our wonderful bodies / edited by David Steinberg.,,094020813X,1991| c1988.,Red Alder Books/Down There Press|,Erotic literature American| American literature 20th century,acbk,canf,NA,cen,09/01/2017,1,,,,,,,,,,,,,

edit---------------------------------------------- date_count --> DataFrame[BibNum : string, ItemCollection : string, CheckoutDateTime : string, count : BigInt ] .. shows likes this but when I checked the value of it using date_count.take(2) , it shows error like this : Input doesn't have expected number of values required by the schema. 6 fields are required while 7 values are provided.

df_final schema looks like this: DataFrame[BibNum:string, ItemType:string, ItemCollection:string, ItemBarcode:string, CallNumber:string, CheckoutDateTime:string, Title:string, Author:string, ISBN:string, PublicationYear:string, Publisher:string, Subjects:string, FloatingItem:string, ItemLocation:string, ReportDate:string, ItemLocation:string, : string , : string, : string .... : string , : string ]

解决方案

So I'll try to answer your question. The solution might be syntactically haywire but I'll try to do my best (I don't have an environment to test right now). Let me know if this is what you are looking for otherwise I can help you with fine tuning the solution.

Here is the documentation for Join in Pyspark

So when you read the files :

rdd1=sc.textFile('checkouts').map(lambda line:line.split(','))
rdd2=sc.textFile('inventory2').map(lambda line:line.split(','))
# Define the headers for both the files
rdd1_header = rdd1.first()
rdd2_header = rdd2.first()

# Define the dataframe
rdd1_df = rdd1.filter(lambda line: line != rdd1_header).toDF(rdd1_header)
rdd2_df = rdd2.filter(lambda line: line != rdd2_header).toDF(rdd2_header)

common_cols = [x for x in rdd1_df.columns if x in rdd2_df.columns]

df_final = rdd1_df.join(rdd2_df, on=common_cols)
date_count = df_final.groupBy(["BibNum", "ItemCollection", "CheckoutDateTime"]).count()

EDITS :

1) Your error : "pyspark.sql.utils.AnalysisException: u"Reference 'ItemCollection' is ambiguous, could be ItemCollection#3, ItemCollection#21" is due to multiple columns being generated after the join. What you need to do is include all common columns in your join. I will mention it in the code.

2) Another issue : Some of weird parts are added into the last part of each RDD, such as -- [Row(BibNum=u'1842225', ItemBarcode=u'10035249209', ItemType=u'acbk', ItemCollection=u'namys', CallNumber=u'MYSTERY ELKINS1999', CheckoutDateTime=u'05/23/2005 03:20:00 PM', =u'', =u'', =u'', =u'', =u'', =u'', =u'')

For this, you had mentioned your CSV file as follows :

BibNum,ItemBarcode,ItemType,ItemCollection,CallNumber,CheckoutDateTime,,,,,,,
1842225,10035249209,acbk,namys,MYSTERY ELKINS1999,05/23/2005 03:20:00 PM,,,,,,,
1928264,10037335444,jcbk,ncpic,E TABACK,12/14/2005 05:56:00 PM,,,,,,,

Now if you can see there are a lot of trailing commas after the date column. i.e. ',,,,,,,', which is giving those extra empty columns (after the split on commas) which you can drop.

这篇关于Pyspark,错误:输入没有期望的模式所需值的数量和列之后的额外尾随逗号的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆