需要跳过包含“值错误"的行. [英] Need to skip line containing "Value Error"

查看:66
本文介绍了需要跳过包含“值错误"的行.的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从Teradata服务器中提取一些旧数据,但是某些记录包含未在python中注册的奇怪字符,例如"U + ffffffc2".

I'm trying to extract some legacy data from a Teradata server, but some of the records contain weird characters that don't register in python, such as "U+ffffffc2".

当前

  1. 我正在使用pyodbc从Teradata中提取数据

  1. I'm using pyodbc to extract the data from Teradata

将结果放入numpy数组中(因为当我将其直接放入pandas时,它将所有列解释为字符串类型的单个列)

Placing the results into a numpy array (because when I put it directly into pandas, It interprets all of the columns as a single column of type string)

然后,我将numpy数组转换为熊猫数据帧,以将Decimal("09809")和Date("2015,11,14")之类的内容更改为[09809,"11,14,2015"]

Then I turn the numpy array into a pandas dataframe to change things like Decimal("09809") and Date("2015,11,14") into [09809,"11,14,2015"]

然后我尝试将其写入发生错误的文件中

Then I try to write it to a file, where this error occurs

ValueError:字符U + ffffffc2不在[U + 0000; U + 10ffff]

ValueError: character U+ffffffc2 is not in range [U+0000; U+10ffff]

我无权编辑此数据,因此,从客户端角度来看,我可以做些什么来跳过或最好删除该字符,然后再尝试将其写入文件并得到错误?

I don't have access to edit this data, so from a client perspective what can I do to skip or, preferably, remove the character before writing it trying to write it to a file and getting the error?

当前,我有一个"try andexcept"块来跳过对错误数据的查询,但是我必须以至少100个行块的形式查询数据.因此,如果我跳过它,则会丢失100条或更多行一个时间.但是,正如我之前提到的,我希望保留该行,但删除该字符.

Currently, I have a "try and except" block to skip queries with erroneous data, but I have to query the data in row chunks of at least 100. So if I just skip it, I lose 100 or more lines at a time. As I mentioned before, however, I would prefer to keep the line, but remove the character.

这是我的代码. (还可以指出任何不良做法!)

Here's my code. (Feel free to point out any bad practices as well!)

#Python 3.4
 #Python Teradata Extraction
 #Created 01/28/16 by Maz Baig

 #dependencies
 import pyodbc
 import numpy as np
 import pandas as pd
 import sys
 import os
 import psutil
 from datetime import datetime


 #create a global variable for start time
 start_time=datetime.now()
 #create global process variable to keep track of memory usage
 process=psutil.Process(os.getpid())

 def ResultIter(curs, arraysize):
         #Get the specified number of rows at a time
         while True:
                 results = curs.fetchmany(arraysize)
                 if not results:
                         break
                 #for result in results:
                 yield results

 def WriteResult(curs,file_path,full_count):
         rate=100
         rows_extracted=0
         for result in ResultIter(curs,rate):
                 table_matrix=np.array(result)
                 #Get shape to make sure its not a 1d matrix
                 rows, length = table_matrix.shape
                 #if it is a 1D matrix, add a row of nothing to make sure pandas doesn't throw an error
                 if rows < 2:
                         dummyrow=np.zeros((1,length))
                         dummyrow[:]=None
                 df = pd.DataFrame(table_matrix)
                 #give the user a status update
                 rows_extracted=rows+rows_extracted
                 StatusUpdate(rows_extracted,full_count)
                 with open(file_path,'a') as f:
                         try:
                                 df.to_csv(file_path,sep='\u0001',encoding='latin-1',header=False,index=False)
                         except ValueError:
                                 #pass afterwards
                                 print("This record was giving you issues")
                                 print(table_matrix)
                                 pass
         print('\n')
         if (rows_extracted < full_count):
                 print("All of the records were not extracted")
                 #print the run durration
                 print("Duration:  "+str(datetime.now() - start_time))
                 sys.exit(3)
         f.close()




 def StatusUpdate(rows_ex,full_count):
         print("                                      ::Rows Extracted:"+str(rows_ex)+" of "+str(full_count)+"    |    Memory Usage: "+str(process.memory_info().rss/78



 def main(args):
         #get Username and Password
         usr = args[1]
         pwd = args[2]
         #Define Table
         view_name=args[3]
         table_name=args[4]
         run_date=args[5]
         #get the select statement as an input
         select_statement=args[6]
         if select_statement=='':
                 select_statement='*'
         #create the output filename from tablename and run date
         file_name=run_date + "_" + table_name +"_hist.dat"
         file_path="/prod/data/cohl/rfnry/cohl_mort_loan_perfnc/temp/"+file_name
         if ( not os.path.exists(file_path)):
                 #create connection
                 print("Logging In")
                 con_str = 'DRIVER={Teradata};DBCNAME=oneview;UID='+usr+';PWD='+pwd+';QUIETMODE=YES;'
                 conn = pyodbc.connect(con_str)
                 print("Logged In")

                 #Get number of records in the file
                 count_query = 'select count (*) from '+view_name+'.'+table_name
                 count_curs = conn.cursor()
                 count_curs.execute(count_query)
                 full_count = count_curs.fetchone()[0]

                 #Generate query to retrieve all of the table data
                 query = 'select '+select_statement+'  from '+view_name+'.'+table_name
                 #create cursor
                 curs = conn.cursor()
                 #execute query
                 curs.execute(query)
                 #save contents of the query into a matrix
                 print("Writting Result Into File Now")
                 WriteResult(curs,file_path,full_count)
                 print("Table: "+table_name+" was successfully extracted")
                 #print the scripts run duration
                 print("Duration:  "+str(datetime.now() - start_time))
                 sys.exit(0)
         else:
                 print("AlreadyThere Exception\nThe file already exists at "+file_path+". Please remove it before continuing\n")
                 #print the scripts run duration
                 print("Duration:  "+str(datetime.now() - start_time))
                 sys.exit(2)

 main(sys.argv)

谢谢

马兹

推荐答案

如果只有4个字节的unicode点给出错误,则可能会有所帮助. 一种解决方案是使用codecs.register_error注册自定义错误处理程序,该处理程序会过滤掉错误点,然后尝试解码:

If you have only 4-byte unicode points giving an error, this probably may help. One solution is to register a custom error handler using codecs.register_error, which would filter out error points and then just try to decode:

import codecs

def error_handler(error):
    return '', error.end+6

codecs.register_error('nonunicode', error_handler)

b'abc\xffffffc2def'.decode(errors='nonunicode')
# gives you 'abcdef' which's exactly what you want

您可能还想提高处理程序来捕获更复杂的错误,请参见 https://docs.python.org/3/library/exceptions.html#UnicodeError https://docs.python.org/3/library/codecs.html#codecs.register_error 了解详情

You may futher impove your handler to catch more complicated errors, see https://docs.python.org/3/library/exceptions.html#UnicodeError and https://docs.python.org/3/library/codecs.html#codecs.register_error for details

这篇关于需要跳过包含“值错误"的行.的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆