pandas 数据框read_csv,指定列并将整行保留为字符串 [英] pandas dataframe read_csv, specify columns and keep whole line as a string
问题描述
在熊猫read_csv中,有一种方法可以指定例如.col1,col15,整行吗?
我正在尝试从文本文件中导入约700000行数据,该文本文件中有帽子"^"作为定界符,没有文本限定符和回车符作为行定界符.
I am trying to import about 700000 rows of data from a text file which has hats '^' as delimiters, no text qualifiers and carriage return as line delimiter.
在文本文件中,我需要第1列,第15列,然后是表/数据框的三列中的整行.
From the text file I need column 1, column 15 and then the whole line in three columns of a table/dataframe.
我已经搜索了如何在熊猫中做到这一点,但对它的逻辑了解不够深.我可以为所有26列导入很好,但这对我的问题没有帮助.
I've searched how to do this in pandas, but don't know it well enough to get the logic. I can import fine for all 26 columns, but that doesn't help my problem.
my_df = pd.read_csv("tablefile.txt", sep="^", lineterminator="\r", low_memory=False)
或者我可以使用标准的python将数据放入表中,但是对于700000行,这大约需要4个小时.对我来说太长了.
Or I can use standard python to put the data into a table, but this takes about 4 hours for the 700000 rows. which is far too long for me.
count_1 = 0
for line in open('tablefile.txt'):
if count_1 > 70:
break
else:
col1id = re.findall('^(\d+)\^', line)
col15id = re.findall('^.*\^.*\^(\d+)\^.*\^.*\^.*\^.*\^.*\^.*\^.*\^.*\^.*\^.*\^.*', line)
line = line.strip()
count_1 = count_1 + 1
cur.execute('''INSERT INTO mytable (mycol1id, mycol15id, wholeline) VALUES (?, ?, ?)''',
(col1id[0], col15id[0], line, ) )
conn.commit()
print('row count_1=',count_1)
在熊猫read_csv中,有一种方法可以指定例如.col1,col15,整线?
如上所述, col1
和 col15
是数字,而 wholeline
是字符串
As in above, col1
and col15
are digits and wholeline
is a string
- 我不想在导入后重建字符串,因为在此过程中我可能会丢失一些字符.
谢谢
提交到数据库的每一行都是燃烧时间.
Committing to the database for each line was burning time.
推荐答案
我将 conn.commit()
放在for循环的外部.尽管我认为它的安全性较低,但是它可以将加载时间减少到几分钟.
I put the conn.commit()
on the outside of the for loop. It reduced the load time to a few minutes, though I'm guessing it's less safe.
无论如何都感谢您的帮助.
Anyway thanks for the help.
这篇关于 pandas 数据框read_csv,指定列并将整行保留为字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!