将数据分为3列数据框 [英] Split data into 3 column dataframe
问题描述
我无法将数据文件解析为数据帧.当我使用熊猫读取数据时,会得到一列包含所有信息的数据框.
I'm having trouble parsing a data file into a data frame. When I read the data using pandas I get a one column data frame with all the information.
Server
7.14.182.917 - - [20/Dec/2018:08:30:21 -0500] "GET /tools/performance/log/lib/ui-bootstrap-tpls-0.23.5.min.js HTTP/1.1" 235 89583
7.18.134.196 - - [20/Dec/2018:07:40:13 -0500] "HEAD / HTTP/1.0" 502 -
...
我想将数据解析为三列.我尝试使用df[['Server', 'Date', 'Address']] = pd.DataFrame([ x.split() for x in df['Server'].tolist() ])
,但出现错误ValueError: Columns must be same length as key
有没有一种方法可以将数据分析为具有3列,如下所示:
I want to parse the data in three columns. I tried using df[['Server', 'Date', 'Address']] = pd.DataFrame([ x.split() for x in df['Server'].tolist() ])
but I'm getting an error ValueError: Columns must be same length as key
Is there a way to parse the data to have 3 columns as follows
Server Date Address
7.14.182.917 20/Dec/2018:08:30:21 -0500. "GET /tools/performance/log/lib/ui-bootstrap-tpls-0.23.5.min.js HTTP/1.1" 235 89583
推荐答案
根据输入文件的类型和格式,此处可以采用多种方法.如果文件是有效的字符串路径,请尝试以下方法 (更多信息在这里):
Multiple approaches can be taken here depending on the input file type and format. If the file is a valid string path, try these approaches (more here):
import pandas as pd
# approach 1
df = pd.read_fwf('inputfile.txt')
# approach 2
df = pd.read_csv("inputfile.txt", sep = "\t") # check the delimiter
# then select the columns you want
df_subset = df[['Server', 'Date', 'Address']]
完整解决方案:
Full solution:
import pandas as pd
# read in text file
df = pd.read_csv("test_input.txt", sep=" ", error_bad_lines=False)
# convert df to string
df = df.astype(str)
# get num rows
num_rows = df.shape[0]
# get IP from index, then reset index
df['IP'] = df.index
# reset index to proper index
new_index = pd.Series(list(range(num_rows)))
df = df.set_index([new_index])
# rename columns and drop old cols
df = df.rename(columns={'Server': 'Date', 'IP': "Server"})
# create Date col, drop old col
df['Date'] = df.Date.str.cat(df['Unnamed: 1'])
df = df.drop(["Unnamed: 1"], axis=1)
# Create address col, drop old col
df['Address'] = df['Unnamed: 2'] + df['Unnamed: 3'] + df['Unnamed: 4']
df = df.drop(["Unnamed: 2","Unnamed: 3","Unnamed: 4"], axis=1)
# Strip brackets, other chars
df['Date'] = df['Date'].str.strip("[]")
df['Server'] = df["Server"].astype(str)
df['Server'] = df['Server'].str.strip("()-'', '-',")
返回:
这篇关于将数据分为3列数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!