将(网站)文本文件转换为R中的数据框 [英] Converting (web site) text file into data frame in R

查看:491
本文介绍了将(网站)文本文件转换为R中的数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



文本文件如下所示:

  2013年5月18日1 X 2 B的
15:30奥格斯堡 - 格林瑟3:1
1.43
4.55
7.27
16
18:30多特蒙德 - 霍芬海姆1:2
1.39
5.23
6.79
16
2013年5月11日1 X 2 B的
15:30拜耳 - 汉诺威3:1
1.29
5.77
9.46
16

数据框应如下所示:

 日期时间Team1 Team2 G1 G2 1 0 2 
18 May 2013 15:30 Augsburg Greuther 3 1 1.43 4.55 7.27
18 May 2013 18:30多特蒙德霍芬海姆1 2 1.39 5.23 6.79
2013年5月11日15: 30拜耳汉诺威3 1 1.29 5.77 9.46

我在想一些for循环,我检查是否我在连线是不是日期。我将设置一个变量为current_date,如果没有新的日期,它不会更新到一个新的日期。
例如,第一个匹配项目都在同一天,所以日期变量将保留在May18的第二行。



我想要生成向量包含当前日期,时间,team1,team2,结果(目标1,目标2),然后获胜,绘制,丢失的可能性。



然后只需将它们按一个另一个。



我认为我会在数据文件的后面一行读取并检查类型时遇到的最大问题。
可以指定在下一个字符是team1之后, - 来到team2之后,...之前和之后是G1和G2,并且接下来的三行将仅包含在该向量中? / p>

我也不确定如果一个for循环是最聪明的想法,如果txt文件大约20,000行。
也是排除时间之后的第四行。



如果我提出这样的问题,我很抱歉,我知道我可以尝试一些更多的东西小时,然后把我的代码发送到这里,但是我可能最终会出现一些不完整的代码:/

解决方案

/ p>

 行<  -  readLines(剪贴板)#将样本文本文件复制到剪贴板首先
lct < Sys.getlocale(LC_TIME); Sys.setlocale(LC_TIME,C)
idx_dates< - strptime(lines,%d%B%Y)
idx_dates< - 其中(!is.na(idx_dates) )
idx_times< - grep([0-9] +:[0-9] +,行)

parse_item< - function(i){
日期< - lines [[max(idx_dates [idx_dates date< - substr(date,1,nchar(date)-16)
date< - paste date,substr(lines [[i]],1,5))
date < - strptime(date,%d%B%Y%H:%M)
teamsgoals< substring(lines [[i]],9)
teamsgoals< - gsub(+,,teamsgoals)
teamsgoals< - strsplit(teamsgoals,)[[1]]
team1< - teamsgoals [1]
team2< - teamsgoals [3]
goal< - strsplit(teamsgoals [4],:)[[1]]
g1 < - as.numeric(goals [1])$ ​​b $ b g2 < - as.numeric(goals [2])
q1 < - as.numeric(lines [[i + 1 ]])
q0 < - as.numeric(lines [[i + 2]])
q2 < - as.numeric(lines [[i + 3]])
data .frame(date = date,team1 = team1,team2 = team2,g1 = g1,g2 = g2,q1 = q1,q0 = q0,q2 = q2,stringsAsFactors = FALSE)
}

解析< - lapply(idx_times,FUN = parse_item)
减少(rbind,解析)
Sys.setlocale(LC_TIME,lct)

返回

 日期team1 team2 g1 g2 q1 q0 q2 
1 2013-05-18 15:30:00奥格斯堡Greuther 3 1 1.43 4.55 7.27
2 2013-05-18 18:30:00多芬特霍芬海姆1 2 1.39 5.23 6.79
3 2013-05-11 15:30:00拜耳汉诺威3 1 1.29 5.77 9.46


I have a text file from a sports betting website and want to convert the lines into a data frame.

The textfile looks like that:

18 May 2013 1   X   2   B's
15:30   Augsburg - Greuther 3:1 
1.43
4.55
7.27
16
18:30   Dortmund - Hoffenheim   1:2 
1.39
5.23
6.79
16
11 May 2013 1   X   2   B's
15:30   Bayer - Hannover    3:1 
1.29
5.77
9.46
16

The data frame should look like this afterwards:

Date        Time    Team1       Team2       G1  G2  1   0   2
18 May 2013 15:30   Augsburg    Greuther    3   1   1.43    4.55    7.27
18 May 2013 18:30   Dortmund    Hoffenheim  1   2   1.39    5.23    6.79
11 May 2013 15:30   Bayer       Hannover    3   1   1.29    5.77    9.46

I was thinking about some for loop where I check whether or not the line I am in contains a date or not. I would set a variable as current_date and if there is no new date it wont be updated to a new date. For example the first to matches are both on the same day so the date-variable will stay May18 for the 2nd line.

I would want to produces vectors containing current date, time, team1, team2, result(Goals1, goals2), and then the odds for winning, draw, losing.

And then just rbind them under one another.

The most problems I think I would have with reading line after line of the data file and checking the type. Can one specify that after the time the next character is team1 and after "-" comes team2 and before and after ":" is G1 and G2 and that the next three lines will just be included raw into that vector?

I am also not sure if a for loop would be the smartest idea if the txt file gets around 20,000 lines. Also the 4th line after the time shoudl be excluded.

I am sorry if I ask questions like that, I know I could try out stuff for some more hours and post my code here but I would probably end up with insufficient half-baked code :/

解决方案

Here is a try

lines <- readLines("clipboard") # copy the sample text file to clipboard first
lct <- Sys.getlocale("LC_TIME"); Sys.setlocale("LC_TIME", "C")
idx_dates <- strptime(lines, "%d %B %Y")
idx_dates <- which(!is.na(idx_dates))
idx_times <- grep("[0-9]+:[0-9]+", lines)

parse_item <- function(i) {
    date <- lines[[max(idx_dates[idx_dates < i])]]
    date <- substr(date, 1, nchar(date)-16)    
    date <- paste(date, substr(lines[[i]], 1, 5))
    date <- strptime(date, "%d %B %Y %H:%M")
    teamsgoals <- substring(lines[[i]], 9)
    teamsgoals <- gsub(" +", " ", teamsgoals)
    teamsgoals <- strsplit(teamsgoals, " ")[[1]]
    team1 <- teamsgoals[1]
    team2 <- teamsgoals[3]
    goals <- strsplit(teamsgoals[4], ":")[[1]]
    g1 <- as.numeric(goals[1])
    g2 <- as.numeric(goals[2])
    q1 <- as.numeric(lines[[i+1]])
    q0 <- as.numeric(lines[[i+2]])
    q2 <- as.numeric(lines[[i+3]])
    data.frame(date=date, team1=team1, team2=team2, g1=g1, g2=g2, q1=q1, q0=q0, q2=q2, stringsAsFactors=FALSE)
}

parsed <- lapply(idx_times, FUN=parse_item)
Reduce(rbind, parsed)
Sys.setlocale("LC_TIME", lct)

which returns

                 date    team1      team2 g1 g2   q1   q0   q2
1 2013-05-18 15:30:00 Augsburg   Greuther  3  1 1.43 4.55 7.27
2 2013-05-18 18:30:00 Dortmund Hoffenheim  1  2 1.39 5.23 6.79
3 2013-05-11 15:30:00    Bayer   Hannover  3  1 1.29 5.77 9.46

这篇关于将(网站)文本文件转换为R中的数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆