将原始数据导入R [英] Import raw data into R

查看:132
本文介绍了将原始数据导入R的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

请任何人都可以帮我从文本或dat文件中将这些数据导入R中。它有空格分隔,但城市名称不应被视为两个名称。喜欢纽约。

please anyone can help me to import this data into R from a text or dat file. It has space delimited, but cities names should not considered as two names. Like NEW YORK.

1 NEW YORK  7,262,700
2 LOS ANGELES  3,259,340
3 CHICAGO  3,009,530
4 HOUSTON  1,728,910
5 PHILADELPHIA  1,642,900
6 DETROIT  1,086,220
7 SAN DIEGO  1,015,190
8 DALLAS  1,003,520
9 SAN ANTONIO  914,350
10 PHOENIX  894,070


推荐答案

主题的变体......但首先是一些样本数据:

A variation on a theme... but first, some sample data:

cat("1 NEW YORK  7,262,700",
    "2 LOS ANGELES  3,259,340",
    "3 CHICAGO  3,009,530",
    "4 HOUSTON  1,728,910",
    "5 PHILADELPHIA  1,642,900",
    "6 DETROIT  1,086,220",
    "7 SAN DIEGO  1,015,190",
    "8 DALLAS  1,003,520",
    "9 SAN ANTONIO  914,350",
    "10 PHOENIX  894,070", sep = "\n", file = "test.txt")

第1步 :使用 readLines

x <- readLines("test.txt")

第2步 :找出一个可用于插入分隔符的正则表达式。在这里,模式似乎是(从行的 end 看)一组数字和逗号,前面是空格,前面是所有大写中的一些单词。我们可以捕获这些组并插入一些tab分隔符( \t )。额外的斜杠是为了正确地逃避它们。

Step 2: Figure out a regular expression that you can use to insert delimiters. Here, the pattern seems to be (looking from the end of the lines) a set of numbers and commas preceded by space preceded by some words in ALL CAPS. We can capture those groups and insert some "tab" delimiters (\t). The extra slashes are to properly escape them.

gsub("([A-Z ]+)(\\s?[0-9,]+$)", "\\\t\\1\\\t\\2", x)
#  [1] "1\t NEW YORK  \t7,262,700"     "2\t LOS ANGELES  \t3,259,340" 
#  [3] "3\t CHICAGO  \t3,009,530"      "4\t HOUSTON  \t1,728,910"     
#  [5] "5\t PHILADELPHIA  \t1,642,900" "6\t DETROIT  \t1,086,220"     
#  [7] "7\t SAN DIEGO  \t1,015,190"    "8\t DALLAS  \t1,003,520"      
#  [9] "9\t SAN ANTONIO  \t914,350"    "10\t PHOENIX  \t894,070"  

第3步 :自从我们知道 gsub 正在运行,我们知道 read.delim 有一个文本可以使用的参数代替 file 参数,我们可以使用 read.delim 直接在 gsub 的结果:

Step 3: Since we know our gsub is working, and we know that read.delim has a "text" argument that can be used instead of a "file" argument, we can use read.delim directly on the result of gsub:

out <- read.delim(text = gsub("([A-Z ]+)(\\s?[0-9,]+$)", "\\\t\\1\\\t\\2", x), 
                  header = FALSE, strip.white = TRUE)
out
#    V1           V2        V3
# 1   1     NEW YORK 7,262,700
# 2   2  LOS ANGELES 3,259,340
# 3   3      CHICAGO 3,009,530
# 4   4      HOUSTON 1,728,910
# 5   5 PHILADELPHIA 1,642,900
# 6   6      DETROIT 1,086,220
# 7   7    SAN DIEGO 1,015,190
# 8   8       DALLAS 1,003,520
# 9   9  SAN ANTONIO   914,350
# 10 10      PHOENIX   894,070

最后一步可能是将第三列转换为数字:

One possible last step would be to convert the third column to numeric:

out$V3 <- as.numeric(gsub(",", "", out$V3))

这篇关于将原始数据导入R的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆