如何在R中使用powershell删除csv文件中的行? [英] How to delete a row in a csv file with powershell in R?

查看:334
本文介绍了如何在R中使用powershell删除csv文件中的行?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

早上好

我对powershell还是陌生的,我想问你是否有人可以帮助我.

I'm new about powershell and I'd like to ask you if somebody can help me.

我在3.5gb左右有一个很大的csv文件,我的目标是在R环境中以fread(一个data.table函数)加载该文件,但此函数会出错.

I have a big csv file around 3.5gb and my goal is to load it with fread (a data.table function) in R environment, but this function makes a error.

> n_a<-fread("C:/x/xy/xyz/name_file.csv",sep=";", fill = TRUE)

错误是:

Warning message:
In fread("C:/x/xy/xyz/name_file.csv") :
  Stopped early on line 458945. Expected 29 fields but found 30. Consider fill=TRUE and comment.char=. First discarded non-empty line

我尝试使用不同的方式(我在代码fill=true中放入了代码,但是没有用)解决问题,但是我做不到.

I tried to use different way (I putted in my code fill=true, but doesn't work) to solve the problem, but I couldn't do it.

经过不同的研究,我发现了这种解决方案(始终在R中执行):

After different researches I found this kind of solution (always to do in R):

>system("powershell Get-Content C:/a/b/c/file.csv | Select -Index (0..458944 + 1000000) > output.csv")

在R中使用powershell的重点是删除特定行并加载并读取文件.

The focus about the use of powershell in R is to delete a specific row and to load with fread the file.

我的问题是:

如何在Powershell中删除csv中的特定行但不指定矩阵的长度?

How I can delete a specific row in a csv in powershell but without specifying the length of the matrix?

预先感谢您提供的各种帮助.

Thank you in advance for every type of help.

弗朗切斯科

推荐答案

我可能会猜测无效行的位置未知.在这种情况下,读取原始文件并创建仅包含有效数据的新文件可能是明智的.而且,如果源数据可以从操纵中受益,则可以在将其读入R之前完成.

I'd hazard a guess that the invalid row's location is not known. In such a case, it might be sensible to read the original file and create a new file that contains only valid data. What's more, if the source data would benefit of manipulation, it can be done before reading it into R.

一个大到3,5 GiB的文件在较大的方面有点要读入这样的内存中.当然,它可以在64位系统时代完成,但是对于简单的行处理而言,这并不方便.可伸缩的解决方案使用.Net方法和逐行方法.

A file as large as 3,5 GiB is a bit on the large side to read in memory as such. Sure, it can be done in the days of 64 bit systems, but for simple row processing it's unwieldy. A scalable solution uses .Net methods and row-by-row approach.

要逐行处理文件,请使用.Net方法进行有效的行读取.创建StringBuilder来存储包含有效数据的行,其他行则被丢弃. StringBuilder经常在磁盘上刷新.即使是几天的SSD,相对于一次写入大量(例如10000)行,每行的写入操作也相对较慢.

To process a file on row-by-row basis, use .Net methods for efficient row reading. A StringBuilder is created to store rows that contain valid data, others are discarded. The StringBuilder is flushed on disk every so often. Even on days of SSDs, a write operation for each row is relatively slow in respect to writing in a bulk of, say, 10 000 rows a time.

$sb = New-Object Text.StringBuilder
$reader = [IO.File]::OpenText("MyCsvFile.csv")
$i = 0
$MaxRows = 10000
$colonCount = 30
while($null -ne ($line = $reader.ReadLine())) {
    # Split the line on semicolons
    $elements = $line -split ';'
    # If there were $colonCount elements, add those to builder
    if($elements.count -eq $colonCount) {
        # If $line's contents need modifications, do it here
        # before adding it into the builder
        [void]$sb.AppendLine($line)
        ++$i
    }
    # Write builder contents into file every now and then
    if($i -ge $MaxRows) {
        add-content "MyCleanCsvFile.csv" $sb.ToString()
        [void]$sb.Clear()
        $i = 0
    }
}
# Flush the builder after the loop if there's data
if($sb.Length -gt 0) {
    add-content "MyCleanCsvFile.csv" $sb.ToString()
}

这篇关于如何在R中使用powershell删除csv文件中的行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆