选择某列包含"hsa .."(部分字符串匹配)之类的字符串的行 [英] Selecting rows where a column has a string like 'hsa..' (partial string match)
问题描述
我有一个371MB的文本文件,其中包含micro RNA数据.本质上,我只想选择那些包含有关人类microRNA信息的行.
I have a 371MB text file containing micro RNA data. Essentially, I would like to only select those rows that have information about human microRNA.
我已使用read.table读入文件.通常,我会用sqldf完成我想要的-如果它具有'like'语法(从<>中选择*,其中miRNA类似于'hsa').不幸的是-sqldf不支持该语法.
I have read in the file using a read.table. Usually, I'd accomplish what I'd want with sqldf - if it had a 'like' syntax (select * from <> where miRNA like 'hsa'). Unfortunately - sqldf does not support that syntax.
如何在R中做到这一点?我看过stackoverflow,没有看到如何进行部分字符串匹配的示例.我什至安装了stringr软件包-但它并没有我所需要的.
How can I do this in R? I have looked around stackoverflow and do not see an example of how I can do a partial string match. I even installed the stringr package - but it does not quite have what I need.
我想做的就是这样-选择hsa- * 的所有行.
What I would like to do, is something like this - where all rows where hsa-* are selected.
selectedRows <- conservedData[, conservedData$miRNA %like% "hsa-"]
这当然不是正确的语法.
which of course, is not correct syntax.
有人可以帮我吗?非常感谢您阅读.
Can somebody please help me with this? Thanks a lot for reading.
阿斯达
推荐答案
我注意到您在当前方法中提到了函数%like%
.我不知道这是否是对"data.table"中%like%
的引用,但是如果是这样,则可以按如下方式使用它.
I notice that you mention a function %like%
in your current approach. I don't know if that's a reference to the %like%
from "data.table", but if it is, you can definitely use it as follows.
请注意,对象不必是data.table
(但还请记住,data.frame
和data.table
s的子集方法并不相同):
Note that the object does not have to be a data.table
(but also remember that subsetting approaches for data.frame
s and data.table
s are not identical):
library(data.table)
mtcars[rownames(mtcars) %like% "Merc", ]
iris[iris$Species %like% "osa", ]
如果这就是您所拥有的,那么也许您只是在混合行和列位置以设置子集数据.
If that is what you had, then perhaps you had just mixed up row and column positions for subsetting data.
如果不想加载程序包,可以尝试使用grep()
搜索要匹配的字符串.这是mtcars
数据集的示例,其中我们匹配行名包含"Merc"的所有行:
If you don't want to load a package, you can try using grep()
to search for the string you're matching. Here's an example with the mtcars
dataset, where we are matching all rows where the row names includes "Merc":
mtcars[grep("Merc", rownames(mtcars)), ]
mpg cyl disp hp drat wt qsec vs am gear carb
# Merc 240D 24.4 4 146.7 62 3.69 3.19 20.0 1 0 4 2
# Merc 230 22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2
# Merc 280 19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4
# Merc 280C 17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4
# Merc 450SE 16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3
# Merc 450SL 17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3
# Merc 450SLC 15.2 8 275.8 180 3.07 3.78 18.0 0 0 3 3
还有另一个示例,使用iris
数据集搜索字符串osa
:
And, another example, using the iris
dataset searching for the string osa
:
irisSubset <- iris[grep("osa", iris$Species), ]
head(irisSubset)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 3.5 1.4 0.2 setosa
# 2 4.9 3.0 1.4 0.2 setosa
# 3 4.7 3.2 1.3 0.2 setosa
# 4 4.6 3.1 1.5 0.2 setosa
# 5 5.0 3.6 1.4 0.2 setosa
# 6 5.4 3.9 1.7 0.4 setosa
对于您的问题,请尝试:
For your problem try:
selectedRows <- conservedData[grep("hsa-", conservedData$miRNA), ]
这篇关于选择某列包含"hsa .."(部分字符串匹配)之类的字符串的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!