将MySQL转储导入R(不需要MySQL服务器) [英] Import MySQL dump into R (without requiring MySQL server)
问题描述
包括 RMySQL
和 sqldf
允许与本地或远程数据库服务器连接。我正在创建一个可移植项目,涉及在不总是访问正在运行的服务器,但是总是访问最新的.sql转储的情况下(或在设备上)导入sql数据。数据库。
Packages like RMySQL
and sqldf
allow one to interface with local or remote database servers. I'm creating a portable project which involves importing sql data in cases (or on devices) which do not always have access to a running server, but which do always have access to the latest .sql dump of the database.
目标似乎足够简单:将.sql转储导入R而不涉及MySQL服务器。想创建一个列表列表,其中元素对应于在.sql转储中定义的任何数据库(可能有多个),这些元素又由这些数据库中的表组成。
The goal seems simple enough: import an .sql dump into R without the involvement of a MySQL server. More specifically, I'd like to create a list of lists in which the elements correspond to any databases defined in the .sql dump (there may be multiple), and those elements in turn consist of the tables in those databases.
为了使这个可重现,让我们来看一个示例sportsdb SQL文件在这里 - 如果你解压缩它叫sportsdb_sample_mysql_20080303.sql。
To make this reproducible, let's take the sample sportsdb SQL file here — if you unzip it it's called sportsdb_sample_mysql_20080303.sql.
一个人会认为sqldf可以这样做:
One would think sqldf might be able to do it:
read.csv.sql('sportsdb_sample_mysql_20080303.sql',sql =SELECT * FROM addresses)
sqliteSendQuery con,statement,bind.data):
语句中的错误:no such table:addresses
转储中的表地址。 此信息 sqldf列表提到同样的错误,但没有解决方案。
This even though there certainly is a table addresses in the dump. This post on the sqldf list mentions the same error, but no solution.
然后在包 ProjectTemplate $ c>中有一个
sql.reader
$ c>,看起来很有前途。在此处,您可以在此处,它假定正在运行的数据库服务器并依赖于 RMySQL
- 而不是我需要的。
Then there is an sql.reader
function in the package ProjectTemplate
, which looks promising. Poking around, the source for the function can be found here, and it assumes a running database server and relies on RMySQL
— not what I need.
。我们似乎耗尽了选项。任何帮助从hivemind赞赏!
So... we seem to be running out of options. Any help from the hivemind appreciated!
(重申,我不是寻找依赖于访问SQL服务器的解决方案; dbreadTable
从 RMySQL
包。我非常想绕过服务器,并从.sql转储文件直接获取数据。)
(To reiterate, I am not looking for a solution that relies on access to an SQL server; that's easy with dbReadTable
from the RMySQL
package. I would very much like to bypass the server and get the data straight from the .sql dump file.)
推荐答案
取决于你想从表中提取什么,这里是如何玩弄数据
depending on what you want to extract from the table, here is how you can play around with the data
numLines <- R.utils::countLines("sportsdb_sample_mysql_20080303.sql")
# [1] 81266
linesInDB <- readLines("sportsdb_sample_mysql_20080303.sql",n=60)
可以做一些regex获取表名(在CREATE TABLE之后),列名(在第一个括号之间)和VALUES(CREATE TABLE之后和第二个括号之间的行)
Then you can do some regex to get tables names (after CREATE TABLE), column names (between first brackets) and VALUES (lines after CREATE TABLE and between second brackets)
逆向工程mysqldump输出MySQL Workbench给出从尖点开始的语句包含非UTF8字符错误
编辑:为了回应OP的回答,如果我解释python脚本正确,逐行读取,过滤INSERT INTO行,解析为csv,然后写入文件。这与我原来的建议非常相似。如果文件大小太大,最好使用一些其他R包读取大块文件。
in response to OP's answer, if i interpret the python script correct, it is also reading it line by line, filter for INSERT INTO lines, parse as csv, then write to file. This is very similar to my original suggestion. My version below in R. If the file size is too large, it would be better to read in the file in chunks using some other R package
rm(list=ls())
options(stringsAsFactors=F)
library(utils)
library(stringi)
library(plyr)
mysqldumpfile <- "sportsdb_sample_mysql_20080303.sql"
allLines <- readLines(mysqldumpfile)
insertLines <- allLines[which(stri_detect_fixed(allLines, "INSERT INTO"))]
allwords <- data.frame(stri_extract_all_words(insertLines, " "))
d_ply(allwords, .(X3), function(x) {
#x <- split(allwords, allwords$X3)[["baseball_offensive_stats"]]
print(x[1,3])
#find where the header/data columns start and end
valuesCol <- which(x[1,]=="VALUES")
lastCols <- which(apply(x, 2, function(y) all(is.na(y))))
datLastCol <- head(c(lastCols, ncol(x)+1), 1) - 1
#format and prepare for write to file
df <- data.frame(x[,(valuesCol+1):datLastCol])
df <- setNames(df, x[1,4:(valuesCol-1)])
#type convert before writing to file otherwise its all strings
df[] <- apply(df, 2, type.convert)
#write to file
write.csv(df, paste0(x[1,3],".csv"), row.names=F)
})
这篇关于将MySQL转储导入R(不需要MySQL服务器)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!