在MySQL中设置大型数据库以在R中进行分析 [英] Set up large database in MySQL for analysis in R

查看:69
本文介绍了在MySQL中设置大型数据库以在R中进行分析的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在分析R中的大型数据集时,我已达到RAM的极限.我认为下一步是将这些数据导入MySQL数据库并使用RMySQL包.很大程度上是因为我不了解数据库术语,所以我一直无法弄清楚如何在数小时的谷歌搜索和RSeeking之后安装MySQL(我在Mac OSX 10.6上运行MySQL和MySQL Workbench,但也可以运行Ubuntu 10.04 ).

I have reached the limit of RAM in analyzing large datasets in R. I think my next step is to import these data into a MySQL database and use the RMySQL package. Largely because I don't know database lingo, I haven't been able to figure out how to get beyond installing MySQL with hours of Googling and RSeeking (I am running MySQL and MySQL Workbench on Mac OSX 10.6, but can also run Ubuntu 10.04).

关于如何开始使用此用法,是否有很好的参考?在这一点上,我不想做任何类型的关系数据库.我只想将.csv文件导入本地MySQL数据库,并使用RMySQL进行子设置.

Is there a good reference on how to get started with this usage? At this point I don't want to do any sort of relational databasing. I just want to import .csv files into a local MySQL database and do the subsetting in with RMySQL.

我很感谢任何指针(包括您距离基准点很远!",因为我是R的新手,是大型数据集的新手……这个大约80 mb)

I appreciate any pointers (including "You're way off base!" as I'm new to R and newer to large datasets... this one's around 80 mb)

推荐答案

RMySQL的文档非常好-但它确实假设您了解SQL的基础知识.这些是:

The documentation for RMySQL is pretty good - but it does assume that you know the basics of SQL. These are:

  • 创建数据库
  • 创建表格
  • 将数据放入表格
  • 从表中获取数据

第1步很简单:在MySQL控制台中,只需创建数据库DBNAME".或者从命令行使用 mysqladmin ,或者通常有MySQL admin GUI.

Step 1 is easy: in the MySQL console, simply "create database DBNAME". Or from the command line, use mysqladmin, or there are often MySQL admin GUIs.

步骤2有点困难,因为您必须指定表字段及其类型.这取决于您的CSV(或其他定界)文件的内容.一个简单的示例如下所示:

Step 2 is a little more difficult, since you have to specify the table fields and their type. This will depend on the contents of your CSV (or other delimited) file. A simple example would look something like:

use DBNAME;
create table mydata(
  id INT(11) NOT NULL AUTO_INCREMENT PRIMARY KEY,
  height FLOAT(3,2)
); 

哪个说创建一个包含2个字段的表: id ,它将是主键(因此必须是唯一的),并且会随着新记录的添加而自动递增;和 height ,此处指定为浮点数(数字类型),总计3位数字,小数点后为2位(例如 100.27).了解数据类型很重要.

Which says create a table with 2 fields: id, which will be the primary key (so has to be unique) and will autoincrement as new records are added; and height, which here is specified as a float (a numeric type), with 3 digits total and 2 after the decimal point (e.g. 100.27). It's important that you understand data types.

第3步-有多种方法可以将数据导入表中.最简单的方法之一是使用 mysqlimport 实用程序.在上面的示例中,假设您的数据位于与表同名的文件(mydata)中,第一列为制表符,第二列为height变量(不包含标题行),则可以:

Step 3 - there are various ways to import data to a table. One of the easiest is to use the mysqlimport utility. In the example above, assuming that your data are in a file with the same name as the table (mydata), the first column a tab character and the second the height variable (with no header row), this would work:

mysqlimport -u DBUSERNAME -pDBPASSWORD DBNAME mydata

第4步-要求您知道如何运行MySQL查询.再举一个简单的例子:

Step 4 - requires that you know how to run MySQL queries. Again, a simple example:

select * from mydata where height > 50;

意味着从高度大于50的表mydata中获取所有行(id +高度)".

Means "fetch all rows (id + height) from the table mydata where height is more than 50".

一旦您掌握了这些基础知识,就可以转到更复杂的示例,例如创建2个或更多表以及运行将每个表中的数据联接起来的查询.

Once you have mastered those basics, you can move to more complex examples such as creating 2 or more tables and running queries that join data from each.

然后-您可以转到RMySQL手册.在RMySQL中,您设置数据库连接,然后使用SQL查询语法将表中的行作为数据框返回.因此,获得SQL部分非常重要-RMySQL部分非常简单.

Then - you can turn to the RMySQL manual. In RMySQL, you set up the database connection, then use SQL query syntax to return rows from the table as a data frame. So it really is important that you get the SQL part - the RMySQL part is easy.

网络上有很多MySQL和SQL教程,包括官方"

There are heaps of MySQL and SQL tutorials on the web, including the "official" tutorial at the MySQL website. Just Google search "mysql tutorial".

就我个人而言,我根本不认为80 Mb是一个大数据集.我很惊讶这会引起RAM问题,并且我相信本机R函数可以很轻松地处理它.但是,即使您不需要使用SQL这样的新技能,也可以学习这些新技能.

Personally, I don't consider 80 Mb to be a large dataset at all; I'm surprised that this is causing a RAM issue and I'm sure that native R functions can handle it quite easily. But it's good to learn new skill such as SQL, even if you don't need them for this problem.

这篇关于在MySQL中设置大型数据库以在R中进行分析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆