优化我的模拟数据库的代码(2) [英] Optimizating my code simulating a database (2)

查看:161
本文介绍了优化我的模拟数据库的代码(2)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有些天前,我提出了一个问题,我得到了一些非常有用的答案。



说明



我一直在研究一个程序,模拟一个小数据库,首先从txt文件中读取信息,并将它们存储在计算机内存中,然后,我可以使查询采用正常表和/或转置表。问题是,性能还不够好。它的工作速度比我预期的慢。我改进了它,但我想我应该改进它。我有具体的问题,我的程序没有良好的性能。



当前问题



我现在的第一个问题(我的程序较慢)是我花更多的时间,例如有100,000列& 100行(0.325分钟,我已经改善这要感谢你的帮助)超过10万行& 100列(1.61198分钟,与之前相同)。但另一方面,在第二种情况下(在确定的示例中,47秒对比第一种情况下的6079秒)访问一些数据的时间更好任何想法为什么



说明



现在让我提醒您我的程式码的运作方式b
$ b

首先我有一个.txt文件模拟一个数据库表,随机字符串用|分隔。这里有一个表的例子(7行和5列)。我还有转置表



NormalTable.txt

  42sKuG ^ uM | 24465\lHXP | 2996fQo\kN | 293cvByiV | 14772cjZ`SN | 
28704HxDYjzC | 6869xXj \\\
Ie | 27530EymcTU | 9041ByZM] I | 24371fZKbNk |
24085cLKeIW | 16945TuuU\Nc | 16542M [Uz\ | 13978qMdbyF | 6271ait ^ h |
13291_rBZS | 4032aFqa | 13967r ^ \\`T | 27754k] dOTdh | 24947] v_uzg |
1656nn_FQf | 4042OAegZq | 24022nIGz | 4735Syi] \ | 18128klBfynQ |
6618t\SjC | 20601S\EEp | 11009FqZN | 20486rYVPR | 7449SqGC |
14799yNvcl | 23623MTetGw | 6192n] YU\Qe | 20329QzNZO_ | 23845byiP |

TransposedTable.txt

  42sKuG ^ uM | 28704HxDYjzC | 24085cLKeIW | 13291_rBZS | 1656nn_FQf | 6618t\SjC | 14799yNvcl | 
24465\lHXP | 6869xXj \\\
Ie | 16945TuuU\Nc | 4032aFqa | 4042OAegZq | 20601S\EEp | 23623MTetGw |
2996fQo\kN | 27530EymcTU | 16542M [Uz\ | 13967r ^ \\`T | 24022nIGz | 11009FqZN | 6192n] YU \Qe |
293cvByiV | 9041ByZM] I | 13978qMdbyF | 27754k] dOTdh | 4735Syi] \ | 20486rYVPR | 20329QzNZO_ |
14772cjZ`SN | 24371fZKbNk | 6271ait ^ h | 24947] v_uzg | 18128klBfynQ | 7449SqGC | 23845byiP |

说明



.txt文件中的此信息由我的程序读取并存储在计算机内存中。然后,当进行查询时,我将访问存储在计算机内存中的这些信息。将数据加载到计算机内存可能是一个缓慢的过程,但以后访问数据将更快,真正重要的是我。



这里您拥有从文件中读取此信息并存储在计算机中的代码部分。



从Table.txt文件读取数据并将其存储在计算机内存中的代码

  int h; 
do
{
cout<< 你想查询普通表还是转置表?(1-正常表/ 2-转置表):;
cin>> h;
} while(h!= 1& h!= 2);

string ruta_base(C:\\Users\\Raul Velez\\Desktop\\\Tables\\);
if(h == 1)
{
ruta_base + =NormalTable.txt; //找到我的Table.txt的文件夹
}

if(h == 2)
{
ruta_base + =TransposedTable.txt;
}

string temp; //变量,其中Table.txt文件中的每一行将首先存储
vector< string>缓冲; //变量,其中每个不同的行将通过标记分隔不同的元素之后被存储。
vector< ElementSet> RowsCols; //变量与我创建的类,模拟一个向量和每个向量元素是我的表的一行

ifstream ifs(ruta_base.c_str());
while(getline(ifs,temp))//我们将每行读取和存储行,直到.txt文件结束。
{
size_t tokenPosition = temp.find(|); //当我们找到simbol|我们将识别不同的元素。因此,我们将字符串temp分成将存储在向量< string>中的令牌。 buffer
// ---新部分------------------------------------
const char * p = temp.c_str();
char * p1 = strdup(p);

char * pch = strtok(p1,|);
while(pch)
{
buffer.push_back(string(pch));
pch = strtok(NULL,|);
}
free(p1);

ElementSet sss(0,buffer);
buffer.clear();
RowsCols.push_back(sss); //我们将每一行的所有元素(作为向量< string>缓冲区存储)存储在RowsCols中的不同位置
// --- NEW PART END ----------- -------------------------
}

Table TablesStorage(RowsCols); //在每个循环之后,我们将关于每个.txt文件的信息存储在向量< Table>中。 TablesDescriptor
vector< Table>表描述符;
TablesDescriptor.push_back(TablesStorage); //在向量<表> TablesDescriptor将存储所有不同的表及其所有信息

DataBase数据库(1,TablesDescriptor);

上一篇文章中已提供的信息



之后,访问信息部分。让我们假设我想进行查询,我要求输入。假设我的查询是行n,以及连续的元组numTuples和列y。 (我们必须说,列数由十进制数y定义,它将被转换为二进制,并显示要查询的列,例如,如果我要求列54(二进制中的00110110)I将要求列2,3,5和6)。然后我访问计算机内存所需的信息,并将其存储在一个向量shownVector。


在循环 if(h == 2)



在访问转置表的数据时访问所需信息的代码

  int n,numTuples; 
unsigned long long int y;

cout<< 编写要获取更多信息的行的ID:;
cin>> n; //我们得到要表示的行 - > n

cout<< 写要查询的跟随元组的数量:;
cin>> numTuples; //我们得到要查询的跟随元组的数量 - > numTuples

cout<<写要获取更多信息的'列'的ID:;
cin>> y; //我们得到要表示的列y

unsigned int r; //用于列路径的辅助变量
int t = 0; //元组路径的辅助变量
int idTable;

vector< int> columnsToBeQueried; //这里我们将存储要从位组< 500>二进制数,在与掩码比较之后
向量< string> shownVector; //向量以存储来自查询的最终信息
bitset< 5000>面具;
mask = 0x1;

clock_t t1,t2;
t1 = clock(); //查询时间的开始

bitset< 5000> binaryNumber = Utilities()。getDecToBin(y); //我们得到列 - >将数字从十进制更改为二进制。最大列数:5000

//我们看到要查询哪些列
for(r = 0; r< binaryNumber.size(); r ++)//

if(binaryNumber.test(r)& mask.test(r))//如果它们都是位1
{
columnsToBeQueried.push_back
}
mask = mask<< 1;
}

do
{
for(int z = 0; z< columnsToBeQueried.size(); z ++)
{
ElementSet selectedElementSet;
int i;
i = columnsToBeQueried.at(z);
表& selectedTable = database.getPointer()。at(0); //它模拟一个具有指向构成数据库的不同表的指针的向量,但是我们的示例数据库只有一个表,所以不要担心ElementSet selectedElementSet;
if(h == 1)
{

selectedElementSet = selectedTable.getRowsCols()。at(n);
shownVector.push_back(selectedElementSet.getElements()。at(i)); //保存在向量shownVector中的行n的元素i
}

if(h == 2)
{
selectedElementSet = selectedTable.getRowsCols()。at(i);
shownVector.push_back(selectedElementSet.getElements()。at(n)); //我们在向量shownVector中保存行i的元素n
}
n = n + 1;
t ++;
}
} while(t
t2 = clock(); //查询时间结束
showVector()。finalVector(shownVector);
float diff((float)t2-(float)t1);
float microseconds = diff / CLOCKS_PER_SEC * 1000000;
cout<<Time:<< microseconds<< endl;

类别定义 b
$ b

这里我附加了一些类定义,以便您可以编译代码,并更好地了解它的工作原理:

  class ElementSet 
{
private:
int id;
vector< string>元素;

public:
ElementSet();
ElementSet(int,vector< string>&);

const int& getId();
void setId(int);

const vector< string>& getElements();
void setElements(vector< string>);

};

class Table
{
private:
vector< ElementSet> RowsCols;

public:
Table();
表(向量< ElementSet>&);

const vector< ElementSet>& getRowsCols();
void setRowsCols(vector< ElementSet>);
};


class DataBase
{
private:
int id;
vector< Table>指针;

public:
DataBase();
DataBase(int,vector< Table>&);

const int& getId();
void setId(int);

const vector< Table> getPointer()
void setPointer(vector< Table>);

};

类实用程序
{
public:
Utilities();
static bitset< 500> getDecToBin(unsigned long long int);
};

我的问题摘要




  • 为什么根据表格式数据的加载不同

  • 为什么访问信息也取决于表(而且性能与表格数据加载的方式相反?



非常感谢您的帮助!!! )

解决方案

我看到的一件事可以解释你的问题是,你正在做很多分配,是临时的。例如,在加载您:




  • 为每行分配一个临时字符串


  • 将此行复制到临时ElementSet

  • 将它复制到RowSet

  • 复制RowSet to a Table

  • 将表复制到TableDescriptor

  • 将TableDescriptor复制到数据库



据我所知,这些副本都是对象的完整副本。如果你只有几百条或1000条记录可能是好的,但在你的情况下,你有1000万条记录,所以副本将是耗时的。



您的加载时间可能会因加载循环中每行和每列的分配数量不同而有所不同。内存碎片也可能在某一点上起作用(当处理大量小分配时,默认存储器处理器有时需要很长时间来分配新的存储器)。即使您删除了所有不必要的分配,我仍然希望100列的情况比10万的情况稍慢,因为您是如何加载和解析的行。



您的信息访问时间可能不同,因为您正在 selectedElementSet 中创建一行的完整副本。当您有100列时,这将很快,但是当您有100,000列时,会很慢。



有关改进代码的几个具体建议:




  • 减少分配和复制的数量。理想情况是对读取文件进行一次分配,然后在存储时对每个记录进行另一次分配。

  • 如果您要将数据存储在数据库中,那么请从头开始。

  • 如果可能,请使用对数据的引用,而不是实际的副本。

  • 当进行性能分析时,确保您在运行程序的新实例时有时间。



如果您在同一个实例中测试这两种情况,那么内存使用和碎片可能会产生重大影响。

修改:代码建议
要在搜索循环中提高速度,请尝试以下操作:

 code> for(int z = 0; z< columnsToBeQueried.size(); z ++)
{
int i;
i = columnsToBeQueried.at(z);
表& selectedTable = database.getPointer()。at(0);

if(h == 1)
{
ElementSet& selectedElementSet = selectedTable.getRowsCols()。at(n);
shownVector.push_back(selectedElementSet.getElements()。at(i));
}
else if(h == 2)
{
ElementSet& selectedElementSet = selectedTable.getRowsCols()。at(i);
shownVector.push_back(selectedElementSet.getElements()。at(n));
}

n = n + 1;
t ++;
}



我刚更改了selectedElementSet以使用引用应该完全消除行复制,理论上它应该对性能有显着的影响。



编辑

>

你问你在哪里制作副本。您的原始代码中的以下行:

  ElementSet selectedElementSet; 
selectedElementSet = selectedTable.getRowsCols()。at(n);

创建向量的副本< string&元素中的成员 ElementSet 。在100,000列的情况下,这将是一个包含100,000个字符串的向量,所以复制将是相对昂贵的时间。因为你实际上不需要创建一个新的副本,改变 selectedElementSet 作为参考,就像我上面的示例代码,将消除这个副本。


Some days ago I made you a question and I got some really useful answers. I will make a summary to those of you who didn't read and I will explain my new doubts and where I have problems now.

Explanation

I have been working on a program, simulating a small database, that first of all read information from txt files and store them in the computer memory and then, I can make queries taking normal tables and/or transposed tables. The problem is that the performance is not good enough yet. It works slower than what I expect. I have improved it but I think I should improve it more. I have specific points where my program doesn't have a good performance.

Current problem

The first problem that I have now (where my program is slower) is that I spend more time to, for example table with 100,000 columns & 100 rows (0.325 min, I've improved this thanks to your help) than 100,000 rows & 100 columns (1.61198 min, the same than before). But on the other hand, access time to some data is better in the second case (in a determined example, 47 seconds vs. 6079 seconds in the first case) any idea why??

Explanation

Now let me remind you how my code works (with an atached summary of my code)

First of all I have a .txt file simulating a database table with random strings separated with "|". Here you have an example of table (with 7 rows and 5 columns). I also have the transposed table

NormalTable.txt

42sKuG^uM|24465\lHXP|2996fQo\kN|293cvByiV|14772cjZ`SN|
28704HxDYjzC|6869xXj\nIe|27530EymcTU|9041ByZM]I|24371fZKbNk|
24085cLKeIW|16945TuuU\Nc|16542M[Uz\|13978qMdbyF|6271ait^h|
13291_rBZS|4032aFqa|13967r^\\`T|27754k]dOTdh|24947]v_uzg|
1656nn_FQf|4042OAegZq|24022nIGz|4735Syi]\|18128klBfynQ|
6618t\SjC|20601S\EEp|11009FqZN|20486rYVPR|7449SqGC|
14799yNvcl|23623MTetGw|6192n]YU\Qe|20329QzNZO_|23845byiP|

TransposedTable.txt (This is new from the previous post)

42sKuG^uM|28704HxDYjzC|24085cLKeIW|13291_rBZS|1656nn_FQf|6618t\SjC|14799yNvcl|
24465\lHXP|6869xXj\nIe|16945TuuU\Nc|4032aFqa|4042OAegZq|20601S\EEp|23623MTetGw|
2996fQo\kN|27530EymcTU|16542M[Uz\|13967r^\\`T|24022nIGz|11009FqZN|6192n]YU\Qe|
293cvByiV|9041ByZM]I|13978qMdbyF|27754k]dOTdh|4735Syi]\|20486rYVPR|20329QzNZO_|
14772cjZ`SN|24371fZKbNk|6271ait^h|24947]v_uzg|18128klBfynQ|7449SqGC|23845byiP|

Explanation

This information in a .txt file is read by my program and stored in the computer memory. Then, when making queries, I will access to this information stored in the computer memory. Loading the data in the computer memory can be a slow process, but accessing to the data later will be faster, what really matters me.

Here you have the part of the code that read this information from a file and store in the computer.

Code that reads data from the Table.txt file and store it in the computer memory

int h;
do
{
    cout<< "Do you want to query the normal table or the transposed table? (1- Normal table/ 2- Transposed table):" ;
    cin>>h; 
}while(h!=1 && h!=2);

string ruta_base("C:\\Users\\Raul Velez\\Desktop\\Tables\\");
if(h==1)
{
    ruta_base +="NormalTable.txt"; // Folder where my "Table.txt" is found
}

if(h==2)
{
    ruta_base +="TransposedTable.txt";
}

string temp; // Variable where every row from the Table.txt file will be firstly stored
vector<string> buffer; // Variable where every different row will be stored after separating the different elements by tokens.
vector<ElementSet> RowsCols; // Variable with a class that I have created, that simulated a vector and every vector element is a row of my table

ifstream ifs(ruta_base.c_str());
while(getline( ifs, temp )) // We will read and store line per line until the end of the ".txt" file. 
{
    size_t tokenPosition = temp.find("|"); // When we find the simbol "|" we will identify different element. So we separate the string temp into tokens that will be stored in vector<string> buffer
    // --- NEW PART ------------------------------------
    const char* p = temp.c_str();
    char* p1 = strdup(p);

    char* pch = strtok(p1, "|");
    while(pch)
    {
            buffer.push_back(string(pch));
            pch = strtok(NULL,"|");
    }
    free(p1);

    ElementSet sss(0,buffer);
    buffer.clear();
    RowsCols.push_back(sss); // We store all the elements of every row (stores as vector<string> buffer) in a different position in "RowsCols" 
    // --- NEW PART END ------------------------------------
}

Table TablesStorage(RowsCols); // After every loop we will store the information about every .txt file in the vector<Table> TablesDescriptor
vector<Table> TablesDescriptor;
TablesDescriptor.push_back(TablesStorage); // In the vector<Table> TablesDescriptor will be stores all the different tables with all its information

DataBase database(1, TablesDescriptor);

Information already given in the previous post

After this, comes the access to the information part. Let's suppose that I want to make a query, and I ask for input. Let's say that my query is row "n", and also the consecutive tuples "numTuples", and the columns "y". (We must say that the number of columns is defined by a decimal number "y", that will be transformed into binary and will show us the columns to be queried, for example, if I ask for columns 54 (00110110 in binary) I will ask for columns 2, 3, 5 and 6). Then I access to the computer memory to the required information and store it in a vector shownVector. Here I show you the part of this code.

Problem In the loop if(h == 2) where data from the transposed tables are accessed, performance is poorer ¿why?

Code that access to the required information upon my input

int n, numTuples; 
unsigned long long int y;

cout<< "Write the ID of the row you want to get more information: " ;
cin>>n; // We get the row to be represented -> "n"

cout<< "Write the number of followed tuples to be queried: " ;
cin>>numTuples; // We get the number of followed tuples to be queried-> "numTuples"

cout<<"Write the ID of the 'columns' you want to get more information: ";
cin>>y; // We get the "columns" to be represented ' "y"

unsigned int r; // Auxiliar variable for the columns path
int t=0; // Auxiliar variable for the tuples path
int idTable;

vector<int> columnsToBeQueried; // Here we will store the columns to be queried get from the bitset<500> binarynumber, after comparing with a mask
vector<string> shownVector; // Vector to store the final information from the query
bitset<5000> mask;
mask=0x1;

clock_t t1, t2;
t1=clock(); // Start of the query time

bitset<5000> binaryNumber = Utilities().getDecToBin(y); // We get the columns -> change number from decimal to binary. Max number of columns: 5000

// We see which columns will be queried
for(r=0;r<binaryNumber.size();r++) //
{               
    if(binaryNumber.test(r) & mask.test(r))  // if both of them are bit "1"
    {
        columnsToBeQueried.push_back(r);
    }
    mask=mask<<1;   
}

do
{
    for(int z=0;z<columnsToBeQueried.size();z++)
    {
        ElementSet selectedElementSet;
        int i;
        i=columnsToBeQueried.at(z);
        Table& selectedTable = database.getPointer().at(0); // It simmulates a vector with pointers to different tables that compose the database, but our example database only have one table, so don't worry ElementSet selectedElementSet;
        if(h == 1)
        {

            selectedElementSet=selectedTable.getRowsCols().at(n);
            shownVector.push_back(selectedElementSet.getElements().at(i)); // We save in the vector shownVector the element "i" of the row "n"
        }

        if(h == 2)  
        {
            selectedElementSet=selectedTable.getRowsCols().at(i);
            shownVector.push_back(selectedElementSet.getElements().at(n)); // We save in the vector shownVector the element "n" of the row "i"
        }
        n=n+1;
        t++;            
    }
}while(t<numTuples);

t2=clock(); // End of the query time
showVector().finalVector(shownVector);
float diff ((float)t2-(float)t1);
float microseconds = diff / CLOCKS_PER_SEC*1000000;
cout<<"Time: "<<microseconds<<endl;

Class definitions

Here I attached some of the class definitions so that you can compile the code, and understand better how it works:

class ElementSet
{
private:
    int id;
    vector<string> elements; 

public:
    ElementSet(); 
    ElementSet(int, vector<string>&); 

    const int& getId();
    void setId(int);

    const vector<string>& getElements();
    void setElements(vector<string>);

};

class Table
{
private:
    vector<ElementSet> RowsCols; 

public:
    Table(); 
    Table(vector<ElementSet>&); 

    const vector<ElementSet>& getRowsCols();
    void setRowsCols(vector<ElementSet>);
};


class DataBase
{
     private:
        int id;
        vector<Table> pointer; 

     public:
        DataBase(); 
        DataBase(int, vector<Table>&); 

    const int& getId();
    void setId(int);

    const vector<Table>& getPointer();
    void setPointer(vector<Table>);

    };

class Utilities
{
        public:
        Utilities();
        static bitset<500> getDecToBin(unsigned long long int);
};

Summary of my problems

  • Why the load of the data is different depending on the table format???
  • Why the access to the information also depends on the table (and the performance is in the opposite way than the table data load?

Thank you very much for all your help!!! :)

解决方案

One thing I see that may explain both your problems is that you are doing many allocations, a lot of which appear to be temporary. For example, in your loading you:

  • Allocate a temporary string per row
  • Allocate a temporary string per column
  • Copy the row to a temporary ElementSet
  • Copy that to a RowSet
  • Copy the RowSet to a Table
  • Copy the Table to a TableDescriptor
  • Copy the TableDescriptor to a Database

As far as I can tell, each of these copies is a complete new copy of the object. If you only had a few 100 or 1000 records that might be fine but in your case you have 10 million records so the copies will be time consuming.

Your loading times may differ due to the number of allocations done in the loading loop per row and per column. Memory fragmentation may also contribute at some point (when dealing with a large number of small allocations the default memory handler sometimes takes a long time to allocate new memory). Even if you removed all your unnecessary allocations I would still expect the 100 column case to be slightly slower than the 100,000 case due to how your are loading and parsing by line.

Your information access times may be different as you are creating a full copy of a row in selectedElementSet. When you have 100 columns this will be fast but when you have 100,000 columns it will be slow.

A few specific suggestions to improving your code:

  • Reduce the number of allocations and copies you make. The ideal case would be to make one allocation for reading the file and then another allocation per record when stored.
  • If you're going to store the data in a Database then put it there from the beginning. Don't make half-a-dozen complete copies of your data to go from a temporary object to the Database.
  • Make use of references to the data instead of actual copies when possible.
  • When profiling make sure you get times when running a new instance of the program. Memory use and fragmentation may have a significant impact if you test both cases in the same instance and the order in which you do the tests will matter.

Edit: Code Suggestion To hopefully improve your speed in the search loop try something like:

for(int z=0;z<columnsToBeQueried.size();z++)
    {
        int i;
        i=columnsToBeQueried.at(z);
        Table& selectedTable = database.getPointer().at(0);

        if(h == 1)
        {
            ElementSet& selectedElementSet = selectedTable.getRowsCols().at(n);
            shownVector.push_back(selectedElementSet.getElements().at(i));
        }
        else if(h == 2)  
        {
            ElementSet& selectedElementSet = selectedTable.getRowsCols().at(i);
            shownVector.push_back(selectedElementSet.getElements().at(n));
        }

        n=n+1;
        t++;            
    }

I've just changed the selectedElementSet to use a reference which should complete eliminate the row copies taking place and, in theory, it should have a noticeable impact in performance. For even more performance gain you can change shownVector to be a reference/pointer to avoid yet another copy.

Edit: Answer Comment

You asked where you were making copies. The following lines in your original code:

ElementSet selectedElementSet;
selectedElementSet = selectedTable.getRowsCols().at(n);

creates a copy of the vector<string> elements member in ElementSet. In the 100,000 column case this will be a vector containing 100,000 strings so the copy will be relatively expensive time wise. Since you don't actually need to create a new copy changing selectedElementSet to be a reference, like in my example code above, will eliminate this copy.

这篇关于优化我的模拟数据库的代码(2)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆