如何使用django将包含超过5000行的excel文件快速导入sqlite数据库 [英] How to make fast the import of an excel file containing more than 5000 lines into sqlite database with django

查看:102
本文介绍了如何使用django将包含超过5000行的excel文件快速导入sqlite数据库的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

将 xls 文件(超过 5000 行)导入我的 sqlite 数据库需要很长时间.

<前>def importeradsl(请求):如果GET"== request.method:其他:excel_file = request.FILES["excel_file"]#您可以在此处进行验证以检查扩展名或文件大小wb = openpyxl.load_workbook(excel_file)#从许多工作表中按名称获取特定工作表工作表 = wb["工作表 1"]#迭代行并从行中的每个单元格获取值对于 worksheet.iter_rows(min_row=2) 中的行:行数据 = 列表()对于行中的单元格:row_data.append(str(cell.value))#获取内容字段 DerangementCuivre 模型#客户端nd = 行数据[0]nom_client = row_data[3]nd_contact = row_data[4]#类别代码类别 = 行数据[6]acces_reseau = row_data[8]etat = row_data[9]origine = row_data[10]code_sig = row_data[11]agent_sig = row_data[13]date_sig = dt.datetime.strftime(parse(row_data[14]), '%Y-%m-%d %H:%M:%S')date_essai = dt.datetime.strftime(parse(row_data[15]), '%Y-%m-%d %H:%M:%S')agent_essai = row_data[18]尝试:date_ori = dt.datetime.strptime(row_data[19], '%Y-%m-%d %H:%M:%S')除了 ValueError 作为 e:打印(Vous",e)其他:date_ori = dt.datetime.strftime(parse(row_data[19]), '%Y-%m-%d %H:%M:%S')agent_ori = row_data[20]code_ui = row_data[21]装备 = row_data[22]sous_traitant = row_data[23]date_pla = dt.datetime.strftime(parse(row_data[24]), '%Y-%m-%d %H:%M:%S')date_rel = dt.datetime.strftime(parse(row_data[25]), '%Y-%m-%d %H:%M:%S')date_releve = dt.datetime.strptime(row_data[25], '%Y-%m-%d %H:%M:%S')date_essais = dt.datetime.strptime(row_data[15], '%Y-%m-%d %H:%M:%S')pst = pytz.timezone('非洲/达喀尔')date_releve = pst.localize(date_releve)UTC = pytz.UTCdate_releve = date_releve.astimezone(utc)date_essais = pst.localize(date_essais)date_essais = date_essais.astimezone(utc)code_rel = row_data[26]本地化 = row_data[27]原因 = row_data[28]评论 = row_data[29]agent_releve = row_data[30]center_racc = row_data[32]rep = row_data[33]srp = row_data[34]delai = (date_releve - date_essais).total_seconds()dali = divmod(delai, 86400)[0]semaine = date_releve.isocalendar()[1]mois = date_releve.monthannee = date_releve.year如果大理 > 7:etats = "PEX PLUS"其他:etats = "PEX"#Enregistrer un 客户端客户端(nd=nd, nom=nom_client, mobile=nd_contact).save()#Enregistrer la category#Code Pour nom 类别 - renseigner plus tard类别(code_categorie=code_categorie, nom="Public").save()#Enregistrer 代理去信号化AgentSig(matricule=agent_sig, nom="Awa").save()#Enregistrer agent d'essaiAgentEssai(matricule=agent_essai).save()#Enregister agent d'orientationAgentOri(matricule=agent_ori).save()#Enregistrer agent de relèveAgentRel(matricule=agent_releve).save()#Enregistrer le sous-traitantSousTraitant(nom=sous_traitant).save()#注册中心中心(代码=centre_racc).保存()#注册用户界面UniteIntervention(code_ui=code_ui,sous_traitant=SousTraitant.objects.get(nom=sous_traitant)).save()#Enregistrer le repartiteurRepartiteur(code=rep, crac=Centre.objects.get(code=centre_racc)).save()#注册团队Equipe(nom=equipe, unite=UniteIntervention.objects.get(code_ui=code_ui)).save()#Enregistrer le SRSousRepartiteur(code=srp, rep=Repartiteur.objects.get(code=rep)).save()#Enregistrer le drangementDerangementAdsl(acces_reseau=acces_reseau,nd_client=Client.objects.get(nd=nd),类别=类别(代码类别=代码类别),埃塔=埃塔,起源=起源,code_sig=code_sig,agent_sig=AgentSig.objects.get(matricule=agent_sig),date_sig=date_sig,date_essai=date_essai,agent_essai=AgentEssai.objects.get(matricule=agent_essai),date_ori=date_ori,agent_ori=AgentOri.objects.get(matricule=agent_ori),sous_traitant=SousTraitant.objects.get(nom=sous_traitant),unite_int = UniteIntervention.objects.get(code_ui=code_ui),date_pla=date_pla,date_rel=date_rel,code_rel=code_rel,code_local=本地化,原因=原因,comment_cause=评论,agent_rel=AgentRel.objects.get(matricule=agent_releve),center=Centre.objects.get(code=centre_racc),rep=Repartiteur.objects.get(code=rep),srep=SousRepartiteur.objects.get(code=srp),德莱=达利,etat_vr=etats,semaine=semaine,mois=mois,annee=annee).save()

解决方案

很少有不正确的地方.我建议您采用以下方法:

  1. 使您的代码更具可读性
  2. 删除无用的查询
  3. 避免相关记录重复
  4. 缓存您的相关实例.
  5. 使用bulk_create

看看你的代码,粗略估计,每条 csv 记录,每行你会得到超过 30 个 SQL 查询,这有点多......

1.让您的代码更具可读性.

你的解析逻辑可以被干掉很多.

首先,确定您对数据的处理方式.在我看来,2个主要功能:

什么都不做:

def no_transformation(value)返回 str(值)

解析日期

def strptime(value):"我真的不知道你的解析"函数是做什么的,我听着,但它可能在这里添加您的逻辑很有趣"return dt.datetime.strptime(parse(str(value)), '%Y-%m-%d %H:%M:%S')

现在,您可以声明解析器配置:

PARSER_CONFIG=(#(column_index、variable_name、transformation_function)(0,'nd',no_transformation),(10,'origin',no_transformation),(11,'code_sig',no_transformation),(13,'agent_sig',no_transformation),(14,'date_sig',strptime),(15,'date_essai',strptime),(18,'agent_essai',no_transformation),(19,'date_ori',strptime),(20,'agent_ori',no_transformation),(21,'code_ui',no_transformation),(22,'equipe',no_transformation),(23,'sous_traitant',no_transformation),(24,'date_pla',strptime),(25,'date_rel',strptime),(26,'code_rel',no_transformation),(27,'localisation',no_transformation),(28,'原因',no_transformation),(29,'commentaire',no_transformation),(3,'nom_client',no_transformation),(30,'agent_releve',no_transformation),(32,'center_racc',no_transformation),(33,'rep',no_transformation),(34,'srp',no_transformation),(4,'nd_contact',no_transformation),(6,'code_categorie',no_transformation),(8,'acces_reseau',no_transformation),(9,'etat',no_transformation),(15',date_essais',strptime),(19',date_ori',strptime),(25',date_releve',strptime),)

现在,您知道如何解析数据以及如何命名数据.让我们把这些东西放到一个字典里.

def parse(row):"""将一行转化为一个字典参数:行(元组):您行的数据返回:dict:您解析的数据,命名为 dict."返回{key:tranfsorm(row[index]) 用于索引、键、PARSER_CONFIG 中的转换}

从这里开始,您的解析器方式更具可读性,您可以确切地知道自己在用数据做什么.

把这一切总结起来,你应该得到:

PARSER_CONFIG=(#(column_index、variable_name、transformation_function)#...)def no_transformation(值)返回 str(值)def strptime(值)返回 str(值)定义解析(行):"""将一行转化为一个字典参数:行(元组):您行的数据返回:dict:您解析的数据,命名为 dict."返回{key:tranfsorm(row[index]) 用于索引、键、PARSER_CONFIG 中的转换}对于行中的行:项目 = 解析(行)#<您的数据,还没有相关实例....

仍然需要一些工作来创建您的相关实例,但我们最终会到达那里.

2.删除无用的查询.

你这样做:

#...首先,你创建一个记录客户端(nd=nd, nom=nom_client, mobile=nd_contact).save()#... 然后你在保存 DerangementAdsl 时获取它nd_client=Client.objects.get(nd=nd)

虽然这样做的更pythonic方式是:

#... 你创建并分配你的 istance.client = Client(nd=item.get('nd'),nom=item.get('nom_client'),mobile=item.get('nd_contact')).save()#...nd_client=客户端

您刚刚获得了一个 SQL 查询/行!对每个模型执行相同的逻辑,每行您将获得大约 20 个查询!

categorie=Categorie.objects.create(code_categorie=item.get('code_categorie'), nom=Public"),#Enregistrer 代理去信号化agent_sig=AgentSig.objects.create(matricule=item.get('agent_sig'), nom=Awa"),#Enregistrer agent d'essaiagent_essai=AgentEssai.objects.create(matricule=item.get('agent_essai')),#Enregister agent d'orientationagent_ori=AgentOri.objects.create(matricule=item.get('agent_ori')),#Enregistrer agent de relèveagent_rel=AgentRel.objects.create(matricule=item.get('agent_releve')),#Enregistrer le sous-traitantsous_traitant=SousTraitant.objects.create(nom=item.get('sous_traitant')),#注册中心center=Centre.objects.create(code=item.get('center_racc')),#注册用户界面unite_int=UniteIntervention.objects.create(code_ui=item.get('code_ui'), sous_traitant=sous_traitant), # <你通过 sous_traitant 获得了一个额外的查询#Enregistrer le repartiteurrep=Repartiteur.objects.create(code=item.get('rep'), crac=centre), # <您可以通过 center 获得一个额外的查询#注册团队equipe=Equipe.objects.create(nom=item.get('equipe')), unite=unite_int),# <您可以通过 unite_int 获得一个 extrat 查询#Enregistrer le SRsrep=SousRepartiteur.objects.create(code=item.get('srp'), rep=rep),# <您通过代表获得了一个额外的查询

3.避免相关记录重复

现在有一个大问题:

考虑到每个client都有多行,您最终会发现自己有许多重复项,而您不希望这样.您应该使用 create"nofollow noreferrer">get_or_create.

请注意它返回一个元组:(实例,创建)所以......你的代码应该是这样的:

categorie, categorie_created=Categorie.objects.get_or_create(code_categorie=item.get('code_categorie'), nom=Public"),agent_sig, agent_sig_created=AgentSig.objects.get_or_create(matricule=item.get('agent_sig'), nom=Awa"),agent_essai, agent_essai_created=AgentEssai.objects.get_or_create(matricule=item.get('agent_essai')),agent_ori, agent_ori_created=AgentOri.objects.get_or_create(matricule=item.get('agent_ori')),agent_rel, agent_rel_created=AgentRel.objects.get_or_create(matricule=item.get('agent_releve')),sous_traitant, sous_traitant_created=SousTraitant.objects.get_or_create(nom=item.get('sous_traitant')),中心,center_created=Centre.objects.get_or_create(code=item.get('centre_racc')),unite_int, unite_int_created=UniteIntervention.objects.get_or_create(code_ui=item.get('code_ui'), sous_traitant=sous_traitant)rep, rep_created=Repartiteur.objects.get_or_create(code=item.get('rep'), crac=centre)装备,装备_created=Equipe.objects.get_or_create(nom=item.get('equipe')), unite=unite_intsrep, srep_created=SousRepartiteur.objects.get_or_create(code=item.get('srp'), rep=rep)

Tadaaaaam,您将创建仅"您的相关对象所必需的.

4.缓存您的相关对象.

和上一个主题一样,我认为每个相关实例都有多行,对于每一行,您仍然可以从您的数据库中获取它.

没关系,我想如果您在内存中使用 SQLite,它不会像使用其他数据库那样慢,但仍然会成为瓶颈.您可以使用以下方法:

MODEL_CACHE = {}def get_related_instance(模型,**kwargs):键 =(模型,kwargs)如果键入 MODEL_CACHE:返回实例 MODEL_CACHE[key]其他:实例,创建 = model.objects.get_or_create(**kwargs)MODEL_CACH[key]=实例返回实例# 而不是现在有以前的行,你最终得到:类别 = get_related_instance(Categorie,code_categorie=item.get('code_categorie'), nom=公共"),agent_sig = get_related_instance(AgentSig,matricule=item.get('agent_sig'), nom=Awa"),agent_essai = get_related_instance(AgentEssai,matricule=item.get('agent_essai')),agent_ori = get_related_instance(AgentOri,matricule=item.get('agent_ori')),agent_rel = get_related_instance(AgentRel,matricule=item.get('agent_releve')),sous_traitant = get_related_instance(SousTraitant,nom=item.get('sous_traitant')),center = get_related_instance(Centre,code=item.get('centre_racc')),unite_int = get_related_instance(UniteIntervention,code_ui=item.get('code_ui'), sous_traitant=sous_traitant)rep = get_related_instance(Repartiteur,code=item.get('rep'), crac=centre)装备 = get_related_instance(Equipe,nom=item.get('equipe')), unite=unite_intsrep = get_related_instance(SousRepartiteur,code=item.get('srp'),rep=rep)

我不知道您会因此获得多少,这实际上取决于您尝试导入的数据集,但从经验来看,这是相当激烈的!

5 使用 bulk_create

你在做什么

用于行中的行:DerangementAdsl(...你的数据...).save() #<这是一个数据库调用

这是每行一个 SQL 查询,而您可以这样做:

ITEMS = []对于行中的行:#...您之前看到的解析...ITEMS.append(DerangementAdsl(**item))DerangementAdsl.objects.bulk_create(ITEMS) #<这是一个数据库调用

把它们放在一起!

PARSER_CONFIG=(#(column_index、variable_name、transformation_function)#...)def no_transformation(值)返回 str(值)def strptime(值)返回 str(值)模型缓存 = {}def get_related_instance(模型,**kwargs):键 =(模式,kwargs)如果键入 MODEL_CACHE:返回实例 MODEL_CACHE[key]其他:实例,创建 = model.objects.get_or_create(**kwargs)MODEL_CACH[key]=实例返回实例定义解析(行):"""将一行转化为一个字典参数:行(元组):您行的数据返回:dict:您解析的数据,命名为 dict."项目= {key:tranfsorm(row[index]) 用于索引、键、PARSER_CONFIG 中的转换}项目.更新({'categorie': get_related_instance(Categorie,code_categorie=item.get('code_categorie'), nom=Public"),'agent_sig': get_related_instance(AgentSig,matricule=item.get('agent_sig'), nom=Awa"),'agent_essai': get_related_instance(AgentEssai,matricule=item.get('agent_essai')),'agent_ori': get_related_instance(AgentOri,matricule=item.get('agent_ori')),'agent_rel': get_related_instance(AgentRel,matricule=item.get('agent_releve')),'sous_traitant': get_related_instance(SousTraitant,nom=item.get('sous_traitant')),'中心':get_related_instance(Centre,code=item.get('centre_racc')),'unite_int': get_related_instance(UniteIntervention,code_ui=item.get('code_ui'), sous_traitant=sous_traitant)'rep': get_related_instance(Repartiteur,code=item.get('rep'), crac=centre)'equipe': get_related_instance(Equipe,nom=item.get('equipe')), unite=unite_int'srep': get_related_instance(SousRepartiteur,code=item.get('srp'), rep=rep)})退货def importeradsl(请求):#我跳过你的准备条件项目 = []对于 worksheet.iter_rows(min_row=2) 中的行:ITEMS.append(DerangementAdsl(**parse(row)))DerangementAdsl.objects.bulk_create(ITEMS)

结论

按照这些建议,您应该最终得到一个优化的脚本,该脚本将方式比原始脚本运行得更快,并且方式可读和pythonic

粗略地说,根据您的数据集,5k 行应该在 10 秒到几分钟之间运行.

如果每一行的相关实例(client,category...)都是唯一的,我会使用更复杂的方法在数据集上循环多次以创建相关的使用 bulk_create 的模型并将它们缓存如下:

CLIENTS = []对于行中的行:CLIENTS.append(Client(**client_parser(row)))clients=Client.objects.bulk_create(CLIENTS) # 您只需调用一次数据库即可创建*所有*您的客户端!

然后,您缓存所有创建的客户端.您对所有相关模型执行相同的操作,最终您将加载数据并进行十多次数据库调用,但这实际上取决于您的业务逻辑:它也应该设计为处理重复记录.

Import xls file (more than 5000 lines) into my sqlite database takes so long.

def importeradsl(request):
if "GET" == request.method:
    else:
        excel_file = request.FILES["excel_file"]
        #you may put validations here to check extension or file size
        wb = openpyxl.load_workbook(excel_file)
        #getting a particular sheet by name out of many sheets
        worksheet = wb["Sheet 1"]
        #iterating over the rows and getting value from each cell in row
        for row in worksheet.iter_rows(min_row=2):
            row_data = list()
            for cell in row:
                row_data.append(str(cell.value))
            #Get content fields DerangementCuivre models
            #Client
            nd = row_data[0]
            nom_client = row_data[3]
            nd_contact = row_data[4]
            #Categorie
            code_categorie = row_data[6]
            acces_reseau = row_data[8]
            etat = row_data[9]
            origine = row_data[10]
            code_sig = row_data[11]
            agent_sig = row_data[13]
            date_sig = dt.datetime.strftime(parse(row_data[14]), '%Y-%m-%d %H:%M:%S')
            date_essai = dt.datetime.strftime(parse(row_data[15]), '%Y-%m-%d %H:%M:%S')
            agent_essai = row_data[18]
            try:
                date_ori = dt.datetime.strptime(row_data[19], '%Y-%m-%d %H:%M:%S')
            except ValueError as e:
                print ("Vous", e)
            else:
                date_ori = dt.datetime.strftime(parse(row_data[19]), '%Y-%m-%d %H:%M:%S')
            agent_ori = row_data[20]
            code_ui = row_data[21]
            equipe = row_data[22]
            sous_traitant = row_data[23]
            date_pla = dt.datetime.strftime(parse(row_data[24]), '%Y-%m-%d %H:%M:%S')
            date_rel = dt.datetime.strftime(parse(row_data[25]), '%Y-%m-%d %H:%M:%S')
            date_releve = dt.datetime.strptime(row_data[25], '%Y-%m-%d %H:%M:%S')
            date_essais = dt.datetime.strptime(row_data[15], '%Y-%m-%d %H:%M:%S')
            pst = pytz.timezone('Africa/Dakar')
            date_releve = pst.localize(date_releve)
            utc = pytz.UTC
            date_releve = date_releve.astimezone(utc)
            date_essais = pst.localize(date_essais)
            date_essais = date_essais.astimezone(utc)
            code_rel = row_data[26]
            localisation = row_data[27]
            cause = row_data[28]
            commentaire = row_data[29]
            agent_releve = row_data[30]
            centre_racc = row_data[32]
            rep = row_data[33]
            srp = row_data[34]
            delai = (date_releve - date_essais).total_seconds()
            dali = divmod(delai, 86400)[0]
            semaine = date_releve.isocalendar()[1]
            mois = date_releve.month
            annee = date_releve.year
            if dali > 7:
                etats = "PEX PLUS"
            else:
                etats = "PEX"
            #Enregistrer un client
            Client(nd=nd, nom=nom_client, mobile=nd_contact).save()
            #Enregistrer la categorie
            #Code pour nom categorie - renseigner plus tard
            Categorie(code_categorie=code_categorie, nom="Public").save()
            #Enregistrer agent de signalisation
            AgentSig(matricule=agent_sig, nom="Awa").save()
            #Enregistrer agent d'essai
            AgentEssai(matricule=agent_essai).save()
            #Enregister agent d'orientation
            AgentOri(matricule=agent_ori).save()
            #Enregistrer agent de relève
            AgentRel(matricule=agent_releve).save()
            #Enregistrer le sous-traitant
            SousTraitant(nom=sous_traitant).save()
            #Enregistrer le centre
            Centre(code=centre_racc).save()
            #Enregistrer ui
            UniteIntervention(code_ui=code_ui, 
            sous_traitant=SousTraitant.objects.get(nom=sous_traitant)).save()
            #Enregistrer le repartiteur
            Repartiteur(code=rep, crac=Centre.objects.get(code=centre_racc)).save()
            #Enregistrer team
            Equipe(nom=equipe, unite=UniteIntervention.objects.get(code_ui=code_ui)).save()
            #Enregistrer le SR
            SousRepartiteur(code=srp, rep=Repartiteur.objects.get(code=rep)).save()
            #Enregistrer le drangement
            DerangementAdsl(acces_reseau=acces_reseau,
                            nd_client=Client.objects.get(nd=nd),
                            categorie=Categorie(code_categorie=code_categorie),
                            etat=etat,
                            origine=origine,
                            code_sig=code_sig,
                            agent_sig=AgentSig.objects.get(matricule=agent_sig),
                            date_sig=date_sig,
                            date_essai=date_essai,
                            agent_essai=AgentEssai.objects.get(matricule=agent_essai),
                            date_ori=date_ori,
                            agent_ori=AgentOri.objects.get(matricule=agent_ori),
                            sous_traitant=SousTraitant.objects.get(nom=sous_traitant),
                            unite_int = UniteIntervention.objects.get(code_ui=code_ui),
                            date_pla=date_pla,
                            date_rel=date_rel,
                            code_rel=code_rel,
                            code_local=localisation,
                            cause=cause,
                            comment_cause=commentaire,
                            agent_rel=AgentRel.objects.get(matricule=agent_releve),
                            centre=Centre.objects.get(code=centre_racc),
                            rep=Repartiteur.objects.get(code=rep),
                            srep=SousRepartiteur.objects.get(code=srp),
                            delai=dali,
                            etat_vr=etats,
                            semaine=semaine,
                            mois=mois,
                            annee=annee).save()

解决方案

There are few things that are incorrect. I propose to you the following approach:

  1. Make your code more readable
  2. Remove useless queries
  3. Avoid related records duplication
  4. Cache out your related instances.
  5. Use bulk_create

Looking at your code, with a rough estimation, per csv record, you will get over 30 SQL queries per row, that's a bit much...

1. Make you code more readable.

Your parsing logic can be DRYed, a lot.

First, identify what you do with your data. From my point of view, 2 main functions:

Do nothing:

def no_transformation(value)
    return str(value)

Parse dates

def strptime(value):
    """
    I can't really tell what your 'parse' function does, I let it be but it might 
    be interesting adding your logic in here
    """
    return dt.datetime.strptime(parse(str(value)), '%Y-%m-%d %H:%M:%S')

Now, you can declare your parser configuration:

PARSER_CONFIG=(
    #(column_index, variable_name, transformation_function)
    (0,'nd',no_transformation),
    (10,'origine',no_transformation),
    (11,'code_sig',no_transformation),
    (13,'agent_sig',no_transformation),
    (14,'date_sig',strptime),
    (15,'date_essai',strptime),
    (18,'agent_essai',no_transformation),
    (19,'date_ori',strptime),
    (20,'agent_ori',no_transformation),
    (21,'code_ui',no_transformation),
    (22,'equipe',no_transformation),
    (23,'sous_traitant',no_transformation),
    (24,'date_pla',strptime),
    (25,'date_rel',strptime),
    (26,'code_rel',no_transformation),
    (27,'localisation',no_transformation),
    (28,'cause',no_transformation),
    (29,'commentaire',no_transformation),
    (3,'nom_client',no_transformation),
    (30,'agent_releve',no_transformation),
    (32,'centre_racc',no_transformation),
    (33,'rep',no_transformation),
    (34,'srp',no_transformation),
    (4,'nd_contact',no_transformation),
    (6,'code_categorie',no_transformation),
    (8,'acces_reseau',no_transformation),
    (9,'etat',no_transformation),
    (15',date_essais',strptime),
    (19',date_ori',strptime),
    (25',date_releve',strptime),
)

Now, you know how to parse your data, and how to name it. Let just put that stuff into a dict.

def parse(row):
    """Transform a row into a dict
    
    Args:
        row (tuple): Your row's data
    
    Returns:
        dict: Your parsed data, named into a dict.
    """
    return {
        key:tranfsorm(row[index]) for index, key, transform in PARSER_CONFIG
    }

From here, your parser is way more readable, you know exactly what you're doing with your data.

Wrapping this up all together, you should get:

PARSER_CONFIG=(
    #(column_index, variable_name, transformation_function)
    #...
)
def no_transformation(value)
    return str(value)

def strptime(value)
    return str(value)

def parse(row):
    """Transform a row into a dict
    
    Args:
        row (tuple): Your row's data
    
    Returns:
        dict: Your parsed data, named into a dict.
    """
    return {
        key:tranfsorm(row[index]) for index, key, transform in PARSER_CONFIG
    }

for row in rows:
    item = parse(row) #< Your data, without related instances yet....

Still have some work to create your related instances, but we'll get there eventually.

2. Removing useless queries.

You do :

#...First, your create a record
Client(nd=nd, nom=nom_client, mobile=nd_contact).save()
#... Then you fetch it when saving DerangementAdsl
nd_client=Client.objects.get(nd=nd)

While a more pythonic way of doing this would be:

#... You create and assign your istance.
client = Client(nd=item.get('nd'), 
                nom=item.get('nom_client'), 
                mobile=item.get('nd_contact')).save()
#...
nd_client=client

You just earned one SQL query/row! Doing the same logic for each models, and you'll earn around 20 queries per row!

categorie=Categorie.objects.create(code_categorie=item.get('code_categorie'), nom="Public"),
#Enregistrer agent de signalisation
agent_sig=AgentSig.objects.create(matricule=item.get('agent_sig'), nom="Awa"),
#Enregistrer agent d'essai
agent_essai=AgentEssai.objects.create(matricule=item.get('agent_essai')),
#Enregister agent d'orientation
agent_ori=AgentOri.objects.create(matricule=item.get('agent_ori')),
#Enregistrer agent de relève
agent_rel=AgentRel.objects.create(matricule=item.get('agent_releve')),
#Enregistrer le sous-traitant
sous_traitant=SousTraitant.objects.create(nom=item.get('sous_traitant')),
#Enregistrer le centre
centre=Centre.objects.create(code=item.get('centre_racc')),
#Enregistrer ui
unite_int=UniteIntervention.objects.create(code_ui=item.get('code_ui'), sous_traitant=sous_traitant), # < You earn one extrat query with sous_traitant
#Enregistrer le repartiteur
rep=Repartiteur.objects.create(code=item.get('rep'), crac=centre), # < You earn one extrat query with centre
#Enregistrer team
equipe=Equipe.objects.create(nom=item.get('equipe')), unite=unite_int),# < You earn one extrat query with unite_int
#Enregistrer le SR
srep=SousRepartiteur.objects.create(code=item.get('srp'), rep=rep),# < You earn one extrat query with rep

3. Avoid related records duplication

Now there is one big issue:

Considering you have multiple rows for each client, you'll eventually find yourself with many duplicates, and you do not want that. Instead of using create, you should go with get_or_create.

Please note it returns a tuple: (instance, created) So.... your code should go like:

categorie, categorie_created=Categorie.objects.get_or_create(code_categorie=item.get('code_categorie'), nom="Public"),
agent_sig, agent_sig_created=AgentSig.objects.get_or_create(matricule=item.get('agent_sig'), nom="Awa"),
agent_essai, agent_essai_created=AgentEssai.objects.get_or_create(matricule=item.get('agent_essai')),
agent_ori, agent_ori_created=AgentOri.objects.get_or_create(matricule=item.get('agent_ori')),
agent_rel, agent_rel_created=AgentRel.objects.get_or_create(matricule=item.get('agent_releve')),
sous_traitant, sous_traitant_created=SousTraitant.objects.get_or_create(nom=item.get('sous_traitant')),
centre, centre_created=Centre.objects.get_or_create(code=item.get('centre_racc')),
unite_int, unite_int_created=UniteIntervention.objects.get_or_create(code_ui=item.get('code_ui'), sous_traitant=sous_traitant)
rep, rep_created=Repartiteur.objects.get_or_create(code=item.get('rep'), crac=centre)
equipe, equipe_created=Equipe.objects.get_or_create(nom=item.get('equipe')), unite=unite_int
srep, srep_created=SousRepartiteur.objects.get_or_create(code=item.get('srp'), rep=rep)

Tadaaaaam, you'll create records that are "only" necessary for your related objects.

4. Caching out your related objects.

As in previous topic, I consider you have multiple rows for each related instance, and for each row, you will still get to fetch that from your DB.

It's OK I guess if you're using SQLite in memory, it won't be as slow as with other DBs, still, it'll be a bottleneck. You could use an approach like:

MODEL_CACHE = {}
def get_related_instance(model, **kwargs):
    key = (model,kwargs)
    if key in MODEL_CACHE:
        return instance MODEL_CACHE[key]
    else:
        instance, create = model.objects.get_or_create(**kwargs)
        MODEL_CACH[key]=instance
    return instance

# Instead of having previous lines now you end up with:
categorie = get_related_instance(Categorie,code_categorie=item.get('code_categorie'), nom="Public"),
agent_sig = get_related_instance(AgentSig,matricule=item.get('agent_sig'), nom="Awa"),
agent_essai = get_related_instance(AgentEssai,matricule=item.get('agent_essai')),
agent_ori = get_related_instance(AgentOri,matricule=item.get('agent_ori')),
agent_rel = get_related_instance(AgentRel,matricule=item.get('agent_releve')),
sous_traitant = get_related_instance(SousTraitant,nom=item.get('sous_traitant')),
centre = get_related_instance(Centre,code=item.get('centre_racc')),
unite_int = get_related_instance(UniteIntervention,code_ui=item.get('code_ui'), sous_traitant=sous_traitant)
rep = get_related_instance(Repartiteur,code=item.get('rep'), crac=centre)
equipe = get_related_instance(Equipe,nom=item.get('equipe')), unite=unite_int
srep = get_related_instance(SousRepartiteur,code=item.get('srp'), rep=rep)

I cannot tell how much you'll gain thanks to that, it really depends on the data set you're trying to import, but from experience, it's quite drastic!

5 Use bulk_create

You are doing

for row in rows:
    DerangementAdsl(...your data...).save() #<That's one DB call

That's one SQL query per row, while you could do:

ITEMS = []
for row in rows:
    #...Your parsing we saw previously...
    ITEMS.append(DerangementAdsl(**item))
DerangementAdsl.objects.bulk_create(ITEMS) #<That's one DB call

Putting it all together!

PARSER_CONFIG=(
    #(column_index, variable_name, transformation_function)
    #...
)
def no_transformation(value)
    return str(value)

def strptime(value)
    return str(value)

MODEL_CACHE = {}

def get_related_instance(model, **kwargs):
    key = (mode,kwargs)
    if key in MODEL_CACHE:
        return instance MODEL_CACHE[key]
    else:
        instance, create = model.objects.get_or_create(**kwargs)
        MODEL_CACH[key]=instance
    return instance

def parse(row):
    """Transform a row into a dict
    
    Args:
        row (tuple): Your row's data
    
    Returns:
        dict: Your parsed data, named into a dict.
    """
    item= {
        key:tranfsorm(row[index]) for index, key, transform in PARSER_CONFIG
    }
    item.update({
        'categorie': get_related_instance(Categorie,code_categorie=item.get('code_categorie'), nom="Public"),
        'agent_sig': get_related_instance(AgentSig,matricule=item.get('agent_sig'), nom="Awa"),
        'agent_essai': get_related_instance(AgentEssai,matricule=item.get('agent_essai')),
        'agent_ori': get_related_instance(AgentOri,matricule=item.get('agent_ori')),
        'agent_rel': get_related_instance(AgentRel,matricule=item.get('agent_releve')),
        'sous_traitant': get_related_instance(SousTraitant,nom=item.get('sous_traitant')),
        'centre': get_related_instance(Centre,code=item.get('centre_racc')),
        'unite_int': get_related_instance(UniteIntervention,code_ui=item.get('code_ui'), sous_traitant=sous_traitant)
        'rep': get_related_instance(Repartiteur,code=item.get('rep'), crac=centre)
        'equipe': get_related_instance(Equipe,nom=item.get('equipe')), unite=unite_int
        'srep': get_related_instance(SousRepartiteur,code=item.get('srp'), rep=rep)
    })
    return item

def importeradsl(request):
    #I skip your conditions for readility
    ITEMS = []
    for row in worksheet.iter_rows(min_row=2):
        ITEMS.append(DerangementAdsl(**parse(row)))
    
    DerangementAdsl.objects.bulk_create(ITEMS)

Conclusion

Following those recommendation, you should end up with an optimized script that will run way faster than the original one, and be way more readable and pythonic

Roughly, depending on your dataset, 5k lines should run somewhere between 10 seconds up to few minutes.

If each row's related instance (client,category...) is unique, I'd use a more sophisticated approach looping multiple times over your dataset to create related models using bulk_create and cache them out like:

CLIENTS = []
for row in rows:
    CLIENTS.append(Client(**client_parser(row)))
clients=Client.objects.bulk_create(CLIENTS) # You Create *all* your client with only one DB call!

Then, you cache all created clients. You do the same for all your related models and eventually you'll load your data making a dozen of DB calls, but it really depends on your business logic here: It should be engineered to handle duplicated records too.

这篇关于如何使用django将包含超过5000行的excel文件快速导入sqlite数据库的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆