- 通过从Python中的MySQL计数顶部x? [英] -find top x by count from MySQL in Python?

查看:97
本文介绍了 - 通过从Python中的MySQL计数顶部x?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有这样一个csv文件:

  nohaelprince@uwaterloo.ca,01-05-2014 
nohaelprince@uwaterloo.ca,01-05-2014
nohaelprince@uwaterloo.ca,01-05-2014
nohaelprince@gmail.com,01-05-2014

我正在阅读上面的csv文件,并提取域名以及根据域名和日期计算的电子邮件地址。所有这些我需要插入到MySQL表中,称为域,我可以成功地完成它。



问题陈述:现在我需要使用相同的表格按最近30天的百分比增长与总数相比,按照计数排序前50个域名。这是我无法理解的,我该怎么做呢?



以下是我能够成功插入MySQL数据库的代码,但无法做上面的报告任务,因为我无法理解如何实现这个任务?

 #!/ usr / bin / python 
导入文件输入
导入csv
导入os
导入sys
导入时间
导入MySQLdb
$ b $ from集合import defaultdict,Counter

domain_counts = defaultdict(计数器)

#========================定义函数== ====================
def get_file_path(文件名):
currentdirpath = os.getcwd()
#获取当前工作目录路径
filepath = os.path.join(currentdirpath,filename)
返回文件路径
#====================== =====================================
def read_CSV(文件路径):

打开('emails.csv')作为f:
reader = csv.reader(f)
用于行阅读器:
domain_counts [row [0] .split('@')[1] .strip()] [row [1]] + = 1

db = MySQLdb.connect host =localhost,#你的主机,通常是localhost
user =root,#你的用户名
passwd =abcdef1234,#你的密码
db =test)#name的数据库
cur = db.cursor()

q =INSERT INTO domains(domain_name,cnt,date_of_entry)VALUES(%s,%s,STR_TO_DATE(%s, '%% d - %% m - %% Y'))


for domain_counts.iteritems()中的数据:
for email_date,email_count in data.iteritems():
cur.execute(q,(domain,email_count,email_date))

db.commit()

#==== ===================主程序============================= ==========
path = get_file_path('emails.csv')
read_CSV(path)#读取输入文件

执行报告任务的正确方法是什么?同时使用域表。



更新:



以下是我的域名表格:

  mysql>描述域; 
+ ---------------- + ------------- + ------ + ----- + - -------- + ---------------- +
|字段|类型|空| Key |默认|额外|
+ ---------------- + ------------- + ------ + ----- + - -------- + ---------------- +
| id | int(11)| NO | PRI | NULL | auto_increment |
| domain_name | varchar(20)| NO | | NULL | |
| cnt | int(11)|是| | NULL | |
| date_of_entry |日期| NO | | NULL | |
+ ------------- + ------------- + ------ + ----- + ---- ----- + ---------------- +



<这里是我在其中的数据:

  mysql>从域中选择*; 
+ ---- + --------------- + ------- + ------------ +
| id | domain_name | count | date_entry |
+ ---- + --------------- + ------- + ------------ +
| 1 | wawa.com | 2 | 2014-04-30 |
| 2 | wawa.com | 2 | 2014-05-01 |
| 3 | wawa.com | 3 | 2014-05-31 |
| 4 | uwaterloo.ca | 4 | 2014-04-30 |
| 5 | uwaterloo.ca | 3 | 2014-05-01 |
| 6 | uwaterloo.ca | 1 | 2014-05-31 |
| 7 | anonymous.com | 2 | 2014-04-30 |
| 8 | anonymous.com | 4 | 2014-05-01 |
| 9 | anonymous.com | 8 | 2014-05-31 |
| 10 | hotmail.com | 4 | 2014-04-30 |
| 11 | hotmail.com | 1 | 2014-05-01 |
| 12 | hotmail.com | 3 | 2014-05-31 |
| 13 | gmail.com | 6 | 2014-04-30 |
| 14 | gmail.com | 4 | 2014-05-01 |
| 15 | gmail.com | 8 | 2014-05-31 |
+ ---- + --------------- + ------- + ------------ + $ b $您需要的报告可以在MySQL中用SQL来完成,Python和Python也可以用SQL来完成。可以用来调用查询,导入结果集并输出结果。

考虑以下带有子查询和派生表的聚合查询,它遵循百分比增长公式:

 ((本月域名总数cnt) - (上个月域名总数cnt))
/(上个月全部

SQL

  SELECT domain_name,pct_growth 
FROM(

SELECT t1.domain_name,
#特定区域CNT之间的总和AND 30 days AGO
(Sum(CASE WHEN t1.date_of_entry> =(CURRENT_DATE - INTERVAL 30 DAY)
THEN t1.cnt ELSE 0 END)
-
#SUM OF特定领域的碳纳米管30天前
Sum(CASE WHEN t1.date_of_entry< (CURRENT_DATE - INTERVAL 30天)
THEN t1.cnt ELSE 0 END)
)/
#30天内所有域的CNT总数AGO
(SELECT SUM(t2 .cnt)FROM域t2
WHERE t2.date_of_entry<(CURRENT_DATE - INTERVAL 30 DAY))
作为pct_growth

来自域t1
GROUP BY t1.domain_name
)作为派生表

ORDER BY pct_growth DESC
LIMIT 50;

Python
< pre $ cur = db.cursor()
sql =SELECT * FROM ...#SEE ABOVE

cur.execute(sql)

用于cur.fetchall()中的行:
print(row)


I have a csv file like this:

nohaelprince@uwaterloo.ca, 01-05-2014
nohaelprince@uwaterloo.ca, 01-05-2014
nohaelprince@uwaterloo.ca, 01-05-2014
nohaelprince@gmail.com, 01-05-2014

I am reading the above csv file and extracting domain name and also the count of emails address by domain name and date as well. All these things I need to insert into MySQL table called domains which I am able to do it successfully.

Problem Statement:- Now I need to use the same table to report the top 50 domains by count sorted by percentage growth of the last 30 days compared to the total. And this is what I am not able to understand how can I do it?

Below is the code in which I am successfully able to insert into MySQL database but not able to do above reporting task as I am not able to understand how to achieve this task?

#!/usr/bin/python
import fileinput
import csv
import os
import sys
import time
import MySQLdb

from collections import defaultdict, Counter

domain_counts = defaultdict(Counter)

# ======================== Defined Functions ======================
def get_file_path(filename):
    currentdirpath = os.getcwd()  
    # get current working directory path
    filepath = os.path.join(currentdirpath, filename)
    return filepath
# ===========================================================
def read_CSV(filepath):

    with open('emails.csv') as f:
        reader = csv.reader(f)
        for row in reader:
            domain_counts[row[0].split('@')[1].strip()][row[1]] += 1

    db = MySQLdb.connect(host="localhost", # your host, usually localhost
                         user="root", # your username
                         passwd="abcdef1234", # your password
                         db="test") # name of the data base
    cur = db.cursor()

    q = """INSERT INTO domains(domain_name, cnt, date_of_entry) VALUES(%s, %s, STR_TO_DATE(%s, '%%d-%%m-%%Y'))"""


    for domain, data in domain_counts.iteritems():
        for email_date, email_count in data.iteritems():
             cur.execute(q, (domain, email_count, email_date))

    db.commit()

# ======================= main program =======================================
path = get_file_path('emails.csv') 
read_CSV(path) # read the input file

What is the right way to do the reporting task while using domains table.

Update:

Here is my domains table:

mysql> describe domains;
+----------------+-------------+------+-----+---------+----------------+
| Field          | Type        | Null | Key | Default | Extra          |
+----------------+-------------+------+-----+---------+----------------+
| id             | int(11)     | NO   | PRI | NULL    | auto_increment |
| domain_name    | varchar(20) | NO   |     | NULL    |                |
| cnt            | int(11)     | YES  |     | NULL    |                |
| date_of_entry  | date        | NO   |     | NULL    |                |
+-------------+-------------+------+-----+---------+----------------+

And here is data I have in them:

mysql> select * from domains;
+----+---------------+-------+------------+
| id | domain_name   | count | date_entry |
+----+---------------+-------+------------+
|  1 | wawa.com      |     2 | 2014-04-30 |
|  2 | wawa.com      |     2 | 2014-05-01 |
|  3 | wawa.com      |     3 | 2014-05-31 |
|  4 | uwaterloo.ca  |     4 | 2014-04-30 |
|  5 | uwaterloo.ca  |     3 | 2014-05-01 |
|  6 | uwaterloo.ca  |     1 | 2014-05-31 |
|  7 | anonymous.com |     2 | 2014-04-30 |
|  8 | anonymous.com |     4 | 2014-05-01 |
|  9 | anonymous.com |     8 | 2014-05-31 |
| 10 | hotmail.com   |     4 | 2014-04-30 |
| 11 | hotmail.com   |     1 | 2014-05-01 |
| 12 | hotmail.com   |     3 | 2014-05-31 |
| 13 | gmail.com     |     6 | 2014-04-30 |
| 14 | gmail.com     |     4 | 2014-05-01 |
| 15 | gmail.com     |     8 | 2014-05-31 |
+----+---------------+-------+------------+

解决方案

Your needed report can be done in SQL on the MySQL side and Python can be used to call the query, import the resultset, and print out the results.

Consider the following aggregate query with subquery and derived table which follow the percentage growth formula:

((this month domain total cnt) - (last month domain total cnt))
 / (last month all domains total cnt)

SQL

SELECT  domain_name, pct_growth
FROM (

SELECT t1.domain_name,  
         # SUM OF SPECIFIC DOMAIN'S CNT BETWEEN TODAY AND 30 DAYS AGO  
        (Sum(CASE WHEN t1.date_of_entry >= (CURRENT_DATE - INTERVAL 30 DAY) 
                  THEN t1.cnt ELSE 0 END)               
         -
         # SUM OF SPECIFIC DOMAIN'S CNT AS OF 30 DAYS AGO
         Sum(CASE WHEN t1.date_of_entry < (CURRENT_DATE - INTERVAL 30 DAY) 
                  THEN t1.cnt ELSE 0 END) 
        ) /   
        # SUM OF ALL DOMAINS' CNT AS OF 30 DAYS AGO
        (SELECT SUM(t2.cnt) FROM domains t2 
          WHERE t2.date_of_entry < (CURRENT_DATE - INTERVAL 30 DAY))
         As pct_growth   

FROM domains t1
GROUP BY t1.domain_name
) As derivedTable

ORDER BY pct_growth DESC
LIMIT 50;

Python

cur = db.cursor()
sql = "SELECT * FROM ..."  # SEE ABOVE 

cur.execute(sql)

for row in cur.fetchall():
   print(row)

这篇关于 - 通过从Python中的MySQL计数顶部x?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆