在上手之前先确保你的这些软件/第三方库已经装好,可以使用:
- oracle database 11g 客户端+服务端
- python 3.7.7
注:如果你的python是64位的, oracle database的客户端也要求是64位。 - cx-Oracle 7.3.0
注:连接ORACLEDB需要装cx_oracle,连接MYSQL需要装pymysql,连接Hadoop(HDFS)需要装pyhive。无论操作什么数据库,SQLAlchemy的查询语法几乎一致,代码可重用性高。而原生SQL是存在语法差异的,同一SQL语句,换个数据库就可能运行不了了。 - SQLAlchemy 1.3.16
-
pandas 1.0.3
如果pandas安装成功但是import pandas失败(WIN10系统),请将concrt140.dll、msvcp140.dll放到python的安装目录下。如图:
或者安装这个:
pandas是一个数据分析库,完虐EXCEL。笔者学会用Python之后就把Office卸载了。
笔者是一个重度洁癖患者,Python和VSCode都用绿色版。
oracle客户端用的也是绿色版,服务端没办法弄绿色版,就装在自家的一台服务器上,通过局域网访问。
准备演示数据库
为了演示,先创建表和数据:
新建cmd终端 用数据库管理员登录oracle database:
> sqlplus sys/123456 as sysdba
创建新用户
>>> create user abc identified by 123;
授予DBA权限
grant connect,resource,dba to abc;
用新的用户登录
> sqlplus abc/321
创建第一张表
>>> CREATE TABLE Websites
(
id integer PRIMARY KEY,
name varchar(255),
url varchar(255),
alexa integer,
country varchar(255)
);
向第一张表插入数据
>>> INSERT INTO Websites (id, name, url, alexa, country)
VALUES ('1','Google','https://www.google.cm/','1','USA');
>>> INSERT INTO Websites (id, name, url, alexa, country)
VALUES ('2','淘宝','https://www.taobao.com/','13','CN');
>>> INSERT INTO Websites (id, name, url, alexa, country)
VALUES ('3','菜鸟教程','http://www.runoob.com/','4689','CN');
>>> INSERT INTO Websites (id, name, url, alexa, country)
VALUES ('4','微博','http://weibo.com/','20','CN');
>>> INSERT INTO Websites (id, name, url, alexa, country)
VALUES ('5','Facebook','https://www.facebook.com/','3','USA');
>>> INSERT INTO Websites (id, name, url, alexa, country)
VALUES ('6','百度','https://www.baidu.com/','4','CN');
>>> INSERT INTO Websites (id, name, url, alexa, country)
VALUES ('7','stackoverflow','http://stackoverflow.com/','0','IND');
查看数据
>>> select * from Websites;
+----+---------------+---------------------------+-------+---------+
| id | name | url | alexa | country |
+----+---------------+---------------------------+-------+---------+
| 1 | Google | https://www.google.cm/ | 1 | USA |
| 2 | 淘宝 | https://www.taobao.com/ | 13 | CN |
| 3 | 菜鸟教程 | http://www.runoob.com/ | 4689 | CN |
| 4 | 微博 | http://weibo.com/ | 20 | CN |
| 5 | Facebook | https://www.facebook.com/ | 3 | USA |
| 6 | 百度 | https://www.baidu.com/ | 4 | CN |
| 7 | stackoverflow | http://stackoverflow.com/ | 0 | IND |
+----+---------------+---------------------------+-------+---------+
创建第二张表
>>> CREATE TABLE access_log
(
aid integer PRIMARY KEY,
site_id integer,
count integer,
col_date date,
CONSTRAINT fk_site_id FOREIGN KEY (site_id) REFERENCES Websites(id)
);
插入数据
>>> INSERT INTO access_log (aid,site_id,count,col_date)
VALUES ('1','1','45',to_date('2016-05-10','yyyy-mm-dd'));
>>> INSERT INTO access_log (aid,site_id,count,col_date)
VALUES ('2','3','100',to_date('2016-05-13','yyyy-mm-dd'));
>>> INSERT INTO access_log (aid,site_id,count,col_date)
VALUES ('3','1','230',to_date('2016-05-14','yyyy-mm-dd'));
>>> INSERT INTO access_log (aid,site_id,count,col_date)
VALUES ('4','2','10',to_date('2016-05-14','yyyy-mm-dd'));
>>> INSERT INTO access_log (aid,site_id,count,col_date)
VALUES ('5','5','205',to_date('2016-05-14','yyyy-mm-dd'));
>>> INSERT INTO access_log (aid,site_id,count,col_date)
VALUES ('6','4','13',to_date('2016-05-15','yyyy-mm-dd'));
>>> INSERT INTO access_log (aid,site_id,count,col_date)
VALUES ('7','3','220',to_date('2016-05-15','yyyy-mm-dd'));
>>> INSERT INTO access_log (aid,site_id,count,col_date)
VALUES ('8','5','545',to_date('2016-05-16','yyyy-mm-dd'));
>>> INSERT INTO access_log (aid,site_id,count,col_date)
VALUES ('9','3','201',to_date('2016-05-17','yyyy-mm-dd'));
查看数据
>>> select * from access_log
+-----+---------+-------+------------+
| aid | site_id | count | col_date |
+-----+---------+-------+------------+
| 1 | 1 | 45 | 2016-05-10 |
| 2 | 3 | 100 | 2016-05-13 |
| 3 | 1 | 230 | 2016-05-14 |
| 4 | 2 | 10 | 2016-05-14 |
| 5 | 5 | 205 | 2016-05-14 |
| 6 | 4 | 13 | 2016-05-15 |
| 7 | 3 | 220 | 2016-05-15 |
| 8 | 5 | 545 | 2016-05-16 |
| 9 | 3 | 201 | 2016-05-17 |
+-----+---------+-------+------------+
创建第三张表
>>> CREATE TABLE apps
(
id integer PRIMARY KEY,
app_name varchar(255),
url varchar(255),
country varchar(255)
);
插入数据
>>> INSERT INTO apps (id,app_name,url,country)
VALUES ('1','QQ APP','http://im.qq.com/','CN');
>>> INSERT INTO apps (id,app_name,url,country)
VALUES ('2','微博 APP','http://weibo.com/','CN');
>>> INSERT INTO apps (id,app_name,url,country)
VALUES ('3','淘宝 APP','https://www.taobao.com/','CN');
查看表
>>> select * from apps
+----+------------+-------------------------+---------+
| id | app_name | url | country |
+----+------------+-------------------------+---------+
| 1 | QQ APP | http://im.qq.com/ | CN |
| 2 | 微博 APP | http://weibo.com/ | CN |
| 3 | 淘宝 APP | https://www.taobao.com/ | CN |
+----+------------+-------------------------+---------+
至此,表创建完毕。
用SQLAclemy ORM连接数据库
打开python交互,开始编写程序
>>> import os
>>> os.environ["PATH"] += "D:\\oracle"#添加环境变量,是Oracle数据库客户端绿色版实例zip文件的解压目录,注意这里不是sys.path,“sys.path”是Python在import时搜索包的路径。
>>> os.environ["NLS_LANG"] = "SIMPLIFIED CHINESE_CHINA.UTF8"#解决Oracle编码问题
>>> from sqlalchemy import create_engine
>>> from sqlalchemy import MetaData, Table
>>> from sqlalchemy.orm import sessionmaker
>>> import pandas as pd
>>> from sqlalchemy import func
创建数据库连接(只是创建连接,还没有绑定到MetaData)
注:这里展示的是连接ORACLE,连接其他数据库的格式基本相同,详见SQLAclemy文档。
如果要as sysdba登录,换成'oracle://abc:123@127.0.0.1:1521/orcl?mode=SYSDBA'
>>> engine = create_engine('oracle://abc:123@127.0.0.1:1521/orcl', encoding='utf8', convert_unicode=True, echo = False)
准备反射已有表,创建元数据对象(数据库连接绑定到MetaData)
>>> metadata = MetaData(bind=engine)
反射整个数据库(这才真正开始操作数据库)
>>> metadata.reflect()
>>> print(metadata.tables.keys())#获取所有的表名
dict_keys(['apps', 'access_log', 'websites'])
反射单表
>>> #反射表一
>>> web_t = Table('websites', metadata, autoload=True, autoload_with=engine)
>>> web = web_t.alias('web')#起个别名
>>> #反射表二
>>> acc_t = Table('access_log', metadata, autoload=True,autoload_with=engine)
>>> acc = acc_t.alias('acc')#起个别名
>>> #反射表三
>>> app_t = Table('apps', metadata, autoload=True,autoload_with=engine)
>>> app = app_t.alias('app')#起个别名
>>> # 列出所有的字段名
>>> web_col_name = [eval('u'+repr(i)) for i in web.columns.keys()]
>>> acc_col_name = [eval('u'+repr(i)) for i in acc.columns.keys()]
>>> app_col_name = [eval('u'+repr(i)) for i in app.columns.keys()]
>>> print(web_col_name,acc_col_name,app_col_name)
['id', 'name', 'url', 'alexa', 'country'] ['aid', 'site_id', 'count', 'col_date'] ['id', 'app_name', 'url', 'country']
这样起别名也是一样的,缺点是多导了个函数:
from sqlalchemy.orm import aliased
web = aliased(web_t,name='web')
关于跨库查询:
假设我的数据库实例ORCL中还有另外两个用户,'def','ghi'。
'def'下有表'A','ghi'下有表'B',用户‘abc’对'def','ghi'下的表'A'、'B'拥有查询的权限。
只需这么做:
from sqlalchemy.ext.automap import automap_base
autobase = automap_base()
sc = ['def','ghi']
for i in sc:
autobase.metadata.reflect(engine, schema=i)
autobase.prepare()
autobase.metadata.bind = engine
A_t = Table('def.A', autobase.metadata, autoload=True, autoload_with=engine)
B_t = Table('ghi.B', autobase.metadata, autoload=True, autoload_with=engine)
autobase.metadata.reflect(views=True)#views=True可以把视图的表也弄进来
web_t = Table('websites', autobase.metadata, autoload=True, autoload_with=engine)
这样就能在一次查询中同时使用web_t、A_t、B_t这三张表了,别忘了起别名,不然容易报“标识符过长”的错误。
============基础查询============
ORM对表查询的操作,需要先创建一个全局session实例,通过这个session向数据库提交语句。
>>>Session = sessionmaker()
>>>session = Session()
别名
表别名:反射表的时候已经介绍了给表起别名的操作,翻译成SQL是这样的:
>>> SQL = '''SELECT w.name FROM websites w'''
>>> pd.read_sql(SQL,engine)
name
0 Google
1 淘宝
2 菜鸟教程
3 微博
4 Facebook
5 百度
6 stackoverflow
下面讲字段别名
原生SQL 给字段起别名
>>> SQL = "SELECT name AS n, country AS c FROM Websites"
>>> pd.read_sql(SQL,engine)
n c
0 Google USA
1 淘宝 CN
2 菜鸟教程 CN
3 微博 CN
4 Facebook USA
5 百度 CN
6 stackoverflow IND
ORM 给字段起别名
>>> ret = session.query(web.c.name.label('n'),web.c.country.label('c'))
>>> pd.DataFrame(ret.all())
n c
0 Google USA
1 淘宝 CN
2 菜鸟教程 CN
3 微博 CN
4 Facebook USA
5 百度 CN
6 stackoverflow IND
查询整张表
SELECT 语句
原生SQL
>>> SQL = 'SELECT * FROM Websites'
>>> pd.read_sql(SQL,engine)
id name url alexa country
0 1 Google https://www.google.cm/ 1 USA
1 2 淘宝 https://www.taobao.com/ 13 CN
2 3 菜鸟教程 http://www.runoob.com/ 4689 CN
3 4 微博 http://weibo.com/ 20 CN
4 5 Facebook https://www.facebook.com/ 3 USA
5 6 百度 https://www.baidu.com/ 4 CN
6 7 stackoverflow http://stackoverflow.com/ 0 IND
ORM
>>> ret = session.query(web)
>>> pd.DataFrame(ret.all())
id name url alexa country
0 1 Google https://www.google.cm/ 1 USA
1 2 淘宝 https://www.taobao.com/ 13 CN
2 3 菜鸟教程 http://www.runoob.com/ 4689 CN
3 4 微博 http://weibo.com/ 20 CN
4 5 Facebook https://www.facebook.com/ 3 USA
5 6 百度 https://www.baidu.com/ 4 CN
6 7 stackoverflow http://stackoverflow.com/ 0 IND
>>> pd.read_sql("access_log",engine)
aid site_id count col_date
0 1 1 45 2016-05-10
1 2 3 100 2016-05-13
2 3 1 230 2016-05-14
3 4 2 10 2016-05-14
4 5 5 205 2016-05-14
5 6 4 13 2016-05-15
6 7 3 220 2016-05-15
7 8 5 545 2016-05-16
8 9 3 201 2016-05-17
查询部分表
ORM
>>> ret.first()
(1, 'Google', 'https://www.google.cm/', 1, 'USA')
>>> ret = session.query(web).slice(1,3)
>>> pd.DataFrame(ret.all())
id name url alexa country
0 2 淘宝 https://www.taobao.com/ 13 CN
1 3 菜鸟教程 http://www.runoob.com/ 4689 CN
带条件查询
WHERE 子句
原生SQL
>>> SQL = "SELECT url,alexa FROM Websites WHERE name='淘宝'"
>>> pd.read_sql(SQL,engine)
url alexa
0 https://www.taobao.com/ 13
ORM 用 表别名对象.c.字段名 访问字段
>>> ret = session.query(*[web.c.url,web.c.alexa]).filter(web.c.name == '淘宝')
>>> pd.DataFrame(ret.all())
url alexa
0 https://www.taobao.com/ 13
原生SQL
>>> SQL = "SELECT url,alexa FROM Websites WHERE id>2 AND country = 'CN'"
>>> pd.read_sql(SQL,engine)
url alexa
0 http://www.runoob.com/ 4689
1 http://weibo.com/ 20
2 https://www.baidu.com/ 4
ORM
>>> ret = session.query(web.c.url,web.c.alexa).filter(web.c.id > 2, web.c.country == "CN")
>>> pd.DataFrame(ret.all())
url alexa
0 http://www.runoob.com/ 4689
1 http://weibo.com/ 20
2 https://www.baidu.com/ 4
多条件查询
先导入几个必要的函数
>>> from sqlalchemy import or_, and_, case
AND OR
原生SQL
>>> SQL = "SELECT * FROM Websites WHERE id<=2 or id>2 AND country = 'CN'"
>>> pd.read_sql(SQL,engine)
id name url alexa country
0 1 Google https://www.google.cm/ 1 USA
1 2 淘宝 https://www.taobao.com/ 13 CN
2 3 菜鸟教程 http://www.runoob.com/ 4689 CN
3 4 微博 http://weibo.com/ 20 CN
4 6 百度 https://www.baidu.com/ 4 CN
ORM
>>> ret = session.query(web).filter(or_(web.c.id<=2, and_(web.c.id>2, web.c.country == "CN")))
>>> pd.DataFrame(ret.all())
id name url alexa country
0 1 Google https://www.google.cm/ 1 USA
1 2 淘宝 https://www.taobao.com/ 13 CN
2 3 菜鸟教程 http://www.runoob.com/ 4689 CN
3 4 微博 http://weibo.com/ 20 CN
4 6 百度 https://www.baidu.com/ 4 CN
LIKE/NOT LIKE 与 通配符%
原生SQL
>>> SQL = "SELECT * FROM Websites WHERE url NOT LIKE '%.com%'"
>>> pd.read_sql(SQL,engine)
id name url alexa country
0 1 Google https://www.google.cm/ 1 USA
ORM “~”代表反
>>> ret = session.query(web).filter(~web.c.url.like("%.com%"))
>>> pd.DataFrame(ret.all())
id name url alexa country
0 1 Google https://www.google.cm/ 1 USA
>>> ret = session.query(web).filter(~web.c.url.ilike("%.com%"))#不区分大小写用ilike
>>> pd.DataFrame(ret.all())
id name url alexa country
0 1 Google https://www.google.cm/ 1 USA
IN/NOT IN
原生SQL (要NOT IN只要在前面加 “NOT”)
>>> SQL = "SELECT * FROM Websites WHERE name IN ('Google','菜鸟教程')"
>>> pd.read_sql(SQL,engine)
id name url alexa country
0 1 Google https://www.google.cm/ 1 USA
1 3 菜鸟教程 http://www.runoob.com/ 4689 CN
ORM (要not in 只要在前面加 “~”)
>>> ret = session.query(web).filter(web.c.name.in_(['Google','菜鸟教程']))
>>> pd.DataFrame(ret.all())
id name url alexa country
0 1 Google https://www.google.cm/ 1 USA
1 3 菜鸟教程 http://www.runoob.com/ 4689 CN
IS NULL/IS NOT NULL
原生SQL
>>> SQL = "SELECT * FROM Websites WHERE name IS NULL"
>>> pd.read_sql(SQL,engine)
Empty DataFrame
Columns: [id, name, url, alexa, country]
Index: []
ORM
>>> ret = session.query(web).filter(web.c.name.is_(None))
>>> pd.DataFrame(ret.all())
Empty DataFrame
Columns: []
Index: []
>>> ret = session.query(web).filter(web.c.name == None)#或者这样
>>> pd.DataFrame(ret.all())
Empty DataFrame
Columns: []
Index: []
原生SQL
>>> SQL = "SELECT * FROM Websites WHERE name IS NOT NULL"
>>> pd.read_sql(SQL,engine)
id name url alexa country
0 1 Google https://www.google.cm/ 1 USA
1 2 淘宝 https://www.taobao.com/ 13 CN
2 3 菜鸟教程 http://www.runoob.com/ 4689 CN
3 4 微博 http://weibo.com/ 20 CN
4 5 Facebook https://www.facebook.com/ 3 USA
5 6 百度 https://www.baidu.com/ 4 CN
6 7 stackoverflow http://stackoverflow.com/ 0 IND
ORM
>>> ret = session.query(web).filter(web.c.name.isnot(None))
>>> pd.DataFrame(ret.all())
id name url alexa country
0 1 Google https://www.google.cm/ 1 USA
1 2 淘宝 https://www.taobao.com/ 13 CN
2 3 菜鸟教程 http://www.runoob.com/ 4689 CN
3 4 微博 http://weibo.com/ 20 CN
4 5 Facebook https://www.facebook.com/ 3 USA
5 6 百度 https://www.baidu.com/ 4 CN
6 7 stackoverflow http://stackoverflow.com/ 0 IND
>>> ret = session.query(web).filter(web.c.name != None)#或者这样
>>> pd.DataFrame(ret.all())
id name url alexa country
0 1 Google https://www.google.cm/ 1 USA
1 2 淘宝 https://www.taobao.com/ 13 CN
2 3 菜鸟教程 http://www.runoob.com/ 4689 CN
3 4 微博 http://weibo.com/ 20 CN
4 5 Facebook https://www.facebook.com/ 3 USA
5 6 百度 https://www.baidu.com/ 4 CN
6 7 stackoverflow http://stackoverflow.com/ 0 IND
BETWEEN
原生SQL
>>> SQL = "SELECT * FROM Websites WHERE alexa BETWEEN 1 AND 20"
>>> pd.read_sql(SQL,engine)
id name url alexa country
0 1 Google https://www.google.cm/ 1 USA
1 2 淘宝 https://www.taobao.com/ 13 CN
2 4 微博 http://weibo.com/ 20 CN
3 5 Facebook https://www.facebook.com/ 3 USA
4 6 百度 https://www.baidu.com/ 4 CN
ORM
>>> ret = session.query(web).filter(web.c.alexa.between(1, 20))
>>> pd.DataFrame(ret.all())
id name url alexa country
0 1 Google https://www.google.cm/ 1 USA
1 2 淘宝 https://www.taobao.com/ 13 CN
2 4 微博 http://weibo.com/ 20 CN
3 5 Facebook https://www.facebook.com/ 3 USA
4 6 百度 https://www.baidu.com/ 4 CN
CASE WHEN语句
原生SQL
>>> SQL = '''SELECT name,CASE
... WHEN alexa BETWEEN 1 AND 10 THEN '1 to 10'
... WHEN alexa BETWEEN 11 AND 100 THEN '11 to 100'
... ELSE 'morethen 100'
... END AS class
... FROM Websites'''
>>> pd.read_sql(SQL,engine)
name class
0 Google 1 to 10
1 淘宝 11 to 100
2 菜鸟教程 morethen 100
3 微博 11 to 100
4 Facebook 1 to 10
5 百度 1 to 10
6 stackoverflow morethen 100
ORM
>>> ret = session.query(web.c.name,case([(web.c.alexa.between(1,10),'1 to 10'),(web.c.alexa.between(11,100),'11 to 100')],else_='morethen 100').label('class'))
>>> pd.DataFrame(ret.all())
name class
0 Google 1 to 10
1 淘宝 11 to 100
2 菜鸟教程 morethen 100
3 微博 11 to 100
4 Facebook 1 to 10
5 百度 1 to 10
6 stackoverflow morethen 100
数据整理查询
去重 SELECT DISTINCT
>>> from sqlalchemy import distinct
原生SQL
>>> SQL = "SELECT DISTINCT country FROM Websites"
>>> pd.read_sql(SQL,engine)
country
0 USA
1 IND
2 CN
ORM
>>> ret = session.query(web.c.country).distinct()
>>> pd.DataFrame(ret.all())
country
0 USA
1 IND
2 CN
或者
>>> ret = session.query(distinct(web.c.country).label('country'))
>>> pd.DataFrame(ret.all())
country
0 USA
1 IND
2 CN
排序 ORDER BY
原生SQL
>>> SQL = "SELECT * FROM Websites ORDER BY alexa ASC"#ASC正序
>>> SQL = "SELECT * FROM Websites ORDER BY alexa DESC"#DESC倒序
>>> pd.read_sql(SQL,engine)
id name url alexa country
0 3 菜鸟教程 http://www.runoob.com/ 4689 CN
1 4 微博 http://weibo.com/ 20 CN
2 2 淘宝 https://www.taobao.com/ 13 CN
3 6 百度 https://www.baidu.com/ 4 CN
4 5 Facebook https://www.facebook.com/ 3 USA
5 1 Google https://www.google.cm/ 1 USA
6 7 stackoverflow http://stackoverflow.com/ 0 IND
ORM
>>> ret = session.query(web).order_by(web.c.alexa.asc())# 正序
>>> ret = session.query(web).order_by(web.c.alexa.desc())# 倒序
>>> pd.DataFrame(ret.all())
id name url alexa country
0 3 菜鸟教程 http://www.runoob.com/ 4689 CN
1 4 微博 http://weibo.com/ 20 CN
2 2 淘宝 https://www.taobao.com/ 13 CN
3 6 百度 https://www.baidu.com/ 4 CN
4 5 Facebook https://www.facebook.com/ 3 USA
5 1 Google https://www.google.cm/ 1 USA
6 7 stackoverflow http://stackoverflow.com/ 0 IND
连表 JOIN ON
原生SQL INNER JOIN 找到符合Websites.id=access_log.site_id的所有一一对应的方案
>>> SQL = """SELECT * FROM Websites JOIN access_log ON Websites.id=access_log.site_id"""
>>> pd.read_sql(SQL,engine)
id name url alexa country aid site_id count col_date
0 1 Google https://www.google.cm/ 1 USA 1 1 45 2016-05-10
1 1 Google https://www.google.cm/ 1 USA 3 1 230 2016-05-14
2 2 淘宝 https://www.taobao.com/ 13 CN 4 2 10 2016-05-14
3 3 菜鸟教程 http://www.runoob.com/ 4689 CN 2 3 100 2016-05-13
4 3 菜鸟教程 http://www.runoob.com/ 4689 CN 9 3 201 2016-05-17
5 3 菜鸟教程 http://www.runoob.com/ 4689 CN 7 3 220 2016-05-15
6 4 微博 http://weibo.com/ 20 CN 6 4 13 2016-05-15
7 5 Facebook https://www.facebook.com/ 3 USA 8 5 545 2016-05-16
8 5 Facebook https://www.facebook.com/ 3 USA 5 5 205 2016-05-14
>>> SQL = """SELECT * FROM Websites INNER JOIN access_log ON Websites.id=access_log.site_id"""
>>> pd.read_sql(SQL,engine)
id name url alexa country aid site_id count col_date
0 1 Google https://www.google.cm/ 1 USA 1 1 45 2016-05-10
1 1 Google https://www.google.cm/ 1 USA 3 1 230 2016-05-14
2 2 淘宝 https://www.taobao.com/ 13 CN 4 2 10 2016-05-14
3 3 菜鸟教程 http://www.runoob.com/ 4689 CN 2 3 100 2016-05-13
4 3 菜鸟教程 http://www.runoob.com/ 4689 CN 9 3 201 2016-05-17
5 3 菜鸟教程 http://www.runoob.com/ 4689 CN 7 3 220 2016-05-15
6 4 微博 http://weibo.com/ 20 CN 6 4 13 2016-05-15
7 5 Facebook https://www.facebook.com/ 3 USA 8 5 545 2016-05-16
8 5 Facebook https://www.facebook.com/ 3 USA 5 5 205 2016-05-14
ORM INNER JOIN
>>> ret = session.query(web, acc).filter(web.c.id == acc.c.site_id)#通过filter实现的inner join
>>> pd.DataFrame(ret.all())
id name url alexa country aid site_id count col_date
0 1 Google https://www.google.cm/ 1 USA 1 1 45 2016-05-10
1 1 Google https://www.google.cm/ 1 USA 3 1 230 2016-05-14
2 2 淘宝 https://www.taobao.com/ 13 CN 4 2 10 2016-05-14
3 3 菜鸟教程 http://www.runoob.com/ 4689 CN 2 3 100 2016-05-13
4 3 菜鸟教程 http://www.runoob.com/ 4689 CN 9 3 201 2016-05-17
5 3 菜鸟教程 http://www.runoob.com/ 4689 CN 7 3 220 2016-05-15
6 4 微博 http://weibo.com/ 20 CN 6 4 13 2016-05-15
7 5 Facebook https://www.facebook.com/ 3 USA 8 5 545 2016-05-16
8 5 Facebook https://www.facebook.com/ 3 USA 5 5 205 2016-05-14
>>> ret = session.query(web, acc).join(acc,web.c.id == acc.c.site_id)#临时定义的inner join
>>> pd.DataFrame(ret.all())
id name url alexa country aid site_id count col_date
0 1 Google https://www.google.cm/ 1 USA 1 1 45 2016-05-10
1 1 Google https://www.google.cm/ 1 USA 3 1 230 2016-05-14
2 2 淘宝 https://www.taobao.com/ 13 CN 4 2 10 2016-05-14
3 3 菜鸟教程 http://www.runoob.com/ 4689 CN 2 3 100 2016-05-13
4 3 菜鸟教程 http://www.runoob.com/ 4689 CN 9 3 201 2016-05-17
5 3 菜鸟教程 http://www.runoob.com/ 4689 CN 7 3 220 2016-05-15
6 4 微博 http://weibo.com/ 20 CN 6 4 13 2016-05-15
7 5 Facebook https://www.facebook.com/ 3 USA 8 5 545 2016-05-16
8 5 Facebook https://www.facebook.com/ 3 USA 5 5 205 2016-05-14
>>> ret = session.query(web).join(acc)#通过外键的inner left join
>>> pd.DataFrame(ret.all())
id name url alexa country
0 1 Google https://www.google.cm/ 1 USA
1 1 Google https://www.google.cm/ 1 USA
2 2 淘宝 https://www.taobao.com/ 13 CN
3 3 菜鸟教程 http://www.runoob.com/ 4689 CN
4 3 菜鸟教程 http://www.runoob.com/ 4689 CN
5 3 菜鸟教程 http://www.runoob.com/ 4689 CN
6 4 微博 http://weibo.com/ 20 CN
7 5 Facebook https://www.facebook.com/ 3 USA
8 5 Facebook https://www.facebook.com/ 3 USA
>>> ret = session.query(acc).join(web)#通过外键的inner right join
>>> pd.DataFrame(ret.all())
aid site_id count col_date
0 1 1 45 2016-05-10
1 2 3 100 2016-05-13
2 3 1 230 2016-05-14
3 4 2 10 2016-05-14
4 5 5 205 2016-05-14
5 6 4 13 2016-05-15
6 7 3 220 2016-05-15
7 8 5 545 2016-05-16
8 9 3 201 2016-05-17
原生SQL LEFT JOIN ,右表没有的记录也显示
>>> SQL = """SELECT * FROM Websites LEFT JOIN access_log ON Websites.id=access_log.site_id"""
>>> pd.read_sql(SQL,engine)
id name url alexa country aid site_id count col_date
0 1 Google https://www.google.cm/ 1 USA 1.0 1.0 45.0 2016-05-10
1 1 Google https://www.google.cm/ 1 USA 3.0 1.0 230.0 2016-05-14
2 2 淘宝 https://www.taobao.com/ 13 CN 4.0 2.0 10.0 2016-05-14
3 3 菜鸟教程 http://www.runoob.com/ 4689 CN 2.0 3.0 100.0 2016-05-13
4 3 菜鸟教程 http://www.runoob.com/ 4689 CN 9.0 3.0 201.0 2016-05-17
5 3 菜鸟教程 http://www.runoob.com/ 4689 CN 7.0 3.0 220.0 2016-05-15
6 4 微博 http://weibo.com/ 20 CN 6.0 4.0 13.0 2016-05-15
7 5 Facebook https://www.facebook.com/ 3 USA 8.0 5.0 545.0 2016-05-16
8 5 Facebook https://www.facebook.com/ 3 USA 5.0 5.0 205.0 2016-05-14
9 6 百度 https://www.baidu.com/ 4 CN NaN NaN NaN NaT
10 7 stackoverflow http://stackoverflow.com/ 0 IND NaN NaN NaN NaT
ORM 指定isouter=True相当于left join
>>> ret = session.query(web,acc).join(acc, isouter=True)#通过外键的left join
>>> pd.DataFrame(ret.all())
id name url alexa country aid site_id count col_date
0 1 Google https://www.google.cm/ 1 USA 1.0 1.0 45.0 2016-05-10
1 1 Google https://www.google.cm/ 1 USA 3.0 1.0 230.0 2016-05-14
2 2 淘宝 https://www.taobao.com/ 13 CN 4.0 2.0 10.0 2016-05-14
3 3 菜鸟教程 http://www.runoob.com/ 4689 CN 2.0 3.0 100.0 2016-05-13
4 3 菜鸟教程 http://www.runoob.com/ 4689 CN 9.0 3.0 201.0 2016-05-17
5 3 菜鸟教程 http://www.runoob.com/ 4689 CN 7.0 3.0 220.0 2016-05-15
6 4 微博 http://weibo.com/ 20 CN 6.0 4.0 13.0 2016-05-15
7 5 Facebook https://www.facebook.com/ 3 USA 8.0 5.0 545.0 2016-05-16
8 5 Facebook https://www.facebook.com/ 3 USA 5.0 5.0 205.0 2016-05-14
9 6 百度 https://www.baidu.com/ 4 CN NaN NaN NaN NaT
10 7 stackoverflow http://stackoverflow.com/ 0 IND NaN NaN NaN NaT
>>> ret = session.query(web,acc).join(acc,full=True)#通过外键的full join
>>> pd.DataFrame(ret.all())
id name url alexa country aid site_id count col_date
0 1 Google https://www.google.cm/ 1 USA 1.0 1.0 45.0 2016-05-10
1 3 菜鸟教程 http://www.runoob.com/ 4689 CN 2.0 3.0 100.0 2016-05-13
2 1 Google https://www.google.cm/ 1 USA 3.0 1.0 230.0 2016-05-14
3 2 淘宝 https://www.taobao.com/ 13 CN 4.0 2.0 10.0 2016-05-14
4 5 Facebook https://www.facebook.com/ 3 USA 5.0 5.0 205.0 2016-05-14
5 4 微博 http://weibo.com/ 20 CN 6.0 4.0 13.0 2016-05-15
6 3 菜鸟教程 http://www.runoob.com/ 4689 CN 7.0 3.0 220.0 2016-05-15
7 5 Facebook https://www.facebook.com/ 3 USA 8.0 5.0 545.0 2016-05-16
8 3 菜鸟教程 http://www.runoob.com/ 4689 CN 9.0 3.0 201.0 2016-05-17
9 6 百度 https://www.baidu.com/ 4 CN NaN NaN NaN NaT
10 7 stackoverflow http://stackoverflow.com/ 0 IND NaN NaN NaN NaT
多表联合 UNION
原生SQL UNION 从 "Websites" 和 "apps" 表中选取所有不同的country(只有不同的值):
>>> SQL = '''SELECT country FROM Websites
... UNION
... SELECT country FROM apps
... ORDER BY country'''
>>> pd.read_sql(SQL,engine)
country
0 CN
1 IND
2 USA
ORM
>>> ret = session.query(web.c.country).union(session.query(app.c.country))
>>> pd.DataFrame(ret.all())
country
0 CN
1 IND
2 USA
原生SQL 使用 UNION ALL 从 "Websites" 和 "apps" 表中选取所有的country(也有重复的值):
>>> SQL = '''SELECT country FROM Websites
... UNION ALL
... SELECT country FROM apps
... ORDER BY country'''
>>> pd.read_sql(SQL,engine)
country
0 CN
1 CN
2 CN
3 CN
4 CN
5 CN
6 CN
7 IND
8 USA
9 USA
ORM
>>> ret = session.query(web.c.country).union_all(session.query(app.c.country))
>>> pd.DataFrame(ret.all())
country
0 USA
1 CN
2 CN
3 CN
4 USA
5 CN
6 IND
7 CN
8 CN
9 CN
>>> #cte()相当于ORACLE的WITH AS,比上面的方法,多创建了临时表,避免重复多次查询相同内容,提高性能。
>>> sub = session.query(web.c.country).union_all(session.query(app.c.country)).cte("temp_table")
>>> pd.DataFrame(session.query(sub).all())
web_country
0 USA
1 CN
2 CN
3 CN
4 USA
5 CN
6 IND
7 CN
8 CN
9 CN
合并 CONCAT()
原生SQL
>>> SQL = "SELECT CONCAT(id, name) FROM Websites"
>>> pd.read_sql(SQL,engine)
CONCAT(ID,NAME)
0 1Google
1 2淘宝
2 3菜鸟教程
3 4微博
4 5Facebook
5 6百度
6 7stackoverflow
ORM
>>> ret = session.query(func.concat(web.c.id, web.c.name).label('id,name'))
>>> pd.DataFrame(ret.all())
id,name
0 1Google
1 2淘宝
2 3菜鸟教程
3 4微博
4 5Facebook
5 6百度
6 7stackoverflow
原生SQL
>>> SQL = "SELECT CONCAT(CONCAT(id, name), url) FROM Websites"
>>> pd.read_sql(SQL,engine)
CONCAT(CONCAT(ID,NAME),URL)
0 1Googlehttps://www.google.cm/
1 2淘宝https://www.taobao.com/
2 3菜鸟教程http://www.runoob.com/
3 4微博http://weibo.com/
4 5Facebookhttps://www.facebook.com/
5 6百度https://www.baidu.com/
6 7stackoverflowhttp://stackoverflow.com/
ORM
>>> ret = session.query(func.concat(func.concat(web.c.id, web.c.name),web.c.url).label('id,name,url'))
>>> pd.DataFrame(ret.all())
id,name,url
0 1Googlehttps://www.google.cm/
1 2淘宝https://www.taobao.com/
2 3菜鸟教程http://www.runoob.com/
3 4微博http://weibo.com/
4 5Facebookhttps://www.facebook.com/
5 6百度https://www.baidu.com/
6 7stackoverflowhttp://stackoverflow.com/
高级查询
统计函数
平均 AVG()
原生SQL 选择访问量高于平均访问量的 "site_id" 和 "count":
>>> SQL = "SELECT site_id, count FROM access_log WHERE count > (SELECT AVG(count) FROM access_log)"
>>> pd.read_sql(SQL,engine)
site_id count
0 1 230
1 5 205
2 3 220
3 5 545
4 3 201
ORM
>>> ret = session.query(acc.c.site_id,acc.c.count).filter(acc.c.count>session.query(func.avg(acc.c.count)))
>>> pd.DataFrame(ret.all())
site_id count
0 1 230
1 5 205
2 3 220
3 5 545
4 3 201
计数 COUNT()
原生SQL
>>> SQL = "SELECT count(*) FROM access_log"
>>> pd.read_sql(SQL,engine)
COUNT(*)
0 9
ORM
>>> session.query(acc).count()#一个简便方法
9
>>> session.query(func.count('*')).select_from(acc).scalar()#常规方法
9
>>> session.query(func.count(acc.c.aid)).scalar()#不用select_from()用主键acc.c.aid
9
原生SQL 计算 "access_log" 表中不同 site_id 的记录数:
>>> SQL = "SELECT COUNT(DISTINCT site_id) AS nums FROM access_log"
>>> pd.read_sql(SQL,engine)
nums
0 5
ORM
>>> session.query(distinct(acc.c.site_id)).count()#一个简便方法
5
>>> ret = session.query(func.count(distinct(acc.c.site_id)).label('nums'))
>>> pd.DataFrame(ret.all())
nums
0 5
>>> #顺便介绍下子查询
>>> retSub = session.query(acc.c.site_id).distinct().subquery('t')
>>> ret = session.query(func.count(retSub.c.site_id).label('nums'))
>>> pd.DataFrame(ret.all())
nums
0 5
最大值 MAX()
原生SQL
>>> SQL = "SELECT MAX(alexa) AS max_alexa FROM Websites"
>>> pd.read_sql(SQL,engine)
max_alexa
0 4689
ORM
>>> ret = session.query(func.max(web.c.alexa).label('max_alexa'))
>>> pd.DataFrame(ret.all())
max_alexa
0 4689
最小值MIN()
原生SQL
>>> SQL = "SELECT MIN(alexa) AS min_alexa FROM Websites"
>>> pd.read_sql(SQL,engine)
min_alexa
0 0
ORM
>>> ret = session.query(func.min(web.c.alexa).label('min_alexa'))
>>> pd.DataFrame(ret.all())
min_alexa
0 0
求和 SUM()
原生SQL
>>> SQL = "SELECT SUM(count) AS nums FROM access_log"
>>> pd.read_sql(SQL,engine)
nums
0 1569
ORM
>>> ret = session.query(func.sum(acc.c.count).label("nums"))
>>> pd.DataFrame(ret.all())
nums
0 1569
格式化
原生SQL
>>> SQL = "SELECT aid, site_id, count, to_char(col_date,'yyyymm') as ym FROM access_log"#将字符串转日期用to_date
>>> pd.read_sql(SQL,engine)
aid site_id count ym
0 1 1 45 201605
1 2 3 100 201605
2 3 1 230 201605
3 4 2 10 201605
4 5 5 205 201605
5 6 4 13 201605
6 7 3 220 201605
7 8 5 545 201605
8 9 3 201 201605
ORM
>>> ret = session.query(acc.c.aid, acc.c.site_id, acc.c.count,
... func.to_char(acc.c.col_date,'yyyymm').label("ym"))
>>> pd.DataFrame(ret.all())
aid site_id count ym
0 1 1 45 201605
1 2 3 100 201605
2 3 1 230 201605
3 4 2 10 201605
4 5 5 205 201605
5 6 4 13 201605
6 7 3 220 201605
7 8 5 545 201605
8 9 3 201 201605
分组聚合 GROUP BY HAVING
原生SQL
>>> SQL = """SELECT site_id, to_char(col_date,'yyyymm') as ym,SUM(count) as sc
FROM access_log GROUP BY site_id, to_char(col_date,'yyyymm')"""
>>> pd.read_sql(SQL,engine)
site_id ym sc
0 1 201605 275
1 2 201605 10
2 3 201605 521
3 5 201605 750
4 4 201605 13
ORM
>>> cols = [acc.c.site_id,func.to_char(acc.c.col_date,'yyyymm').label('ym')]
>>> ret = session.query(*cols,func.sum(acc.c.count).label('sc')).group_by(*cols)
>>> pd.DataFrame(ret.all())
site_id ym sc
0 1 201605 275
1 2 201605 10
2 3 201605 521
3 5 201605 750
4 4 201605 13
注意:不要写成如下形式:
#这是不对的,同样的代码不能写第二遍
session.query(acc.c.site_id,func.to_char(acc.c.col_date,'yyyymm').label('ym'),
func.sum(acc.c.count).label('sc')).group_by(acc.c.site_id,
func.to_char(acc.c.col_date,'yyyymm').label('ym'))
原生SQL
>>> SQL = """SELECT Websites.name, SUM(access_log.count) AS nums FROM Websites
... INNER JOIN access_log
... ON Websites.id=access_log.site_id
... WHERE Websites.alexa < 200
... GROUP BY Websites.name
... HAVING SUM(access_log.count) > 200"""
>>> pd.read_sql(SQL,engine)
name nums
0 Google 275
1 Facebook 750
ORM
>>> #隐式外键关联
>>> ret = session.query(web.c.name,func.sum(acc.c.count).label("nums")).\
... join(acc).filter(web.c.alexa < 200).\
... group_by(web.c.name).having(func.sum(acc.c.count) > 200)
>>> pd.DataFrame(ret.all())
name nums
0 Google 275
1 Facebook 750
>>> #显式外键关联
>>> ret = session.query(web.c.name,func.sum(acc.c.count).label("nums")).\
... join(acc).filter(web.c.id==acc.c.site_id,web.c.alexa < 200).\
... group_by(web.c.name).having(func.sum(acc.c.count) > 200)
>>> pd.DataFrame(ret.all())
name nums
0 Google 275
1 Facebook 750
EXISTS/NOT EXISTS
原生SQL EXISTS
>>> SQL = "SELECT Websites.name, Websites.url FROM Websites \
... WHERE EXISTS (SELECT count FROM access_log WHERE Websites.id = access_log.site_id AND count > 200)"
>>> pd.read_sql(SQL,engine)
name url
0 Google https://www.google.cm/
1 菜鸟教程 http://www.runoob.com/
2 Facebook https://www.facebook.com/
ORM
>>> ret = session.query().filter(web.c.id==acc.c.site_id,acc.c.count>200).exists()
>>> ret = session.query(web.c.name,web.c.url).filter(ret)
>>> pd.DataFrame(ret.all())
name url
0 Google https://www.google.cm/
1 菜鸟教程 http://www.runoob.com/
2 Facebook https://www.facebook.com/
>>> # 另一种方法,更接近原生SQL
>>> from sqlalchemy.sql import exists
>>> ret = exists([acc.c.count]).where(and_(web.c.id==acc.c.site_id,acc.c.count>200))
>>> ret = session.query(web.c.name,web.c.url).filter(ret)
>>> pd.DataFrame(ret.all())
name url
0 Google https://www.google.cm/
1 菜鸟教程 http://www.runoob.com/
2 Facebook https://www.facebook.com/
原生SQL NOT EXISTS
>>> SQL = "SELECT Websites.name, Websites.url FROM Websites \
... WHERE NOT EXISTS (SELECT count FROM access_log WHERE Websites.id = access_log.site_id AND count > 200)"
>>> pd.read_sql(SQL,engine)
name url
0 淘宝 https://www.taobao.com/
1 微博 http://weibo.com/
2 百度 https://www.baidu.com/
3 stackoverflow http://stackoverflow.com/
ORM
>>> ret = session.query().filter(web.c.id==acc.c.site_id,acc.c.count>200).exists()
>>> ret = session.query(web.c.name,web.c.url).filter(~ret)
>>> pd.DataFrame(ret.all())
name url
0 淘宝 https://www.taobao.com/
1 微博 http://weibo.com/
2 百度 https://www.baidu.com/
3 stackoverflow http://stackoverflow.com/