Hi there.
The things I wanna to talk about today are - relational databases building. Particularly I would discuss such the attainments of relational sql as join and subselect. For my experiment I'll take PostgreSQL and Python.
So, lets start:
BASH:
user@host~$ su postgres
postgres@host~$ createuser -P test
Enter password for new role:
Enter it again:
Shall the new role be a superuser? (y/n) n
Shall the new role be allowed to create more new roles? (y/n) n
postgres@host~$ createdb test
postgres@host~$ exit
After creating a test user and db I wrote a little script to automate tables creation/fill:
PYTHON:
#!/usr/bin/env python
import sys, pgdb
from math import floor
#connect
try:
conn = pgdb.connect("localhost:test:test:test::")
print conn
except StandardError, e:
print str(e)
exit
cur = conn.cursor()
#drop tables
cur.execute("BEGIN")
sql = "DROP TABLE t1"
cur.execute(sql)
sql = "DROP TABLE t2"
cur.execute(sql)
cur.execute("COMMIT")
try:
#create tables
cur.execute("BEGIN")
sql = """CREATE TABLE t1 (
f1 INTEGER UNIQUE,
f2 INTEGER UNIQUE
)"""
cur.execute(sql)
cur.execute("COMMIT")
cur.execute("BEGIN")
sql = """CREATE TABLE t2 (
f1 INTEGER UNIQUE,
f2 INTEGER UNIQUE
)"""
cur.execute(sql)
cur.execute("COMMIT")
#fill tables
max = 1000001
cur.execute("BEGIN")
for i in range(1, max):
sql = "INSERT INTO %s VALUES(%d, %d)"
cur.execute(sql % ("t1", max+i, i))
cur.execute(sql % ("t2", max-i, floor(max/42)+i))
if i % 10000 == 0:
cur.execute("COMMIT")
cur.execute("BEGIN")
except StandardError, e:
print str(e)
cur.close()
conn.close()
exit
After filling the tables with the 1000000 records each, we are ready to test joins, subselects etc. The two tables are bound to each other trough the fields t1.f2 and t2.f1, which have the same values, but in the reverse order. The script to compare different statements is here:
PYTHON:
#!/usr/bin/env python
import sys, pgdb
from math import floor
from random import randint
from time import time
from numarray import *
#connect
try:
conn = pgdb.connect("localhost:test:test:test::")
print conn
except StandardError, e:
print str(e)
exit
cur = conn.cursor()
try:
#test statements
sql = [ "SELECT f2 FROM t2 WHERE f1 = %d",
"SELECT * FROM t1 JOIN t2 ON t1.f2 = t2.f1 WHERE t1.f2 = %d",
"SELECT * FROM t1 WHERE f2 IN (SELECT f1 FROM t2 WHERE f1 = %d)"]
#test statements types
sql_type = [ "Select", "Join", "Subselect" ]
max_rand = 1000001 #max random number
max_test = 100 #times a statement will be called
for j in range(0, len(sql)):
times = []
for i in range(1, max_test):
time_before = time()
cur.execute(sql[j] % randint(1, 1000001))
time_after = time()
times.append(time_after - time_before)
times = array(times)
msg = "%s: iterations: %d avg: %fs max: %fs min: %fs"
print msg % (sql_type[j], max_test, times.sum()/len(times), times.max(), times.min());
except StandardError, e:
print str(e)
cur.close()
conn.close()
exit
and it outputs on my average statistical Sempron 3000+ with 512Mb RAM:
BASH:
user@host~$ ./join_test.py
<pgdb.pgdbCnx instance at 0xb786772c>
Select: iterations: 100 avg: 0.010689s max: 0.042272s min: 0.000463s
Join: iterations: 100 avg: 0.016104s max: 0.027008s min: 0.000622s
Subselect: iterations: 100 avg: 0.016590s max: 0.062419s min: 0.000632s
The main thing we see here - the direct select is at least 1/3 faster as join. So in a real app, if we duplicate a column to not must be joining it, we could increase the productivity 1/3 as much. However, executing the same script with 1000 iterations for each statement gives the following:
BASH:
user@host~$ ./join_test.py
<pgdb.pgdbCnx instance at 0xb78c272c>
Select: iterations: 1000 avg: 0.009090s max: 0.136451s min: 0.000448s
Join: iterations: 1000 avg: 0.019139s max: 0.175684s min: 0.000586s
Subselect: iterations: 1000 avg: 0.017999s max: 0.398541s min: 0.000606s
Here we see - the "join" way costs about 3 times as much as normal select.
So choosing the best way to plan a database can vary. As we see, duplicating data brings more performance, but it seems like we would have only one table at all ... the way to make a database splitted in logical parts and then continually optimizing it (making indexes etc.) is the way for relational databases, which the modern databases as matter of fact are.
Useful links:
http://www.revsys.com/writings/postgresql-performance.html