The best of both worlds - python and q interpreters in one process; with an iPython notebook frontend.
This is just a quick example using pyq; check out the full article by Himanshu Gupta.
Both tools are great for data analysis, but if performance is a concern you may find the numbers at the bottom interesting.
Vehicle collision data courtesy of NYC Open Data.
In [1]:
import pandas as pd
from pyq import q
In [2]:
%load_ext pyq.magic
%%q¶
For running q code in the interpreter - can either use the q(...) function or the ipython magic %%q. q variables (K types) are available to python via the 'q' namespace, e.g. q.var1
In [3]:
%%q
a:("DTSIFF****IIIIIIII*****I*****";enlist ",")0:`:NYPD_Motor_Vehicle_Collisions.csv // Read in csv file
a:(cols .Q.id a) xcol a // remove the spaces in the column names
3#a
Out[3]:
In [4]:
%%q
count a // largish
Out[4]:
In [5]:
type (q.a) # native K types, as per k.h
Out[5]:
In [6]:
df = pd.read_csv("NYPD_Motor_Vehicle_Collisions.csv",parse_dates=[['DATE', 'TIME']]) # python's turn to read file
In [7]:
df.columns = [c.replace(' ', '') for c in df.columns] # remove the spaces in column names
In [8]:
df[:3]
Out[8]:
Timing the q aggregation:¶
In [9]:
%timeit -n 100 q('grp:select sum NUMBEROFPERSONSINJURED,sum NUMBEROFPERSONSKILLED by DATE.year from a')
In [10]:
q.grp
Out[10]:
Timing the pandas aggregation¶
7-8x slower w/ pandas:
In [11]:
%timeit -n 100 grp=df[['NUMBEROFPERSONSINJURED','NUMBEROFPERSONSKILLED']].groupby(df.DATE_TIME.dt.year).sum()
In [12]:
df[['NUMBEROFPERSONSINJURED','NUMBEROFPERSONSKILLED']].groupby(df.DATE_TIME.dt.year).sum()
Out[12]: