Skip to content

Libraries

sqlglot

sqlglot SQL parser, transpiler, and optimizer engine. I was inspired by the library's approach to managing SQL gracefully: _ SQLGlot can rewrite queries into an "optimized" form. _ SQLGlot can calculate the difference between two expressions and output changes in the form of a sequence of actions needed to transform a source expression into a target one: * SQLGlot can interpret SQL queries where the tables are represented as Python dictionaries.

python
sql = """WITH baz AS (SELECT a, c FROM foo WHERE a = 1) SELECT f.a, b.b, baz.c, CAST("b"."a" AS REAL) d FROM foo f JOIN bar b ON f.a = b.a LEFT JOIN baz ON f.a = baz.a"""

print(sqlglot.transpile(sql, write="spark", identify=True, pretty=True)[0])
# ...
for column in parse_one("SELECT a, b + 1 AS c FROM d").find_all(exp.Column):
    print(column.alias_or_name)
# ...
where = condition("x=1").and_("y=1")
select("*").from_("y").where(where).sql()
sql = """WITH baz AS (SELECT a, c FROM foo WHERE a = 1) SELECT f.a, b.b, baz.c, CAST("b"."a" AS REAL) d FROM foo f JOIN bar b ON f.a = b.a LEFT JOIN baz ON f.a = baz.a"""

print(sqlglot.transpile(sql, write="spark", identify=True, pretty=True)[0])
# ...
for column in parse_one("SELECT a, b + 1 AS c FROM d").find_all(exp.Column):
    print(column.alias_or_name)
# ...
where = condition("x=1").and_("y=1")
select("*").from_("y").where(where).sql()
pandara

pandara is for validating a DataFrame object. Ideal for managing data pipeline schemas and underlying data. As an aside, pandera reinforced the idea of hypothesis testing.

python
schema = pa.DataFrameSchema({
    "column1": pa.Column(int, checks=pa.Check.le(10)),
    "column2": pa.Column(float, checks=pa.Check.lt(-1.2)),
    "column3": pa.Column(str, checks=[
        pa.Check.str_startswith("value_"),
        # define custom checks as functions that take a series as input and
        # outputs a boolean or boolean Series
        pa.Check(lambda s: s.str.split("_", expand=True).shape[1] == 2)
    ]),
})
validated_df = schema(df)  # will generate an error
schema = pa.DataFrameSchema({
    "column1": pa.Column(int, checks=pa.Check.le(10)),
    "column2": pa.Column(float, checks=pa.Check.lt(-1.2)),
    "column3": pa.Column(str, checks=[
        pa.Check.str_startswith("value_"),
        # define custom checks as functions that take a series as input and
        # outputs a boolean or boolean Series
        pa.Check(lambda s: s.str.split("_", expand=True).shape[1] == 2)
    ]),
})
validated_df = schema(df)  # will generate an error
pypika

pypika focus is on assembling SQL. This library applies clever usage of builder patterns and has a nicely constructed API

python
q = Query.from_('customers').select('id', 'fname', 'lname', 'phone')
q.get_sql()
customers = Table('customers')
q = Query.from_(customers).select(
    customers.id,customers.fname
).where(
    customers.age[18:65] & customers.status.isin(['new', 'active'])
)
q = Query.from_('customers').select('id', 'fname', 'lname', 'phone')
q.get_sql()
customers = Table('customers')
q = Query.from_(customers).select(
    customers.id,customers.fname
).where(
    customers.age[18:65] & customers.status.isin(['new', 'active'])
)
hamilton

hamilton is a general-purpose micro-orchestration framework for creating dataflows from Python functions. Imagine DBT but for DataFrame constructions. The developer builds functions and relationships and Hamilton stitches together the execution DAG. This is a very impressive API.

python
from hamilton import driver
initial_columns = {...}
module = importlib.import_module(module_name)  # <-- import functions
dr = driver.Driver(initial_columns, module)  # <-- build dag
df = dr.execute(output_columns)  # <-- execute
from hamilton import driver
initial_columns = {...}
module = importlib.import_module(module_name)  # <-- import functions
dr = driver.Driver(initial_columns, module)  # <-- build dag
df = dr.execute(output_columns)  # <-- execute
chromedb

Chroma is an in-memory vector database. Built on top of great open-source tokenization and embedding algorithms

python
import chromadb
client = chromadb.Client()
collection = client.create_collection("my-collection")
# A lot of magic here --> tokenization, embedding
collection.add(
    documents=["This is a document about cat", "This is a document about car", "This is a document about bike"],
    metadatas=[{"category": "animal"}, {"category": "vehicle"}, {"category": "vehicle"}],
    ids=["id1", "id2","id3"]
)
results = collection.query(query_texts=["vehicle"], n_results=1)
import chromadb
client = chromadb.Client()
collection = client.create_collection("my-collection")
# A lot of magic here --> tokenization, embedding
collection.add(
    documents=["This is a document about cat", "This is a document about car", "This is a document about bike"],
    metadatas=[{"category": "animal"}, {"category": "vehicle"}, {"category": "vehicle"}],
    ids=["id1", "id2","id3"]
)
results = collection.query(query_texts=["vehicle"], n_results=1)
duckdb

duckdb is an in-memory analytical database. One use case suggested by a colleague - reads a CSV file very fast.

python
import duckdb
duckdb.sql('SELECT * FROM "example.csv"')
import duckdb
duckdb.sql('SELECT * FROM "example.csv"')
ibis-framework

ibis-framework is a write-once and executes SQL on many analytics engines. One use case - develop against pandas and scale to BigQuery. See Ibis Docs - Tutorial for SQL