How we Wrangled Bind Variables From Snowflake to DuckDB

How we Wrangled Bind Variables From Snowflake to DuckDB

Greybeam routes analytical queries across multiple query engines, executing each on whichever engine is best suited for that workload. Of the many things Greybeam does to make this work, translating from Snowflake to DuckDB dialect is one of them.

We didn't start from scratch of course. As they say, we're standing on the shoulders of giants, and rely on and extend the excellent work done by the folks behind SQLGlot.

SQLGlot

At its core, SQLGlot is a SQL parser and transpiler. You give it SQL in one dialect, it gives you back SQL in another: Snowflake in, DuckDB out.

What makes this work is the Abstract Syntax Tree (AST). If you haven't worked with one before, think of an AST as the logical skeleton of a query: the SELECT, the columns, the WHERE predicates, the JOIN conditions, etc., all arranged as a tree of nodes.

from sqlglot import parse_one

parse_one("SELECT id, name FROM customers", read="snowflake")

# The AST representation of this query is below
Select(
  expressions=[
    Column(
      this=Identifier(this=id, quoted=False)),
    Column(
      this=Identifier(this=name, quoted=False))],
  from_=From(
    this=Table(
      this=Identifier(this=customers, quoted=False))))

When customers need to use bind variables like ? or :name, things start to get hairy.

A quick refresher on bind variables

Bind variables (bind vars) let you parameterize a query, that is, they are placeholders (e.g. :name) used in SQL to replace literal values. Bind vars enhance security by preventing SQL injection and improve performance by enabling the database to reuse execution plans, reducing parsing overhead.

You can have named bind vars (denoted by :name, SELECT foo FROM bar WHERE name = :name) and anonymous bind vars (denoted by ?, SELECT foo FROM bar WHERE name = ?). Named bind vars are simple! We just pass them to DuckDB's native support, and everyone is happy.

import duckdb

con = duckdb.connect(":memory:")

con.query("SELECT $name", params={"name": 1})

┌───────┐
│   1   │
│ int32 │
├───────┤
│     1 │
└───────┘

Using bind vars in DuckDB

The issue with anonymous bind vars

Anonymous bind vars are a different story. Snowflake binds ? placeholders by lexical position, i.e., the first ? in the source string maps to the first parameter, the second ? to the second, and so on. That's easy to honor as long as you still know where each ? sat in the original query.

The AST doesn't.

SQLGlot's AST does not retain Span information, each node knows what it is (a Where, a Column, a BindVar), but not what line or column it came from. In fact, Parameter nodes are reordered seemingly arbitrarily. For example, a CTE node places the WHERE clause below the SELECT, which is opposite of the lexical ordering. Remembering all of these rules and preserving this part of the AST through the SQLGlot optimization pipeline is simply infeasible.

from sqlglot import exp

"""
WITH BAR AS (SELECT BAR FROM BAZ WHERE id = ?)
SELECT FOO FROM BAR WHERE id <> ?
"""

# Inverted placement of the two bindvar placeholders!

exp.Select(
    this='FOO FROM BAR WHERE id <> ?', 
    _with=exp.CTE('BAR', exp.Select('BAR FROM BAZ WHERE id = ?'))
)

Parsed CTE node inverts lexical order

All this to say that walking the AST to collect ? placeholders gives you the right placeholders, just sometimes in the wrong order. And the right order depends on the specific shape of every SQL construct SQLGlot parses.

What do you do if your data transformer has normalized away the information you needed?

You start searching earlier in your pipeline, of course! But earlier for us was just a string blob, which is a burden to parse data from correctly (--? or '?', both are anonymous bind vars if you have a crummy parser!).

select a from -- is this right?
b where bar = ? and baz = '?';

3 question marks, but only 1 is a bind var!

The Tokenizer

Soon enough, we remembered: The AST isn't the first thing SQLGlot generates from our string! There is a tokenizer in front; one that parses enough to treat ? specially:

from sqlglot.dialects import snowflake

sf_dialect = snowflake.Dialect()

tokens = sf_dialect.tokenize("SELECT PARSE_JSON('[]'), 'bar' = ?;")

# token types from this pass
assert tokens == [
    SELECT, VAR(PARSE_JSON), L_PAREN, STRING('[]'), R_PAREN, 
    COMMA, STRING('bar'), EQ, PLACEHOLDER, SEMICOLON,
]

# PLACEHOLDER is present! this maps 1:1 to anonymous bind var points!

ast = sf_dialect.parse(tokens)

return ast

And since the token stream is linear with regard to the input string, our ordering issue goes away! The magic trick here was to detect ? at the token stream level and rewrite to a named bind var then and there. Additionally, we apply a prefix (like gbeam_bindvar_0) to these bind vars so that they don't collide with any customer-defined bind vars.

This means we support mixing named and anonymous bindings just like Snowflake! All while routing to DuckDB instead!

The customer running SELECT * FROM orders WHERE id = ? never sees any of this, unless they so choose. The customer keeps writing Snowflake, the query still returns the right rows, and the Snowflake bill aggressively goes down.

Bind variables are one Snowflake-ism on a long list that we've had to walk through. Subscribe if you want to learn more!

Alex H

Alex H

SWE, developing HA systems in Python and Rust.
California