This post shows how to keep a Delta Lake table synchronized with a PostgreSQL table using only Spark SQL and a minimal shell script.
We’ll demonstrate:
- A one-time full load from PostgreSQL to Delta Lake
- A reusable merge-based sync script for applying changes (inserts, updates, deletes)
Prerequisites
- Apache Spark 3.5.6 installed (setup guide)
- PostgreSQL is accessible from the Spark node
- All commands are executed as the non-root
spark
user
- First-Time Full Load
To create the Delta Lake table from the PostgreSQL source, run the following SQL using spark-sql
:
CREATE OR REPLACE TEMP VIEW temp_device
USING JDBC
OPTIONS (
url 'jdbc:postgresql://thingsboard.maksonlee.com:5432/thingsboard',
dbtable 'device',
user 'postgres',
password 'password',
driver 'org.postgresql.Driver'
);
CREATE TABLE thingsboard.device
USING DELTA
AS SELECT * FROM temp_device;
This creates a new Delta table thingsboard.device
populated with all data from PostgreSQL.
- Sync Changes Using a Shell Script
Once the Delta table is created, use the following shell script to keep it up to date. The script loads all data from PostgreSQL into a temporary view, then uses MERGE
to sync changes.
Script: sync_tb_pg.sh
#!/bin/bash
# PostgreSQL connection configuration
PG_HOST="thingsboard.maksonlee.com"
PG_PORT="5432"
PG_DB="thingsboard"
PG_USER="postgres"
PG_PASSWORD="password"
PG_TABLE="device"
DELTA_DB="thingsboard"
# Run Spark SQL using inline heredoc
spark-sql <<EOF
CREATE OR REPLACE TEMP VIEW temp_${PG_TABLE}
USING JDBC
OPTIONS (
url 'jdbc:postgresql://${PG_HOST}:${PG_PORT}/${PG_DB}',
dbtable '${PG_TABLE}',
user '${PG_USER}',
password '${PG_PASSWORD}',
driver 'org.postgresql.Driver'
);
MERGE INTO ${DELTA_DB}.${PG_TABLE} AS target
USING temp_${PG_TABLE} AS source
ON target.id = source.id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED BY TARGET THEN INSERT *
WHEN NOT MATCHED BY SOURCE THEN DELETE;
EOF
How to Use
chmod +x sync_tb_pg.sh
./sync_tb_pg.sh
You can schedule this script using cron
or systemd
for regular updates.
How It Works
Although this script reloads the entire PostgreSQL table, it only applies the delta changes to the Delta Lake target using Spark SQL’s MERGE
. This includes:
- Inserting new rows
- Updating existing rows
- Deleting rows removed from the source
It’s a merge-based full reload, not a true incremental sync, but it’s simple and effective for many workloads.