Sync ThingsBoard Entity Data from PostgreSQL to Delta Lake Using Spark SQL

This post shows how to keep a Delta Lake table synchronized with a PostgreSQL table using only Spark SQL and a minimal shell script.

We’ll demonstrate:

A one-time full load from PostgreSQL to Delta Lake
A reusable merge-based sync script for applying changes (inserts, updates, deletes)

Prerequisites

Apache Spark 3.5.6 installed (setup guide)
PostgreSQL is accessible from the Spark node
All commands are executed as the non-root spark user

First-Time Full Load

To create the Delta Lake table from the PostgreSQL source, run the following SQL using spark-sql:

CREATE OR REPLACE TEMP VIEW temp_device
USING JDBC
OPTIONS (
  url 'jdbc:postgresql://thingsboard.maksonlee.com:5432/thingsboard',
  dbtable 'device',
  user 'postgres',
  password 'password',
  driver 'org.postgresql.Driver'
);

CREATE TABLE thingsboard.device
USING DELTA
AS SELECT * FROM temp_device;

This creates a new Delta table thingsboard.device populated with all data from PostgreSQL.

Sync Changes Using a Shell Script

Once the Delta table is created, use the following shell script to keep it up to date. The script loads all data from PostgreSQL into a temporary view, then uses MERGE to sync changes.

Script: sync_tb_pg.sh

#!/bin/bash

# PostgreSQL connection configuration
PG_HOST="thingsboard.maksonlee.com"
PG_PORT="5432"
PG_DB="thingsboard"
PG_USER="postgres"
PG_PASSWORD="password"
PG_TABLE="device"
DELTA_DB="thingsboard"

# Run Spark SQL using inline heredoc
spark-sql <<EOF
CREATE OR REPLACE TEMP VIEW temp_${PG_TABLE}
USING JDBC
OPTIONS (
  url 'jdbc:postgresql://${PG_HOST}:${PG_PORT}/${PG_DB}',
  dbtable '${PG_TABLE}',
  user '${PG_USER}',
  password '${PG_PASSWORD}',
  driver 'org.postgresql.Driver'
);

MERGE INTO ${DELTA_DB}.${PG_TABLE} AS target
USING temp_${PG_TABLE} AS source
ON target.id = source.id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED BY TARGET THEN INSERT *
WHEN NOT MATCHED BY SOURCE THEN DELETE;
EOF

How to Use

chmod +x sync_tb_pg.sh
./sync_tb_pg.sh

You can schedule this script using cron or systemd for regular updates.

How It Works

Although this script reloads the entire PostgreSQL table, it only applies the delta changes to the Delta Lake target using Spark SQL’s MERGE. This includes:

Inserting new rows
Updating existing rows
Deleting rows removed from the source

It’s a merge-based full reload, not a true incremental sync, but it’s simple and effective for many workloads.

Leave a Comment Cancel Reply