Sync ThingsBoard Time-Series Data from Cassandra to Delta Lake Using Spark SQL

This guide shows how to mirror time-series data stored in ThingsBoard’s Cassandra database into a Delta Lake table using spark-sql. It leverages Spark’s built-in support for both Cassandra and Delta Lake and enables scheduled, incremental syncing using standard SQL, with no external jobs or scripts required.

Prerequisites

Apache Spark 3.5.6 installed (setup guide)
ThingsBoard is configured to store time-series data in Cassandra
Cassandra is accessible from the Spark node:
The Spark host can connect to Cassandra on port 9042
Authentication is disabled or properly configured in Spark
All commands are executed as the non-root spark user

Configure Spark Defaults

Edit the Spark configuration:

vi /opt/spark/conf/spark-defaults.conf

Add the following:

spark.jars.packages              com.github.jnr:jnr-posix:3.1.20,com.datastax.spark:spark-cassandra-connector_2.12:3.5.1,io.delta:delta-spark_2.12:3.3.2,io.delta:delta-storage:3.3.2
spark.sql.extensions             com.datastax.spark.connector.CassandraSparkExtensions,io.delta.sql.DeltaSparkSessionExtension
spark.sql.catalog.spark_catalog  org.apache.spark.sql.delta.catalog.DeltaCatalog
spark.sql.catalog.cass_tb        com.datastax.spark.connector.datasource.CassandraCatalog
spark.sql.catalog.cass_tb.spark.cassandra.connection.host thingsboard.maksonlee.com

This enables:

Cassandra and Delta Lake support in Spark SQL
Catalog access via cass_tb and spark_catalog

Launch Spark SQL

Run the SQL shell directly:

spark-sql

No additional flags are needed because everything is defined in spark-defaults.conf.

Create the Delta Lake Table

Assuming you want to mirror the table ts_kv_cf from ThingsBoard’s Cassandra schema:

CREATE TABLE IF NOT EXISTS thingsboard.ts_kv_cf
USING DELTA
AS
SELECT * FROM cass_tb.thingsboard.ts_kv_cf;

This creates a Delta table initialized with historical data.

Perform Incremental Upserts

Use a MERGE query that reflects Cassandra’s primary key:
((entity_type, entity_id, key, partition), ts)

MERGE INTO thingsboard.ts_kv_cf AS target
USING (
  SELECT * FROM cass_tb.thingsboard.ts_kv_cf
  WHERE ts > (SELECT COALESCE(MAX(ts), 0) FROM thingsboard.ts_kv_cf) - 3600000
) AS source
ON  target.entity_type = source.entity_type
 AND target.entity_id   = source.entity_id
 AND target.key         = source.key
 AND target.partition   = source.partition
 AND target.ts          = source.ts
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;

This safely overwrites existing rows and appends new ones, avoiding duplicates.

Create a Sync Script (as spark user)

Create the directory and script:

mkdir -p /opt/spark/scripts
vi /opt/spark/scripts/sync_ts_kv_cf.sh

Paste:

#!/bin/bash
/opt/spark/bin/spark-sql <<EOF
MERGE INTO thingsboard.ts_kv_cf AS target
USING (
  SELECT * FROM cass_tb.thingsboard.ts_kv_cf
  WHERE ts > (SELECT COALESCE(MAX(ts), 0) FROM thingsboard.ts_kv_cf) - 3600000
) AS source
ON  target.entity_type = source.entity_type
 AND target.entity_id   = source.entity_id
 AND target.key         = source.key
 AND target.partition   = source.partition
 AND target.ts          = source.ts
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;
EOF

Make it executable:

chmod +x /opt/spark/scripts/sync_ts_kv_cf.sh

Schedule Automatic Sync with Cron

Edit the spark user’s crontab:

crontab -e

Add this line to sync every 15 minutes:

*/15 * * * * /opt/spark/scripts/sync_ts_kv_cf.sh

This keeps the Delta Lake table continuously updated with recent telemetry.

Query the Delta Table

Check schema:

DESCRIBE TABLE thingsboard.ts_kv_cf;

Query average temperature values:

SELECT entity_id, key, AVG(dbl_v) AS avg_val
FROM thingsboard.ts_kv_cf
WHERE key = 'temperature' AND dbl_v IS NOT NULL
GROUP BY entity_id, key;

Summary

Component	Purpose
`cass_tb.thingsboard.ts_kv_cf`	Source: raw time-series data in Cassandra
`thingsboard.ts_kv_cf`	Destination: Delta Lake mirror table
1-hour buffer	Protects against delayed/misordered records
SQL shell + cron	Minimal, SQL-only automation pipeline

This gives you a reliable ETL flow from ThingsBoard Cassandra to Delta Lake using Spark SQL, ideal for historical analysis, BI tools, backups, or long-term archival.

Leave a Comment Cancel Reply