Sync ThingsBoard Time-Series Data from Cassandra to Delta Lake Using Spark SQL

This guide shows how to mirror time-series data stored in ThingsBoard’s Cassandra database into a Delta Lake table using spark-sql. It leverages Spark’s built-in support for both Cassandra and Delta Lake and enables scheduled, incremental syncing using standard SQL, with no external jobs or scripts required.


Prerequisites

  • Apache Spark 3.5.6 installed (setup guide)
  • ThingsBoard is configured to store time-series data in Cassandra
  • Cassandra is accessible from the Spark node:
    The Spark host can connect to Cassandra on port 9042
    Authentication is disabled or properly configured in Spark
  • All commands are executed as the non-root spark user

  1. Configure Spark Defaults

Edit the Spark configuration:

vi /opt/spark/conf/spark-defaults.conf

Add the following:

spark.jars.packages              com.github.jnr:jnr-posix:3.1.20,com.datastax.spark:spark-cassandra-connector_2.12:3.5.1,io.delta:delta-spark_2.12:3.3.2,io.delta:delta-storage:3.3.2
spark.sql.extensions             com.datastax.spark.connector.CassandraSparkExtensions,io.delta.sql.DeltaSparkSessionExtension
spark.sql.catalog.spark_catalog  org.apache.spark.sql.delta.catalog.DeltaCatalog
spark.sql.catalog.cass_tb        com.datastax.spark.connector.datasource.CassandraCatalog
spark.sql.catalog.cass_tb.spark.cassandra.connection.host thingsboard.maksonlee.com

This enables:

  • Cassandra and Delta Lake support in Spark SQL
  • Catalog access via cass_tb and spark_catalog

  1. Launch Spark SQL

Run the SQL shell directly:

spark-sql

No additional flags are needed because everything is defined in spark-defaults.conf.


  1. Create the Delta Lake Table

Assuming you want to mirror the table ts_kv_cf from ThingsBoard’s Cassandra schema:

CREATE TABLE IF NOT EXISTS thingsboard.ts_kv_cf
USING DELTA
AS
SELECT * FROM cass_tb.thingsboard.ts_kv_cf;

This creates a Delta table initialized with historical data.


  1. Perform Incremental Upserts

Use a MERGE query that reflects Cassandra’s primary key:
((entity_type, entity_id, key, partition), ts)

MERGE INTO thingsboard.ts_kv_cf AS target
USING (
  SELECT * FROM cass_tb.thingsboard.ts_kv_cf
  WHERE ts > (SELECT COALESCE(MAX(ts), 0) FROM thingsboard.ts_kv_cf) - 3600000
) AS source
ON  target.entity_type = source.entity_type
 AND target.entity_id   = source.entity_id
 AND target.key         = source.key
 AND target.partition   = source.partition
 AND target.ts          = source.ts
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;

This safely overwrites existing rows and appends new ones, avoiding duplicates.


  1. Create a Sync Script (as spark user)

Create the directory and script:

mkdir -p /opt/spark/scripts
vi /opt/spark/scripts/sync_ts_kv_cf.sh

Paste:

#!/bin/bash
/opt/spark/bin/spark-sql <<EOF
MERGE INTO thingsboard.ts_kv_cf AS target
USING (
  SELECT * FROM cass_tb.thingsboard.ts_kv_cf
  WHERE ts > (SELECT COALESCE(MAX(ts), 0) FROM thingsboard.ts_kv_cf) - 3600000
) AS source
ON  target.entity_type = source.entity_type
 AND target.entity_id   = source.entity_id
 AND target.key         = source.key
 AND target.partition   = source.partition
 AND target.ts          = source.ts
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;
EOF

Make it executable:

chmod +x /opt/spark/scripts/sync_ts_kv_cf.sh

  1. Schedule Automatic Sync with Cron

Edit the spark user’s crontab:

crontab -e

Add this line to sync every 15 minutes:

*/15 * * * * /opt/spark/scripts/sync_ts_kv_cf.sh

This keeps the Delta Lake table continuously updated with recent telemetry.


  1. Query the Delta Table

Check schema:

DESCRIBE TABLE thingsboard.ts_kv_cf;

Query average temperature values:

SELECT entity_id, key, AVG(dbl_v) AS avg_val
FROM thingsboard.ts_kv_cf
WHERE key = 'temperature' AND dbl_v IS NOT NULL
GROUP BY entity_id, key;

Summary

ComponentPurpose
cass_tb.thingsboard.ts_kv_cfSource: raw time-series data in Cassandra
thingsboard.ts_kv_cfDestination: Delta Lake mirror table
1-hour bufferProtects against delayed/misordered records
SQL shell + cronMinimal, SQL-only automation pipeline

This gives you a reliable ETL flow from ThingsBoard Cassandra to Delta Lake using Spark SQL, ideal for historical analysis, BI tools, backups, or long-term archival.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top