This guide shows how to mirror time-series data stored in ThingsBoard’s Cassandra database into a Delta Lake table using spark-sql
. It leverages Spark’s built-in support for both Cassandra and Delta Lake and enables scheduled, incremental syncing using standard SQL, with no external jobs or scripts required.
Prerequisites
- Apache Spark 3.5.6 installed (setup guide)
- ThingsBoard is configured to store time-series data in Cassandra
- Cassandra is accessible from the Spark node:
The Spark host can connect to Cassandra on port9042
Authentication is disabled or properly configured in Spark - All commands are executed as the non-root
spark
user
- Configure Spark Defaults
Edit the Spark configuration:
vi /opt/spark/conf/spark-defaults.conf
Add the following:
spark.jars.packages com.github.jnr:jnr-posix:3.1.20,com.datastax.spark:spark-cassandra-connector_2.12:3.5.1,io.delta:delta-spark_2.12:3.3.2,io.delta:delta-storage:3.3.2
spark.sql.extensions com.datastax.spark.connector.CassandraSparkExtensions,io.delta.sql.DeltaSparkSessionExtension
spark.sql.catalog.spark_catalog org.apache.spark.sql.delta.catalog.DeltaCatalog
spark.sql.catalog.cass_tb com.datastax.spark.connector.datasource.CassandraCatalog
spark.sql.catalog.cass_tb.spark.cassandra.connection.host thingsboard.maksonlee.com
This enables:
- Cassandra and Delta Lake support in Spark SQL
- Catalog access via
cass_tb
andspark_catalog
- Launch Spark SQL
Run the SQL shell directly:
spark-sql
No additional flags are needed because everything is defined in spark-defaults.conf
.
- Create the Delta Lake Table
Assuming you want to mirror the table ts_kv_cf
from ThingsBoard’s Cassandra schema:
CREATE TABLE IF NOT EXISTS thingsboard.ts_kv_cf
USING DELTA
AS
SELECT * FROM cass_tb.thingsboard.ts_kv_cf;
This creates a Delta table initialized with historical data.
- Perform Incremental Upserts
Use a MERGE
query that reflects Cassandra’s primary key:((entity_type, entity_id, key, partition), ts)
MERGE INTO thingsboard.ts_kv_cf AS target
USING (
SELECT * FROM cass_tb.thingsboard.ts_kv_cf
WHERE ts > (SELECT COALESCE(MAX(ts), 0) FROM thingsboard.ts_kv_cf) - 3600000
) AS source
ON target.entity_type = source.entity_type
AND target.entity_id = source.entity_id
AND target.key = source.key
AND target.partition = source.partition
AND target.ts = source.ts
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;
This safely overwrites existing rows and appends new ones, avoiding duplicates.
- Create a Sync Script (as
spark
user)
Create the directory and script:
mkdir -p /opt/spark/scripts
vi /opt/spark/scripts/sync_ts_kv_cf.sh
Paste:
#!/bin/bash
/opt/spark/bin/spark-sql <<EOF
MERGE INTO thingsboard.ts_kv_cf AS target
USING (
SELECT * FROM cass_tb.thingsboard.ts_kv_cf
WHERE ts > (SELECT COALESCE(MAX(ts), 0) FROM thingsboard.ts_kv_cf) - 3600000
) AS source
ON target.entity_type = source.entity_type
AND target.entity_id = source.entity_id
AND target.key = source.key
AND target.partition = source.partition
AND target.ts = source.ts
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;
EOF
Make it executable:
chmod +x /opt/spark/scripts/sync_ts_kv_cf.sh
- Schedule Automatic Sync with Cron
Edit the spark
user’s crontab:
crontab -e
Add this line to sync every 15 minutes:
*/15 * * * * /opt/spark/scripts/sync_ts_kv_cf.sh
This keeps the Delta Lake table continuously updated with recent telemetry.
- Query the Delta Table
Check schema:
DESCRIBE TABLE thingsboard.ts_kv_cf;
Query average temperature
values:
SELECT entity_id, key, AVG(dbl_v) AS avg_val
FROM thingsboard.ts_kv_cf
WHERE key = 'temperature' AND dbl_v IS NOT NULL
GROUP BY entity_id, key;
Summary
Component | Purpose |
---|---|
cass_tb.thingsboard.ts_kv_cf | Source: raw time-series data in Cassandra |
thingsboard.ts_kv_cf | Destination: Delta Lake mirror table |
1-hour buffer | Protects against delayed/misordered records |
SQL shell + cron | Minimal, SQL-only automation pipeline |
This gives you a reliable ETL flow from ThingsBoard Cassandra to Delta Lake using Spark SQL, ideal for historical analysis, BI tools, backups, or long-term archival.