Enable Spark Connect (systemd) + Set Hive Catalog on Spark 4.0

This post is a follow-up to: Upgrade Apache Spark 3.5.6 → 4.0 on Ubuntu 24.04 (Single Node, Classic Mode)

What you’ll do

Add a Spark Connect server as a systemd service (/opt/spark/sbin/start-connect-server.sh).
Set Hive as the SQL catalog implementation in spark-defaults.conf.
Connect from a separate Jupyter node using the Spark Connect client.

Prerequisites

Spark 4.0.0 installed at /opt/spark (Standalone Master/Worker already running).
Linux user spark owns /opt/spark.
We keep Classic tools (spark-sql, spark-submit) unchanged.

Set Hive as the catalog implementation

sudo sed -i '/^spark\.sql\.catalogImplementation/d' /opt/spark/conf/spark-defaults.conf
echo 'spark.sql.catalogImplementation  hive' | sudo tee -a /opt/spark/conf/spark-defaults.conf

Create the systemd service

# /etc/systemd/system/spark-connect.service
[Unit]
Description=Apache Spark Connect Server
After=network-online.target
Wants=network-online.target

[Service]
Type=forking
User=spark
Group=spark
EnvironmentFile=-/etc/default/spark-connect-server
WorkingDirectory=/opt/spark
ExecStart=/bin/bash -lc '/opt/spark/sbin/start-connect-server.sh'
ExecStop=/opt/spark/sbin/stop-connect-server.sh
Restart=on-failure
RestartSec=3
LimitNOFILE=1048576

[Install]
WantedBy=multi-user.target

Enable & start:

sudo systemctl daemon-reload
sudo systemctl enable --now spark-connect
systemctl --no-pager -l status spark-connect

Connect from a separate Jupyter node

source jupyterlab-venv/bin/activate

# install client + kernel
pip install "pyspark[connect]==4.0.0" ipykernel

Minimal test code (in a Jupyter notebook using the new kernel):

from pyspark.sql import SparkSession

CONNECT_URL = "sc://spark.maksonlee.com:15002"
spark = SparkSession.builder.remote(CONNECT_URL).getOrCreate()

print("Spark version:", spark.version)

spark.sql("USE thingsboard")
spark.sql("SHOW TABLES").show(truncate=False)

# Silver
spark.sql("SELECT * FROM ts_kv_cf LIMIT 5").show(truncate=False)
spark.sql("SELECT * FROM ts_kv_partitions_cf LIMIT 5").show(truncate=False)

Spark version: 4.0.0
+-----------+-----------------------+-----------+
|namespace  |tableName              |isTemporary|
+-----------+-----------------------+-----------+
|thingsboard|brz_ts_kv_cf           |false      |
|thingsboard|brz_ts_kv_partitions_cf|false      |
|thingsboard|device                 |false      |
|thingsboard|device_profile         |false      |
|thingsboard|temp_humidity_processed|false      |
|thingsboard|ts_kv_cf               |false      |
|thingsboard|ts_kv_partitions_cf    |false      |
+-----------+-----------------------+-----------+

+-----------+------------------------------------+-----------+-------------+-------------+-----+------+-----+------+------+
|entity_type|entity_id                           |key        |partition    |ts           |str_v|long_v|dbl_v|bool_v|json_v|
+-----------+------------------------------------+-----------+-------------+-------------+-----+------+-----+------+------+
|DEVICE     |98573350-3db2-11f0-8581-b171bf77cb6a|humidity   |1754006400000|1755430255027|NULL |39    |NULL |NULL  |NULL  |
|DEVICE     |98573350-3db2-11f0-8581-b171bf77cb6a|temperature|1754006400000|1755430255027|NULL |35    |NULL |NULL  |NULL  |
|DEVICE     |98573350-3db2-11f0-8581-b171bf77cb6a|humidity   |1754006400000|1755430256022|NULL |39    |NULL |NULL  |NULL  |
|DEVICE     |98573350-3db2-11f0-8581-b171bf77cb6a|temperature|1754006400000|1755430256022|NULL |35    |NULL |NULL  |NULL  |
|DEVICE     |98573350-3db2-11f0-8581-b171bf77cb6a|humidity   |1754006400000|1755430257022|NULL |39    |NULL |NULL  |NULL  |
+-----------+------------------------------------+-----------+-------------+-------------+-----+------+-----+------+------+

+---------------+------------------------------------+--------------------------+-------------+
|entity_type    |entity_id                           |key                       |partition    |
+---------------+------------------------------------+--------------------------+-------------+
|API_USAGE_STATE|9a0145c0-369f-11f0-98b8-2ddc48f3ce2c|activeDevicesCount        |1754006400000|
|API_USAGE_STATE|9a0145c0-369f-11f0-98b8-2ddc48f3ce2c|activeDevicesCountHourly  |1754006400000|
|API_USAGE_STATE|9a0145c0-369f-11f0-98b8-2ddc48f3ce2c|inactiveDevicesCount      |1754006400000|
|API_USAGE_STATE|9a0145c0-369f-11f0-98b8-2ddc48f3ce2c|inactiveDevicesCountHourly|1754006400000|
|API_USAGE_STATE|9a0145c0-369f-11f0-98b8-2ddc48f3ce2c|ruleEngineExecutionCount  |1754006400000|
+---------------+------------------------------------+--------------------------+-------------+

Done. Spark Connect runs under systemd, Spark SQL uses the Hive catalog, and you can connect from a separate Jupyter node via sc://spark.maksonlee.com:15002 while Classic workflows remain unchanged.

Leave a Comment Cancel Reply