This post is a follow-up to: Upgrade Apache Spark 3.5.6 → 4.0 on Ubuntu 24.04 (Single Node, Classic Mode)
What you’ll do
- Add a Spark Connect server as a systemd service (
/opt/spark/sbin/start-connect-server.sh
). - Set Hive as the SQL catalog implementation in
spark-defaults.conf
. - Connect from a separate Jupyter node using the Spark Connect client.
Prerequisites
- Spark 4.0.0 installed at /opt/spark (Standalone Master/Worker already running).
- Linux user spark owns /opt/spark.
- We keep Classic tools (
spark-sql
,spark-submit
) unchanged.
- Set Hive as the catalog implementation
sudo sed -i '/^spark\.sql\.catalogImplementation/d' /opt/spark/conf/spark-defaults.conf
echo 'spark.sql.catalogImplementation hive' | sudo tee -a /opt/spark/conf/spark-defaults.conf
- Create the systemd service
# /etc/systemd/system/spark-connect.service
[Unit]
Description=Apache Spark Connect Server
After=network-online.target
Wants=network-online.target
[Service]
Type=forking
User=spark
Group=spark
EnvironmentFile=-/etc/default/spark-connect-server
WorkingDirectory=/opt/spark
ExecStart=/bin/bash -lc '/opt/spark/sbin/start-connect-server.sh'
ExecStop=/opt/spark/sbin/stop-connect-server.sh
Restart=on-failure
RestartSec=3
LimitNOFILE=1048576
[Install]
WantedBy=multi-user.target
Enable & start:
sudo systemctl daemon-reload
sudo systemctl enable --now spark-connect
systemctl --no-pager -l status spark-connect
- Connect from a separate Jupyter node
source jupyterlab-venv/bin/activate
# install client + kernel
pip install "pyspark[connect]==4.0.0" ipykernel
Minimal test code (in a Jupyter notebook using the new kernel):
from pyspark.sql import SparkSession
CONNECT_URL = "sc://spark.maksonlee.com:15002"
spark = SparkSession.builder.remote(CONNECT_URL).getOrCreate()
print("Spark version:", spark.version)
spark.sql("USE thingsboard")
spark.sql("SHOW TABLES").show(truncate=False)
# Silver
spark.sql("SELECT * FROM ts_kv_cf LIMIT 5").show(truncate=False)
spark.sql("SELECT * FROM ts_kv_partitions_cf LIMIT 5").show(truncate=False)
Spark version: 4.0.0
+-----------+-----------------------+-----------+
|namespace |tableName |isTemporary|
+-----------+-----------------------+-----------+
|thingsboard|brz_ts_kv_cf |false |
|thingsboard|brz_ts_kv_partitions_cf|false |
|thingsboard|device |false |
|thingsboard|device_profile |false |
|thingsboard|temp_humidity_processed|false |
|thingsboard|ts_kv_cf |false |
|thingsboard|ts_kv_partitions_cf |false |
+-----------+-----------------------+-----------+
+-----------+------------------------------------+-----------+-------------+-------------+-----+------+-----+------+------+
|entity_type|entity_id |key |partition |ts |str_v|long_v|dbl_v|bool_v|json_v|
+-----------+------------------------------------+-----------+-------------+-------------+-----+------+-----+------+------+
|DEVICE |98573350-3db2-11f0-8581-b171bf77cb6a|humidity |1754006400000|1755430255027|NULL |39 |NULL |NULL |NULL |
|DEVICE |98573350-3db2-11f0-8581-b171bf77cb6a|temperature|1754006400000|1755430255027|NULL |35 |NULL |NULL |NULL |
|DEVICE |98573350-3db2-11f0-8581-b171bf77cb6a|humidity |1754006400000|1755430256022|NULL |39 |NULL |NULL |NULL |
|DEVICE |98573350-3db2-11f0-8581-b171bf77cb6a|temperature|1754006400000|1755430256022|NULL |35 |NULL |NULL |NULL |
|DEVICE |98573350-3db2-11f0-8581-b171bf77cb6a|humidity |1754006400000|1755430257022|NULL |39 |NULL |NULL |NULL |
+-----------+------------------------------------+-----------+-------------+-------------+-----+------+-----+------+------+
+---------------+------------------------------------+--------------------------+-------------+
|entity_type |entity_id |key |partition |
+---------------+------------------------------------+--------------------------+-------------+
|API_USAGE_STATE|9a0145c0-369f-11f0-98b8-2ddc48f3ce2c|activeDevicesCount |1754006400000|
|API_USAGE_STATE|9a0145c0-369f-11f0-98b8-2ddc48f3ce2c|activeDevicesCountHourly |1754006400000|
|API_USAGE_STATE|9a0145c0-369f-11f0-98b8-2ddc48f3ce2c|inactiveDevicesCount |1754006400000|
|API_USAGE_STATE|9a0145c0-369f-11f0-98b8-2ddc48f3ce2c|inactiveDevicesCountHourly|1754006400000|
|API_USAGE_STATE|9a0145c0-369f-11f0-98b8-2ddc48f3ce2c|ruleEngineExecutionCount |1754006400000|
+---------------+------------------------------------+--------------------------+-------------+
Done. Spark Connect runs under systemd, Spark SQL uses the Hive catalog, and you can connect from a separate Jupyter node via sc://spark.maksonlee.com:15002
while Classic workflows remain unchanged.