Enable Spark Connect (systemd) + Set Hive Catalog on Spark 4.0

This post is a follow-up to: Upgrade Apache Spark 3.5.6 → 4.0 on Ubuntu 24.04 (Single Node, Classic Mode)

What you’ll do

  • Add a Spark Connect server as a systemd service (/opt/spark/sbin/start-connect-server.sh).
  • Set Hive as the SQL catalog implementation in spark-defaults.conf.
  • Connect from a separate Jupyter node using the Spark Connect client.

Prerequisites

  • Spark 4.0.0 installed at /opt/spark (Standalone Master/Worker already running).
  • Linux user spark owns /opt/spark.
  • We keep Classic tools (spark-sql, spark-submit) unchanged.

  1. Set Hive as the catalog implementation
sudo sed -i '/^spark\.sql\.catalogImplementation/d' /opt/spark/conf/spark-defaults.conf
echo 'spark.sql.catalogImplementation  hive' | sudo tee -a /opt/spark/conf/spark-defaults.conf

  1. Create the systemd service
# /etc/systemd/system/spark-connect.service
[Unit]
Description=Apache Spark Connect Server
After=network-online.target
Wants=network-online.target

[Service]
Type=forking
User=spark
Group=spark
EnvironmentFile=-/etc/default/spark-connect-server
WorkingDirectory=/opt/spark
ExecStart=/bin/bash -lc '/opt/spark/sbin/start-connect-server.sh'
ExecStop=/opt/spark/sbin/stop-connect-server.sh
Restart=on-failure
RestartSec=3
LimitNOFILE=1048576

[Install]
WantedBy=multi-user.target

Enable & start:

sudo systemctl daemon-reload
sudo systemctl enable --now spark-connect
systemctl --no-pager -l status spark-connect

    1. Connect from a separate Jupyter node
    source jupyterlab-venv/bin/activate
    
    # install client + kernel
    pip install "pyspark[connect]==4.0.0" ipykernel
    

    Minimal test code (in a Jupyter notebook using the new kernel):

    from pyspark.sql import SparkSession
    
    CONNECT_URL = "sc://spark.maksonlee.com:15002"
    spark = SparkSession.builder.remote(CONNECT_URL).getOrCreate()
    
    print("Spark version:", spark.version)
    
    spark.sql("USE thingsboard")
    spark.sql("SHOW TABLES").show(truncate=False)
    
    # Silver
    spark.sql("SELECT * FROM ts_kv_cf LIMIT 5").show(truncate=False)
    spark.sql("SELECT * FROM ts_kv_partitions_cf LIMIT 5").show(truncate=False)
    
    Spark version: 4.0.0
    +-----------+-----------------------+-----------+
    |namespace  |tableName              |isTemporary|
    +-----------+-----------------------+-----------+
    |thingsboard|brz_ts_kv_cf           |false      |
    |thingsboard|brz_ts_kv_partitions_cf|false      |
    |thingsboard|device                 |false      |
    |thingsboard|device_profile         |false      |
    |thingsboard|temp_humidity_processed|false      |
    |thingsboard|ts_kv_cf               |false      |
    |thingsboard|ts_kv_partitions_cf    |false      |
    +-----------+-----------------------+-----------+
    
    +-----------+------------------------------------+-----------+-------------+-------------+-----+------+-----+------+------+
    |entity_type|entity_id                           |key        |partition    |ts           |str_v|long_v|dbl_v|bool_v|json_v|
    +-----------+------------------------------------+-----------+-------------+-------------+-----+------+-----+------+------+
    |DEVICE     |98573350-3db2-11f0-8581-b171bf77cb6a|humidity   |1754006400000|1755430255027|NULL |39    |NULL |NULL  |NULL  |
    |DEVICE     |98573350-3db2-11f0-8581-b171bf77cb6a|temperature|1754006400000|1755430255027|NULL |35    |NULL |NULL  |NULL  |
    |DEVICE     |98573350-3db2-11f0-8581-b171bf77cb6a|humidity   |1754006400000|1755430256022|NULL |39    |NULL |NULL  |NULL  |
    |DEVICE     |98573350-3db2-11f0-8581-b171bf77cb6a|temperature|1754006400000|1755430256022|NULL |35    |NULL |NULL  |NULL  |
    |DEVICE     |98573350-3db2-11f0-8581-b171bf77cb6a|humidity   |1754006400000|1755430257022|NULL |39    |NULL |NULL  |NULL  |
    +-----------+------------------------------------+-----------+-------------+-------------+-----+------+-----+------+------+
    
    +---------------+------------------------------------+--------------------------+-------------+
    |entity_type    |entity_id                           |key                       |partition    |
    +---------------+------------------------------------+--------------------------+-------------+
    |API_USAGE_STATE|9a0145c0-369f-11f0-98b8-2ddc48f3ce2c|activeDevicesCount        |1754006400000|
    |API_USAGE_STATE|9a0145c0-369f-11f0-98b8-2ddc48f3ce2c|activeDevicesCountHourly  |1754006400000|
    |API_USAGE_STATE|9a0145c0-369f-11f0-98b8-2ddc48f3ce2c|inactiveDevicesCount      |1754006400000|
    |API_USAGE_STATE|9a0145c0-369f-11f0-98b8-2ddc48f3ce2c|inactiveDevicesCountHourly|1754006400000|
    |API_USAGE_STATE|9a0145c0-369f-11f0-98b8-2ddc48f3ce2c|ruleEngineExecutionCount  |1754006400000|
    +---------------+------------------------------------+--------------------------+-------------+
    

    Done. Spark Connect runs under systemd, Spark SQL uses the Hive catalog, and you can connect from a separate Jupyter node via sc://spark.maksonlee.com:15002 while Classic workflows remain unchanged.

    Leave a Comment

    Your email address will not be published. Required fields are marked *

    Scroll to Top