Upgrade Apache Spark 3.5.6 → 4.0 on Ubuntu 24.04 (Single Node, Classic Mode)

Goal: Install Spark 4.0.0 (Hadoop 3, Connect build) at /opt/spark, but keep Classic mode for now by setting SPARK_CONNECT_MODE=0.
We’ll remove legacy Spark Cassandra Connector bits (we now use Debezium → Kafka → Spark → Delta), keep the PostgreSQL Hive Metastore, and leave systemd units unchanged.


Prerequisites

  • Ubuntu 24.04 LTS
  • Java 17+ (openjdk-17-jre)
  • Existing Spark in /opt/spark (3.5.6)
  • Hive Metastore on PostgreSQL (keep hive-site.xml)
  • We submit jobs the classic way (spark-submit) and do not need a Python venv yet

Quick checks:

java -version
python3 --version
df -h / /opt

  1. Stop Spark 3.5.6 services
sudo systemctl stop spark-worker || true
sudo systemctl stop spark-master || true
# Kill any leftover drivers/executors
ps aux | egrep 'SparkSubmit|CoarseGrainedExecutorBackend' | awk '{print $2}' | xargs -r sudo kill -9

  1. Download & install Spark 4.0.0 (Connect build)

We’ll install the Connect tarball now, but explicitly disable Connect so everything behaves like 3.5.6 Classic.

cd /tmp
curl -O https://dlcdn.apache.org/spark/spark-4.0.0/spark-4.0.0-bin-hadoop3-connect.tgz

sudo mv /opt/spark /opt/spark.bak-3.5.6
sudo tar -xzf spark-4.0.0-bin-hadoop3-connect.tgz -C /opt
sudo mv /opt/spark-4.0.0-bin-hadoop3-connect /opt/spark
sudo chown -R spark:spark /opt/spark

  1. Bring over necessary configuration

Seed the new conf/ with old files without clobbering new defaults; then copy the must-haves:

sudo rsync -a --ignore-existing /opt/spark.bak-3.5.6/conf/ /opt/spark/conf/
sudo cp -f /opt/spark.bak-3.5.6/conf/hive-site.xml /opt/spark/conf/ 2>/dev/null || true

spark-env.sh — force Classic mode (and keep it minimal)

sudo tee /opt/spark/conf/spark-env.sh >/dev/null <<'EOF'
#!/usr/bin/env bash
export SPARK_MASTER_HOST=spark.maksonlee.com
# Disable Spark Connect so jobs run in Classic (driver/executor) mode
export SPARK_CONNECT_MODE=0
EOF
sudo chmod +x /opt/spark/conf/spark-env.sh
sudo chown spark:spark /opt/spark/conf/spark-env.sh

We’re not enabling Connect in this post. we’ll flip this later in a separate migration post.


  1. Update spark-defaults.conf

Open /opt/spark/conf/spark-defaults.conf and make sure these entries exist (adjust if already present):

# Core warehouse
spark.sql.warehouse.dir          /opt/spark/warehouse

# Runtime dependencies (Spark 4.0 / Scala 2.13)
spark.jars.packages              io.delta:delta-spark_2.13:4.0.0,io.delta:delta-storage:4.0.0,org.apache.spark:spark-sql-kafka-0-10_2.13:4.0.0

# Delta integration
spark.sql.extensions             io.delta.sql.DeltaSparkSessionExtension
spark.sql.catalog.spark_catalog  org.apache.spark.sql.delta.catalog.DeltaCatalog

Why no venv here? We’re staying in Classic mode and not using Pandas UDFs/toPandas() today. When we migrate to Connect later, we’ll add a venv and Python lines then.


  1. Remove legacy Cassandra Connector bits

We now ingest via Debezium → Kafka, so remove SCC from config and jars.

Clean config lines from spark-defaults.conf if they exist:

  • com.datastax.spark:spark-cassandra-connector_…
  • spark.sql.extensions = com.datastax.spark.connector.CassandraSparkExtensions
  • spark.sql.catalog.cass_* = com.datastax.spark.connector.datasource.CassandraCatalog
  • Any spark.cassandra.* settings

Purge old jars (SCC & Scala 2.12 artifacts) from Spark’s classpath:

sudo find /opt/spark/jars -type f \
  \( -iname '*cassandra*' -o -iname '*datastax*' -o -iname '*_2.12*.jar' \) -print -delete

Clear user caches (prevents repulling 3.x / 2.12):

rm -rf ~/.ivy2 ~/.ivy2.5.2 ~/.m2/repository 2>/dev/null || true

Spark 4.0 is Scala 2.13 only. All our coordinates must end with _2.13.


  1. Keep PostgreSQL JDBC for Hive Metastore

Make sure we have one recent pgjdbc jar (adjust version/path if needed):

sudo cp /opt/spark.bak-3.5.6/jars/postgresql-42.7.6.jar /opt/spark/jars/ 2>/dev/null || true
sudo chown spark:spark /opt/spark/jars/postgresql-42.7.6.jar

  1. systemd services (no change required)

Our existing units that call /opt/spark/sbin/start-master.sh and /opt/spark/sbin/start-worker.sh do not need changes.
They’ll pick up SPARK_CONNECT_MODE=0 from spark-env.sh.

Start them:

sudo systemctl daemon-reload   # only needed if edited unit files
sudo systemctl start spark-master
sudo systemctl start spark-worker
systemctl --no-pager -l status spark-master spark-worker

  1. Sanity checks

Spark version & a quick SQL ping:

/opt/spark/bin/spark-submit --version
/opt/spark/bin/spark-sql -S -e 'select current_date(), version()'

Delta smoke test:

/opt/spark/bin/spark-shell <<'SCALA'
spark.range(5).write.format("delta").mode("overwrite").save("/delta/upgrade_smoke")
println("rows=" + spark.read.format("delta").load("/delta/upgrade_smoke").count)
SCALA

  1. Run jobs exactly like before
spark-submit --master spark://spark.maksonlee.com:7077 --deploy-mode client cdc_to_delta.py

Final reference

/opt/spark/conf/spark-env.sh

export SPARK_MASTER_HOST=spark.maksonlee.com
export SPARK_CONNECT_MODE=0

/opt/spark/conf/spark-defaults.conf

spark.sql.warehouse.dir          /opt/spark/warehouse
spark.jars.packages              io.delta:delta-spark_2.13:4.0.0,io.delta:delta-storage:4.0.0,org.apache.spark:spark-sql-kafka-0-10_2.13:4.0.0
spark.sql.extensions             io.delta.sql.DeltaSparkSessionExtension
spark.sql.catalog.spark_catalog  org.apache.spark.sql.delta.catalog.DeltaCatalog

We now have Spark 4.0 (Connect build) installed, Connect disabled, Classic jobs running as before, and a clean path to enable Connect later in a separate migration post.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top