Goal: Install Spark 4.0.0 (Hadoop 3, Connect build) at /opt/spark
, but keep Classic mode for now by setting SPARK_CONNECT_MODE=0
.
We’ll remove legacy Spark Cassandra Connector bits (we now use Debezium → Kafka → Spark → Delta), keep the PostgreSQL Hive Metastore, and leave systemd units unchanged.
Prerequisites
- Ubuntu 24.04 LTS
- Java 17+ (
openjdk-17-jre
) - Existing Spark in
/opt/spark
(3.5.6) - Hive Metastore on PostgreSQL (keep
hive-site.xml
) - We submit jobs the classic way (
spark-submit
) and do not need a Python venv yet
Quick checks:
java -version
python3 --version
df -h / /opt
- Stop Spark 3.5.6 services
sudo systemctl stop spark-worker || true
sudo systemctl stop spark-master || true
# Kill any leftover drivers/executors
ps aux | egrep 'SparkSubmit|CoarseGrainedExecutorBackend' | awk '{print $2}' | xargs -r sudo kill -9
- Download & install Spark 4.0.0 (Connect build)
We’ll install the Connect tarball now, but explicitly disable Connect so everything behaves like 3.5.6 Classic.
cd /tmp
curl -O https://dlcdn.apache.org/spark/spark-4.0.0/spark-4.0.0-bin-hadoop3-connect.tgz
sudo mv /opt/spark /opt/spark.bak-3.5.6
sudo tar -xzf spark-4.0.0-bin-hadoop3-connect.tgz -C /opt
sudo mv /opt/spark-4.0.0-bin-hadoop3-connect /opt/spark
sudo chown -R spark:spark /opt/spark
- Bring over necessary configuration
Seed the new conf/
with old files without clobbering new defaults; then copy the must-haves:
sudo rsync -a --ignore-existing /opt/spark.bak-3.5.6/conf/ /opt/spark/conf/
sudo cp -f /opt/spark.bak-3.5.6/conf/hive-site.xml /opt/spark/conf/ 2>/dev/null || true
spark-env.sh
— force Classic mode (and keep it minimal)
sudo tee /opt/spark/conf/spark-env.sh >/dev/null <<'EOF'
#!/usr/bin/env bash
export SPARK_MASTER_HOST=spark.maksonlee.com
# Disable Spark Connect so jobs run in Classic (driver/executor) mode
export SPARK_CONNECT_MODE=0
EOF
sudo chmod +x /opt/spark/conf/spark-env.sh
sudo chown spark:spark /opt/spark/conf/spark-env.sh
We’re not enabling Connect in this post. we’ll flip this later in a separate migration post.
- Update
spark-defaults.conf
Open /opt/spark/conf/spark-defaults.conf
and make sure these entries exist (adjust if already present):
# Core warehouse
spark.sql.warehouse.dir /opt/spark/warehouse
# Runtime dependencies (Spark 4.0 / Scala 2.13)
spark.jars.packages io.delta:delta-spark_2.13:4.0.0,io.delta:delta-storage:4.0.0,org.apache.spark:spark-sql-kafka-0-10_2.13:4.0.0
# Delta integration
spark.sql.extensions io.delta.sql.DeltaSparkSessionExtension
spark.sql.catalog.spark_catalog org.apache.spark.sql.delta.catalog.DeltaCatalog
Why no venv here? We’re staying in Classic mode and not using Pandas UDFs/toPandas()
today. When we migrate to Connect later, we’ll add a venv and Python lines then.
- Remove legacy Cassandra Connector bits
We now ingest via Debezium → Kafka, so remove SCC from config and jars.
Clean config lines from spark-defaults.conf
if they exist:
com.datastax.spark:spark-cassandra-connector_…
spark.sql.extensions = com.datastax.spark.connector.CassandraSparkExtensions
spark.sql.catalog.cass_* = com.datastax.spark.connector.datasource.CassandraCatalog
- Any
spark.cassandra.*
settings
Purge old jars (SCC & Scala 2.12 artifacts) from Spark’s classpath:
sudo find /opt/spark/jars -type f \
\( -iname '*cassandra*' -o -iname '*datastax*' -o -iname '*_2.12*.jar' \) -print -delete
Clear user caches (prevents repulling 3.x / 2.12):
rm -rf ~/.ivy2 ~/.ivy2.5.2 ~/.m2/repository 2>/dev/null || true
Spark 4.0 is Scala 2.13 only. All our coordinates must end with _2.13
.
- Keep PostgreSQL JDBC for Hive Metastore
Make sure we have one recent pgjdbc jar (adjust version/path if needed):
sudo cp /opt/spark.bak-3.5.6/jars/postgresql-42.7.6.jar /opt/spark/jars/ 2>/dev/null || true
sudo chown spark:spark /opt/spark/jars/postgresql-42.7.6.jar
- systemd services (no change required)
Our existing units that call /opt/spark/sbin/start-master.sh
and /opt/spark/sbin/start-worker.sh
do not need changes.
They’ll pick up SPARK_CONNECT_MODE=0
from spark-env.sh
.
Start them:
sudo systemctl daemon-reload # only needed if edited unit files
sudo systemctl start spark-master
sudo systemctl start spark-worker
systemctl --no-pager -l status spark-master spark-worker
- Sanity checks
Spark version & a quick SQL ping:
/opt/spark/bin/spark-submit --version
/opt/spark/bin/spark-sql -S -e 'select current_date(), version()'
Delta smoke test:
/opt/spark/bin/spark-shell <<'SCALA'
spark.range(5).write.format("delta").mode("overwrite").save("/delta/upgrade_smoke")
println("rows=" + spark.read.format("delta").load("/delta/upgrade_smoke").count)
SCALA
- Run jobs exactly like before
spark-submit --master spark://spark.maksonlee.com:7077 --deploy-mode client cdc_to_delta.py
Final reference
/opt/spark/conf/spark-env.sh
export SPARK_MASTER_HOST=spark.maksonlee.com
export SPARK_CONNECT_MODE=0
/opt/spark/conf/spark-defaults.conf
spark.sql.warehouse.dir /opt/spark/warehouse
spark.jars.packages io.delta:delta-spark_2.13:4.0.0,io.delta:delta-storage:4.0.0,org.apache.spark:spark-sql-kafka-0-10_2.13:4.0.0
spark.sql.extensions io.delta.sql.DeltaSparkSessionExtension
spark.sql.catalog.spark_catalog org.apache.spark.sql.delta.catalog.DeltaCatalog
We now have Spark 4.0 (Connect build) installed, Connect disabled, Classic jobs running as before, and a clean path to enable Connect later in a separate migration post.