If you’ve already followed my previous guide to install JupyterLab, you might wonder how to use it for working with Apache Spark and Delta Lake, especially if your data is stored in Delta format and managed through Hive Metastore.
The simplest and most reliable way is to install JupyterLab directly on a Spark node, typically the Spark master. This ensures you have local access to both the Spark runtime and the Delta Lake storage path.
Why Run JupyterLab on a Spark Node?
Running JupyterLab on the Spark master gives you:
- Direct access to Spark binaries and PySpark APIs
- Local access to Delta Lake files (e.g.,
/opt/spark/warehouse/
) - Built-in Hive Metastore connectivity (e.g., via
hive-site.xml
) - Fewer issues with remote data access or environment mismatches
Note: This guide uses Apache Spark 3.5.6.
If you’re using Spark 4.0 or later, you can optionally use Spark Connect to run Spark jobs remotely from Jupyter without installing Spark locally.
Prerequisites
Ensure the following are installed on the same machine:
- Apache Spark 3.5.6 (with Delta Lake support)
- Hive Metastore config (
hive-site.xml
) - Delta Lake JARs in
jars/
or added tospark-defaults.conf
- JupyterLab (as covered in the previous post)
You should also verify:
echo $SPARK_HOME
If not set, add to your shell config:
export SPARK_HOME=/opt/spark
export PATH=$SPARK_HOME/bin:$PATH
Sample Notebook: Query Delta Table from JupyterLab
- Open JupyterLab in your browser
- Create a new Python notebook
- Use the following code:
from pyspark.sql import SparkSession
# Create Spark session with Hive and Delta support
spark = SparkSession.builder \
.appName("DeltaLakeQuery") \
.enableHiveSupport() \
.getOrCreate()
# List all databases
for db in spark.catalog.listDatabases():
print(db.name)
# Show all tables in 'thingsboard' database
spark.catalog.setCurrentDatabase("thingsboard")
for table in spark.catalog.listTables():
print(table.name)
# Query a Delta Lake table
df = spark.sql("SELECT * FROM device LIMIT 10")
df.show()