Use JupyterLab to Query Delta Lake Tables on a Spark Node

If you’ve already followed my previous guide to install JupyterLab, you might wonder how to use it for working with Apache Spark and Delta Lake, especially if your data is stored in Delta format and managed through Hive Metastore.

The simplest and most reliable way is to install JupyterLab directly on a Spark node, typically the Spark master. This ensures you have local access to both the Spark runtime and the Delta Lake storage path.

Why Run JupyterLab on a Spark Node?

Running JupyterLab on the Spark master gives you:

Direct access to Spark binaries and PySpark APIs
Local access to Delta Lake files (e.g., /opt/spark/warehouse/)
Built-in Hive Metastore connectivity (e.g., via hive-site.xml)
Fewer issues with remote data access or environment mismatches

Note: This guide uses Apache Spark 3.5.6.

If you’re using Spark 4.0 or later, you can optionally use Spark Connect to run Spark jobs remotely from Jupyter without installing Spark locally.

Prerequisites

Ensure the following are installed on the same machine:

Apache Spark 3.5.6 (with Delta Lake support)
Hive Metastore config (hive-site.xml)
Delta Lake JARs in jars/ or added to spark-defaults.conf
JupyterLab (as covered in the previous post)

You should also verify:

echo $SPARK_HOME

If not set, add to your shell config:

export SPARK_HOME=/opt/spark
export PATH=$SPARK_HOME/bin:$PATH

Sample Notebook: Query Delta Table from JupyterLab

Open JupyterLab in your browser
Create a new Python notebook
Use the following code:

from pyspark.sql import SparkSession

# Create Spark session with Hive and Delta support
spark = SparkSession.builder \
    .appName("DeltaLakeQuery") \
    .enableHiveSupport() \
    .getOrCreate()

# List all databases
for db in spark.catalog.listDatabases():
    print(db.name)

# Show all tables in 'thingsboard' database
spark.catalog.setCurrentDatabase("thingsboard")
for table in spark.catalog.listTables():
    print(table.name)

# Query a Delta Lake table
df = spark.sql("SELECT * FROM device LIMIT 10")
df.show()

Leave a Comment Cancel Reply