Setting Up Trino with Hive to Query Delta Lake Data on MinIO: A Scalable Big Data Solution

15 min readOct 20, 2024

Introduction

This blog continues my previous post, in which we generated data in Delta Lake format and stored it on MinIO-S3 storage. While we covered how to read the data, querying it efficiently in real-world scenarios requires a scalable SQL engine.

Trino (formerly known as Presto) is a distributed SQL query engine designed to handle large-scale queries across diverse data sources. As mentioned in the previous blog, robust big data solutions require significant computing resources. A Trino cluster is ideal for larger-scale environments, but a single Trino instance can be sufficient for smaller setups.

For reference, here is my previous blog: https://medium.com/@dudu.zbeda_13698/streamlining-big-data-with-spark-writing-and-reading-delta-lake-format-on-minio-1700060eaa72

Blog Goals

In this post, I’ll guide you through setting up a single Trino instance that acts as both the coordinator and worker, alongside a Hive Metastore and Postgres database. Additionally, I’ll demonstrate how to create a schema and table to query the partitioned and non-partitioned data generated in the previous blog. This guide will take you through the necessary installations and configurations, providing both Docker Compose and Kubernetes manifests to run the setup.

Download, Install, and configure Trino (formerly known as Presto)

We’ll set up Trino and configure it with the Hive and Delta Lake connectors to execute queries on data stored in MinIO S3 in Delta Lake format. Both partitioned and non-partitioned data will be queried using Trino.

2. Configure and Run Hive Metastore

The Hive Metastore will serve as the metadata repository, storing information about the Delta Lake tables such as schema and partitions. Trino will access this metadata to query the data in MinIO S3 efficiently.

3. Configure and Run Postgres DB

Postgres will be the backend database for Hive Metastore, storing all the metadata data.

4. Install Trino Cli

We’ll use the Trino CLI to connect to the Trino instance and execute commands to create the schema, register tables, and run queries on the Delta Lake data.

Understanding Key Components and Dependencies

1. What is Trino and What are Connectors?

Trino (formerly known as Presto) is a distributed SQL query engine designed to run fast, scalable, interactive analytics against large datasets. Trino allows you to query data from multiple sources, such as object storage (like MinIO S3), relational databases, NoSQL systems, and even HDFS storage, using standard SQL.

Connectors are essential to Trino because they enable it to interact with these different data sources. Each connector provides the necessary functionality to query a specific type of data storage. For example:

Delta Lake Connector: Allows Trino to query Delta Lake data stored on MinIO or any S3-compatible storage.
Hive Connector: Trino needs the Hive connector to query Delta Lake data because it relies on the Hive metastore to manage table metadata, such as schema and partitions. The Hive connector also provides the necessary interface to read underlying data files (e.g., Parquet) stored on MinIO. Without it, Trino wouldn’t be able to efficiently query and manage Delta Lake tables.
SQL Connector: Trino can also connect to relational databases like MySQL, PostgreSQL, and SQL Server using the SQL connector, allowing you to run queries against traditional databases.
HDFS Connector: Enables Trino to query data stored in Hadoop Distributed File System (HDFS), making it possible to integrate HDFS-stored datasets into your queries.

These connectors make Trino highly versatile, enabling it to query data from multiple sources in a seamless way, all through standard SQL queries.

2. What is Hive and Why Do We Need It with Trino?

Hive is a data warehouse software built on top of Hadoop, and its primary role is to manage and query large datasets stored in distributed storage. It provides a Hive Metastore, which acts as a metadata repository for storing the schemas, tables, and partitioning information about your datasets.

While Trino can query data directly from a variety of data sources, the Hive Metastore is necessary when working with Delta Lake, as Delta Lake uses the Hive Metastore to store metadata about the tables. Without this metadata, Trino wouldn’t know how to locate and query the data stored in Delta Lake format.

By integrating Hive with Trino:

Trino retrieves the metadata (such as schemas and tables) from Hive Metastore.
Trino then uses that information to efficiently query the data stored in Delta Lake on MinIO S3.

In summary, Hive is essential because it provides the metadata management needed for Trino to interact with the Delta Lake format.

Prerequisites and Ingredients

Below are all the prerequisites required to run this exercise:

Linux box with internet connection & 20GB memory — for our exercises I have used Linux box running Ubuntu 22.04 Operating system
Preparation of the Linux box included

Python3 installation
Docker & Docker Compose installation - You can use the following link: https://docs.docker.com/engine/install/ubuntu/
/data/poc folder as our designated working directory. The folder path is recommended but can be modified.
/data/trino folder as Trino persistent storage. The folder path is recommended but can be modified.
/data/postgres folder as Postgres persistent storage. The folder path is recommended but can be modified.

3. MINIO-S3 solution

Set up a MinIO instance with a bucket named spark-delta-lake. This bucket should contain the data we generated in our previous blog.
If you haven’t completed the previous steps, refer to “Streamlining Big Data with Spark: Writing and Reading Delta Lake Format on MinIO-S3 Storage” for instructions on generating the data and saving it to MinIO. — https://medium.com/@dudu.zbeda_13698/streamlining-big-data-with-spark-writing-and-reading-delta-lake-format-on-minio-1700060eaa72

Let’s start working

Note: Before diving into the setup, please note that for this exercise, I’m using specific versions of Java, Trino, and Hive to ensure compatibility. It’s essential to use the exact versions mentioned to avoid conflicts and potential failures. Changing the versions may result in errors during the setup or execution.

Create Trino configuration files

Trino requires two sets of configuration files:

Platform and Resource Configuration
These configuration files define the Trino platform’s setup and allocation of resources for the Trino instance. They include:

config.properties: Defines the overall configuration for the Trino server, such as the coordinator's address, port, and worker setup.
node.properties: Contains settings specific to each Trino node (coordinator or worker), such as node ID, environment, and data directories.
jvm.config: Specifies JVM (Java Virtual Machine) options for the Trino server, including memory settings and garbage collection tuning.

2. Connector Configuration Files
Connector configuration files must be created to allow Trino to query data from external sources like Hive or Delta Lake. Each connector configuration is stored in a separate .properties file, located in the /etc/trino/catalog/ directory.

Hive Connector (hive.properties):
Enables Trino to integrate with the Hive Metastore, providing access to the metadata about tables and schemas stored in MinIO S3.
Delta Lake Connector (delta.properties):
Provides Trino with the capability to query Delta Lake tables stored in MinIO or other object storage formats, ensuring full compatibility with ACID transactions and time-travel features supported by Delta Lake.

Create the relevant folder structure

Login to the Linux box
Create an additional folder structure under /data/trino by executing the command mkdir -p /data/trino/etc/catalog

To set up the necessary configuration files for Trino, follow these steps:

Navigate to /data/trino/etc by executing the command cd /data/trino/etc ; This will take you to the directory where you’ll manage all Trino configuration files.

config.properties

You can create or download the config.properties from the following location:

Git Repository address: https://github.com/dzbeda/spark-delta_lake-trino-hive-minio.git
File location: ./docker-compose/trino-files/config.properties

# If enabled - both coordinator and worker will be running on a single instance
coordinator=true
node-scheduler.include-coordinator=true
http-server.http.port=8080

# Set the IP of the server runinng the trino container
discovery.uri=http://130.1.1.2:8080

2. If needed, modify the specified sections in the properties file

discovery.uri — Update this with the server address where the Trino container is running. In this setup, the IP address is 130.1.1.2 , but you should adjust this based on your environment.

node.properties

You can create or download the node.properties from the following location:

Git Repository address: https://github.com/dzbeda/spark-delta_lake-trino-hive-minio.git
File location: ./docker-compose/trino-files/node.properties

node.environment=docker

# Make sure to update the path based on your setup
node.data-dir=/data/trino

plugin.dir=/usr/lib/trino/plugin

2. If needed, modify the specified sections in the properties file

node.data-dir — This is the Trino persistent storage folder we have defined in the prerequisite section.

jvm.config

You can create or download the jvm.config from the following location:

Git Repository address: https://github.com/dzbeda/spark-delta_lake-trino-hive-minio.git
File location: ./docker-compose/trino-files/jvm.config

-server
-Xmx1G
-XX:+UseG1GC
-XX:G1HeapRegionSize=32M
-XX:+ExplicitGCInvokesConcurrent
-XX:+HeapDumpOnOutOfMemoryError
-XX:+UseGCOverheadLimit
-XX:+ExitOnOutOfMemoryError
-XX:ReservedCodeCacheSize=256M
-Djdk.attach.allowAttachSelf=true
-Djdk.nio.maxCachedBufferSize=2000000

2. If needed, modify the properties file based on your performance requirements. For a detailed explanation, follow Trino's documentation.

To set up the connector configuration files for Trino, follow these steps:

3. Navigate to /data/trino/etc/catalog by executing the command cd /data/trino/etc/catalog

delta.properties

You can create or download the delta.properties from the following location:

Git Repository address: https://github.com/dzbeda/spark-delta_lake-trino-hive-minio.git
File location: ./docker-compose/trino-files/delta.properties

connector.name=delta-lake

# The is the server address where HIVE container is running. in my setup we will run the Hive on server ip 130.1.1.2
hive.metastore.uri=thrift://130.1.1.2:9083

# This is the access key that has permission to the bucket where the data we wish to query is stored
hive.s3.aws-access-key=N1adsdasdsadasp8gKvOa

# MINIO Secret key
hive.s3.aws-secret-key=lch9rNXRySodasdsadsadsadasdsadnwGWaCD

# This is the MINIO address. in my setup the address and port of the MINIO are http://130.1.1.1.9000
hive.s3.endpoint=http://130.1.1.1:9000

hive.s3.path-style-access=true
hive.s3.ssl.enabled=false

2. If needed, modify the specified sections in the properties file

hive.metastore.uri — This is the server address where the HIVE container is running. in my setup, we will run the Hive on server ip 130.1.1.2
hive.s3.aws-access-key — This is the access key that has permission to the bucket where we intend to store the data
hive.s3.aws-secret-key — This is the secret access key
hive.s3.endpoint — This is the MINIO address. in my setup the address and port of the MINIO are http://130.1.1.1.9000
Do not change the connector.name parameters !!!

hive.properties

You can create or download the hive.properties from the following location:

Git Repository address: https://github.com/dzbeda/spark-delta_lake-trino-hive-minio.git
File location: ./docker-compose/trino-files/hive.properties

connector.name=hive

# The is the server address where HIVE container is running. in my setup we will run the Hive on server ip 130.1.1.2
hive.metastore.uri=thrift://130.1.1.2:9083

# This is the access key that has permission to the bucket where the data we wish to query is stored
hive.s3.aws-access-key=N1a96fsdfdsfdsgKvOa

# MINIO Secret key
hive.s3.aws-secret-key=lch9rNXRySfdsfdsfdsfdsfdsfsfdsnwGWaCD

# This is the MINIO address. in my setup the address and port of the MINIO are http://130.1.1.1.9000
hive.s3.endpoint=http://130.1.1.1:9000

hive.s3.path-style-access=true
hive.s3.ssl.enabled=false

2. If needed, modify the specified sections in the properties file

hive.metastore.uri — This is the server address where the HIVE container is running. in my setup, we will run the Hive on server ip 130.1.1.2
hive.s3.aws-access-key — This is the access key that has permission to the bucket where we intend to store the data
hive.s3.aws-secret-key — This is the secret access key
hive.s3.endpoint — This is the MINIO address. in my setup the address and port of the MINIO are http://130.1.1.1.9000
Do not change the connector.name parameters !!!

Verify files are available :

Verify that config.properties, node.properties, and jvm.config files exist under /data/trino/etc folder by executing the command ll /data/trino/etc

2. Verify that delta.properties and hive.properties files exist under /data/trino/etc/catalog folder by executing the command ll /data/trino/etc/catalog

Create a Docker Compose file to run Hive, Postgres, and Trino containers

To run Hive, Postgres, and Trino as containers using, run the following steps

Login to the Linux box
Navigate to the folder /data/poc by executing the command cd /data/poc
You can create or download the trino-hive-postgres-docker-compose.yaml from the following location:

Git Repository address: https://github.com/dzbeda/spark-delta_lake-trino-hive-minio.git
File location: ./docker-compose/trino-hive-postgres-docker-compose.yaml

version: "3.4"

services:
  postgresdb:
    # Postgres Image
    image: postgres
    # This is the Postgres server hostname
    hostname: metastore_db
    # ports:
      # this is optional only in case you wish to expose the Postgres port
      #- '5432:5432'
    # In case of service/container crash, the container will restart.
    restart: always
    environment:
      # Specify the username that will be created in the Postgres DB. 
      # You can change this to any username you prefer.
      POSTGRES_USER: hive
      
      # Set password for the Postgres user. Change this to a more complex password.
      POSTGRES_PASSWORD: hive
      
      # Name of the database created in Postgres. You can modify this name if desired.
      POSTGRES_DB: metastore
    volumes:
      # Postgres DB data will be saved on the Linux box under /data/postgres.
      # You can change the host path if you want to store the data in a different location.
      - /data/postgres:/var/lib/postgresql/data

  hive-metastore:
    # This is the Hive Metastore service
    image: 'starburstdata/hive:3.1.2-e.18'
    # This is the hive server hostname
    hostname: hive-metastore
    ports:
      # Expose Metastore Thrift service on port 9083.
      - '9083:9083'
    # In case of service/container crash, the container will restart.
    restart: always
    environment:
      # JDBC driver for connecting Hive Metastore to Postgres
      HIVE_METASTORE_DRIVER: org.postgresql.Driver
      
      # JDBC URL for the Postgres DB. Ensure the hostname and port match your Postgres service.
      HIVE_METASTORE_JDBC_URL: jdbc:postgresql://metastore_db:5432/metastore
      
      # Username for connecting to Postgres from Hive. Change this if you modified POSTGRES_USER.
      HIVE_METASTORE_USER: hive
      
      # Password for connecting to Postgres from Hive. Change this if you modified POSTGRES_PASSWORD.
      HIVE_METASTORE_PASSWORD: hive
      
      # Location of the Hive warehouse where Delta Lake data is stored.
      # You can modify this to match the bucket or path in your MinIO setup.
      HIVE_METASTORE_WAREHOUSE_DIR: s3://spark-delta-lake/
      
      # MinIO endpoint. Replace this with your MinIO address and port.
      S3_ENDPOINT: http://130.1.1.1:9000
      
      # Access key for MinIO. Change this to your actual MinIO access key.
      S3_ACCESS_KEY: N1a96gdfgdfgdfgdfgKvOa
      
      # Secret key for MinIO. Change this to your actual MinIO secret key.
      S3_SECRET_KEY: lgdfgfdgdfgdfgfdgdfgdfgdfgmKnwGWaCD
      
      # Use path-style access. Keep this set to "true" for MinIO compatibility.
      S3_PATH_STYLE_ACCESS: "true"
      
      # The region for S3 storage. For MinIO, leave this blank.
      REGION: ""

      # Following fields are placeholders for integration with other cloud services like Google Cloud and Azure.
      # These can be left blank unless you're using one of those services.
      GOOGLE_CLOUD_KEY_FILE_PATH: ""
      AZURE_ADL_CLIENT_ID: ""
      AZURE_ADL_CREDENTIAL: ""
      AZURE_ADL_REFRESH_URL: ""
      AZURE_ABFS_STORAGE_ACCOUNT: ""
      AZURE_ABFS_ACCESS_KEY: ""
      AZURE_WASB_STORAGE_ACCOUNT: ""
      AZURE_ABFS_OAUTH: ""
      AZURE_ABFS_OAUTH_TOKEN_PROVIDER: ""
      AZURE_ABFS_OAUTH_CLIENT_ID: ""
      AZURE_ABFS_OAUTH_SECRET: ""
      AZURE_ABFS_OAUTH_ENDPOINT: ""
      AZURE_WASB_ACCESS_KEY: ""

      # Define Hive Metastore admin role. You can change the role if needed.
      HIVE_METASTORE_USERS_IN_ADMIN_ROLE: "admin"
    healthcheck:
      # Health check to confirm if the Hive Metastore is running on port 9083.
      test: bash -c "exec 6<> /dev/tcp/localhost/9083"

  trino-coordinator:
    # Trino coordinator service
    image: 'trinodb/trino:400'
    hostname: trino-coordinator
    # In case of service/container crash, the container will restart.
    restart: always
    ports:
      # Trino UI and coordinator service exposed on port 8080. You can change the host-side port if needed.
      - '8080:8080'
    volumes:
      # Trino configuration files stored in /data/trino/etc.
      # You can modify the host directory if you want the files stored in a different path.
      - /data/trino/etc:/etc/trino

4. If needed, modify the specified sections in the docker-compose based on the provided comments in the file. The most important parameters that you will need to update based on your step

S3_ENDPOINT — This is the MINIO address. in my setup the address and port of the MINIO are http://130.1.1.1.9000
S3._ACCESS_KEY — This is the access key that has permission to the bucket where we intend to store the data
S3_SECRET_KEY — This is the secret access key

5. Run the containers by executing the command docker compose -f trino-hive-postgres-docker-compose.yaml up

Trino is running version 400 — Using a different version may result in a failure
Trino is running both coordinator and worker on a single instance. Running workers separately will require adding additional instances with different configurations — Please refer to the Trino documentation.
When running the containers, note that this will download images, which might take some time depending on your network speed. In my setup, it took around 10 minutes
By default docker images are saved on /var — Verify that you have enough storage space ~5GB

Running **docker compose -f trino-hive-postgres-docker-compose.yaml up**

Verification

To verify that the containers are running as expected run the following steps

In the docker logs make sure that you can see the message that Trino has started

2. Verify that the containers are up and running without restart by executing the command docker ps

3. Browse to Trino web UI , http://<server-ip-running trino-container>:8080. In my setup, I will browse to http://130.1.1.2:8080

In the login insert any username
Make sure that you can see 1 active worker

Download Trino-cli

Please note that Trinio-cli must be matched to the same Trino server version that is running. In our case we are running Trino version 400 , therefore we will download Trino-cli version 400

Login to the Linux box
Navigate to the folder /data/poc by executing the command cd /data/poc
Download Trino-CLI version 400 by executing the command wget https://repo1.maven.org/maven2/io/trino/trino-cli/400/trino-cli-400-executable.jar

Alternatively, You can download the trino-cli-400-executable.jar from the following location:

Git Repository address: https://github.com/dzbeda/spark-delta_lake-trino-hive-minio.git
File location: ./trino-cli-400-executable.jar

4. Change the trino-cli-400-executable.jar file permission by executing the command chmod 777 trino-cli-400-executable.jar

Create Schema,Table and Run queries

In this section, we will use the Trino-cli to create schemas and tables on the existing Delata catalog and run queries on both partitioned and non-partitioned data. These steps will guide you in setting up your schema and running SQL queries efficiently

Connect to trino server using Trinio-cli

Navigate to /data/poc by executing the command cd /data/poc
Execute the connection by executing the command

./trino-cli-400-executable.jar --server http://130.1.1.2:8080 --catalog delta --debug --insecure

Change the server IP where your Trino container is running
As part of the command we have connected to the delta catalog

2. Verity that we can query catalog by executing the command show CATALOGS;

Non-partitioned data

3. Create a schema named “non-partition” on the delta catalog by executing the command create SCHEMA delta.non_partition;

Confirm that the CREAT SCHEMA message appears

Query Schema by executing the command show SCHEMAS;

4. Create a table named “table_non_partition” by executing the command CREATE TABLE IF NOT EXISTS non_partition.table_non_partition (dummy bigint) WITH ( location = ‘s3://spark-delta-lake/non-partition’);

non_partition — is the schema that we have created
table_non_partition — is the table name we wish to create.
s3://spark-delta-lake/non-partition — This is pointing the non-partition data we have created on the MINIO-S3 in the previous blog

Trino — Create table

5. Run the query by executing the command select * from non_partition.table_non_partition limit 5;

Partitioned data

3. Create a schema named “non-partition” on the delta catalog by executing the command create SCHEMA delta.partitioned;

Confirm that the CREAT SCHEMA message appears

Query Schema by executing the command show SCHEMAS;

4. Create a table named “table_non_partition” by executing the command CREATE TABLE IF NOT EXISTS partitioned.table_partitioned (dummy bigint) WITH ( location = ‘s3://spark-delta-lake/partition’, partitioned_by = ARRAY[‘ID’]);

partitioned — is the schema that we have created
table_partitioned — is the table name we wish to create.
s3://spark-delta-lake/partition — This is pointing to the partitioned data we have created on the MINIO-S3 in the previous blog

5. Run the query by executing the command select * from partitioned.table_partitioned limit 5;

Congratulations!

Across these two blogs, you’ve successfully generated data using Spark in Delta Lake format (both partitioned and non-partitioned), stored it on MinIO S3 storage, and queried it using Trino and Hive. This setup provides you with a scalable storage solution (MinIO), and since we have used Spark, you now have a scalable ingestion solution to create data on your storage. Additionally, with Trino and Hive, you have a powerful and scalable solution for SQL engines for efficiently querying large datasets, offering a cost-effective and robust infrastructure.

It’s important to note that while Delta Lake is widely used, the market is increasingly moving toward the Iceberg format. The exercise we performed here can also support Iceberg by updating the necessary JARs, Python scripts, and configuring a new Trino connector that supports Iceberg format. The next Blog going to show the same process with IceBerg format — make sure to follow.

If you liked this blog don’t forget to clap and follow me on both Medium and Linkedin

www.linkedin.com/in/davidzbeda

Extras

Kubernetes

Instead of Docker compose the same solution can run on Kubernetes cluster. Kubernetes manifest files can be found in the following location

Git Repository address: https://github.com/dzbeda/spark-delta_lake-trino-hive-minio.git
File location: ./kubernetes

References

Streamlining Big Data with Spark: Writing and Reading Delta Lake Format on MinIO-S3 Storage — https://medium.com/@dudu.zbeda_13698/streamlining-big-data-with-spark-writing-and-reading-delta-lake-format-on-minio-1700060eaa72

Git repo — https://github.com/dzbeda/spark-delta_lake-trino-hive-minio.git

Trinoi web site — https://trino.io/

S3 browser application — https://s3browser.com/

Setting Up Trino with Hive to Query Delta Lake Data on MinIO: A Scalable Big Data Solution

Introduction

Blog Goals

Understanding Key Components and Dependencies

1. What is Trino and What are Connectors?

2. What is Hive and Why Do We Need It with Trino?

Prerequisites and Ingredients

Let’s start working

Create Trino configuration files

Create the relevant folder structure

To set up the necessary configuration files for Trino, follow these steps:

To set up the connector configuration files for Trino, follow these steps:

Verify files are available :

Create a Docker Compose file to run Hive, Postgres, and Trino containers

Verification

Download Trino-cli

Create Schema,Table and Run queries

Non-partitioned data

Partitioned data

Extras

References

Written by David (Dudu) Zbeda

No responses yet