Manage Datasets With LakeFS — Install & Get Started

David (Dudu) Zbeda
16 min readJun 21, 2024

--

This blog was created after I got a request to manage datasets for our LLM model. So my first question was why, what is wrong with GIT? The second question was what are you doing today?

For the first question, it appears that datasets are huge files that cannot be managed on GIT — believe me, I have tried it. Basically, it is impossible to clone huge files (4GB and more) from GIT.

For the second question, today the research team is managing the datasets in folders. For each change, the developers open a new folder — so no version control for you. It is hard to manage and very hard to understand what the changes were.

I do believe that from time to time I will update this page with more information — so don't forget to look for changes.

Introduction

In simple words

lakeFS is Git-like for your Machine learning datasets. It lets you clone dataset records, track changes, revert to previous versions, and work together on datasets easily.

With lakeFS, you can experiment with machine learning models faster and more safely. You’ll understand your data better and be able to reproduce successful models for real-world use. By adopting this approach, you can significantly accelerate your development cycles, improve the reliability of your models, and unlock the full potential of your machine-learning projects.

The same introduction in more sophisticated words

The realm of machine learning thrives on high-quality, well-managed datasets. But as your datasets grow in size and complexity, ensuring their integrity and reproducibility becomes a significant hurdle. Traditional data lake storage, while offering scalability, often lacks the version control and collaborative features essential for robust machine learning pipelines.

Enter lakeFS, an open-source platform that bridges the gap between data lakes and the rigorous version control practices of software development. By introducing Git-like functionalities to data management, lakeFS empowers you to:

  • Streamline Experimentation: Rapidly iterate on your machine learning models by creating isolated branches for testing new features or data preprocessing techniques. Revert to previous versions seamlessly if experiments go awry.
  • Maintain Data Lineage: Track changes made to your datasets meticulously, ensuring you understand the origin and transformations applied to your training data. This enhances model interpretability and facilitates debugging.
  • Boost Collaboration: Enable seamless collaboration among data scientists and engineers. Team members can work on separate branches, test modifications in isolation, and merge changes efficiently.
  • Guarantee Reproducibility: Reproducing successful machine learning models is crucial for real-world deployment. lakeFS allows you to recreate specific dataset versions used to train your models, ensuring consistent results across environments.
  • Minimize Errors and Costs: Version control mitigates the risk of accidentally corrupting or modifying crucial training data. Roll back to previous versions quickly and minimize the impact of potential errors.

In short, lakeFS empowers you to manage your machine learning datasets with the same control and precision you expect from your codebase. By adopting this approach, you can significantly accelerate your development cycles, improve the reliability of your models, and unlock the full potential of your machine learning projects.

Blog Goals

In the blog, we will install the On-Premise lakeFS platform. The setup will be based on docker-compose

  • Install lakeFS platform
  • Integrate lakeFS platform with Postgres and Minio
  • Integrate Padmin with Postgres (Optional)
  • Create users on lakeFS platform
  • Create a new repository on lakeFS
  • Create changes and commit to lakeFS branch
  • Merge branches and more

How it works — some insights

As I said I’m not an expert in lakeFS , but from the short time I have spent playing with the lakeFS platform, I got the following insights

  • When creating a repo & branches, the metadata is saved on the Postgres DB and the content is saved on the Minio storage
  • In order to intercut with lakeFS when you wish to update your code you will need to use lakeFS client, named lakectl. The tool gives GIT-Like command set
  • Code changes, commits, updates, etc can be done by running the lakectl tool on the developer’s laptop. I didn’t manage to find an IDE solution that can interact with lakeFS.
  • The lakectl tool requires login credentials to access lakeFS platform. To have the option to blame someone for code changes please make sure to create a user on lakeFS for each developer.
  • When I say that lakectl command is a GIT-like, this is because lakeFS is missing functionality like having local commits, branch checkout and more
  • Important, Since there is no branch checkout , every branch in the repository should be mapped to a local folder on the user laptop. I think that the best alternative in to have a folder that represent the repository name and a subfolder that represent the branch name.
    For example , if we have repository named “my-repo” with branches “main” & “dev” you should have the following folders
    /my-repo/main
    /my-repo/dev

Prerequisites and Ingrediencies

Below are all the prerequisites required to run this exercise:

Required prerequisite

  1. Linux Box where we will run docker images (Postgres , lakeFS and pgadmin) — For the exercise I have used Ubuntu 22.04 version
  2. Install Docker & Docker Compose on the Linux Box- You can use the following link: https://docs.docker.com/engine/install/ubuntu/
  3. To enable Persistent storage for Postgres and Pgadmin create the following folder under your preferred folder — In our exercises the folder will be /data
    postgres-volume
    pgadmin-volume
  4. download lakectl on the Linux Box by running the following steps
lakectl and lakefs

5. Minio server — This exercise assumes that you already have running Minio

  • generate a bucket in the Minio server — In our excircles, the bucket name will be named “lakefs”
  • It is highly recommended to generate a specific S3 access token to be assigned to the bucket that will be used by the lakeFS platform. This way you can ensure that no user can write or delete data from the bucket and that the lakeFS platform will not write data on any other location on the Minio

Prerequisites verification

  1. In order to verify that docker & docker-compose are installed & running, run the following commands and verify the output
    docker — — version
    docker compose version
Docker and Docker compose version

2. Browse to your Minio server and verify that you have a directory name lakefs — I’m using s3 browser can be download from the following link: https://s3browser.com/download.aspx

S3 browser — verify lakefs folder exists

3. Verify that folder for Postgres and Pgadmin exist

4. To verify that lakectl is installed, run the following commands and verify the output
lakefs — -version

lakectl verify version

Let’s start working

Lakefs , Postgres & pgadmin installation

All platforms are installed using docker-compose. run the following steps to install the platforms

  1. Open a new file named docker-compose-lakefs.yml under /data by running the command: touch /data/docker-compose-lakefs.yml
  2. Edit the file and paste the following content — the file includes all relevant parameters and explanations
#  Create an Internal network that will be used by the different services
networks:
# Internal network name
lakefsnetwork:

services:
# This is the Posgres server name
postgresdb:
# Postgres Image
image: postgres
# In case of service\container crush, the container will restart.
restart: always
environment:
# Specify the username that will be created in the Postgres DB. By default, it will create DB with the same name
POSTGRES_USER: lakefs
# Set password for lakefs user - I believe in you that you will use a more complex password :-)
POSTGRES_PASSWORD: 1qaz@WSX
volumes:
# Postgres DB data will be saved on the Linux box under /data/postgres-volume
- /data/postgres-volume:/var/lib/postgresql/data
# Will run the service under lakefsnetwork internal network
networks:
- lakefsnetwork

pgadmin:
# pgadmin Image
image: dpage/pgadmin4
# In case of service\container crush, the container will restart.
restart: always
environment:
# Specify the username that will be created in pgadmin - Must be email
PGADMIN_DEFAULT_EMAIL: zbeda@zbeda.com
# Set password for zbeda@zbeda.com user - I believe in you that you will use a more complex password :-)
PGADMIN_DEFAULT_PASSWORD: 1qaz@WSX
# Pgadmin UI is running under port 80. To connect the pgadmin from the external browser, port 8080 is mapped to pgadmin UI port 80
ports:
- 8080:80
# Will run the service under lakefsnetwork internal network
networks:
- lakefsnetwork
volumes:
# Mapping a predefined JSON file that include the Postgres server connection configuration
- /data/pgadmin-volume/server.json:/pgadmin4/servers.json

lakefs:
# lakefs Image
image: treeverse/lakefs:latest
# In case of service\container crush, the container will restart.
restart: always
# Requires that Postgres DB will be up for the lakeFS platform to ru
depends_on:
- postgresdb
environment:
# Define the type of DataBase that lakeFS platform will use for metadata and configuration
LAKEFS_DATABASE_TYPE: postgres
# Connection link to postgres DB - postgres://<db-username>:<username:password>@<postgres-server-name>:<postgres-port>/<db-name>
LAKEFS_DATABASE_POSTGRES_CONNECTION_STRING: postgres://lakefs:1qaz@WSX@postgresdb:5432/lakefs
# Encryption key that will be used for data encryption
LAKEFS_AUTH_ENCRYPT_SECRET_KEY: 1qaz@WSX
# Define the type of storage that lakeFS platform will use to save content. In pur case we are using Minio -s3
LAKEFS_BLOCKSTORE_TYPE: s3
# This value is required when integrating with Minio
LAKEFS_BLOCKSTORE_S3_FORCE_PATH_STYLE: "true"
# Minio server endpoing & main bucket name. If you will not add the bucket name , lakefs repos will be adding under main storage path
LAKEFS_BLOCKSTORE_S3_ENDPOINT: http://10.130.1.1:9000/lakefs
# This value is required when integrating with Minio
LAKEFS_BLOCKSTORE_S3_DISCOVER_BUCKET_REGION: "false"
# Minio access key
LAKEFS_BLOCKSTORE_S3_CREDENTIALS_ACCESS_KEY_ID: GkdadsadsaovZ4pBHjdasdsa
# Minio access token
LAKEFS_BLOCKSTORE_S3_CREDENTIALS_SECRET_ACCESS_KEY: qQ3dsdssCUmjTfSFpdsds2TPtZaLfSNpgasJ
ports:
# lakeFS UI is running under port 8000. To connect the lakeFS from the external browser, port 8000 is mapped to lakeFS UI port 8000
- 8000:8000
# Will run the service under lakefsnetwork internal network
networks:
- lakefsnetwork

3. To avoid manual configuration of Postgres DB connectivity , Create in advance a JSON file that includes the connection parameters — The parameters are based on the same parameters that can be found in the docker-compose.yml file. run the following steps to define the JSON file

  • Connect to the Linux Box
  • Navigate to /data/pgadmin-volume by running the command: cd /data/pgadmin-volume
  • create new file named server.json by running the command touch /data/pgadmin-volume/server.json
  • Update the file with the following content
{
"Servers": {
"1": {
"Name": "Postgres Server",
"Group": "Servers",
"Host": "postgresdb",
"Port": 5432,
"MaintenanceDB": "postgres",
"Username": "lakefs",
"Password": "1qaz@WSX",
"SSLMode": "prefer",
"ConnectNow": true
}
}
}

3. Start downloading the Images and run the platforms by running the command: docker compose -f docker-compose-lakefs.yml up

Image download
Postgres is running
LakeFS is running
Pgadmin is running

Running Pgadmin

Pgadmin is a client DB UI tool that Lets you connect to the Postgres DB — Please note that Pgadmin is not mandatory for running lakeFS

Run the following steps to connect pgadmin UI

  1. Open your browser
  2. Navigate to http://<Linux-box-ip>:8080
  3. Update username and password
Pgadmin login

4. Click on the server connection, and choose the Postgres server. In the “connect to server” window enter the lakefs user password — 1qaz@WSX

Server connection based on the server.json

Running lakeFS

Run the following steps to connect lakeFS UI platform

  1. Open your browser
  2. Navigate to http://<Linux-box-ip>:8000
lakeFS — First time login

3. In order to generate admin user credentials, enter user email & click Setup

4. Copy the admin user credentials

lakefs — generated user credentials

Congrats!!! lakeFS platform is up and running

lakeFS — Lets create you first repository

  1. Open your browser
  2. Navigate to http://<Linux-box-ip>:8000
  3. Enter the admin credentials, based on the previous step
  4. Click on create sample repository
lakeFS — create sample repository

5. Update following parameters

  • Repo name: zbeda-sample-repo
  • Default branch: you can add any name, default is main
  • Storage name space -
    - Use the following convention: s3://<repo-name>/
    -
    Please note, since we have added the http://10.130.1.1:9000/lakefs S3 endpoint under lakeFS environment configuration (docker-compose.yml file) by default the defined repo name will be created under the lakefs bucket
lakeFS — Create repo parameters
lakeFS — New repo
MINIO — Repo content on minio

Congrats!!! you have created your first repository in lakeFS

Create new user and configure lakectl

In this section, we will create a developer user on lakeFS platform & configure lakectl tool on the developer’s laptop. we will call our developer user “duck” — why “duck”? this is the first thing I saw on my desk

duck user — was promoted to a developer

Create a new user

  1. Open your browser
  2. Navigate to http://<Linux-box-ip>:8000
  3. Login with Admin credentials
  4. Click on Administration tab → users → Create user
Create user page

5. In the Create User window, enter the username duck & Click Create

6. From the list click on user “duck”

7. Click on Add user to Group

8. Select the required roles & click Add to Group

9. Click on the Access Credentials tab and Create Access Key

10. Download the keys, and send them to user “duck”

Configure lakectl

In this stage user “duck” is required to download the lakectl binary to his laptop — Instructions for downloading and installing lakectl can be found in the prerequisite section. In the exercise, I have installed the lakectl on Ubuntu Operating System.

The following steps need to be performed on user “duck” laptop

  1. configure lakectl by running lakectl config
  2. In the prompt update the following:
    Access Key: user “duck” access key you have generated
    Secret access key: user “duck” secret key you have generated
    Server endpoint : http://<Linux-BOX-IP-Running-lakeFS>:<exposed-port>/api/v1

3. To verify connectivity, run lakectl repo list command, this will give you all repos available in lakeFS platform

lakectl repo list

User “duck” user can now instruct with lakeFS platform using lakectl command tool (GIT like)

Code management — Git like commands

In this section, we will perform actions using lakectl tool that will simulate the developer work. The entire section shall be run on the user “duck” laptop.

  1. crate a new folder named lakefsdata. In this folder, we will clone our repo

Create new repository

  1. Run the command lakectl repo create lakefs://repo-1/ s3://repo-1/
  • This command will create a repo-1 repository in lakeFS platform and repo-1 folder in S3. By default, main branch will be created
  • Verify that the repository was created by running the command: lakectl repo list
lakectl Create repo
lakeFS platform — Repo list
main default branch

Clone repository

  1. Create a folder named repo-1 under your main folder lakefsdata by running the command: mkdir -p lakefsdata/repo-1/main
  2. navigate to lakefsdata/repo-1/main folder
  3. Clone the repo-1 repository from lakeFS by running the command: lakectl local clone lakefs://repo-1/main/
  • branch name must be specified and ended with /
  • The main branch from repo-1 repository was cloned, but since the branch doesn't include any files, the local folder is empty
lakectl repo clone

Add file to local folder and commit to destination repository

  1. Add file to /lakefsdata/repo-1/main folder. file name first-file.txt , file content “this is my first file”

2. Run lakectl local status command to see the changes between your local folder and remote repository

  • first-file.txt was added to the local folder
  • After this step, the first-file.txt is not yet available on the remote repository
lakectl repo create lakefs://repo-1/ s3://repo-1/

3. Run commit command by running lakectl local commit -m “Adding my first file”

  • This command adds a commit message and uploads the “first-file.txt” file to the remote repository under the main branch
  • Running the command lakectl local status, will show that no differences were found between the remote repository to the local folder
lakectl commit
File was added to repo

Create a new branch from main branch & clone it

  1. Create new branch named branch-1 by running the command: lakectl branch create lakefs://repo-1/branch-1 -s lakefs://repo-1/main/
  • This command will create new branch named branch-1 that was create from main branch
  • When running this command, no file was downloaded from the remote repository to the local folder
lakectl create branch from source branch
new branch create from main branch

2. create a new folder /lakefsdata/repo-1/branch-1. This folder will present branch-1

3. clone branch-1 to local folder /lakefsdata/repo-1/branch-1 by running the command: lakectl clone lakefs://repo-1/branch-1/

  • Make sure to navigate to branch-1 folder before running the command
  • The barnch-1 branch was cloned to /lakefsdata/repo-1/branch-1 local folder. Therefore all file from the remote branch were downloaded to the local folder

Update file in branch-1

  1. Update first-file.txt by adding “but modified” string to the file content

2. Run lakectl local status command ito see the changes between your local folder to the remote repository (on branch-1)

3. Upload the file from the local folder to the remote repository by running the command: lakectl local commit -m “first-file.txt was modified”

  • After running this command, we can see that the file was modified

Merge branches

  1. Add a new file “second-file.txt” to branch-1

2. Upload to remote repository (branch-1) lakectl local commit -m “second-file.txt was modified”

3. In order to merge branch-1 to main branch run the following command lakectl merge lakefs://repo-1/branch-1 lakefs://repo-1/main/

lakectl merge branch

main branch before merge

main before merge
First-file content on main before merge

main branch after merge

main after merge
First-file content on main after merge

Sync data from remote repository — main branch

  1. navigate to /lakefsdata/repo-1/main folder
  2. run ls
  • first-file.txt file is not modified with the new content

3. run lakectl local status

  • The output shows that “first-file.txt” file was modified and new file was added “second-file.txt”

4. to sync the remote branch to your local folder run the command lakectl local pull

lakectl pull

lakeFS UI exploration

  1. Open your browser
  2. Navigate to http://<Linux-box-ip>:8000
  3. Enter the admin credentials, based on the previous step
  4. In the Repositories page click on repo-1 repository & select the main branch
  5. Click on the Commits tab — List of all commits in this branch are showing (Also commits that were added from branch-1 after the merge)
  6. Click on “first-file.txt was modified” commit
lakeFS show branch commits

7. Click on Show Object Changes link in order to see the changes were done on the file

lakeFS show commit details
lakeFS show changes

8. From your branch , click on the configuration button → blame. The output will be the last commit and the user that issue the commit

lakeFS blame
Blame output

If you liked this blog don’t forget to clap and follow me on both Medium and Linkedin

www.linkedin.com/in/davidzbeda

More to come

lakeFS Client — as part of the prerequisites we have installed lakeFS client but we didn't used it as part of our excircles. lakeFS client in a high level

  • Type: Python library
  • Functionality: Provides programmatic access to the lakeFS API.
  • Use Cases:
  • Integrate lakeFS functionality into custom applications.
  • Automate tasks like uploading, downloading, and managing branches.
  • Build higher-level tools on top of lakeFS.

Integration

References

lakeFS Architecturehttps://docs.lakefs.io/understand/architecture.html

--

--

David (Dudu) Zbeda
David (Dudu) Zbeda

Written by David (Dudu) Zbeda

DevOps | Infrastructure Architect | System Integration | Professional Services | Leading Teams & Training Future Experts | Linkedin: linkedin.com/in/davidzbeda

No responses yet