Quickstart

Welcome to Kerblam! This page will give you a hands-on introduction. If you like what you see, you can check out the manual to learn all there is to know about Kerblam!

To follow along, be sure to be comfortable in the command line and install Kerblam!, as well as wget and Python. Have access to a text editor of some sort. If you want to follow along for the conainerization section, also install Docker and be sure you can run it.

For this test project we will use Python to process some toy input data and make a simple plot. We will create a simple make workflow to handle the execution, and showcase all of Kerblam! features.

Making a new project

Move to a location where the new project will be stored. Run:

kerblam new example_project

Go in a directory where you want to store the new project and run kerblam new test-project. Kerblam! asks you some setup questions:

  • If you want to use Python;
  • If you want to use R;
  • If you want to use pre-commit;
  • If you have a Github account, and would like to setup the origin of your repository to github.com.

Say 'yes' to all of these questions to follow along. Kerblam! will then:

  • Create the project directory,
  • initialise it as new git repository,
  • create the kerblam.toml file,
  • create all the default project directories,
  • make an empty .pre-commit-config file for you,
  • create a venv environment, as well as the requirements.txt and requirements-dev.txt files (if you opted to use Python),
  • and setup the .gitignore file with appropriate ignores.

You can now start working in your new project, simply cd test-project. Since Kerblam took care of making a virtual environment, use source env/bin/activate to start working in it.

Take a moment to see the structure of the project. Note the kerblam.toml file, that marks this project as a Kerblam! project (akin to the .git folder for git).

tip

You could use tree . to do this. See the tree utility.

Get input data

The input data we will use is available online in this gist. It is the famous Iris data from Fisher, R. A. (1936), "The use of multiple measurements in taxonomic problems.", Annals of Eugenics, 7, Part II, 179–188., as reported by R's data(iris) command.

We can use Kerblam! to fetch input data. Open the kerblam.toml file and add at the bottom:

[data.remote]
"https://gist.githubusercontent.com/MrHedmad/261fa39cd1402eaf222e5c1cdef18b3e/raw/0c2ad0228a1d7e7b6f01268e4ee2ee01a55c9717/iris.csv" = "iris.csv"
"https://gist.githubusercontent.com/MrHedmad/261fa39cd1402eaf222e5c1cdef18b3e/raw/0c2ad0228a1d7e7b6f01268e4ee2ee01a55c9717/test_iris.csv" = "test_iris.csv"

note

The benefit of letting Kerblam! handle data retrieval for you is that, later, it can delete this remote data to save disk space.

Save the file and run

kerblam data fetch

Kerblam! will fetch the data and save it in data/in. You can check how your disk is being used by using kerblam data. You'll see a summary like this:

>> kerblam data
./data	0 B [0]
└── in	4 KiB [2]
└── out	0 B [0]
──────────────────────
Total	4 KiB [2]
└── cleanup	4 KiB [2] (100.00%)
└── remote	4 KiB [2]

Write the processing logic

We will take the input Iris data and make a simple plot. Kerblam! has already set up your repository to use the src/ folder, so we can start writing code in it.

Save this Python script in src/process_csv.py:

import pandas as pd
import matplotlib.pyplot as plt
import sys

print(f"Reading data from {sys.argv[1]}")

iris = pd.read_csv(sys.argv[1])
species = iris['Species'].unique()
colors = ['blue', 'orange', 'green']

plt.figure(figsize=(14, 6))

for spec, color in zip(species, colors):
    subset = iris[iris['Species'] == spec]
    plt.hist(subset['Petal.Length'], bins=20, alpha=0.6, label=f'{spec} Petal Length', color=color, edgecolor='black')

for spec, color in zip(species, colors):
    subset = iris[iris['Species'] == spec]
    plt.hist(subset['Petal.Width'], bins=20, alpha=0.6, label=f'{spec} Petal Width', color=color, edgecolor='black', hatch='/')

plt.title('Distribution of Petal Length and Width')
plt.xlabel('Size (cm)')
plt.ylabel('Frequency')
plt.legend(title='Species and Measurement', loc='upper right')

plt.tight_layout()

print(f"Saving plot to {sys.argv[2]}")

plt.savefig(sys.argv[2], dpi = 300)

Since we are in a Python virtual environment, we can use pip install pandas matplotlib to install the required packages for this script.

We can use the python script with a command like this:

python src/process_csv.py data/in/iris.csv data/out/plot.png

Try it now! It should create a plot.png file which you can manually inspect.

Create and run a workflow

This is a very simple example, but a lot of day-to-day data analysis is relatively straightforward. In this case, we do not need a rich workflow manager: a bash script does the trick.

We can let Kerblam! handle the execution through Bash. Create the /src/workflows/create_iris_plot.sh file and write in the command from above:

python src/process_csv.py data/in/iris.csv data/out/plot.png

Now try it out! Run kerblam run create_iris_plot and see what happens. Kerblam! has handled moving your workflow to the top-level of the project (else the command would not work - it uses relative paths!) and executed bash to run the command.

Swap input files

We also downloaded a test_iris.csv dataset. We might want to also use it to create the same plot. We could edit the create_iris_plot to change the input file, or maybe copy-and-paste it into a new create_test_iris_plot, but it would be verbose, tedious and error-prone.

Instead, we can use Kerblam profiles that do this for us. The test profile requires no configuration, so go right ahead and run

kerblam run create_iris_plot --profile test

and see how the plot.png file changes (the test data has less entries, so the plot should be less dense).

Run in a container

note

To run the examples in this section, you must have Docker installed and configured.

Our analysis is complete, but it's not reproducible with a simple bash script. Kerblam! helps in this: we can run everything in a docker container pretty easily.

Create the src/dockerfiles/create_iris_plot.dockerfile file, and write in:

FROM python:latest

RUN pip install pandas matplotlib

COPY . .

There is no need to reference the actual workflow in the dockerfile. Kerblam! takes care of everything for you.

To reduce the size of the image, it's a good idea to create the .dockerignore file in the root of the project (next to kerblam.toml). We can exclude the data and env folder from the container safely:

data
env

Now we can run again:

kerblam run create_iris_plot

Kerblam! picks up automatically that a dockerfile is present for this workflow, builds the image and uses it as a runtime.

tip

You can use profiles even with docker containers!

Packaging data

We're done, and we'd like to share our work with others. The simplest is to send them the output. Kerblam! can help: run the following:

kerblam data pack

Kerblam! creates the data/data_export.tar.gz file with all the output data of your project (and eventually input data that cannot be downloaded from the internet, but this example does not have any). You can share this tarball with colleagues quite easily.

Packaging execution

If you used a container, you can also have Kerblam! package the workflow proper for you. Just run:

kerblam package create_iris_plot --tag my_test_container

Kerblam! will create the my_test_container image (so you can upload it to the registry of your choice) and a tarball with the necessary input data plus other bits and bobs that are needed to replay your run: the replay package.

Speaking of replay, you can do just that from the tarball you just created by using kerblam replay:

# Let's move to an empty directory
mkdir -p test-replay && cd test-replay
kerblam replay ../

Kerblam! unpacks the tarball for you, creates a dummy project directory, fetches the remote input data and runs the pipeline for you in the correct docker container, automatically.

You can use replay packages manually too - they include the kerblam binary that created them, so the reproducer does not need to leave anything to chance.

Cleaning up

We're done! The output is sent to the reviewers, together with the replay package, and we can close up shop.

If you don't want to completely delete the project, you can make it lightweight by using Kerblam!.

Run:

kerblam data clean

Kerblam! will clean out all output data, intermediate (in data/) data and input data that can be fetched remotely, saving you disk space for dormant projects.

Conclusions

Hopefully this toy example got you excited to use Kerblam!. It only showcases some of Kerblam! features. Read the manual to learn all theres is to know about how Kerblam! can make your life easier.