If you want it, Kerblam it!

GitHub issues GitHub License GitHub Repo stars All Contributors DOI

Kerblam! is a project management system for data analysis projects.

Wherever you have input data that needs to be processed to obtain some output, Kerblam! can help you out by dealing with the more tedious and repetitive parts of working with data for you, letting you concentrate on getting things done.

Kerblam! lets you work reproducibly, rigorously and quickly for big and small projects alike: from routine data analysis to large, multi-workflow projects.

Kerblam! is a Free and Open Source Software, hosted on Github at MrHedmad/kerblam. The code is licensed under the MIT License.

This website hosts the documentation of Kerblam. Use the sidebar to jump to a specific section. If you have never used Kerblam! before, you can read the documentation from start to finish to learn all there is to know about Kerblam! by clicking on the arrows on the side of the page.

Kerblam! is very opinionated. To read more about why these choices where made, you can read the Kerblam! philosophy.

About

This page aggregates a series of meta information about Kerblam!.

License

The project is licensed with the MIT License. Read here for the choose a license entry of the license.

Citing

If you want or need to cite Kerblam!, provide a link to the Github repository or use the following Zenodo DOI: doi.org/10.5281/zenodo.10664806.

Naming

This project is named after the fictitious online shop/delivery company in S11E07 of Doctor Who. Kerblam! might be referred to as Kerblam!, Kerblam or Kerb!am, interchangeably, although Kerblam! is preferred. The Kerblam! logo is written in the Kwark Font by tup wanders.

About this book

This book is rendered by mdbook, and is written as a series of markdown files. Its source code is available in the Kerblam! repo under the ./docs/ folder.

The book hosted online always refers to the latest Kerblam! release. If you are looking for older or newer versions of this book, you should read the markdown files directly on Github, where you can select which tag to view from the top bar, or clone the repository locally, checkout to the commit you like, and rebuild from source. If you're interested, read the development guide to learn more.

Installation

You have a few options when installing Kerblam!.

Requirements

Currently, Kerblam! only supports mac OS (both intel and apple chips) and GNU linux. Other unix/linux versions may work, but are untested. It also uses binaries that it assumes are already installed and visible from your $PATH:

If you can use git, make, bash and docker or podman from your CLI, you're good to go!

Most if not all of these tools come pre-packaged in most linux distros. Check your repositories for them.

You can find and download a Kerblam! binary for your operating system in the releases tab.

There are also helpful scripts that automatically download the correct version for your specific operating system thanks to cargo-dist. You can always install or update to the latest version with:

curl --proto '=https' --tlsv1.2 -LsSf https://github.com/MrHedmad/kerblam/releases/latest/download/kerblam-installer.sh | sh

Be warned that the above command executes a script downloaded from the internet. You can click here or manually follow the fetched URL above to download the same installer script and inspect it before you run it, if you'd like.

Install from source

If you want to install the latest version from source, install Rust and cargo, then run:

cargo install kerblam

If you wish to instead use the latest development version, run:

cargo install --git https://github.com/MrHedmad/kerblam.git

The main branch should always compile on supported platforms with the above command. If it does not, please open an issue.

Adding the Kerblam! badge

You can add a Kerblam! badge in the README of your project to show that you use Kerblam! Just copy the following code and add it to the README:

![Kerblam!](https://img.shields.io/badge/Kerblam!-v0.5.1-blue?logo=data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAC0AAAAtCAMAAAANxBKoAAABlVBMVEUAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABAAEAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAADW1tYNDHwcnNLKFBQgIB/ExMS1tbWMjIufDQ3S0tLOzs6srKyioqJRUVFSS0o0MjIBARqPj48MC3pqaWkIB2MtLS1ybm3U1NS6uroXirqpqamYmJiSkpIPZ4yHh4eFhIV8fHwLWnuBe3kMC3cLCnIHBlwGBlgFBU8EBEVPRkICAi4ADRa+EhIAAAwJCQmJiYnQ0NDKysoZkMK2trYWhLOjo6MTeKMTd6KgoKCbm5uKiIaAgIAPDHhubm4JT20KCW0KCWoIS2cHBUxBQUEEAz9IQT4DAz0DKTpFPTgCAjcCASoBASAXFxcgGRa5ERG1ERGzEBCpDw+hDg4fFA2WDAyLCgouAQFaWloFO1MBHStWBATnwMkoAAAAK3RSTlMA7zRmHcOuDQYK52IwJtWZiXJWQgXw39q2jYBgE/j2187JubKjoJNLSvmSt94WZwAAAvlJREFUSMeF1GdXGkEUgOGliIgIorFH0+u7JBIChEgJamyJvWt6783eS8rvzszAusACvp88x4d7hsvsaqdU57h8oQnobGmtb6xMzwbOkV9jJdvWBRwf7e9uLyzs7B3+o7487miC+AjcvZ3rkNZyttolbKxPv2fyPVrKYKcPhp7oIpPv0FkGN5N5rmd7afAFKH0MH99DihrTK2j3RTICF/Pt0trPUr9AxXyXpkJ3xu6o97tgQJDQm+Xlt6E8vs+FfNrg6kQ1pOuREVSPoydf9YjLpg14gMW1X0IInGZ+9PWr0Xl+R43pxzgM3NgCiekvqfE50hFdT7Ly8Jbo2R/xWYNTl8Ptwk6lgsHUD+Ji2NMlBFZ8ntzZRziXW5kLZsaDom/0yH/G+CSkapS3CvfFCWTxJZgMyqbYVLtLMmzoVywrHaPrrNJX4IHCDyCmF+nXhHXRkzhtCncY+PMig3pu0FfzJG900RBNarTTxrTCEwne69miGV5k8cPst3wOHSfrmJmcCH6Y42NEzzXIX8EFXmFE/q4ZXJrKW4VsY13uzqivF74OD39CbT/0HV/1yQW9Xn8e1O0w+WAG0VJS4P4Mzc7CK+2B7jt6XtFYMhl7Kv4YWMKnsJkXZiW3NgQXxTEKamM2fL8EjzwGv1srykZveBULj6bBZX2Bwbs03cXTQ3HAb9FOGNsS4wt5fw9zv0q9oZo54Gf4UQ95PLbJj/E1HFZ9DRgTuMecPgjfUqlF7Jo1B9wX+JFxmMh7mAoGv9B1pkg2tDoVl7i3G8mjH1mUN3PaspJaqM1NH/sJq2L6QJzEZ4FTCRosuKomdxjYSofDs8DcRPZh8hQd5IbE3qt1ih+MveuVeP2DxOMJAlphgSs1mt3GVWO6yMNGUDZDi1uzJLDNqxbZDLab3mqQB5mExtLYrtU45L10qlfMeSbVQ91eFlfRmnclZyR2VcB5y7pOYhouuSvg2rxHCZG/HHZnsVkVtg7NmkdirS6LzbztTq1EPo9dXRWxqtP7D+wL5neoEOq/AAAAAElFTkSuQmCC&link=https%3A%2F%2Fgithub.com%2FMrHedmad%2Fkerblam)

The above link is very long - this is because the Kerblam! logo is baked in as a base64 image. You can update the badge's version by directly editing the link (e.g. change v0.5.1 to v0.4.0) manually.

Quickstart

Welcome to Kerblam! This page will give you a hands-on introduction. If you like what you see, you can check out the manual to learn all there is to know about Kerblam!

To follow along, be sure to be comfortable in the command line and install Kerblam!, as well as wget and Python. Have access to a text editor of some sort. If you want to follow along for the conainerization section, also install Docker and be sure you can run it.

For this test project we will use Python to process some toy input data and make a simple plot. We will create a simple make workflow to handle the execution, and showcase all of Kerblam! features.

Making a new project

Move to a location where the new project will be stored. Run:

kerblam new example_project

Go in a directory where you want to store the new project and run kerblam new test-project. Kerblam! asks you some setup questions:

  • If you want to use Python;
  • If you want to use R;
  • If you want to use pre-commit;
  • If you have a Github account, and would like to setup the origin of your repository to github.com.

Say 'yes' to all of these questions to follow along. Kerblam! will then:

  • Create the project directory,
  • initialise it as new git repository,
  • create the kerblam.toml file,
  • create all the default project directories,
  • make an empty .pre-commit-config file for you,
  • create a venv environment, as well as the requirements.txt and requirements-dev.txt files (if you opted to use Python),
  • and setup the .gitignore file with appropriate ignores.

You can now start working in your new project, simply cd test-project. Since Kerblam took care of making a virtual environment, use source env/bin/activate to start working in it.

Take a moment to see the structure of the project. Note the kerblam.toml file, that marks this project as a Kerblam! project (akin to the .git folder for git).

tip

You could use tree . to do this. See the tree utility.

Get input data

The input data we will use is available online in this gist. It is the famous Iris data from Fisher, R. A. (1936), "The use of multiple measurements in taxonomic problems.", Annals of Eugenics, 7, Part II, 179–188., as reported by R's data(iris) command.

We can use Kerblam! to fetch input data. Open the kerblam.toml file and add at the bottom:

[data.remote]
"https://gist.githubusercontent.com/MrHedmad/261fa39cd1402eaf222e5c1cdef18b3e/raw/0c2ad0228a1d7e7b6f01268e4ee2ee01a55c9717/iris.csv" = "iris.csv"
"https://gist.githubusercontent.com/MrHedmad/261fa39cd1402eaf222e5c1cdef18b3e/raw/0c2ad0228a1d7e7b6f01268e4ee2ee01a55c9717/test_iris.csv" = "test_iris.csv"

note

The benefit of letting Kerblam! handle data retrieval for you is that, later, it can delete this remote data to save disk space.

Save the file and run

kerblam data fetch

Kerblam! will fetch the data and save it in data/in. You can check how your disk is being used by using kerblam data. You'll see a summary like this:

>> kerblam data
./data	0 B [0]
└── in	4 KiB [2]
└── out	0 B [0]
──────────────────────
Total	4 KiB [2]
└── cleanup	4 KiB [2] (100.00%)
└── remote	4 KiB [2]

Write the processing logic

We will take the input Iris data and make a simple plot. Kerblam! has already set up your repository to use the src/ folder, so we can start writing code in it.

Save this Python script in src/process_csv.py:

import pandas as pd
import matplotlib.pyplot as plt
import sys

print(f"Reading data from {sys.argv[1]}")

iris = pd.read_csv(sys.argv[1])
species = iris['Species'].unique()
colors = ['blue', 'orange', 'green']

plt.figure(figsize=(14, 6))

for spec, color in zip(species, colors):
    subset = iris[iris['Species'] == spec]
    plt.hist(subset['Petal.Length'], bins=20, alpha=0.6, label=f'{spec} Petal Length', color=color, edgecolor='black')

for spec, color in zip(species, colors):
    subset = iris[iris['Species'] == spec]
    plt.hist(subset['Petal.Width'], bins=20, alpha=0.6, label=f'{spec} Petal Width', color=color, edgecolor='black', hatch='/')

plt.title('Distribution of Petal Length and Width')
plt.xlabel('Size (cm)')
plt.ylabel('Frequency')
plt.legend(title='Species and Measurement', loc='upper right')

plt.tight_layout()

print(f"Saving plot to {sys.argv[2]}")

plt.savefig(sys.argv[2], dpi = 300)

Since we are in a Python virtual environment, we can use pip install pandas matplotlib to install the required packages for this script.

We can use the python script with a command like this:

python src/process_csv.py data/in/iris.csv data/out/plot.png

Try it now! It should create a plot.png file which you can manually inspect.

Create and run a workflow

This is a very simple example, but a lot of day-to-day data analysis is relatively straightforward. In this case, we do not need a rich workflow manager: a bash script does the trick.

We can let Kerblam! handle the execution through Bash. Create the /src/workflows/create_iris_plot.sh file and write in the command from above:

python src/process_csv.py data/in/iris.csv data/out/plot.png

Now try it out! Run kerblam run create_iris_plot and see what happens. Kerblam! has handled moving your workflow to the top-level of the project (else the command would not work - it uses relative paths!) and executed bash to run the command.

Swap input files

We also downloaded a test_iris.csv dataset. We might want to also use it to create the same plot. We could edit the create_iris_plot to change the input file, or maybe copy-and-paste it into a new create_test_iris_plot, but it would be verbose, tedious and error-prone.

Instead, we can use Kerblam profiles that do this for us. The test profile requires no configuration, so go right ahead and run

kerblam run create_iris_plot --profile test

and see how the plot.png file changes (the test data has less entries, so the plot should be less dense).

Run in a container

note

To run the examples in this section, you must have Docker installed and configured.

Our analysis is complete, but it's not reproducible with a simple bash script. Kerblam! helps in this: we can run everything in a docker container pretty easily.

Create the src/dockerfiles/create_iris_plot.dockerfile file, and write in:

FROM python:latest

RUN pip install pandas matplotlib

COPY . .

There is no need to reference the actual workflow in the dockerfile. Kerblam! takes care of everything for you.

To reduce the size of the image, it's a good idea to create the .dockerignore file in the root of the project (next to kerblam.toml). We can exclude the data and env folder from the container safely:

data
env

Now we can run again:

kerblam run create_iris_plot

Kerblam! picks up automatically that a dockerfile is present for this workflow, builds the image and uses it as a runtime.

tip

You can use profiles even with docker containers!

Packaging data

We're done, and we'd like to share our work with others. The simplest is to send them the output. Kerblam! can help: run the following:

kerblam data pack

Kerblam! creates the data/data_export.tar.gz file with all the output data of your project (and eventually input data that cannot be downloaded from the internet, but this example does not have any). You can share this tarball with colleagues quite easily.

Packaging execution

If you used a container, you can also have Kerblam! package the workflow proper for you. Just run:

kerblam package create_iris_plot --tag my_test_container

Kerblam! will create the my_test_container image (so you can upload it to the registry of your choice) and a tarball with the necessary input data plus other bits and bobs that are needed to replay your run: the replay package.

Speaking of replay, you can do just that from the tarball you just created by using kerblam replay:

# Let's move to an empty directory
mkdir -p test-replay && cd test-replay
kerblam replay ../

Kerblam! unpacks the tarball for you, creates a dummy project directory, fetches the remote input data and runs the pipeline for you in the correct docker container, automatically.

You can use replay packages manually too - they include the kerblam binary that created them, so the reproducer does not need to leave anything to chance.

Cleaning up

We're done! The output is sent to the reviewers, together with the replay package, and we can close up shop.

If you don't want to completely delete the project, you can make it lightweight by using Kerblam!.

Run:

kerblam data clean

Kerblam! will clean out all output data, intermediate (in data/) data and input data that can be fetched remotely, saving you disk space for dormant projects.

Conclusions

Hopefully this toy example got you excited to use Kerblam!. It only showcases some of Kerblam! features. Read the manual to learn all theres is to know about how Kerblam! can make your life easier.

Quickstart

Welcome to Kerblam! This introductory chapter will give you the general overview on Kerblam!: what it does and how it does it.

Kerblam! is a project manager. It helps you write clean, concise data analysis pipelines, and takes care of chores for you.

Every Kerblam! project has a kerblam.toml file in its root. When Kerblam! looks for files, it does it relative to the position of the kerblam.toml file and in specific, pre-determined folders. This helps you keep everything in its place, so that others that are unfamiliar with your project can understand it if they ever need to look at it.

tip

Akin to git, Kerblam! will look in parent directories for a kerblam.toml file and run there if you call it from a project sub-folder.

These folders, relative to where the kerblam.toml file is, are:

  • ./data/: Where all the project's data is saved. Intermediate data files are specifically saved here.
  • ./data/in/: Input data files are saved and should be looked for in here.
  • ./data/out/: Output data files are saved and should be looked for in here.
  • ./src/: Code you want to be executed should be saved here.
  • ./src/workflows/: Makefiles and bash build scripts should be saved here. They have to be written as if they were saved in ./.
  • ./src/dockerfiles/: Container build scripts should be saved here.

tip

Any sub-folder of one of these specific folders (with the exception of src/workflows and src/dockerfiles) contains the same type of files as the parent directory. For instance, data/in/fastq is treated as if it contains input data by Kerblam! just as the data/in directory is.

You can configure almost all of these paths in the kerblam.toml file, if you so desire. This is mostly done for compatibility reasons with non-kerblam! projects. New projects that wish to use Kerblam! are strongly encouraged to follow the standard folder structure, however.

important

The rest of these docs are written as if you are using the standard folder structure. If you are not, don't worry! All Kerblam! commands respect your choices in the kerblam.toml file.

If you want to convert an existing project to use Kerblam!, you can take a look at the kerblam.toml section of the documentation to learn how to configure these paths.

If you follow this standard (or you write proper configuration), you can use Kerblam! to do a bunch of things:

  • You can run pipelines written in make or arbitrary shell files in src/workflows/ as if you ran them from the root directory of your project by simply using kerblam run <pipe>;
  • You can wrap your pipelines in docker containers by just writing new dockerfiles in src/dockerfiles, with essentially just the installation of the dependencies, letting Kerblam! take care of the rest;
  • If you have wrapped up pipelines, you can export them for later execution (or to send them to a reviewer) with kerblam package <pipe> without needing to edit your dockerfiles;
  • If you have a package from someone else, you can run it with kerblam replay.
  • You can fetch remote data from the internet with kerblam data fetch, see how much disk space your project's data is using with kerblam data and safely cleanup all the files that are not needed to re-run your project with kerblam data clean.
  • You can show others your work by packing up the data with kerblam data pack and share the .tar.gz file around.
  • And more!

The rest of this tutorial walks you through every feature.

I hope you enjoy Kerblam! and that it makes your projects easier to understand, run and reproduce!

info

If you like Kerblam!, please consider leaving a star on Github. Thank you for supporting Kerblam!

Creating new projects - kerblam new

You can quickly create new kerblam! projects by using kerblam new.

Kerblam asks you some questions for setting up your project, and uses your answer to pick sensible defaults. For instance, if you say that you will use Python, it will create a virtual environment for you (using python-venv) and pull the Github recommended Python .gitignore for you.

Since Kerblam! assumes you use git, a new git repository will be initialized for you.

important

Kerblam! will NOT do an Initial commit for you! You still need to do that manually once you've finished setting up.

Managing multiple Workflows

Kerblam! can be used as a workflow manager manager. It makes it easier to write multiple workflows for your project, keeping them simple and ordered, and executing your workflow managers.

what are workflows?

When analysing data, you want to take an input file, apply some transformations on it (programmatically), and obtain some output. If this is done in one, small and easy step, you could run a single command on the command line and get it done.

For more complicated operations, where inputs and outputs "flow" into various programs for additional processing, you might want to describe the process of creating the output you want, and let the computer handle the execution itself. This is a workflow: a series of instructions that are executed to obtain some output. The program that reads the workflow and executes it is a workflow manager.

The simplest workflow manager is your shell: you can write a shell script with the commands that should be executed in order to obtain your output.

More feature-rich workflow managers exist. For instance, make can be used to execute workflows1. The workflows written in make are called makefiles. You'd generally place this file in the root of the repository and run make to execute it.

When your project grows in complexity, and separate workflows emerge, they are increasingly hard to work with. Having a single file that has the job of running all the different workflows that your project requires is hard, adds complexity and makes running them harder than it needs to be.

Kerblam! manages your workflows for you.

Kerblam supports make out of the box, and all other workflow managers through thin Bash wrappers.

You can write different makefiles and/or shell files for different types of runs of your project and save them in ./src/workflows/. When you kerblam run, Kerblam! looks into that folder, finds (by name) the workflows that you've written, and brings them to the top level of the project (e.g. ./) for execution. In this way, you can write your workflows as if they were in the root of the repository, cutting down on a lot of boilerplate paths.

For instance, you could write a ./src/workflows/process_csv.makefile and you could invoke it with kerblam run process_csv.

This lets you write separate workflows and keep your project compact, non-redundant and less complex.

The next sections outline the specifics of how Kerblam! does this, as well as other chores that you can let Kerblam! handle instead of doing them manually yourself.

1

Make is not a workflow manager per-se. It was created to handle the compilation of programs, where many different files have to be compiled and combined together. While workflows are not the reason that make was created for, it can be used to write them. In fact, an extended make-like workflow manager exists: makeflow.

Running workflow managers - kerblam run

The kerblam run command is used to execute workflow managers for you.

Kerblam! looks for makefiles ending in the .makefile extension and .sh for shell files in the workflows directory (by default src/workflows/). It automatically uses the proper execution strategy based on what extension the file is saved as: either make or bash.

important

Shell scripts are always executed in bash.

You can use any workflow manager that is installed on your system through Kerblam! (e.g. snakemake or nextflow) by writing thin shell wrappers with the execution command in the src/workflows/ folder. Make has a special execution policy to allow it to work with as little boilerplate as possible.

kerblam run supports the following flags:

  • --profile <profile>: Execute this workflow with a profile. Read more about profiles in the section below.
  • --desc (-d): Show the description of the workflow, then exit.
  • --local (-l): Skip running in a container, if a container is available, preferring a local run.

In short, kerblam run does something similar to this:

  • Move your workflow.sh or workflow.makefile file in the root of the project, under the name executor;
  • Launch make -f executor or bash executor for you.

This is why workflows are written as if they are executed in the root of the project, because they are.

Listing out workflows

If you just want a list of workflows that Kerblam! can see, just use kerblam run with no workflow specified. Kerblam will reply with something like this:

Error: No runtime specified. Available runtimes:
    ◾📜 process_csv :: Calculate the sums of the input metrics
    🐋◾ save_plots
    ◾◾ generate_metrics

Available profiles: No profiles defined.

Workflows with a 📜 have an associated description, and those with a 🐋 have an associated docker container. You also get a list of available data profiles, which are detailed just below.

Data Profiles - Running the same workflows on different data

You can run your same workflows, as-is, on different data thanks to data profiles.

By default, Kerblam! will leave ./data/in/ untouched when running workflow managers. If you want the same workflows to run on different sets of input data, Kerblam! can temporarily swap out your real data with this 'substitute' data during execution.

For example, a process_csv.makefile requires an input ./data/in/input.csv file. However, you might want to run the same workflow on another, different_input.csv file. You could copy and paste the first workflows and change the paths to the first file to this alternative one, or you might group variables into configuration files for your workflow. However, you then have to maintain two essentially identical workflows (or several different configuration files), and you are prone to adding errors while you modify them (what if you forget to change one reference to the original file?).

You can let Kerblam! handle temporarely swapping input files for you, without touching your workflows. Define in your kerblam.toml file a new section under data.profiles:

# You can use any ASCII name in place of 'alternate'.
[data.profiles.alternate]
# The quotes are important!
"input.csv" = "different_input.csv"

You can then run the same makefile with the new data with:

kerblam run process_csv --profile alternate

tip

Profiles work on directories too! If you specify a directory as a target of a profile, Kerblam! will move the whole directory to the new location.

important

Paths under every profile section are relative to the input data directory, by default data/in.

Under the hood, Kerblam! will:

  • Move input.csv to a temporary directory in the root of the project named .kerblam/scratch, adding a very small salt string to its name (to avoid potential name collisions);
  • Move different_input.csv to input.csv;
  • Run the analysis as normal;
  • When the run ends (it finishes, it crashes or you kill it), Kerblam! will restore the original state: it moves both different_input.csv and input.csv.<salt> back to their original places.

This effectively causes the workflow to run with different input data.

warning

Careful that the output data will (most likely) be saved as the same file names as a "normal" run!

Kerblam! does not look into where the output files are saved or what they are saved as. If you really want to, use the KERBLAM_PROFILE environment variable described below and change the output paths accordingly.

Profiles are most commonly useful to run the workflows on test data that is faster to process or that produces pre-defined outputs. For example, you could define something similar to:

[data.profiles.test]
"input.csv" = "test_input.csv"
"configs/config_file.yaml" = "configs/test_config_file.yaml"

And execute your test run with kerblam run workflow --profile test.

The profiles feature is used so commonly for test data that Kerblam! will automatically make a test profile for you, swapping all input files in the ./data/in folder that start with test_xxx with their "regular" counterparts xxx. For example, the profile above is redundant!

If you write a [data.profiles.test] profile yourself, Kerblam! will not modify it in any way, effectively disabling the automatic test profile feature.

Kerblam! tries its best to cleanup after itself (e.g. undo profiles, delete temporary files, etc...) when you use kerblam run, even if the workflow fails, and even if you kill your workflow with CTRL-C.

tip

If your workflow is unresponsive to a CTRL-C, pressing it twice (two SIGTERM signals in a row) will kill Kerblam! instead, leaving the child process to be cleaned up by the OS and the (eventual) profile not cleaned up.

This is to allow you to stop whatever Kerblam! or the workflow is doing in case of emergency.

Detecting if you are in a profiled run

Kerblam! will run the workflows with the environment variable KERBLAM_PROFILE set to whatever the name of the profile is. In this way, you can detect from inside the workflow if you are in a profile or not. This is useful if you want to keep the outputs of different profiles separate, for instance.

File modification times when using profiles

make tracks file creation times to determine if it has to re-run workflows again. This means that if you move files around, like Kerblam! does when it applies profiles, make will always re-run your workflows, even if you run the same workflow with the same profile back-to-back.

To avoid this, Kerblam! will keep track of the last-run profile in your projects and will update the timestamps of the moved files only when strictly necessary.

This means that the profile files will get updated timestamps only when they actually need to be updated, which is:

  • When you use a profile for the first time;
  • When you switch from one profile to a different one;
  • When you don't use a profile, but you just used one the previous run;

To track what was the last profile used, Kerblam! creates a file in $HOME/.cache/kerblam/ for each of your projects.

Sending additional arguments to the worker process

You can send additional arguments to either make or bash after what Kerblam! sets by default by specifying them after kerblam's own run arguments:

kerblam run my_workflow -- extra_arg1 extra_arg_2 ...

Everything after the -- will be passed as-is to the make or bash worker after Kerblam!'s own arguments.

For example, you can tell make to build a different target with this syntax:

kerblam run make_workflow -- other_target

As if you had run make other_target yourself.

Containerized Execution of workflows

Kerblam! can ergonomically run workflow managers inside containers for you, making it easier to be reproducible.

This is very useful for those cases where you just need a small bash script to run your analysis, but still wish to be reproducible. If you use more feature-rich workflow managers, they can handle containers for you, making Kerblam-handled containers less useful.

If Kerblam! finds a container recipe (such as a Dockerfile) of the same name as one of your workflows in the ./src/dockerfiles/ folder (e.g. ./src/dockerfiles/process_csv.dockerfile for the ./src/workflows/process_csv.makefile workflow), it will use it automatically when you execute the workflow manager for that workflow (e.g. kerblam run process_csv) to run it inside a container.

Kerblam! will do something similar to this (for an example makefile):

  • Copy the workflow file to the root of the directory (as it does normally when you launch kerblam run), as ./executor;
  • Run docker build -f ./src/dockerfiles/process_csv.dockerfile --tag process_csv_kerblam_runtime . to build the container;
  • Run docker run --rm -it -v ./data:/data --entrypoint make process_csv_kerblam_runtime -f /executor.

This last command runs the container, telling it to execute make with target file -f /executor. Note that it's not exactly what kerblam does - it has additional features to correctly mount your paths, capture stdin and stdout, etc... meaning that it works transparently with your other settings and profiles.

If you have your docker container COPY . ., you can then effectively have Kerblam! run your projects in docker environments, so you can tweak your dependencies and tooling (which might be different than your dev environment) and execute even small analyses in a protected, reproducible environment.

Kerblam! will build the container images without moving the recipies around (this is what the -f flag does). The .dockerignore in the build context (next to the kerblam.toml) is shared by all pipes. See the 'using a dockerignore' section of the Docker documentation for more.

You can write dockerfiles for all types of workflows. Kerblam! configures automatically the correct entrypoint and arguments to run the pipe in the container for you.

Read the "writing dockerfiles for Kerblam!" section to learn more about how to write dockerfiles that work nicely with Kerblam! (spoiler: it's easier than writing canonical dockerfiles!).

Listing out available workflows

If you run kerblam run without a workflow (or with a non-existant workflow), you will get the list of available workflows. You can see at a glance what workflows have an associated dockerfile as they are prepended with a little whale (🐋):

Error: No runtime specified. Available runtimes:
    🐋◾ my_workflow :: Generate the output data in a docker container
    ◾◾ local_workflow :: Run some code locally

Available profiles: No profiles defined.

Default dockerfile

Kerblam! will look for a default.dockerfile if it cannot find a container recipe for the specific pipe (e.g. pipe.dockerfile), and use that instead. You can use this to write a generalistic dockerfile that works for your most simple workflows. The whale (🐋) emoji in the list of pipes will be replaced by a fish (🐟) for pipes that use the default container, so you can identify them easily:

Error: No runtime specified. Available runtimes:
    🐋◾ my_workflow :: Generate the output data in a docker container
    🐟◾ another :: Run in the default container

Available profiles: No profiles defined.

Switching backends

Kerblam! runs containers by default with Docker, but you can tell it to use Podman instead by setting the execution > backend option in your kerblam.toml:

[execution]
backend = "podman" # by default "docker"

Podman is slightly harder to set up but has a few benefits, mainly not having to run in root mode, and being a FOSS program. For 90% of usecases, you can use podman instead of docker and it will work exactly the same. Podman and Docker images are interchangeable, so you can use Podman with dockerhub with no issues.

Setting the container working directory

Kerblam! does not parse your dockerfile or add any magic to the calls that it makes based on heuristics. This means that if you wish to save your code not in the root of the container, you must tell kerblam! about it.

For instance, this recipe copies the contents of the analysis in a folder called "/app":

COPY . /app/

This one does the same by using the WORKDIR directive:

WORKDIR /app
COPY . .

If you change the working directory, let Kerblam! know by setting the execution > workdir option in kerblam.toml:

[execution]
workdir = "/app"

In this way, Kerblam! will run the containers with the proper paths.

important

This option applies to ALL containers managed by Kerblam!

There is currently no way to configure a different working directory for every specific dockerfile.

Skipping using cache

Sometimes, you want to skip using the build cache when executing a workflow with a container executable.

Using kerblam run my_workflow --no-build-cache will do just that: the build backend will be told not to use the cached layers for that build (with the --no-cache flag).

Example

For example, consider a makefile named process_csv.makefile that uses a Python script to process CSV files. You could have the following Dockerfile:

# ./src/dockerfiles/process_csv.dockerfile

FROM ubuntu:latest

RUN apt-get install python, python-pip && \
    pip install pandas

COPY . .

and this dockerignore file:

# ./src/dockerfiles/.dockerignore
.git
data
venv

and simply run kerblam run process_csv to build the container and run your code inside it.

Writing Dockerfiles for Kerblam!

When you write dockerfiles for use with Kerblam! there are a few things you should keep in mind:

  • Kerblam! will automatically set the proper entrypoints for you;
  • The build context of the dockerfile will always be the place where the kerblam.toml file is.
  • Kerblam! will not ignore any file for you.
  • The behaviour of kerblam package is slightly different than kerblam run, in that the context of kerblam package is an isolated "restarted" project, as if kerblam data clean --yes was run on it, while the context of kerblam run is the current project, as-is.

This means a few things, detailed below.

COPY directives are executed in the root of the repository

This is exactly what you want, usually. This makes it possible to copy the whole project over to the container by just using COPY . ..

The data directory is excluded from packages

If you have a COPY . . directive in the dockerfile, it will behave differently when you kerblam run versus when you kerblam package.

When you run kerblam package, Kerblam! will create a temporary build context with no input data. This is what you want: Kerblam! needs to separately package your (precious) input data on the side, and copy in the container only code and other execution-specific files.

In a run, the current local project directory is used as-is as a build context. This means that the data directory will be copied over. At the same time, Kerblam! will also mount the same directory to the running container, so the copied files will be "overwritten" by the live mountpoint while to container is running.

This generally means that copying the whole data directory is useless in a run, and that it cannot be done during packaging.

Therefore, a best practice is to ignore the contents of the data folders in the .dockerignore file. This makes no difference while packaging containers but a big difference when running them, as docker skips copying the useless data files.

To do this in a standard Kerblam! project, simply add this to your .dockerignore in the root of the project directory:

# Ignore the intermediate/output directory
data

You might also want to add any files that you know are not useful in the docker environment, such as local python virtual environments.

Your dockerfiles can be very small

Since the configuration is handled by Kerblam!, the main reason to write dockerfiles is to install dependencies.

This makes your dockerfiles generally very small:

FROM ubuntu:latest

RUN apt-get update && apt-get install # a list of packages

COPY . .

You might also be interested in the article 'best practices while writing dockerfiles' by Docker.

Docker images are named based on the workflow name

If you run kerblam run my_workflow twice, the same container is built to run the workflow twice, meaning that caching will make your execution quite fast if you place the COPY . . directive near the bottom of the dockerfile.

This way, you can essentially work exclusively in docker and never install anything locally.

Kerblam! will name the containers for the workflows as <workflow name>_kerblam_runtime. For example, the container for my_workflow.sh will be my_workflow_kerblam_runtime.

Describing workflows

If you execute kerblam run without specifying a pipe (or you try to run a pipe that does not exist), you will get a message like this:

Error: No runtime specified. Available runtimes:
    ◾◾ process_csv
    🐋◾ save_plots
    ◾◾ generate_metrics

Available profiles: No profiles defined.

The whale emoji (🐋) represents pipes that have an associated Docker container.

If you wish, you can add additional information to this list by writing a section in the makefile/shellfile itself. Using the same example as above:

#? Calculate the sums of the input metrics
#?
#? The script takes the input metrics, then calculates the row-wise sums.
#? These are useful since we can refer to this calculation later.

./data/out/output.csv: ./data/in/input.csv ./src/calc_sum.py
    cat $< | ./src/calc_sum.py > $@

If you add this block of lines starting with #? , Kerblam! will use them as descriptions (note that the space after the ? is important!), and it will treat them as markdown. The first paragraph of text (#? lines not separated by an empty #? line) will be the title of the workflow. Try to keep this short and to the point. The rest of the lines will be the long description.

Kerblam will parse all lines starting with #? , although it's preferrable to only have a single contiguous description block in each file.

The output of kerblam run will now read:

Error: No runtime specified. Available runtimes:
    ◾📜 process_csv :: Calculate the sums of the input metrics
    🐋◾ save_plots
    ◾◾ generate_metrics

Available profiles: No profiles defined.

The scroll (📜) emoji appears when Kerblam! notices a long description. You can show the full description for such pipes with kerblam run process_csv --desc.

With workflow docstrings, you can have a record of what the workflow does for both yourself and others who review your work.

You cannot write docstrings inside docker containers1.

1

You actually can. I can't stop you. But Kerblam! ignores them.

Packaging workflows for later

The kerblam package command is one of the most useful features of Kerblam! It allows you to package everything needed to execute a workflow in a docker container and export it for execution later.

As with kerblam run, this is chiefly useful for those times where the workflow manager of your choice does not support such features, or you do not wish to use a workflow manager.

You must have a matching dockerfile for every workflow that you want to package, or Kerblam! won't know what to package your workflow into.

For example, say that you have a process pipe that uses make to run, and requires both a remotely-downloaded remote.txt file and a local-only precious.txt file.

If you execute:

kerblam package process --tag my_process_package

Kerblam! will:

  • Create a temporary build context;
  • Copy all non-data files to the temporary context;
  • Build the specified dockerfile as normal, but using this temporary context;
  • Create a new Dockerfile that:
    • Inherits from the image built before;
    • Copies the Kerblam! executable to the root of the container;
    • Configure the default execution command to something suitable for execution (just like kerblam run does, but "baked in").
  • Build the docker container and tag it with my_process_package;
  • Export all precious data, the kerblam.toml and the --tag of the container to a process.kerblam.tar tarball.

The --tag parameter is a docker tag. You can specify a remote repository with it (e.g. my_repo/my_container) and push it with docker push ... (or podman) as you would normally do.

tip

If you don't specify a --tag, Kerblam! will name the result as <pipe>_exec.

Replaying packaged projects

After Kerblam! packages your project, you can re-run the analysis with kerblam replay by using the process.kerblam.tar file:

kerblam replay process.kerblam.tar ./replay_directory

Kerblam! reads the .kerblam.tar file, recreates the execution environment from it by unpacking the packed data, and executes the exported docker container with the proper mountpoints (as described in the kerblam.toml file).

In the container, Kerblam! fetches remote files (i.e. runs kerblam data fetch) and then the workflow is triggered via kerblam run. Since the output folder is attached to the output directory on disk, the final output of the workflow is saved locally.

These packages are meant to make workflows reproducible in the long-term. For day-to-day runs, kerblam run is much faster.

important

The responsibility of having the resulting docker work in the long-term is up to you, not Kerblam! For most cases, just having kerblam run work is enough for the resulting package made by kerblam package to work, but depending on your docker files this might not be the case. Kerblam! does not test the resulting package - it's up to you to do that. It's best to try your packaged workflow once before shipping it off.

However, even a broken kerblam package is still useful! You can always enter with --entrypoint bash and interactively work inside the container later, manually fixing any issues that time or wrong setup might have introduced.

Kerblam! respects your choices of execution options when it packages, changing backend or working directory as you'd expect. See the kerblam.toml specification to learn more.

Managing Data

Kerblam! has a bunch of utilities to help you manage the local data for your project. If you follow open science guidelines, chances are that a lot of your data is FAIR, and you can fetch it remotely.

Kerblam! is perfect to work with such data. The next tutorial sections outline what Kerblam! can do to help you work with data.

Remember that Kerblam! recognizes what data is what by the location where you save the data in. If you need a refresher, read this section of the book.

kerblam data will give you an overview of the status of local data:

> kerblam data
./data       500 KiB [2]
└── in       1.2 MiB [8]
└── out      823 KiB [2]
──────────────────────
Total        2.5 Mib [12]
└── cleanup  2.3 Mib [9] (92.0%)
└── remote   1.0 Mib [5]
! There are 3 undownloaded files.   

The first lines highlight the size (500 KiB) and amount (2) of files in the ./data/in (input), ./data/out (output) and ./data (intermediate) folders.

The total size of all the files in the ./data/ folder is then broken down between categories: the Total data size, how much data can be removed with kerblam data clean or kerblam data pack, and how many files are specified to be downloaded but are not yet present locally.

Fetching remote data

If you define in kerblam.toml the section data.remote you can have Kerblam! automatically fetch remote data for you:

[data.remote]
# This follows the form "url_to_download" = "save_as_file"
"https://raw.githubusercontent.com/MrHedmad/kerblam/main/README.md" = "some_readme.md"

When you run kerblam data fetch, Kerblam! will attempt to download some_readme.md by following the URL you provided and save it in the input data directory (e.g. data/in).

Most importantly, some_readme.md is treated as a file that is remotely available and therefore locally expendable for the sake of saving disk size (see the data clean and data pack commands).

You can specify any number of URLs and file names in [data.remote], one for each file that you wish to be downloaded.

danger

The download directory for all fetched data is your input directory, so if you specify some/nested/dir/file.txt, kerblam! will save the file in ./data/in/some/nested/dir/file.txt. This also means that if you write an absolute path (e.g. /some_file.txt), Kerblam! will treat the path as it should treat it - by making some_file.txt in the root of the filesystem (and most likely failing to do so).

Kerblam! will, however, warn you before acting, telling you that it is about to do something potentially unwanted, and giving you the chance to abort.

Unfetcheable data

Sometimes, a simple GET request is not enough to fetch your data. Perhaps you need some complicated login, or you use specific software to fetch your remote data. You can still tell Kerblam! that a file is remote, but that Kerblam! cannot directly fetch it: this way you can use all other Kerblam! features but "opt out" of the fetching one.

To do this, simply specify "_" as the remote URL in the kerblam.toml file:

[data.remote]
"https://example.com/" = "remote_file.txt"
"_" = "unfetcheable_file.txt"

If you run kerblam data fetch with the above command, you'll fetch the remote_file.txt, but not unfetcheable_file.txt (and Kerblam! will remind you of that).

note

Remember that Kerblam! replay packages will fetch remote data for you before running the packaged workflow. If an unfetcheable file is needed by the packaged workflow, be sure to fetch it inside the workflow itself before running the computation proper.

Package and distribute data

Say that you wish to send all your data folder to a colleague for inspection. You can tar -czvf exported_data.tar.gz ./data/ and send your whole data folder, but you might want to only pick the output and non-remotely available inputs, and leave re-downloading the (potentially bulky) remote data to your colleague.

failure

It is widely known that remembering tar commands is impossible.

If you run kerblam data pack you can do just that. Kerblam! will create a exported_data.tar.gz file and save it locally with the non-remotely-available .data/in files and the files in ./data/out. You can also pass the --cleanup flag to also delete them after packing.

You can then share the data pack with others.

Omit input data

If you only want to package your output data, simply pass the --output-only flag to kerblam data pack. The resulting tarball will just contain the data/out folder.

Cleanup data

If you want to cleanup your data (perhaps you have finished your work, and would like to save some disk space), you can run kerblam data clean.

Kerblam! will remove:

  • All temporary files in ./data/;
  • All output files in ./data/out;
  • All empty (even nested) folders in ./data/ and ./data/out. This essentially only leaves input data on the dist.

To additionally clean remotely available data (to really put a project in cold storage), pass the --include-remote flag.

Kerblam! will consider as "remotely available" files that are present in the data.remote section of kerblam.toml. See this chapter of the book to learn more about remote data.

If you want to preserve the empty folders left behind after cleaning, pass the --keep-dirs flag to do just that.

Kerblam! will ask for your confirmation before deleting the files. If you're feeling bold, skip it with the --yes flag.

With the --preserve-output flag, Kerblam! will skip deleting the output files.

Dry runs

With the --dry-run option, Kerblam! will just show the list of files to be deleted, without actually deleting anything:

> kerblam data clean --dry-run
Files to clean:
data/temp.csv
data/out/finala.txt

Other utilities

Kerblam! has a few other utilities to deal with the most tedius steps when working with projects.

kerblam ignore - Add items to your .gitignore quickly

Oops! You forgot to include your preferred language to your .gitignore. You now need to google for the template .gitignore, open the file and copy-paste it in.

With Kerblam! you can do that in just one command. For example:

kerblam ignore Rust

will fetch Rust.gitignore from the Github gitignore repository and append it to your .gitignore for you. Be careful that this command is case sensitive (e.g. Rust works, rust does not).

You can also add specific files or folders this way:

kerblam ignore ./src/something_useless.txt

Kerblam! will add the proper pattern to the .gitignore file to filter out that specific file.

The optional --compress flag makes Kerblam! check the .gitignore file for duplicated entries, and only retain one copy of each pattern. This also cleans up comments and whitespace in a sensible way.

The --compress flag allows to fix ignoring stuff twice. E.g. kerblam ignore Rust && kerblam ignore Rust --compress is the same as running kerblam ignore Rust just once.

Getting help

You can get help with Kerblam! via a number of channels:

Thank you so much for giving Kerblam! a go.

Usage examples

There are a bunch of examples in the MrHedmad/kerblam-examples repository, ready for your perusal.

The latest development version of Kerblam! is tested against these examples, so you can be sure they are as fresh as they can be.

The Kerblam.toml file

The kerblam.toml file is the control center of kerblam! All of its configuration is found there. Here is what fields are available, and what they do.

warning

Extra fields not found here are silently ignored. This means that you must be careful of typos!

The fields are annotated where possible with the default value.

[meta] # Metadata regarding kerblam!
version = "0.4.0"
# Kerblam! will check this version and give you a warning
# if you are not running the same executable.
# To save you headaches!

# The [data] section has options regarding... well, data.
[data.paths]
input = "./data/in"
output = "./data/out"
intermediate = "./data"

[data.profiles] # Specify profiles here
profile_name = {
    "original_name" = "profile_name",
    "other_name" = "other_profile_name"
}

# Or, alternatively
[data.profiles.profile_name]
"original_name" = "profile_name"
"other_name" = "other_profile_name"
# Any number of profiles can be specified, but stick to just one of these
# two methods of defining them.

[data.remote] # Specify how to fetch remote data
"url_to_fetch" = "file_to_save_to"
# there can be any number of "url" = "file" entries here.
# Files are saved inside `[data.paths.input]`

##### --- #####
[code] # Where to look for containers and pipes
env_dir = "./src/dockerfiles"
pipes_dir = "./src/workflows"

[execution] # How to execute the pipelines
backend = "docker" # or "podman", the backend to use to build and run containers
workdir = "/" # The working directory inside all built containers

Note that this does not want to be a valid TOML, just a reference. Don't expect to copy-paste it and obtain a valid Kerblam! configuration.

Contributing to Kerblam!

Thank you for wanting to contribute!

The developer guide changes more often than this book, so you can read it directly on Github.

The Kerblam! philosophy

Hello! This is the maintainer. This article covers the design principles behind how Kerblam! functions. It is both targeted at myself - to remind me why I did what I did - and to anyone who is interested in the topic of managing data analysis projects.

Reading this is not at all necessary to start using Kerblam!. Perhaps you want to read the tutorial instead.

I am an advocate of open science, open software and of sharing your work as soon and as openly as possible. I also believe that documenting your code is even more important than the code itself. Keep this in mind when reading this article, as it is strongly opinionated.

The first time I use an acronym I'll try to make it bold italics so you can have an easier time finding it if you forget what it means. However, I try to keep acronyms to a minimum.

Introduction

After three years doing bioinformatics work as my actual job, I think I have come across many of the different types of projects that one encounters as a bioinformatician:

  1. You need to analyse some data either directly from someone or from some online repository. This requires the usage of both pre-established tools and new code and/or some configuration.
    • For example, someone in your research group performed RNA-Seq, and you are tasked with the data analysis.
  2. You wish to create a new tool/pipeline/method of analysis and apply it to some data to both test its performance and/or functionality, before releasing the software package to the public.

The first point is data analysis. The second point is software development. Both require writing software, but they are not exactly the same.

You'd generally work on point 2 like a generalist programmer would. In terms of how you work, there are many different workflow mental schemas that you can choose from, each with its following, pros, and cons. Simply search for coding workflow to find a plethora of different styles, methods and types of way you can use to manage what to do and when while you code.

In any case, while working with a specific programming language, you usually have only one possible way to layout your files. A python project uses a quite specific structure: you have a pyproject.toml/setup.py, a module directory1... Similarly, when you work on a Rust project, you use cargo, and therefore have a cargo.toml file, a /src directory...

note

The topic of structuring the code itself is even deeper, with different ways to think of your coding problem: object oriented vs functional vs procedural, monolithic vs microservices, etcetera, but it's out of the scope of this piece.

At its core, software is a collection of text files written in a way that the computer can understand. The process of laying out these files in a logical way in the filesystem is what I mean when I say project layout (PL). A project layout system (PLS) is a pre-established way to layout these files. Kerblam! is a tool that can help you with general tasks if you follow the Kerblam! project layout system.

note

There are also project management systems, that are tasked with managing what has to be done while writing code. They are not the subject of this piece, however.

Since we are talking about code, there are a few characteristics in common between all code-centric projects:

  • The changes between different versions of the text files are important. We need to be able to go back to a previous version if we need to. This can be due by a number of things: if we realize that we changed something that we shouldn't have, if we just want to see a previous version of the code or if we need to run a previous version of the program for reproducibility purposes.
  • Code must be documented to be useful. While it is often sufficient to read a piece of code to understand what it does, the why is often unclear. This is even more important when creating new tools: a tool without clear documentation is unusable, and an unusable tool might as well not exist.
  • Often, code has to be edited by multiple people simultaneously. It's important to have a way to coordinate between people as you add your edits in.
  • Code layout is often driven by convention or by requirements of build systems/ interpreters/external tools that need to read your code. Each language is unique under this point.

From these initial observations we can start to think about a generic PLS. Version control takes care of - well - version control and is essential for collaboration. Version control generally does not affect the PL meaningfully. However, version control often does not work well with large files, especially binary files.

Design principle A: We must use a version control system.

Design principle B: Big binary blobs bad2!

2

I'm very proud of this pun. Please don't take it from me.

quote

I assume that the reader knows how vital version control is when writing software. In case that you do not, I want to briefly outline why you'd want to use a version control system in your work:

  • It takes care of tracking what you did on your project;
  • You can quickly turn back time if you mess up and change something that should not have been changed.
  • It allows you to collaborate both in your own team (if any) and with the public (in the case of open-source codebases). Collaboration is nigh impossible without a version control system.
  • It allows you to categorize and compartimentalize your work, so you can keep track of every different project neatly.
  • It makes the analysis (or tool) accessible - and if you are careful also reproducible - to others, which is an essential part of the scientific process. These are just some of the advantages you get when using a version control system. One of the most popular version control systems is git. With git, you can progressively add changes to code over time, with git taking care of recording what you did, and managing different versions made by others.

If you are not familiar with version control systems and specifically with git, I suggest you stop reading and look up the git user manual.

Design principle A makes it so that the basic unit of our PLS is the repository. Our project therefore is a repository of code.

As we said, documentation is important. It should be versioned together with the code, as that is what it is describing and it should change at the same pace.

Design principle C: Documentation is good. We should do more of that.

Code is read more times than it is written, therefore, it's important for a PLS to be logical and obvious. To be logical, one should categorize files based on their content, and logically arrange them in a way that makes sense when you or a stranger looks through them. To be obvious, the categorization and the choice of folder and file names should make sense at a glance (e.g. the 'scripts' directory is for scripts, not for data).

Design principle D: Be logical, obvious and predictable

Scientific computing needs to be reproduced by others. The best kind of reproducibility is computational reproducibility, by which the same output is generated given the same input. There are a lot of things that you can do while writing code to achieve computational reproducibility, but one of the main contributors to reproducibility is still containerization.

Additionally, being easily reproducible is - in my mind - as important to being reproducible to begin with. The easier it is to reproduce your work, the more "morally upright" you will be in the eyes of the reader. This has a lot of benefits, of course, with the main one being that you are more resilient to backlash in the inevitable case that you commit an error.

Design principle E: Be (easily) reproducible.

Structuring data analysis

While structuring single programs is relatively straightforward, doing the same for a data analysis project is less set in stone. However, given the design principles that we have established in the previous section, we can try to find a way to fulfill all of them for the broadest scope of application possible.

To design such a system, it's important to find what are the points in common between all types of data analysis projects. In essence, a data analysis project encompasses:

  • Input data that must be analysed in order to answer some question.
  • Output data that is created as a result of analysing the input data.
  • Code that analyses that data.
  • It is often the case that data analysis requires many different external tools, each with its own set of requirements. These sum with the requirements of your own code and scripts.

"Data analysis" code is not "tool" code: it usually uses more than one programming language, it is not monolithic (i.e builds up to just one "thing") and can differ wildly in structure (from just one script, to external tool, to complex pieces of code that run many steps of the analysis).

This complexity results in a plethora of different ways to structure the code and the data during the project.

I will not say that the Kerblam! way is the one-and-only, cover-all way to structure your project, but I will say that it is a sensible default.

Kerblam!

The kerblam! way to structure a project is based on the design principles that we have seen, the characteristics of all data analysis project and some additional fundamental observations, which I list below:

  1. All projects deal with input and output data.
  2. Some projects have intermediate data that can be stored to speed up the execution, but can be regenerated if lost (or the pipeline changes).
  3. Some projects generate temporary data that is needed during the pipeline but then becomes obsolete when the execution ends.
  4. Projects may deal with very large data files.
  5. Projects may use different programming languages.
  6. Projects, especially exploratory data analysis, require a record of all the trials that were made during the exploratory phase. Often, one last execution is the final one, with the resulting output the presented one. Having these in mind, we can start to outline how Kerblam! deals with each of them.

Data

Points 1, 2, 3 and 4 deal with data. A kerblam! project has a dedicated data directory, as you'd expect. However, kerblam! actually differentiates between the different data types. Other than input, output, temporary and intermediate data, kerblam! also considers:

  • Remote data is data that can be downloaded at runtime from a (static) remote source.
  • Input data that is not remote is called precious, since it cannot be substituted if it is lost.
  • All data that is not precious is fragile, since it can be deleted with little repercussion (i.e. you just re-download it or re-run the pipeline to obtain it again.

note

Practically, data can be input/output/temp/intermediate, either fragile or precious and either local or remote.

To make the distinction between these different data types we could either keep a separate configuration that points at each file (a git-like system), or we specify directories where each type of file will be stored.

Kerblam! takes both of these approaches. The distinction between input/output/temp/intermediate data is given by directories. It's up to the user to save each file in the appropriate directory. The distinction between remote and local files is however given by a config file, kerblam.toml, so that Kerblam! can fetch the remote files for you on demand3. Fragile and precious data can just be computed from knowing the other two variables.

3

Two birds with one stone, or so they say.

The only data that needs to be manually shared with others is precious data. Everything else can be downloaded or regenerated by the code. This means that the only data that needs to be committed to version control is the precious one. If you strive to keep precious data to a minimum - as should already be the case - analysis code can be kept tiny, size-wise. This makes Kerblam! compliant with principle B4 and makes it easier (or in some cases possible) to be compliant with principle A5.

Execution

Points 5 and 6 are generally covered by pipeline managers. A pipeline manager, like snakemake or nextflow, executes code in a controlled way in order to obtain output files. While both of these were made with data analysis in mind, they are both very powerful and very "complex"6 and unwieldy for most projects.

Kerblam! supports simple shell scripts (which in theory can be used to run anything, even pipeline managers like nextflow or snakemake) and makefiles natively. make is a quite old GNU utility that is mainly used to build packages and create compiled C/C++ projects. However, it supports and manages the creation of any file with any creation recipe. It is easy to learn and quick to write, and is at the perfect spot for most analyses between a simple shell script and a full-fledged pipeline manager.

Kerblam! considers these executable scripts and makefiles as "pipes", where each pipe can be executed to obtain some output. Each pipe should call external tools and internal code. If code is structured following the unix philosophy, each different piece of code ("program") can be reused in the different pipelines and interlocked with one another inside pipelines.

With these considerations, point 6 can be addressed by making different pipes with sensible names, saving them in version control. Point 5 is easy if each program is independent of each other, and developed in its own folder. Kerblam! appoints the ./src directory to contain the program code (e.g. scripts, directories with programs, etc...) and the /src/workflows directory to contain shell scripts and makefile pipelines.

These steps fulfill the design principle D7: Makefiles and shell scripts are easy to read, and having separate folders for pipelines and actual code that runs makes it easy to know what is what. Having the rest of the code be sensibly managed is up to the programmer.

Principle E8 can be messed up very easily, and the reproducibility crisis is a symptom of this. A very common way to make any analysis reproducible is to package the execution environment into containers, executable bundles that can be configured to do basically anything in an isolated, controlled environment.

Kerblam! projects leverage docker containers to make the analysis as easily reproducible as possible. Using docker for the most basic tasks is relatively straightforward:

  • Start with an image;
  • Add dependencies;
  • Copy the current environment;
  • Setup the proper entrypoint;
  • Execute the container with a directory mounted to the local file system in order to extract the output files as needed.

Kerblam! automatically detects dockerfiles in the ./src/dockerfiles directory and builds and executes the containers following this simple schema. To give as much freedom to the user as possible, Kerblam! does not edit or check these dockerfiles, just executes them in the proper environment and the correct mounting points.

The output of a locally-run pipeline cannot be trusted as it is not reproducible. Having Kerblam! natively run all pipelines in containers allows development runs to be exactly the same as the output runs when development ends.

To be compliant with principle D7, knowing what dockerfile is needed for what pipeline can be challenging. Kerblam! requires that pipes and the respective dockerfiles must have the same name.

Documentation

Documentation is essential, as we said in principle C9. However, documentation is for humans, and is generally well established how to layout the documentation files in a repository:

  • Add a README file.
  • Add a [LICENSE], so it's clear how other may use your code.
  • Create a /docs folder with other documentation, such as CONTRIBUTING guides, tutorials and generally human-readable text needed to understand your project.

There is little that an automated tool can do to help with documentation. There are plenty of guides online that deal with the task of documenting a project, so I will not cover it further.

1

Python packaging is a bit weird since there are so many packaging engines that create python packages. Most online guides use setuptools, but modern python (as of Dec 2023) now works with the build script with a pyproject.toml file, which supports different build engines. See this pep for more info.

6

I cannot find a good adjective other than "complex". These tools are not hard to use, or particularly difficult to learn, but they do have an initial learning curve. The thing that I want to highlight is that they are so formal, and require careful specification of inputs, output, channels and pipelines that they become a bit unwieldy to use as a default. For large project with many moving parts and a lot of computing (e.g. the need to run in a cluster), using programs such as these can be very important and useful. However, bringing a tank to a fist fight could be a bit too much.

4

Big binary blobs bad.

5

We must use a version control system.

7

Be logical, obvious and predictable.

8

Be (easily) reproducible.

9

Documentation is good. We should do more of that.