DVC and Backblaze B2 for Reliable & Reproducible Data Science

Dr James Ravenscroft
8 min readNov 27, 2020

Introduction

When you’re working with large datasets, storing them in git alongside your source code is usually not an optimal solution. Git is famously, not really suited to large files and whilst general purpose solutions exist (Git LFS being perhaps the most famous and widely adopted solution), DVC is a powerful alternative that does not require a dedicated LFS server and can be used directly with a range of cloud storage systems as well as traditional NFS and SFTP-backed filestores all listed out here.

It’s also worth pointing out that another point in DVC’s favour is its powerful dependency system and being able to precisely recreate data science projects down to the command line flag — particularly desirable in academic and commercial R&D settings.

I use data buckets like S3 and Google Cloud Storage at work frequently and they’re very useful as an off-site backup large quantities of training data. However, in my personal life my favourite S3-like vendor is BackBlaze who offer a professional, reliable service with cheaper data access rates than Amazon and Google and offer an S3-compatible API which you can use in many places — including DVC. If you’re new to remote storage buckets or you want to try-before-you-buy, BackBlaze offer 10GB of remote storage free — plenty of room for a few hundred thousand pictures of dogs and chicken nuggets to train your classifier with.

Setting up your DVC Project

Configuring DVC to use B2 instead of S3 is actually a breeze once you find the right incantation in the documentation. Our first step, if you haven’t done it already is to install dvc. You can download an installer bundle/debian package/RPM package from their website or if you prefer you can install it inside python via pip install dvc[all] - the [all] on the end pulls in all the various DVC remote storage libraries - you could swap this for [s3] if you just want to use that.

Next you will want to create your data science project — I usually set mine up like this:

- README.md  
- .gitignore <-- prefilled with pythonic ignore rules
- environment.yml <-- my conda environment yaml
- data/
- raw/ <-- raw unprocessed data assets go here
- processed/ <-- partially processed and pre-processed data assets go here
-

Now we can initialize git and dvc:

git init
dvc init

Setting up your Backblaze Bucket and Credentials

Now we’re going to create our bucket in backblaze. Assuming you’ve registered an account, you’ll want to go to “My Account” in the top right hand corner, then click “Create a new bucket”

Enter a bucket name (little gotcha: the name must be unique across the whole of backblaze — not just your account) and click “Create a Bucket” taking the default options on the rest of the fields.

Once your bucket is created you’ll also need to copy down the “endpoint” value that shows up in the information box — we’ll need this later when we set up DVC.

We’re also going to need to create credentials for accessing the bucket. Go back to “My Account” and then “App Keys” and go for “Add a New Application Key”

Here you can enter a memorable name for this key — by convention I normally use the name of the experiment or model that I’m training.

You can leave all of the remaining options with default/empty values or you can use these to lock down your security if you have multiple users accessing your account (or in the event that your key got committed to a public github repo) — for example we could limit this key to only the bucket we just created or only folders with a certain prefix within this bucket. For this tutorial I’m assuming you left these as they were and if you change them, your mileage may vary.

Once you create the key you will need to copy down the keyID and applicationKey values — heed the warning — they will only appear once and as soon as you move off this page it will be gone forever unless you copy the values somewhere safe. It’s not the end of the world since we can create more keys but still a bit annoying to have to go through again.

If you’ve got the name of your bucket, your endpoint, your keyID and applicationKey values stored somewhere safe then we’re done here and we can move on to the next step.

Configuring your DVC ‘remote’

With our bucket all set up, we can configure DVC to talk to backblaze. First we add a new remote to DVC. The -d flag sets this as the default (so that when we push it will send the data to this location by default without being told explicitely).

dvc remote add b2 s3://your-bucket-name/

So DVC knows about our bucket but unless we tell it otherwise it will assume that it’s an Amazon S3 bucket rather than a B2 bucket. We need to tell it our endpoint value:

dbc remote modify b2 endpointurl https://s3.us-west-002.backblazeb2.com

You’ll see that I’ve copied and pasted my endpoint from when I set up my bucket and stuck “https://” on the front which dvc needs to know about to form a valid URL.

Authenticating DVC

Next we need to tell DVC about our auth keys. In the DVC manual they show you that you can use the dvc remote modify command to permanently store your access credentials in the DVC config file. However this stores your super-duper secret credentials in plain text in a file called .dvc/config which gets stored in your git repository meaning that if you're storing your work on GitHub then Joe Public could come along and start messing with your private bucket.

Instead I advocate the following approach. Firstly, in our .gitignore file at the top level of our project (create one if it doesn't exist) add a line that says .env

Now we’re going to create a new file — again in the top level of our project directory called .env and paste in the following:

export AWS_ACCESS_KEY_ID='<keyID>'
export AWS_SECRET_ACCESS_KEY='<applicationKey>'

Replace <keyID> and <applicationKey> with the values from the BackBlaze web UI that we copied earlier.

What we’ve just done is create a local file that contains our credentials that git is not permitted to store in your repository and it’s easy enough to use these credentials with DVC from the terminal by running source .env first - don't worry I'll show you now.

Finally we can run git add .dvc followed by a git commit to lock in our dvc configuration in this git repository.

Adding files to DVC

Ok so imagine you have a folder full of images for your neural model to train on. stored in data/raw/training-data. We're going to add this to DVC with:

dvc add data/raw/training-data

After you run this, you’ll get a message along these lines:

100% Add|████████████████████████████████████████████████████████████|1/1 [00:01,  1.36s/file]

To track the changes with git, run:

git add data/raw/.gitignore data/raw/training-data/001.jpg

Go ahead and execute the git command now. This will update your git repository so that the actual data (the pictures of dogs and chicken nuggets) will be gitignored but the .dvc files which contain metadata about those files and where to find them will be added to the repository. When you’re ready you can now git commit to save the metadata about the data to git permanently.

Storing DVC data in backblaze

Now we have the acid test: this next step will push your data to your backblaze bucket if we have everything configured correctly. Simply run:

source .env
dvc push

At this point you’ll either get error messages or a bunch of progress bars that will populate as the images in your folder are uploaded. Once the process is finished you’ll see a summary that says N files pushed where N is the number of pictures you had in your folder. If that happened then congratulations you've successfully configured DVC and backblaze.

Getting the data back on another machine

If you want to work on this project with your friends on this project or you want to check out the project on your other laptop then you or they will need to install git and dvc before checking out your project from github (or wherever your project is hosted). Once they have a local copy they should be able to go into the data/raw/training-data folder and they will see all of the *.dvc files describing where the training data is.

Your git repository should have all of your dvc configuration in it already including the endpoint URL for your bucket. However, In order to check out this data they will first need to create a .env file of their own containing a key pair (ideally one that you've generated for them that is locked down as much as possible to just the project that you'd like to collaborate with them on). Then they will need to run:

source .env
dvc checkout

This should begin the process of downloading your files from backblaze and making local copies of them in data/raw/training-data.

Streamlining Workflows

One final tip I’d offer is using dvc install which will add hooks to git so that every time you push and pull, dvc push and pull are also automatically triggered - saving you from manually running those steps. It will also hook up dvc checkout and git checkout in case you're working with different data assets on different project branches.

Final Thoughts

Congratulations, if you got this far it means you’ve configured DVC and Backblaze B2 and have a perfectly reproducible data science workflow at the tips of your fingers. This workflow is well optimised for teams of people working on data science experiments that need to be repeatable or have large volumes of unwieldy data that needs a better home than git.

If you found this post useful please leave claps and comments or follow me on twitter @jamesravey for more.

Originally published at Brainsteam.

--

--

Dr James Ravenscroft

Ml and NLP Geek, CTO at Filament. Saxophonist, foodie and explorer. I was born in Bermuda and I Live in the UK