Skip to content

Large Files in GitHub

Liz Dobbins edited this page Apr 11, 2024 · 1 revision

Original Goal

To generate plots from an SQLite database when the pages render. But this is a binary file format and incompatible with GitHub. So we were looking for a large file storage option.

DVC

Originally, Liz thought of DVC because she used it at Axiom. It is designed to work with large datasets as part of a machine learning workflow. The files are stored in Google Drive and DVC records a pointer to them in the GutHub repo. This idea got no traction in DevOps

GitHub Large File Service

Will suggested https://git-lfs.com/. Jody concurred.

So we do have LFS available through the university github enterprise account and it actually looks like we have a few repos already using it ... There is apparently an additional cost that comes with using LFS but I'm not sure how much or how that gets charged or if the university just absorbs that.

One other repo is https://github.com/acep-uaf/thearcticprogram.net. It has a .gitattributes that looks like

*.pdf filter=lfs diff=lfs merge=lfs -text
*.zip filter=lfs diff=lfs merge=lfs -text

Limitations

General notes:

  • GitHub warns about 50MB, and will not allow 100MB.
    • aetr.db with prices in it is 400 KB
    • Neil's workbooks are ~20 MB each
  • Git is not designed to handle large SQL files. To share large databases with other developers, we recommend using a file sharing service.
  • repos should be <1 GB, and less than 5 GB for sure
  • could package as a release
  • could remove those files from repo with BFG Repo-Cleaner or the git filter-repo command. (see sensitive files)

Costs: One data pack costs $5 per month, and provides a monthly quota of 50 GiB for bandwidth and 50 GiB for storage and there is 1GB free per account (UAF? ACEP? individual?)

About Git Large File Storage: Git LFS cannot be used with GitHub Pages sites.

Why it didn't work

The interference between GitHub Pages and LFS grenades this for our use case. The LFS-tracked database is present in the repo directory, but our GH Pages website doesn't seem to recognize it.

Final Solution

Eh. We put off thinking about it and used small CSV files for figure generation.

A possible workaround in the future is to make a separate repo that builds the DB and pushes it to a public GCS storage bucket. Then we can set up an action in this repo to pull the data from the bucket and stash as CSVs. That way, our DB build will be nice and separate (for permissions etc), and this repo will have local CSVs, so will be fast and responsive. The data on the repo will be up to date, and we'll be interfacing with an actual database.