Several of our courses reference .zip archives stored on the website. The reference code for our courses is kept in a git repository. Whenever changes are made, the CI/CD infrastructure checks the code, packages it, and uploads the files to the server.
The Problem
We only want to upload archives to the server if the contents have changed. We chose rsync for this, but noticed that all archives were being updated every time the upload step was executed.
We noticed that the file hashes were changing every time we created an archive, even if there were no changes to the contents of the file.
md5 module_001_basics.zip
(out)MD5 (module_001_basics.zip) = 81f97561721eb9fdca96f31af79c2f90
make package # no source changes in between
md5 module_001_basics.zip
(out)MD5 (module_001_basics.zip) = e999800745e9929774e316d706be4ee4
Goals
We should be able to deterministically package files into a .zip archive, such that the hash of a .zip archive made of unchanged files stays the same. This way, when we rsync the set of all archives to the web server, archives with no changes are recognized and left alone.
Investigation
The first attempt used a naive zip -r to package the files. After some research, we learned that zip files include several additional file attributes, such as timestamps, file owner, creation time, access time, permissions, etc. So changes in any of these attributes without corresponding changes in file contents can result in different archive contents.
zip provides a --no-extra (-X) flag to remove extra file attributes. However, this does not discount all attributes: permissions and timestamps are still considered. So, if file contents remain unchanged but the timestamp or permissions change, the zip process will produce two different archives.
Once we added the -X flag to our zip commands, checksums of the archives were the same:
md5 module_001_basics.zip
(out)MD5 (module_001_basics.zip) = 1116ece88719fbe5fc07713959ddbcb3
make package # no source changes in between
md5 module_001_basics.zip
(out)MD5 (module_001_basics.zip) = 1116ece88719fbe5fc07713959ddbcb3
This worked well enough, until we ran the script on our build machine. After some investigation, we learned that the build machine was producing different MD5 hashes because the timestamps of the files were different. The files are freshly cloned for a new “nightly” build, resulting in a different timestamp for each file than the previous build. Every nightly build produced archives with different hashes and resulted in a full upload of each file.
The Solution
In order to create a deterministic zip file, we need to make sure that the timestamp of each included file stays the same across machines and packaging runs.
The solution we (and others) arrived at was to set the timestamp of each file to its last commit timestamp in the git log. This script below uses git ls-tree to generate a list of tracked files. We iterate over each file in a loop, using touch to update the file with its last commit timestamp in the git log.
cd ${MESON_SOURCE_ROOT}
rev=HEAD
if [[ "$OSTYPE" == "darwin"* ]]; then
git ls-tree -r -t --full-name --name-only "$rev" | while read filename ; do
touch -t $(git log --pretty=format:%cd --date=format:%Y%m%d%H%m.%S -1 "$rev" -- "$filename") "$filename";
done
else
git ls-tree -r -t --full-name --name-only "$rev" | while read filename ; do
touch -d $(git log --pretty=format:%cI -1 "$rev" -- "$filename") "$filename";
done
fi
touch command.
This can be a time consuming process for large repositories, so we mapped this to a separate target in our build system (update_timestamps). For CI builds, we still want to do a dry run of the packaging process, so we don’t need to incur the cost of the update_timestamps operation. Nightly builds, which run the “upload” operation in the CI pipeline, will update timestamps before generating packages.
.PHONY: upload
upload: buildall
$(Q)cd $(BUILDRESULTS); ninja update_timestamps
$(Q)cd $(BUILDRESULTS); ninja package_course_001
$(Q)cd $(BUILDRESULTS); ninja upload_course_001
For our purposes, this is a sufficient solution. We transfer files using rsync --checksum, which ignores times and examines file sizes. If file sizes differ, there’s a transfer. If sizes match, then they are checksummed (md5), and those that have differing sums are also transferred.
If timestamps were included, then you would need one additional step: setting the timestamp of the generated .zip archive deterministically. We recommend the same general approach above. Use git to get the timestamp for the most recent commit affecting the relevant folder(s), then use touch to set the timestamp of the generated .zip archive to that value. For example:
ls -l libc-skeleton.zip
(out)-rw-r--r-- 1 phillip staff 126 Feb 20 08:56 libc-skeleton.zip
touch -t `git log --pretty=format:%cd --date=format:%Y%m%d%H%m.%S -1 HEAD -- libc-skeleton` libc-skeleton.zip
ls -l libc-skeleton.zip
(out)-rw-r--r-- 1 phillip staff 126 Dec 3 11:12 libc-skeleton.zip
References
- Embedded Artistry’s Continuous Integration Process
- A Look at CI/CD for Embedded Artistry Course Code (Member’s Only) – original publication, with the context motivating this problem
- Building Deterministic Zip Files with Built-In Commands | Medium by Ezri

Useful script to create deterministic zip files
Very helpful embedded systems traits explained simply, making complex ideas easy overall.