Fingerprinting data changes

I have a suggestion about monitoring data changes. While saving a copy of the experiment’s data in Guild is usually not a realistic option, it would be very helpful to know were the data same or different between the experiments. So the use scenario is like this:

  1. run data creation operation and create a fingerprint for the result (fingerprint might be MD5 hash on resulting data files or something similar)
  2. when running experiment operation, fingerprint the input data. So when comparing different experiments, one will know for sure what are the differences between them, not only in results and code, but also whether the input data were different. While this will not provide immediate answer what are exact differences in data, together with previous paragraph it will provide relatively easy way to find the source of difference, as one can easily find specific fingerprint in data creation operations.

One of the requirements is for user to define how exactly to fingerprint the data. Obviously this should be configurable, as there is no commonly accepted way of fingerprinting. For example, even md5sum utility does not exist on Windows.

Another issue here is timing of the fingerprinting. In the first case the fingerprinting should occur after the operation, because data will be created as a result of the operation. In the second case it is preferable to run fingerprinting before the operation. So it might eventually become “pre-run hook” and “post-run hook” actions.

Awesome suggestion. This is on the roadmap and will appear in a near-term release. This is an important feature that any tool must have to make a claim to strict reproducibility.