Provenance , also referred to somethimes as lineage, is the documentation of a dataset origin. It includes how the data was collected or generated, which methodologies, instruments and/or software were used for it's creation. You could think of it as the data workflow from the start of a project to hte point a dataset is published and so you have the project final product. Provenance is complicated as usually data gets through a lot of steps an re-iterations. Given the nature of research itself, the objectives and methods or your analysis might changed various time, you might keep some of the steps and change others. Which is why having a good provenance is so important for the reproducibility and to be able to share your data. There are tools available to help recording some of steps automatically, but ultimately non of them will produce a good provenance record without regular manual intervention. It is not enough to track the changes you also need to know why they happened. Important things to keep track of
- what data you used as input, if any, it helps if the dataset has been published and has a doi you can refer to, or it is at least well documented and properly versioned. While there are situations were you do not have choice, when you do a well documented dataset is a safer option for your analysis.
- use a version control system for your analysis code, again as for data if you are using someone else code make sure is properly documented and versioned.(link) Use all the options given you but a version control system, for example github has readme files, issues, project plans and commit messages, they all help you not only tracking the chnages but why they happened.
use good coding practices and metadata conventions for your dataset, not only that is necessary when you will ventually want to share the data or code, but it will make it easier for you to remember what the data is and what the code is doing
review often, before you can forget, you could make it an habit at the end of a working day to make sure your previous notes, metadata etc are all sitll relevant.
In conclusion provenance is a progressive account of your research, part of the provenance will be directly attached to the data or the code you used, but it is good to have one document that collect all the other sources. A data management plan is a good template for such a document, if you create one at the start of your project and update it regularly you will have your work done when you want to publish the data, when you need to describe your research in a paper, or even before leaving an institution at the end of your PhD or postdoc.