Documenting your code
Documenting your code that you use to process, clean and analyse your data has three main uses:
- It helps you, when you come back to the code after a break, to describe what you did.
- It enables your collaborator, supervisor, etc., who are familiar with the data and methods, to understand your ‘working’.
- It provides assurance that any computations have been conducted rigorously, for any unknown person in the future who is unfamiliar with the data and perhaps may not be a specialist.
This section provides some general guidance on documenting code, including how and what should be documented.
Commenting
It is typically best to describe a block of code in a short two- or three-line comment, instead of adding a comment before each command or line of code, primarily because it is more readable. Where you have conducted something with a specific assumption, say so and why.
Comments in code often explain what was carried out, but usually are not necessary. Comments explaining why something was done are often more useful–for your future self and for other users.
Sourcing your data
It is good practice to source your data in the code. If possible, include the full citation and Digital Object Identifier (DOI) for the dataset you have used. Or, if you are using ‘raw data’, provide as much information as possible, e.g., the date it was extracted from a database, the parameters used, etc.
Version of the software
The version of software and libraries or packages being used should be stated. The default behaviour of commands is usually not backwardly compatible, meaning older versions do something differently to newer versions.
Some commands may also be deprecated, which means that using the command is no longer recommended and alternatives are encouraged.
Source data
It is useful to source your data in the code. If possible, give a URL or DOI for the dataset you have used. Or, if you are using ‘raw data’, provide as much information as possible, e.g., the date it was extracted from a database, the parameters used, etc.
Variable naming and labelling
Use meaningful names for your variables, add meaningful labels and declare the measurement unit (if any). Be consistent with your variable names within a dataset.
Checking your code
It is always useful to check your data cleaning steps and any computations using descriptive statistics (e.g., crosstabs, frequencies, minimums and maximums, checking the amount of missing data) to make sure they have been carried out correctly–this is also called unit testing.
You will probably carry out these checks as you go along when you are writing the code, but moving the checks to the end of the code will make it easier to read for others. It will also provide a final check that the preceding code has done what was expected and that nothing has been overwritten or altered. There are also functions in some programs for unit testing, e.g., testthat in R.