Documenting your code

Documenting your code that you use to process, clean and analyse your data has three main uses:

It helps you, when you come back to the code after a break, to describe what you did.
It enables your collaborator, supervisor, etc., who are familiar with the data and methods, to understand your ‘working’.
It provides assurance that any computations have been conducted rigorously, for any unknown person in the future who is unfamiliar with the data and perhaps may not be a specialist.

This section provides some general guidance on documenting code, including how and what should be documented.

Commenting

It is typically best to describe a block of code in a short two- or three-line comment, instead of adding a comment before each command or line of code, primarily because it is more readable. Where you have conducted something with a specific assumption, say so and why.

Comments in code often explain what was carried out, but usually are not necessary. Comments explaining why something was done are often more useful–for your future self and for other users.

Guidance on writing comments in code

Guidance

Code documentation

Code documentation guidance (with examples in R and Python) by the UK Government Analysis Function

Guidance

Best practices for writing code comments

by Ellen Spertus on Stack Overflow

Version of the software

The version of software and libraries or packages being used should be stated. The default behaviour of commands is usually not backwardly compatible, meaning older versions do something differently to newer versions.

Some commands may also be deprecated, which means that using the command is no longer recommended and alternatives are encouraged.

Source data

It is good practice to source your data in the code. If possible, include the full citation and Digital Object Identifier (DOI) or URL for the dataset you have used. Or, if you are using ‘raw data’, provide as much information as possible, e.g., the date it was extracted from a database, the parameters used, etc.

Variable naming and labelling

Use meaningful names for your variables, add meaningful labels and declare the measurement unit (if any). Be consistent with your variable names within a dataset.

Checking your code

It is always useful to check your data cleaning steps and any computations using descriptive statistics (e.g., crosstabs, frequencies, minimums and maximums, checking the amount of missing data) to make sure they have been carried out correctly–this is also called unit testing.

You will probably carry out these checks as you go along when you are writing the code, but moving the checks to the end of the code will make it easier to read for others. It will also provide a final check that the preceding code has done what was expected and that nothing has been overwritten or altered. There are also functions in some programs for unit testing, e.g., testthat in R.

Learn more about documenting your code

Guidance

Documenting your code

Commenting

Guidance on writing comments in code

Code documentation

Best practices for writing code comments

Version of the software

Source data

Variable naming and labelling

Checking your code

Learn more about documenting your code

Quality assurance of code for analysis and research

Explore the Training Hub

Cross-study research

Data management

Dissemination and impact

Training opportunities