Spreadsheets are a quick way of entering tabular data. Usually, it is done as a “temporary” measure, until someone comes and turns it into a proper database. Sometimes, it is used as an intermediate structure for future extra processing.
Whatever the scenario, there are some mistakes that occur so consistently, that they cause a lot of problems later on, and will need a lot of rework before they can be made machine readable.
Here are some basic tips, and what not to do when entering data on a spreadsheet.
Do not enter the full name in one column
Possibly the most common issue. The person’s name, last name, middle name, father name, all entered in the same cell.
- Why is this a problem?
Even if they are entered correctly, names are very non standard, some people have multiple first names or last names, some names are multi word. Sometimes parts of it may be missing, like father for some, first name for others. Overall, it is near impossible to separate these later, without manual intervention.
- Solution:
Have separate columns for first name, family name, father name, mother name, or any other name information that you have. If the information does not exist for an entry, simply leave it blank.
As a general rule, each column/cell should contain only one piece of information.
Do not enter the date of birth in one column
Despite the name, in practice, a date of birth is not really a full proper date. Mostly because some piece of information may be missing. For example, with personal data, it is very common for some people to have missing day and/or month values.
- Why is this a problem?
Assuming missing days or months to be 1 will be introducing wrong information. The person may actually be born on day 31 of month 12, causing an error of one full year. Moreover, it will then be impossible to differentiate them with people actually born on 1/1.
- Solution:
The solution is to have separate columns for the three, and keep them empty or 0 in case the information is missing.
This applies to any date information, that may not have all three components for every case.
Enter dates in ISO format
The ISO-8601 standard defines date and time formats that are unambiguous, and understood by humans and machines alike.
For dates, it comes down to the following format: YYYY-MM-DD. For example, 2022-12-01 corresponds to the first of December 2022.
- Why is this a problem?
There are two mainstream date formats, that are directly contradictory to each other. These are DD/MM/YYYY and the infamous MM/DD/YYYY. To make matters worse, spreadsheet programs try to assume the formats of the date columns, by relying on the localization settings of the operating system where the file is opened.
For example, if a file was written on a system set to MM/DD/YYYY, opening it on another one set to DD/MM/YYYY will interpret the dates incorrectly, some of them not even registering as dates.
Related issue, is that people operating on the data will not know which format is used, especially when dealing internationally.
- Solution:
Use the YYYY-MM-DD format described above. It will be understood by programs, and can be exported and imported safely.
- Related tip:
Always use the ISO-8601 format in CSV or other text files, that you need to import into spreadsheets. It is the only format programs will correctly convert to dates.
Do not enter explanations and elaborations
Usually happens with a combination of the first mistake, and a typical example is something like: “Name (couldn’t read the last name)”.
- Solution:
If you need to convey information, that some data was illegible, put it in a dedicated column, if it is obvious and happens systemically. Or, if you need to write a lot, have a notes column. In both cases, this will be a column made obviously for manual processing.
Do not highlight by colors
A lot of times, after data entry is completed, people come back to it and start coloring some lines in red, green, blue, yellow, etc, indicating some status the data of that line has achieved.
- Why is this a problem?
While some programs may be capable of filtering or sorting by color, it is an information that does not get exported, and it cannot be post-processed without extra steps. Moreover, the meaning of the different colors is not inherent and could be lost over time.
- Solution:
Add columns for status, action, etc, and enter the information there as codes. You can still use colors as an additional visual indicator, or use the conditional formatting capabilities available in most programs, that will let you color based on the values of the cell.