gargoyle image Collected works and current thoughts

Non-Printable Characters and How We Got Them

We do a bit of scripting. We often get input files that have non-printable characters. Sometimes we make them ourselves. How can you identify when that's an issue and how can you fix it?

Overview

This follows Sed Breakdown. We start with a “text-only” csv file. But, spoiler alert, but it’s not text-only. When this happens, it’s non-obvious.

Here is an example file: Example.csv

,,,,,,
,AA03-123-12,F,2021-02-03T12:13:14.567z,,,

Notice I just created a new spreadsheet in Numbers and then exported as CSV. It added several extra commas and such, but that isn’t the issue we’re going to solve… Once you save this file, try the following:

wc Example.csv 
  2   2   52 /Users/bschuch2/Documents/Example.csv

That’s 2 lines, 2 words, 52 characters. We only care about the 52…

Now, we’ll use sed to remove all non-printable characters directly in-place (updates file, does not echo updates lines to standard out):

sed -i 's/[^[:print:]]//' Example.csv

Has the file changed? Example.csv

,,,,,,
,AA03-123-12,F,2021-02-03T12:13:14.567z,,,

While the file looks the same, what were those 2 characters? I think ASCII 0 (null) inserted at the beginning of the file to indicate it’s a rich text file rather than an ASCII-based text file.

Here’s a breakdown of the sed command:

Token Description
-i In-place. Update the file directly rather than output a new file. OK if you have the original.
s Search
first / Beginning of what to search for
[ Start of group of characters
^ Not beginning of line, in [] it inverts the search (this means find a line that has non-printable characters)
[:print:] Refer to the posix name for printable characters, e.g., letters, numbers, space, tab.
] End of group
/ What to replace it with, or the start of the “to” section
/ end of search, so replace non-printable characters with nothing.

As written, this will find any line that has at least one non-printable character and replace the first one. If the line has only printable characters, the regex doesn’t match, and the original line remains unchanged. If the line has a non-printable character, it is removed, but only the first one. That is fine, because the problem characters are at the beginning of the file, and apparently one per line. If you want to be more aggressive, add “g” after the final /. It will replace each non-printable character in the line. In this case, the results are the same.

Conclusion

Don’t trust “text” files from outside. Verify they don’t have unwanted characters. Here’s a quick way to check using the commands from above:

wc -c < Example.csv && sed 's/[^[:print:]]//' < Example.csv | wc -c
52
50

While the formatting is different, you can see if anything would change by looking at the last value.

Using a bit more bash, you can compare the results numerically:

if [[ $(wc -c < e) -ne $(sed 's/[^[:print:]]//' < e | wc -c) ]]; then echo "different";fi

Published 15 February 2021

" Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.