Saturday, December 13, 2008

Fix Broken Delimiter Separated Values Files

We get broken delimiter separated values files that are not properly quoted. Actually they don't use any type of quoting and expect the delimiter (a pipe character) to not appear in the output. Invariably these files end up having a handful or rows with fields with carriage returns in them. These rows end up getting dropped during later processing. Below is a simple script to fix such files. Note that the default delimiter is a pipe symbol and the default number of separators comes from counting the separators in the first row (typically the header row).

This simple script demonstrates optparse and writing a script that uses standard in / out or opens files according to arguments.