We all know that we should be writing file extracts as XML but if we really need to get a CSV file then here is how to do it

Introduction

Comma Separated Values (CSV) files have been used since data first had to be exchanged between two applications. CSV files are an imperfect format that often loses data when not properly constructed and whilst XML is far superior it often raises concern over the size of files produced both for storage and transmission becuase of the overhead of the tags. Therefore if we have to move data using CSV files then it is better to know how to do it properly and here it is.

The definition

  • Each record is one line
    …but: a record separator may consist of a line feed (ASCII/LF=0x0A), or a carriage return and line feed pair (ASCII/CRLF=0x0D 0x0A).
    …but: fields may contain embedded line-breaks (see below) so a record may span more than one line.
  • Fields are separated with commas
    Example: John,Doe,120 any st.,”Anytown, WW”,08123
  • Leading and trailing space-characters adjacent to comma field separators are ignored.
    So John , Doe ,… resolves to “John” and “Doe”, etc. Space characters can be spaces, or tabs.
  • Fields with embedded commas must be delimited with double-quote characters
    In the above example. “Anytown, WW” had to be delimited in double quotes because it had an embedded comma.
  • Fields that contain double quote characters must be surrounded by double-quotes, and the embedded double-quotes must each be represented by a pair of consecutive double quotes
    So, John “Da Man” Doe would convert to “John “”Da Man”””,Doe, 120 any st.,…
  • A field that contains embedded line-breaks must be surrounded by double-quotes
    So:Field 1: Conference room 1
    Field 2:
    John,
    Please bring the M. Mathers file for review
    -J.L.
    Field 3: 10/18/2002
    …would convert to:Conference room 1, “John,
    Please bring the M. Mathers file for review
    -J.L.
    “,10/18/2002,… 

    Note that this is a single CSV record, even though it takes up more than one line in the CSV file. This works because the line breaks are embedded inside the double quotes of the field.

  • Fields with leading or trailing spaces must be delimited with double-quote characters
    So to preserve the leading and trailing spaces around the last name above: John ,” Doe “,…
  • Fields may always be delimited with double quotes.
    The delimiters will always be discarded.
  • The first record in a CSV file may be a header record containing column (field) names
    There is no mechanism for automatically discerning if the first record is a header row, so in the general case, this will have to be provided by an outside process (such as prompting the user). The header row is encoded just like any other CSV record in accordance with the rules above. A header row for the multi-line example above, might be:
    Location, Notes, “Start Date”, …

Sources

The text above is not original work, we found it on the internet some years ago and have used it ever since, unfortunately we have lost the original reference.

There is also a document at Creativyst that claims copyright from 2002 onwards however we had seen this material prior to that date (circa 1998). The article also discusses CSV and Unicode and conversion to XML. Creativyst also have a number of tools including a CSV to XML that look useful but we have not tried.

If there is an original author who can prove copyright we are more than willing to show appropriate accreditation.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.