| Ticket UUID: | 606141 | |||
| Title: | csv bug: empty field in middle of row | |||
| Type: | Bug | Version: | None | |
| Submitter: | todolson | Created on: | 2002-09-07 21:39:21 | |
| Subsystem: | csv | Assigned To: | andreas_kupries | |
| Priority: | 8 | Severity: | ||
| Status: | Closed | Last Modified: | 2003-04-26 01:12:48 | |
| Resolution: | Fixed | Closed By: | andreas_kupries | |
| Closed on: | 2003-04-25 18:12:48 | |||
| Description: |
csv has a bug parsing an empty field in the middle of a row. In the attached tarball, see sample.data (taken from MS Access export file). The fourth field in the last three rows is empty, but csv parses it as containing a single double-quote character. The fix is a single line added to csv.tcl, see csv.diff in the tarball, or csv-fixed.tcl. Run the sample program csv-test.tcl to see the difference in action. I'd be grateful if you could incluge this fix in the next release, as I am relying on the csv package in one of my projects. -Tod Olson <tod@uchicago.edu> | |||
| User Comments: |
andreas_kupries added on 2003-04-24 07:15:23:
Logged In: YES
user_id=75003
Did a complete rewrite of the parser for the alternate syntax.
The one I committed last was a derivate of the original parser
and simply could not handle the nested "". The new parser
just splits into the primary tokens (", sepchar, remainder) and
then converts the token sequence through a tcl-coded DFA
(state-machine). This is able to detect an embedded ""
sequence correctly.
Committed now. No known bugs in the extended testsuite.
Full pass.
andreas_kupries added on 2003-04-24 06:19:51: Logged In: YES user_id=75003 Committed changes to head. Please test. The testsuite has one of the new cases marked as knownBug. I.e. this code is completely correct, but handles the majority of cases. It dislikes "" inside of a value and handles that incorrectly. andreas_kupries added on 2003-04-24 00:39:39: Logged In: YES user_id=75003 Actually there is a sampledata file in the attached tarball. I am looking into this now. todolson added on 2003-04-23 20:56:25: Logged In: YES user_id=450877 The package certainly works as advertised. However, there are many applications that generate the ill-defined CSV format. The most common CSV files that I see are exported from MicroSoft Access, such as what was provided in the original bug report. The utility of this package would be greatly improved if it could parse files exported from these programs. Otherwise, those of us who deal with such data have to roll our own CSV parsers, as some of my collegues do, or patch every release of tcllib. I could easily provide a small number of test files which include the awkward cases. lvirden added on 2003-04-23 19:57:37: Logged In: YES
user_id=15949
The man page I am seeing says this:
FORMAT
Each record of a csv file (comma-separated values, as
exported
e.g. by Excel) is a set of ASCII values separated by
",". For
other languages it may be ";" however, although this is not
important for this case (The functions provided here
allow any
separator character).
If a value contains itself the separator ",", then it
(the value)
is put between "".
If a value contains ", it is replaced by "".
----
1. the format is a bit off in the man page - it probably
should be
(the functions ...)
^
2. The "it" in the third point is a bit vague - probably
should say something
like "If a value needs to contain the " character, the
character must be
represented as "".
2. I don't see anything here that references missing data.
And that was the
point of at least 2 or more bug reports.
todolson added on 2002-09-08 04:39:21: File Added - 30645: csvpatch.tar.gz | |||
Attachments:
- csvpatch.tar.gz [download] added by todolson on 2002-09-08 04:39:21. [details]
