Forums | Admin

Discussion Forums: open-discussion

Start New Thread Start New Thread

 

By: Yves-Eric Martin
RE: :row_sep => :auto and mixed line endings [ reply ]  
2010-03-03 03:14
Thank you for your precious advice. In fact, the patch does @io.gets(@quote_char), so we read until the next quote_char, not line ending, but the problem remains: if we are fed bad data, or if @quote_char is not set properly, we could end up slurping the whole file into memory.

Anyway, I will take things to Github as you suggested, and rewrite the patch to use @io.read.

Thanks again for your help.

By: James Gray
RE: :row_sep => :auto and mixed line endings [ reply ]  
2010-03-02 16:31
> I am sure you are much more familiar with CSV processing than myself, please let me know if you think the above logic is flawed.

Well, your current patch uses @io.gets() and that's one problem. We can't assume we know a line ending before we figure out what it is. If the document used a \r as line endings (legal and actually used in older Mac Operating Systems), you would slurp the entire contents into memory. That might require a lot of memory.

I can't currently see a flaw with your quote logic though, no. I guess, if we get all of these issues sorted out, we can look at adding this to FasterCSV. I think it's time we take this discussion to Github. Can you fork the project, build up a patch that doesn't use gets() (see how I'm using read() instead), and send me a pull request? Thanks.

By: Yves-Eric Martin
RE: :row_sep => :auto and mixed line endings [ reply ]  
2010-03-02 05:19
Hi James, and thank you for your reply.


I do not control user input, so when you say "specify the line ending manually", what happens is that I find myself back at catching MalformedCSVError often and retrying it with a different line ending... which was what the :auto feature was supposed to alleviate (see http://rubyforge.org/forum/message.php?msg_id=6609 ).

I would gladly ignore this problem if such mixed-ending file was really a border case scenario. But:

1) we are talking about Excel on Windows, so mixed-ending files may actually be more common than one may think at first.

2) as the proposed patch shows, the fix does not seem that hard. (I am not 100% sure yet, but I think the current patch may be all that is needed)

About your concerns of duplication, if we assume that the proposed patch is all that is needed, would you consider it be acceptable? Please note that only the first added line is a duplicate. The regexp used is not found in the existing parsers. Also, I do not think duplicating the full parser is necessary (see logic below that explains the current patch).

Regarding escaped quotes, I think the current patch does happen to deal with them properly without any additional specific code. Sorry the comments may have been misleading (I should have wrote: "if unbalanced quote_chars, we are inside a quoted field on the first character of
a quoted (doubled) quote_char"). The relevant code is:



# if unbalanced quote_chars, we are inside a quoted field,
# so finish reading it.
sample += @io.gets(@quote_char) if sample.count(@quote_char)[0] > 0 and not @io.eof?

# Now remove quoted parts from sample
sample.gsub!(quoted_part_regexp, '')



The logic is this:

(assumption: at the start of the file, or data read until now had
an even number of quote_chars)
If we have an odd number of quote_chars in the current sample, then we are either
A) inside a quoted field; or
B) on the first quote of a quote_char pair (quoted quote_char).

So, reading until the next quote_char will:
A1) complete the quoted field if it does not contain more quote_chars; or
A2) get to the first character of a quote_char pair inside the quoted field; or
B1) complete the quote_char pair.

Then removing contents between quote_chars with a non-greedy regexp will remove all quoted contents from the sample. Please note that a quoted field is not necessarily removed all at once (this part is significantly different from the CSV parsing, and explains why I think we do not need to duplicate the parser logic).


An example to help follow the logic:

if sample is:
1,2,"This is ""quot

There is 3 quote_chars, odd number so we continue reading until the next quote_char and get:
1,2,"This is ""quoted"

Now the gsub! will remove
1) "This is "
2) "quoted"
effectively producing the following sample
1,2,

No newlines found yet, so we continue processing and read more data, getting the new sample:
"goodness!",3,4,5\r\n1,2

2 quote_chars, even number so we apply the gsub! directly, removing
3) "goodness!"
leaving us with the sample:
,3,4,5\r\n1,2

"\r\n" found, autodetection complete. The original data was:
1,2,"This is ""quoted""goodness!",3,4,5\r\n1,2

and the actual sample looked at was
1,2,,3,4,5\r\n1,2

Note that the quoted field was removed in 3 parts (that is why I called the regexp quoted_part_regexp and not quoted_field_regexp), and that any newline separator inside is would have been properly ignored by the autodetection.

I am sure you are much more familiar with CSV processing than myself, please let me know if you think the above logic is flawed.

Thank you.

By: James Gray
RE: :row_sep => :auto and mixed line endings [ reply ]  
2010-03-01 15:14
My concern is that we are copying the parser code to attempt this. We would still need to do more too, because your patch does not take into account escaped quotes yet.

The :auto mode is just intended as a simple helper. I think it's reasonable to say, if you have a non-simple case like mixed line endings, that you need to specify the line ending manually.

Does that make sense?

By: Yves-Eric Martin
RE: :row_sep => :auto and mixed line endings [ reply ]  
2010-03-01 08:54
Sorry for replying to myself. I just finished a quick patch that implements the "skip quoted fields in row separator autodetection" suggestion. Pasting it below for your review.


--- faster_csv_ORIG.rb 2010-03-01 17:47:33.000000000 +0900
+++ faster_csv.rb 2010-03-01 17:50:38.000000000 +0900
@@ -1721,6 +1721,10 @@
else
begin
saved_pos = @io.pos # remember where we were
+ esc_quote = Regexp.escape(@quote_char)
+ quoted_part_regexp = Regexp.new("#{esc_quote}.*?#{esc_quote}",
+ Regexp::MULTILINE,
+ @encoding)
while @row_sep == :auto
#
# if we run out of data, it's probably a single line
@@ -1735,6 +1739,13 @@
sample = @io.read(1024)
sample += @io.read(1) if sample[-1..-1] == "\r" and not @io.eof?

+ # if unbalanced quote_chars, we are inside a quoted field,
+ # so finish reading it.
+ sample += @io.gets(@quote_char) if sample.count(@quote_char)[0] > 0 and not @io.eof?
+
+ # Now remove quoted parts from sample
+ sample.gsub!(quoted_part_regexp, '')
+
# try to find a standard separator
if sample =~ /\r\n?|\n/
@row_sep = $&

By: Yves-Eric Martin
:row_sep => :auto and mixed line endings [ reply ]  
2010-03-01 06:55
Hi all,


■ Problem: FasterCSV autodetects row separator inside quoted fields, causes improper MalformedCSVError

I am facing an issue with FasterCSV, that kind of defeats the purpose of the :row_sep => :auto option. It looks like FasterCSV considers the first "\n" or "\r\n" that it finds in the file, regardless of whether is is inside a quoted field or not, to be the row separator.

At first, the above seem like a reasonable assumption. Who would be mixing line endings in the same file? Well unfortunately, it seems like Excel does. A CSV file generated by Excel on Windows seems to have "\n" for newlines inside the data, while "\r\n" is the row separator.

Of course, this causes a MalformedCSVError if a newline is present inside some data on the first row: FasterCSV will wrongly auto-detect it ("\n") as the row separator.


■ Proposed solution: when scanning the file for row separators, ignore quoted data.

To prevent the above issue from happening, quoted data should be ignored when auto-detecting the row separator.


Can anyone comment on the validity of this problem, and feasibility (if appropriate) of the proposed solution? Should I fill in an entry in the tracker?


Thank you,

--
Yves-Eric