Bugs: Browse | Submit New | Admin

[#18968] Parsing of anything on unicode non-ASCII lines is broken

Date:
2008-03-20 18:29
Priority:
3
Submitted By:
Oleg Ivanov (morhekil)
Assigned To:
Nobody (None)
Category:
None
State:
Open
Summary:
Parsing of anything on unicode non-ASCII lines is broken

Detailed description
If the line start with unicode non-ASCII character then all following Maruku code will not work, I've faced in when
working with Maruku in Rails environment due to Rails set Unicode locale, but it should be a general bug as anyone with
unicode locale will be hit by it. Here is the example:

---

Тестовый стринг <mail@mail.com>

Test string <mail@mail.com>

---

It will be parsed to

---

<p>Тестовый стринг &lt;mail@mail.com&gt;</p>

<p>Test string <a
href='mailto:mail@mail.com'>&#109;&#097;&#105;&#108;&#064;&#109;&#097;&#105;&#108
;&#046;&#099;&#111;&#109;</a></p>

---


The problem lies in the next_matches method of CharSourceManual (lib/maruku/input/charsource.rb). It uses regexp
/.{#{@buffer_index}}#{r}/m to match the given regexp against the part in the input string, but in case of Unicode string
@buffer_index can't be used as the repetition counter. Instead we can just get a part of the input string and then just
match it to the original regexp from method's arguments. See the attached patch.

Add A Comment: Notepad

Please login


Followup

Message
Date: 2008-03-20 23:45
Sender: Andrea Censi

Thanks for the patch; I'm afraid this unicode problem shows up
also somewhere else.
Date: 2008-03-20 18:32
Sender: Oleg Ivanov

oops, forget to add multiline flag to the regexp. Please 
use this patch instead.

Attached Files:

Name Description Download
charsource-unicode.patch Patch for the regexp used under Unicode locale Download
charsource-unicode-m.patch Proper patch for the regexp used under Unicode locale Download

Changes:

Field Old Value Date By
File Added3482: charsource-unicode-m.patch2008-03-20 18:32morhekil
File Added3481: charsource-unicode.patch2008-03-20 18:29morhekil