From Gregory.Murphy at Sun.COM Mon Nov 5 18:20:47 2007 From: Gregory.Murphy at Sun.COM (Gregory Murphy) Date: Mon, 05 Nov 2007 15:20:47 -0800 Subject: [Mediacloth-devel] Fixing treatment of paragraphs, and other fixes Message-ID: <472FA54F.2040505@sun.com> I have checked into my private branch (./branches/gjmurphy) a new version of the lexer and parser, which include some major changes to the lexer. I have normalized the treatment of paragraphs, to follow as closely as possible what Mediawiki actually does. Consecutive spans of "\n\n" will now generate consecutive paragraphs. I have also added complete support for dictionary-style lists, and fixed the handling of pre-formatted spans of text. As I mentioned in an earlier mail, the lexer was beginning to get very complicated. So I have taken a pass at simplifying it, by modeling more closely the design of the GNU lex utility. Mediawiki markup is especially context-sensitive, and we were dealing with this by adding conditional logic to the character-handling methods. I have moved all of this conditional logic into a set of context-specific lexer tables. Whenever the lexing context changes, an appropriate table of handler methods is chosen. When the context ends, the previous table of handler methods is restored. For example, when "[" is encountered, the link_lexer_table is made current, and when "]" is encountered, the previous lexer table is restored. Moving the logic into lexer tables means that the lexer does much less work. I ran a test on 100 random pages downloaded from Mediawiki, and the total parse time is now about 30% faster using native Ruby, and almost 150% faster using JRuby! Part of the speed-up is also due to changes in the way lists are parsed, which no longer requires that a separate input buffer be filled with the list contents. Because these changes will alter the way paragraphs are generated, I would like to let others try out the new version in the private branch before I think about committing this to the trunk. // Gregory From Gregory.Murphy at Sun.COM Fri Nov 9 11:25:38 2007 From: Gregory.Murphy at Sun.COM (Gregory Murphy) Date: Fri, 09 Nov 2007 08:25:38 -0800 Subject: [Mediacloth-devel] Fixing treatment of paragraphs, and other fixes In-Reply-To: <472FA54F.2040505@sun.com> References: <472FA54F.2040505@sun.com> Message-ID: <47348A02.9030609@sun.com> I have merged these changes into the trunk. // Gregory Gregory Murphy wrote: > I have checked into my private branch (./branches/gjmurphy) a new > version of the lexer and parser, which include some major changes to the > lexer. > > I have normalized the treatment of paragraphs, to follow as closely as > possible what Mediawiki actually does. Consecutive spans of "\n\n" will > now generate consecutive paragraphs. > > I have also added complete support for dictionary-style lists, and fixed > the handling of pre-formatted spans of text. > > As I mentioned in an earlier mail, the lexer was beginning to get very > complicated. So I have taken a pass at simplifying it, by modeling more > closely the design of the GNU lex utility. Mediawiki markup is > especially context-sensitive, and we were dealing with this by adding > conditional logic to the character-handling methods. I have moved all of > this conditional logic into a set of context-specific lexer tables. > Whenever the lexing context changes, an appropriate table of handler > methods is chosen. When the context ends, the previous table of handler > methods is restored. For example, when "[" is encountered, the > link_lexer_table is made current, and when "]" is encountered, the > previous lexer table is restored. > > Moving the logic into lexer tables means that the lexer does much less > work. I ran a test on 100 random pages downloaded from Mediawiki, and > the total parse time is now about 30% faster using native Ruby, and > almost 150% faster using JRuby! Part of the speed-up is also due to > changes in the way lists are parsed, which no longer requires that a > separate input buffer be filled with the list contents. > > Because these changes will alter the way paragraphs are generated, I > would like to let others try out the new version in the private branch > before I think about committing this to the trunk. > > // Gregory > _______________________________________________ > Mediacloth-devel mailing list > Mediacloth-devel at rubyforge.org > http://rubyforge.org/mailman/listinfo/mediacloth-devel >