entities, was: Test cleanup: code.yml

Stephen Bannasch stephen.bannasch at deanbrook.org
Tue Aug 28 00:25:24 EDT 2007


At 11:55 AM -0400 8/27/07, Jason Garber wrote:
>Stephen, I was referring to the translation of "Barnes & Noble" into 
>"Barnes & Noble" in regular text.  Textile2 does it as "Barnes 
>& Noble" so I prefer the SuperRedCloth way of converting certain 
>entities into the named entities instead of numerical ones (see
>superredcloth_inline.rl about line 209).

I agree, however I'm wrestling with whether SRC should only use named entities for the first four of the five standard XML character entity references

http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references#Character_entities_in_XML

  &  & (ampersand, U+0026)
  &lt;   < (left angle bracket, less-than sign, U+003C)
  &gt;   > (right angle bracket, greater-than sign, U+003E)
  &quot; " (quotation mark, U+0022)
  &apos; ' (apostrophe, U+0027)

These are the only named entities that xml documents can contain unless the character entities are declared. The apostrophe entity isn't actually a standard html entity so only the first four should be used.

If the output of SRC is going to be processed by any other general XML processor any other special character references should be converted to numerical Unicode character references.

We are using an XML processing path that chokes on entity references like &rarr;. However we are specifically including XHTML content in our XML documents and right now we are not including the XHTML DTD character entity declarations -- and now I'm thinking we should include these declarations.

See: http://www.w3.org/TR/xhtml1/dtds.html#h-A2

Which references these three DTD entity files:

  http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent
  http://www.w3.org/TR/xhtml1/DTD/xhtml-special.ent
  http://www.w3.org/TR/xhtml1/DTD/xhtml-symbol.ent

You applied my patch which removes &rarr; (which fixes my current problem).  The only other character entity in SRC is in test/table.yml in the output around line 180. The nbsp is the output for '|.|'.  However I don't even see any code for producing the nbsp enity anymore so you could probably replace the test with '| |' and look for <td> </td>.

>It still doesn't convert
>all entities though, so if I write a pound sign (£), it is not 
>converted to &pound; or &#163;.  Should it be?

I'm not sure what is best. When should a 'special' character like £ be converted to an entity?

It's not part of the document structure like '<' so we don't need to convert it for that reason.

I don't think it's valid in a url so it shouldn't appear there at all.

In a document:

If you convert it to a character entity it will be human parsable but it will cause a generic XML processor to choke.

If you convert it to a numerical character referenceI think it will work everywhere but it won't be easily parsable.

If you leave it un-translated AND transport protocols don't mess with it AND it's displayed properly to users then it's probably best not to translate it at all.


More information about the Redcloth-upwards mailing list