Some issues in the HTTP parser (and suggestions)

Eric Wong normalperson at yhbt.net
Sat May 8 12:17:06 EDT 2010


Iñaki Baz Castillo <ibc at aliax.net> wrote:
> 2010/5/7 Eric Wong <normalperson at yhbt.net>:
> > Underscore isn't valid for hostnames, but it is allowed in domain names
> > and most DNS servers will resolve them.  I've personally seen websites
> > with underscores in their domain names in the wild[1].
> 
> Hi Eric, could you point me to the spec stating that underscore is
> allowed for a domain? In the past I've done a SIP parser [*] with
> Ragel, being 100% strict at BNF grammar, and note that SIP reuses 80%
> of the grammar of HTTP. I'm pretty sure that "_" is not valid in a
> domain (host, hostname or whatever). Anyhow it's better just to allow
> it at parsing level :)

http://www.ietf.org/rfc/rfc2782.txt

Even if it's not part of the RFC, our parser will match reality and
accommodate broken things we see in the wild, as it has done in the
past:

  http://mid.gmane.org/20080327215027.GA14531@untitled

> > We'll have to test the IPv6 addresses and probably split that out into a
> > separate regexp since ":" would raise issues with the port number in
> > existing cases.  This is probably something for post-1.0.
> 
> There is a IETF draft to improve and *fix* the existing BNF grammar for IPv6.
> It also improves the grammar for IPv4 (by dissallowing values greater than 255):
> 
>     http://tools.ietf.org/html/draft-ietf-sip-ipv6-abnf-fix
> 
> 
> I've already implemented it in Ragel and I can sure that it's 100%
> valid and strict (I've done lots of tests):
> 
> alphanum  =  ALPHA / DIGIT
> domainlabel = alphanum | ( alphanum ( alphanum | "-" )* alphanum );
> toplabel = ALPHA | ( ALPHA ( alphanum | "-" )* alphanum );
> hostname = ( domainlabel "." )* toplabel "."?;
> dec_octet = DIGIT | ( 0x31..0x39 DIGIT ) | ( "1" DIGIT{2} ) | ( "2"
> 0x30..0x34 DIGIT ) | ( "25" 0x30..0x35 );
> IPv4address = dec_octet "." dec_octet "." dec_octet "." dec_octet;
> h16 = HEXDIG{1,4};
> ls32 = ( h16 ":" h16 ) | IPv4address;
> IPv6address = ( ( h16 ":" ){6} ls32 ) | ( "::" ( h16 ":" ){5} ls32 ) |
> ( h16? "::" ( h16 ":" ){4} ls32 ) | ( ( ( h16 ":" )? h16 )? "::" ( h16
> ":" ){3} ls32 ) | ( ( ( h16 ":" ){,2} h16 )? "::" ( h16 ":" ){2} ls32
> ) | ( ( ( h16 ":" ){,3} h16 )? "::" h16 ":" ls32 ) | ( ( ( h16 ":"
> ){,4} h16 )? "::" ls32 ) | ( ( ( h16 ":" ){,5} h16 )? "::" h16 ) | ( (
> ( h16 ":" ){,6} h16 )? "::" );
> IPv6reference = "[" IPv6address "]";
> host = hostname | IPv4address | IPv6reference;
> port = DIGIT{1,5};
> hostport = host ( ":" port )?;
> 
> 
> This is much better than the deprecated and bogus grammar in RFC 2396 ;)

Thanks, it might be worth simplifying a bit for readability, simplicity
(and possibly performance) at the expense of 100% conformance.

-- 
Eric Wong


More information about the mongrel-unicorn mailing list