Some issues in the HTTP parser (and suggestions)

Iñaki Baz Castillo ibc at aliax.net
Sat May 8 09:11:05 EDT 2010


2010/5/7 Eric Wong <normalperson at yhbt.net>:
> Underscore isn't valid for hostnames, but it is allowed in domain names
> and most DNS servers will resolve them.  I've personally seen websites
> with underscores in their domain names in the wild[1].

Hi Eric, could you point me to the spec stating that underscore is
allowed for a domain? In the past I've done a SIP parser [*] with
Ragel, being 100% strict at BNF grammar, and note that SIP reuses 80%
of the grammar of HTTP. I'm pretty sure that "_" is not valid in a
domain (host, hostname or whatever). Anyhow it's better just to allow
it at parsing level :)



> We'll have to test the IPv6 addresses and probably split that out into a
> separate regexp since ":" would raise issues with the port number in
> existing cases.  This is probably something for post-1.0.

There is a IETF draft to improve and *fix* the existing BNF grammar for IPv6.
It also improves the grammar for IPv4 (by dissallowing values greater than 255):

    http://tools.ietf.org/html/draft-ietf-sip-ipv6-abnf-fix


I've already implemented it in Ragel and I can sure that it's 100%
valid and strict (I've done lots of tests):

alphanum  =  ALPHA / DIGIT
domainlabel = alphanum | ( alphanum ( alphanum | "-" )* alphanum );
toplabel = ALPHA | ( ALPHA ( alphanum | "-" )* alphanum );
hostname = ( domainlabel "." )* toplabel "."?;
dec_octet = DIGIT | ( 0x31..0x39 DIGIT ) | ( "1" DIGIT{2} ) | ( "2"
0x30..0x34 DIGIT ) | ( "25" 0x30..0x35 );
IPv4address = dec_octet "." dec_octet "." dec_octet "." dec_octet;
h16 = HEXDIG{1,4};
ls32 = ( h16 ":" h16 ) | IPv4address;
IPv6address = ( ( h16 ":" ){6} ls32 ) | ( "::" ( h16 ":" ){5} ls32 ) |
( h16? "::" ( h16 ":" ){4} ls32 ) | ( ( ( h16 ":" )? h16 )? "::" ( h16
":" ){3} ls32 ) | ( ( ( h16 ":" ){,2} h16 )? "::" ( h16 ":" ){2} ls32
) | ( ( ( h16 ":" ){,3} h16 )? "::" h16 ":" ls32 ) | ( ( ( h16 ":"
){,4} h16 )? "::" ls32 ) | ( ( ( h16 ":" ){,5} h16 )? "::" h16 ) | ( (
( h16 ":" ){,6} h16 )? "::" );
IPv6reference = "[" IPv6address "]";
host = hostname | IPv4address | IPv6reference;
port = DIGIT{1,5};
hostport = host ( ":" port )?;


This is much better than the deprecated and bogus grammar in RFC 2396 ;)





>> ------------------
>>
>> host_with_port = (hostname (":" digit*)?) >mark %host;
>>
>> - It allows something ugly as "mydomain.org:"
>>
>> I suggest:
>>   host_with_port = (hostname (":" digit{1,5})?) >mark %host;
>
> It's ugly, but section 3.2.2 of RFC 2396 appears to allows it.

Sometimes there are bugs in the RFC's related to parsing and BNF
grammars. I know several cases. Unfortunatelly RFC's cannot be fixed,
instead the errors are reported and a new draft or RFC "xxx-fix"
appears some years later.



>> message_header = ((field_name ":" " "* field_value)|value_cont) :> CRLF;
>>
>> - It doesn't allow valid spaces before ":" as:
>>      Host : mydomain.org
>
> Spaces before the ":" aren't allowed in rfc2616, and I have yet to see
> evidence of clients sending headers like this in ~4 years of using this
> parser.

In SIP protocol spaces and tabulators before ":" are allowed, I really
expected that in HTTP the same occurs as SIP grammar is based on HTTP
grammar. But it could be different in some aspects, of course.


>> - Tabulators are also allowed.
>>
>> I suggest:
>>   message_header = ((field_name [ \t]* ":" [ \t]*
>> field_value)|value_cont) :> CRLF;
>
> I just pushed this out to unicorn.git to allow horizontal tabs:

Thanks.


[*] http://dev.sipdoc.net/projects/ragel-sip-parser/wiki/Phase1


-- 
Iñaki Baz Castillo
<ibc at aliax.net>


More information about the mongrel-unicorn mailing list