Some issues in the HTTP parser (and suggestions)

Iñaki Baz Castillo ibc at aliax.net
Sat May 8 12:50:54 EDT 2010


2010/5/8 Eric Wong <normalperson at yhbt.net>:
> Iñaki Baz Castillo <ibc at aliax.net> wrote:
>> 2010/5/7 Eric Wong <normalperson at yhbt.net>:
>> > Underscore isn't valid for hostnames, but it is allowed in domain names
>> > and most DNS servers will resolve them.  I've personally seen websites
>> > with underscores in their domain names in the wild[1].
>>
>> Hi Eric, could you point me to the spec stating that underscore is
>> allowed for a domain? In the past I've done a SIP parser [*] with
>> Ragel, being 100% strict at BNF grammar, and note that SIP reuses 80%
>> of the grammar of HTTP. I'm pretty sure that "_" is not valid in a
>> domain (host, hostname or whatever). Anyhow it's better just to allow
>> it at parsing level :)
>
> http://www.ietf.org/rfc/rfc2782.txt

Hi Eric. DNS SRV are not domain names, but DNS queries. For example,
in SIP (and also in XMPP) if a phone is configured to use
"myproxy.org" as proxy then it must perform a DNS SRV query for these
entries:

  _sip._udp.myproxy.org
  _sip._tcp.myproxy.org

Then the DNS query recevives some DNS A records for which the client
also retrieves the associated IP's. But after it, when the client
generates a SIP request it uses "myproxy.org" rather than
"_sip._udp.myproxy.org". This is, "_sip._udp.myproxy.org" is not a
domain/hostname, but a format for querying DNS SRV records.


> Even if it's not part of the RFC, our parser will match reality and
> accommodate broken things we see in the wild, as it has done in the
> past:
>
>  http://mid.gmane.org/20080327215027.GA14531@untitled

Yes, I agree.



>> alphanum  =  ALPHA / DIGIT
>> domainlabel = alphanum | ( alphanum ( alphanum | "-" )* alphanum );
>> toplabel = ALPHA | ( ALPHA ( alphanum | "-" )* alphanum );
>> hostname = ( domainlabel "." )* toplabel "."?;
>> dec_octet = DIGIT | ( 0x31..0x39 DIGIT ) | ( "1" DIGIT{2} ) | ( "2"
>> 0x30..0x34 DIGIT ) | ( "25" 0x30..0x35 );
>> IPv4address = dec_octet "." dec_octet "." dec_octet "." dec_octet;
>> h16 = HEXDIG{1,4};
>> ls32 = ( h16 ":" h16 ) | IPv4address;
>> IPv6address = ( ( h16 ":" ){6} ls32 ) | ( "::" ( h16 ":" ){5} ls32 ) |
>> ( h16? "::" ( h16 ":" ){4} ls32 ) | ( ( ( h16 ":" )? h16 )? "::" ( h16
>> ":" ){3} ls32 ) | ( ( ( h16 ":" ){,2} h16 )? "::" ( h16 ":" ){2} ls32
>> ) | ( ( ( h16 ":" ){,3} h16 )? "::" h16 ":" ls32 ) | ( ( ( h16 ":"
>> ){,4} h16 )? "::" ls32 ) | ( ( ( h16 ":" ){,5} h16 )? "::" h16 ) | ( (
>> ( h16 ":" ){,6} h16 )? "::" );
>> IPv6reference = "[" IPv6address "]";
>> host = hostname | IPv4address | IPv6reference;
>> port = DIGIT{1,5};
>> hostport = host ( ":" port )?;
>>
>>
>> This is much better than the deprecated and bogus grammar in RFC 2396 ;)
>
> Thanks, it might be worth simplifying a bit for readability, simplicity
> (and possibly performance) at the expense of 100% conformance.

You can try to simplicity it, but note that some previous IPv6 BNF
grammars failed to cover all the valid cases and they are bogus. For
example, the original IPv6 BNF grammar appearing in RFC 3261 (SIP
protocol) is buggy (even if it seems simpler) and the specification
has been fixed with the draft I linked in my previous mail.


Regards.


-- 
Iñaki Baz Castillo
<ibc at aliax.net>


More information about the mongrel-unicorn mailing list