Rob 'Commander' Pike
2012-07-16 23:08:44 UTC
Looking into issue 3785 (http://code.google.com/p/go/issues/detail?id=3785), which is easy to fix (CL coming), I decided to look into rejecting the surrogate range of Unicode in a UTF-8 encoding. It's correct to reject them, since those code points are not part of any valid UTF-8-encoded string.
The compiler, gc anyway, will not let you define a surrogate constant: http://play.golang.org/p/G5Dmhn0JXy. But the implementation is inconsistent because a surrogate half does not generate an error in a range statement or in a conversion: http://play.golang.org/p/HPwv7_YaYp.
I think we should be consistent, which requires changes:
1) Surrogate halves are illegal inside string constants when represented with \u or \U notation (already true).
2) unicode/utf8 should reject surrogate halves in either direction, treating them as encoding errors (not done).
3) range over a string with a surrogate half should treat them as encoding errors (not done)
4) conversion of a surrogate half as an integer to a string should yield a replacement rune for each byte (not done).
This is a significant API and spec change but it's already 25% there and should affect no Unicode-correct programs.
-rob
The compiler, gc anyway, will not let you define a surrogate constant: http://play.golang.org/p/G5Dmhn0JXy. But the implementation is inconsistent because a surrogate half does not generate an error in a range statement or in a conversion: http://play.golang.org/p/HPwv7_YaYp.
I think we should be consistent, which requires changes:
1) Surrogate halves are illegal inside string constants when represented with \u or \U notation (already true).
2) unicode/utf8 should reject surrogate halves in either direction, treating them as encoding errors (not done).
3) range over a string with a surrogate half should treat them as encoding errors (not done)
4) conversion of a surrogate half as an integer to a string should yield a replacement rune for each byte (not done).
This is a significant API and spec change but it's already 25% there and should affect no Unicode-correct programs.
-rob