Surrogate halves and UTF-8

Discussion:

Rob 'Commander' Pike

2012-07-16 23:08:44 UTC

Looking into issue 3785 (http://code.google.com/p/go/issues/detail?id=3785), which is easy to fix (CL coming), I decided to look into rejecting the surrogate range of Unicode in a UTF-8 encoding. It's correct to reject them, since those code points are not part of any valid UTF-8-encoded string.

The compiler, gc anyway, will not let you define a surrogate constant: http://play.golang.org/p/G5Dmhn0JXy. But the implementation is inconsistent because a surrogate half does not generate an error in a range statement or in a conversion: http://play.golang.org/p/HPwv7_YaYp.

I think we should be consistent, which requires changes:

1) Surrogate halves are illegal inside string constants when represented with \u or \U notation (already true).
2) unicode/utf8 should reject surrogate halves in either direction, treating them as encoding errors (not done).
3) range over a string with a surrogate half should treat them as encoding errors (not done)
4) conversion of a surrogate half as an integer to a string should yield a replacement rune for each byte (not done).

This is a significant API and spec change but it's already 25% there and should affect no Unicode-correct programs.

-rob

roger peppe

2012-07-17 14:03:58 UTC

Permalink

Post by Rob 'Commander' Pike
Looking into issue 3785 (http://code.google.com/p/go/issues/detail?id=3785), which is easy to fix (CL coming), I decided to look into rejecting the surrogate range of Unicode in a UTF-8 encoding. It's correct to reject them, since those code points are not part of any valid UTF-8-encoded string.
The compiler, gc anyway, will not let you define a surrogate constant: http://play.golang.org/p/G5Dmhn0JXy. But the implementation is inconsistent because a surrogate half does not generate an error in a range statement or in a conversion: http://play.golang.org/p/HPwv7_YaYp.
1) Surrogate halves are illegal inside string constants when represented with \u or \U notation (already true).
2) unicode/utf8 should reject surrogate halves in either direction, treating them as encoding errors (not done).
3) range over a string with a surrogate half should treat them as encoding errors (not done)
4) conversion of a surrogate half as an integer to a string should yield a replacement rune for each byte (not done).

All this sounds good to me, except I'm not quite sure what you mean by point 4.
It sounds like it might break the invariant that
strings.RuneCount(string(i)) == 1
which I wouldn't be keen on.

Rob 'Commander' Pike

2012-07-17 14:43:01 UTC

Permalink

Post by roger peppe

All this sounds good to me, except I'm not quite sure what you mean by point 4.
It sounds like it might break the invariant that
strings.RuneCount(string(i)) == 1
which I wouldn't be keen on.

Fair point, but we'd need to decide what string(surrogateHalfInt) and []rune(surrogateHalfUtf8) should yield.

-rob

roger peppe

2012-07-18 08:39:46 UTC

Permalink

Post by Rob 'Commander' Pike

Post by roger peppe

All this sounds good to me, except I'm not quite sure what you mean by point 4.
It sounds like it might break the invariant that
strings.RuneCount(string(i)) == 1
which I wouldn't be keen on.

Fair point, but we'd need to decide what string(surrogateHalfInt) and []rune(surrogateHalfUtf8) should yield.

I wouldn't be unhappy if string(surrogateHalfInt) == "\ufffd";
[]rune(surrogateHalfUtf8) == [0xfffd, 0xfffd, 0xfffd]

That would be in line with current behaviour for an out of range
rune, I think.

Rob Pike

2012-07-18 13:46:09 UTC

Permalink

Received: by 10.204.10.67 with SMTP id o3mr109118bko.34.1342619170616;
Wed, 18 Jul 2012 06:46:10 -0700 (PDT)
X-BeenThere: golang-dev-/***@public.gmane.org
Received: by 10.204.13.17 with SMTP id z17ls1750590bkz.5.gmail; Wed, 18 Jul
2012 06:46:09 -0700 (PDT)
Received: by 10.205.126.4 with SMTP id gu4mr667711bkc.8.1342619169534;
Wed, 18 Jul 2012 06:46:09 -0700 (PDT)
Received: by 10.205.126.4 with SMTP id gu4mr667709bkc.8.1342619169490;
Wed, 18 Jul 2012 06:46:09 -0700 (PDT)
Received: from mail-bk0-f53.google.com (mail-bk0-f53.google.com [209.85.214.53])
by gmr-mx.google.com with ESMTPS id j4si5907209bkj.3.2012.07.18.06.46.09
(version=TLSv1/SSLv3 cipher=OTHER);
Wed, 18 Jul 2012 06:46:09 -0700 (PDT)
Received-SPF: pass (google.com: domain of ***@golang.org designates 209.85.214.53 as permitted sender) client-ip=209.85.214.53;
Received: by mail-bk0-f53.google.com with SMTP id j4so1355282bkw.12
for <golang-dev-/***@public.gmane.org>; Wed, 18 Jul 2012 06:46:09 -0700 (PDT)
Received: by 10.152.132.40 with SMTP id or8mr3594467lab.24.1342619169288; Wed,
18 Jul 2012 06:46:09 -0700 (PDT)
Received: by 10.112.30.42 with HTTP; Wed, 18 Jul 2012 06:46:09 -0700 (PDT)
In-Reply-To: <CAJhgacjNiqiDZQeGd6A3aXV8_cCNe+f83ZTNeJdf194Wy27fow-JsoAwUIsXosN+***@public.gmane.org>
X-Gm-Message-State: ALoCoQl+TSUeQzmr19SfrKB1SmSB9xusZ0leL577e9h9iyEQLjM2ITsuw2IyagrOFCxvnrKsaz9Y
X-Original-Sender: ***@golang.org
X-Original-Authentication-Results: gmr-mx.google.com; spf=pass (google.com:
domain of ***@golang.org designates 209.85.214.53 as permitted sender) smtp.mail=***@golang.org
Precedence: list
Mailing-list: list golang-dev-/***@public.gmane.org; contact golang-dev+owners-/***@public.gmane.org
List-ID: <golang-dev.googlegroups.com>
X-Google-Group-Id: 1097896213209
List-Post: <http://groups.google.com/group/golang-dev/post?hl=en_US>, <mailto:golang-dev-/***@public.gmane.org>
List-Help: <http://groups.google.com/support/?hl=en_US>, <mailto:golang-dev+help-/***@public.gmane.org>
List-Archive: <http://groups.google.com/group/golang-dev?hl=en_US>
Sender: golang-dev-/***@public.gmane.org
List-Subscribe: <http://groups.google.com/group/golang-dev/subscribe?hl=en_US>,
<mailto:golang-dev+subscribe-/***@public.gmane.org>
List-Unsubscribe: <http://groups.google.com/group/golang-dev/subscribe?hl=en_US>,
<mailto:googlegroups-manage+1097896213209+unsubscribe-/***@public.gmane.org>
Archived-At: <http://permalink.gmane.org/gmane.comp.lang.go.devel/41840>

Post by roger peppe
I wouldn't be unhappy if string(surrogateHalfInt) == "\ufffd";
[]rune(surrogateHalfUtf8) == [0xfffd, 0xfffd, 0xfffd]
That would be in line with current behaviour for an out of range
rune, I think.

That's probably the right design. Still thinking.

-rob

Maxim Khitrov

2012-07-17 15:16:17 UTC

Permalink

Post by roger peppe

All this sounds good to me, except I'm not quite sure what you mean by point 4.
It sounds like it might break the invariant that
strings.RuneCount(string(i)) == 1
which I wouldn't be keen on.

Did you mean utf8.RuneCount? It would still be 1, just as if you
passed some other invalid code point to string(). For example:

http://play.golang.org/p/zj0iAhKC9m

As I understand, point 4 would make line 3 == line 2, which is
probably the right thing to do.

- Max

Marc-Antoine Ruel

2012-07-18 14:56:33 UTC

Permalink

Post by Rob 'Commander' Pike
Looking into issue 3785 (http://code.google.com/p/go/issues/detail?id=3785),
which is easy to fix (CL coming), I decided to look into rejecting the
surrogate range of Unicode in a UTF-8 encoding. It's correct to reject
them, since those code points are not part of any valid UTF-8-encoded
string.
http://play.golang.org/p/G5Dmhn0JXy. But the implementation is
inconsistent because a surrogate half does not generate an error in a range
statement or in a conversion: http://play.golang.org/p/HPwv7_YaYp.
1) Surrogate halves are illegal inside string constants when represented
with \u or \U notation (already true).
2) unicode/utf8 should reject surrogate halves in either direction,
treating them as encoding errors (not done).

I wonder about interaction with javascript libraries (ab)using utf-8 to
encode binary data, especially for games.

I have the impression this would make it impossible to code a Go web
backend serving binary encoded as json for a web client front end like
http://code.google.com/p/webgl-loader/.

In particular, the "caveat" in
http://code.google.com/p/webgl-loader/wiki/UtfEight. :)

I don't mind breaking this use case for coherent surrogate handling
throughout the compiler and libraries, it's just good to know that this
(somewhat edgy) use case is potentially being broken.

M-A

3) range over a string with a surrogate half should treat them as encoding

Post by Rob 'Commander' Pike
errors (not done)
4) conversion of a surrogate half as an integer to a string should yield a
replacement rune for each byte (not done).
This is a significant API and spec change but it's already 25% there and
should affect no Unicode-correct programs.
-rob

Maxim Khitrov

2012-07-18 15:09:51 UTC

Permalink

Received: by 10.224.205.132 with SMTP id fq4mr665929qab.4.1342624222562;
Wed, 18 Jul 2012 08:10:22 -0700 (PDT)
X-BeenThere: golang-dev-/***@public.gmane.org
Received: by 10.224.178.138 with SMTP id bm10ls2817588qab.1.gmail; Wed, 18 Jul
2012 08:10:21 -0700 (PDT)
Received: by 10.224.176.67 with SMTP id bd3mr3766532qab.8.1342624221900;
Wed, 18 Jul 2012 08:10:21 -0700 (PDT)
Received: by 10.224.176.67 with SMTP id bd3mr3766531qab.8.1342624221882;
Wed, 18 Jul 2012 08:10:21 -0700 (PDT)
Received: from mail-qa0-f50.google.com (mail-qa0-f50.google.com [209.85.216.50])
by gmr-mx.google.com with ESMTPS id k34si6038265qcz.1.2012.07.18.08.10.21
(version=TLSv1/SSLv3 cipher=OTHER);
Wed, 18 Jul 2012 08:10:21 -0700 (PDT)
Received-SPF: pass (google.com: domain of max-52VJov5/***@public.gmane.org designates 209.85.216.50 as permitted sender) client-ip=209.85.216.50;
Received: by mail-qa0-f50.google.com with SMTP id l39so1469010qaf.16
for <golang-dev-/***@public.gmane.org>; Wed, 18 Jul 2012 08:10:21 -0700 (PDT)
Received: by 10.224.194.137 with SMTP id dy9mr6390329qab.67.1342624221384;
Wed, 18 Jul 2012 08:10:21 -0700 (PDT)
Received: by 10.229.238.74 with HTTP; Wed, 18 Jul 2012 08:09:51 -0700 (PDT)
In-Reply-To: <CANAQWOWUmeNd2t-XM9O++Q=5W2j-f5rWETvMeqTwKb_y3bwoCA-JsoAwUIsXosN+***@public.gmane.org>
X-Gm-Message-State: ALoCoQnt/5UQX7y6EJJ0vN6RAJyWNSffI4d7zoFxrodsX/Sr91G5odtyE4imQed6DzuM10LuXv7g
X-Original-Sender: max-52VJov5/***@public.gmane.org
X-Original-Authentication-Results: gmr-mx.google.com; spf=pass (google.com:
domain of max-52VJov5/***@public.gmane.org designates 209.85.216.50 as permitted sender) smtp.mail=max-52VJov5/***@public.gmane.org
Precedence: list
Mailing-list: list golang-dev-/***@public.gmane.org; contact golang-dev+owners-/***@public.gmane.org
List-ID: <golang-dev.googlegroups.com>
X-Google-Group-Id: 1097896213209
List-Post: <http://groups.google.com/group/golang-dev/post?hl=en_US>, <mailto:golang-dev-/***@public.gmane.org>
List-Help: <http://groups.google.com/support/?hl=en_US>, <mailto:golang-dev+help-/***@public.gmane.org>
List-Archive: <http://groups.google.com/group/golang-dev?hl=en_US>
Sender: golang-dev-/***@public.gmane.org
List-Subscribe: <http://groups.google.com/group/golang-dev/subscribe?hl=en_US>,
<mailto:golang-dev+subscribe-/***@public.gmane.org>
List-Unsubscribe: <http://groups.google.com/group/golang-dev/subscribe?hl=en_US>,
<mailto:googlegroups-manage+1097896213209+unsubscribe-/***@public.gmane.org>
Archived-At: <http://permalink.gmane.org/gmane.comp.lang.go.devel/41845>

Post by Marc-Antoine Ruel

Post by Rob 'Commander' Pike
Looking into issue 3785
(http://code.google.com/p/go/issues/detail?id=3785), which is easy to fix
(CL coming), I decided to look into rejecting the surrogate range of Unicode
in a UTF-8 encoding. It's correct to reject them, since those code points
are not part of any valid UTF-8-encoded string.
http://play.golang.org/p/G5Dmhn0JXy. But the implementation is inconsistent
because a surrogate half does not generate an error in a range statement or
in a conversion: http://play.golang.org/p/HPwv7_YaYp.
1) Surrogate halves are illegal inside string constants when represented
with \u or \U notation (already true).
2) unicode/utf8 should reject surrogate halves in either direction,
treating them as encoding errors (not done).

I wonder about interaction with javascript libraries (ab)using utf-8 to
encode binary data, especially for games.
I have the impression this would make it impossible to code a Go web backend
serving binary encoded as json for a web client front end like
http://code.google.com/p/webgl-loader/.
In particular, the "caveat" in
http://code.google.com/p/webgl-loader/wiki/UtfEight. :)
I don't mind breaking this use case for coherent surrogate handling
throughout the compiler and libraries, it's just good to know that this
(somewhat edgy) use case is potentially being broken.

You should still be able to store binary data in strings. In
particular, []byte(string([]byte)) will never change any data in the
original byte slice. As my previous example shows, there are already
cases where conversion of an int to string results in the replacement
rune. The proposed fix would just add the surrogate pair range to the
list of code points that cannot be encoded in UTF-8.

- Max