Home arrow static arrow Java Programming [Archive] - UTF-16: better than UTF-8? How to use?
Warning: Creating default object from empty value in /www/htdocs/w008deb8/wiki/components/com_staticxt/staticxt.php on line 51
Java Programming [Archive] - UTF-16: better than UTF-8? How to use?
This topic has 81 replies on 6 pages.    « Previous | 1 | 2 | 3 | 4 | 5 | 6 | Next »

Posts:3,258
Registered: 00-08-28
Re: UTF-16: better than UTF-8? How to use?  
Jun 23, 2004 2:00 PM (reply 60 of 81)



 
And by the way, the ISO-8859-X are both encodings and
character sets. The standard defines an encoding (a
mapping from sequences of bytes to sequences of
characters) and the set of characters that encoding
works on (a character set).

I said the same thing in reply 19.
 

Posts:11,200
Registered: 7/22/99
Re: UTF-16: better than UTF-8? How to use?  
Jun 23, 2004 2:06 PM (reply 61 of 81)



 
DrCLap is one of the most reverred member of this
forum.

uh-huh and so am I. Check out what I've said on this subject.
 

Posts:3,258
Registered: 00-08-28
Re: UTF-16: better than UTF-8? How to use?  
Jun 23, 2004 2:06 PM (reply 62 of 81)



 
DrCLap is one of the most reverred member of this
forum. Here are some of his posts dealing with String
and byte array encodings

http://forum.java.sun.com/thread.jsp?forum=16&thread=17
945

http://forum.java.sun.com/thread.jsp?forum=4&thread=298
66
http://forum.java.sun.com/thread.jsp?forum=31&thread=43
320
http://forum.java.sun.com/thread.jsp?forum=4&thread=375
35

Also refer to DrClaps rely in post

http://forum.java.sun.com/thread.jsp?forum=31&thread=532791

reply#6.

As far as I am concerned DrClap is the final authority in this forum.
 

Posts:11,200
Registered: 7/22/99
Re: UTF-16: better than UTF-8? How to use?  
Jun 23, 2004 2:29 PM (reply 63 of 81)



 
Your reply #19:

A mapping of binary values to code positions and
back; generally a 1:1 (bijective) mapping.

You do realise this is not a very good definition?

Does "code positions" and "binary values" mean multiple code positions map to multiple binary values or that one "code position" maps to possibly several "binary values" or what, and is this mapping 1:1 or bijective? One-to-one mappings are called injections, a one-to-one mapping is called a bijection if it is also "onto" (surjection). Not to mention that the term "code position" is left undefined.

In the case of ASCII, this is generally a f(x)=x
mapping: code point 65 maps to the byte value 65, and
vice versa. This is possible because ASCII uses only
code positions representable as single bytes, i.e.,
values between 0 and 255, at most. (US-ASCII only uses
values 0 to 127, in fact.)
But this is only a mapping from a subset of integers to a subset of integers. What do characters like A or B have to do with this?

[...]

So as in my previous post different values mean
different characters in different encodings this is
because there is a mapping between these values and
characters and thats what the encoding is.

Since Strings in java are arrays of unicode characters
and donot need mapping between binary values to
code positions
thats why I said they donot have
an encoding.
What are these "code points" you talk about?

In an ideal world things would be like that. In an ideal world computer memory could hold "characters" themselves but alas, all a computer "understands" is numbers.

So to be able to deal with characters in computers you have to have some kind of a conversion between them and numbers; something that says that A for instance is 65. That conversion defines a character encoding.
 

Posts:3,258
Registered: 00-08-28
Re: UTF-16: better than UTF-8? How to use?  
Jun 23, 2004 2:31 PM (reply 64 of 81)



 
Your reply #19:

A mapping of binary values to code positions and
back; generally a 1:1 (bijective) mapping.

You do realise this is not a very good definition?

Does "code positions" and "binary values" mean
multiple code positions map to multiple binary values
or that one "code position" maps to possibly several
"binary values" or what, and is this mapping 1:1 or
bijective? One-to-one mappings are called injections,
a one-to-one mapping is called a bijection if it is
also "onto" (surjection). Not to mention that the term
"code position" is left undefined.

In the case of ASCII, this is generally a f(x)=x
mapping: code point 65 maps to the byte value 65,
and
vice versa. This is possible because ASCII uses only
code positions representable as single bytes, i.e.,
values between 0 and 255, at most. (US-ASCII only
uses
values 0 to 127, in fact.)
But this is only a mapping from a subset of integers
to a subset of integers. What do characters like A or
B have to do with this?

[...]

So as in my previous post different values mean
different characters in different encodings this is
because there is a mapping between these values and
characters and thats what the encoding is.

Since Strings in java are arrays of unicode
characters
and donot need mapping between binary values to
code positions
thats why I said they donot have
an encoding.
What are these "code points" you talk about?

In an ideal world things would be like that. In an
ideal world computer memory could hold "characters"
themselves but alas, all a computer "understands" is
numbers.

So to be able to deal with characters in computers you
have to have some kind of a conversion between them
and numbers; something that says that A for instance
is 65. That conversion defines a character encoding.

In reply 19 I have also specified a URL. If there definition is not good enough. I couldnt have provided any better.
 

Posts:11,200
Registered: 7/22/99
Re: UTF-16: better than UTF-8? How to use?  
Jun 23, 2004 2:32 PM (reply 65 of 81)



 
Also refer to DrClaps rely in post

http://forum.java.sun.com/thread.jsp?forum=31&thread=532791

reply#6.

Paul writes: "With certain obscure exceptions you can say they ARE Unicode characters."

Do you know what these "obscure exceptions" are? They happen to rise from Java strings being encoded specifically in UTF-16. And they also happen to imply that Java chars are not really abstract characters that belong to the Unicode character set.
 

Posts:11,200
Registered: 7/22/99
Re: UTF-16: better than UTF-8? How to use?  
Jun 23, 2004 2:38 PM (reply 66 of 81)



 
In reply 19 I have also specified a URL. If there definition is not good
enough. I couldnt have provided any better.
When the question has risen, usually I have defined "character set" as a set of 'characters' (what you call a character is up to you) and a "character encoding" as a function that maps a sequence of integers to a sequence of elements of a character set. How you decide to word the definition does not really matter at all.

For instance A := {a, b, c} is a "character set." The function f: {1, 2, 3} -> A defined so that f(1) = b, f(2) = a, f(3) = c is a "character encoding."
 

Posts:3,258
Registered: 00-08-28
Re: UTF-16: better than UTF-8? How to use?  
Jun 23, 2004 2:44 PM (reply 67 of 81)



 
http://www.hyperdictionary.com/dictionary/character+encoding
http://character encodings.bluerider.com/wordsearch/character%20encodings
http://burks.brighton.ac.uk/burks/foldoc/95/18.htm
http://www.nightflight.com/foldoc-bin/foldoc.cgi?character+encoding
http://www.wkonline.com/d/character_encoding.html
http://character.encoding.word.sytes.org/
http://www.liokalos.com/default.asp?pid=12&ts=character%20encoding
http://www.indwes.edu/Faculty/bcupp/things/Characters/chars.html

define it the same way as well.
 

Posts:11,200
Registered: 7/22/99
Re: UTF-16: better than UTF-8? How to use?  
Jun 23, 2004 3:11 PM (reply 68 of 81)



 
Repeating a bad definition does not make it a good one; none of them define what is meant by "code position" (excluding Jukka Korpela's excellent text). But does the actual definition really matter?
The definitions at the unicode standard seem pretty good to me if you are interested:
http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf

However it doesn't change the fact that Java's Strings are encoded, unless you have built new kind of computer memory that is able to store abstract characters.
 

Posts:6,750
Registered: 1/25/04
Re: UTF-16: better than UTF-8? How to use?  
Jun 23, 2004 3:19 PM (reply 69 of 81)



 
Who thinks this thread is the forum's dumbest argument of the year?
 

Posts:11,200
Registered: 7/22/99
Re: UTF-16: better than UTF-8? How to use?  
Jun 23, 2004 3:23 PM (reply 70 of 81)



 
I think the one last week about initializing subclass fields before calling the superclass constructor was worse :)
 

Posts:6,750
Registered: 1/25/04
Re: UTF-16: better than UTF-8? How to use?  
Jun 23, 2004 3:48 PM (reply 71 of 81)



 
Hm, I found that one to be somewhat more educational. :-)
 

Posts:2,909
Registered: 13.8.2003
Re: UTF-16: better than UTF-8? How to use?  
Jun 24, 2004 12:27 AM (reply 72 of 81)



 
Blast it, late again. This was one of my favorite threads.

Incidentally my favorite threads tend to be the ones where I'm right and the other person thinks he's right.
 

Posts:1,125
Registered: 5/4/01
Re: UTF-16: better than UTF-8? How to use?  
Jun 24, 2004 1:16 AM (reply 73 of 81)



 
I'd like to say I feel responsible for keeping this thread going so long. If I had just stayed quiet and let kilyas think he's right with his ****-a-mani stories of 'code points' and 'byte positions' and his constant contradictions.....

<sigh>

Well. The one thing he actually points to correctly is DrClap <bow to the master>.

DrClap says that technically speaking a String is kind of encoded and that the resulting byte array is not encoded but is in the mind of the programmer. I've known all along its the semantics that are causing the problem but kilyas acts like such a troll I just couldn't help myself.

Thank you DrClap for hopefully capping kilyas's argument and thank you kilyas for such entertainment.

I thank you.

Ted.
 

Posts:11,200
Registered: 7/22/99
Re: UTF-16: better than UTF-8? How to use?  
Jun 24, 2004 1:27 AM (reply 74 of 81)



 
Blast it, late again. This was one of my favorite
threads.

Incidentally my favorite threads tend to be the ones
where I'm right and the other person thinks he's
right.

No, the best threads is where one person is talking about X and the other one about Y, and both think they are talking about the same thing. "The elephant has two ears." --"No! It has only ONE tail!!!"

Threads like that make the juiciest and most fruitless arguments :)
 
This topic has 81 replies on 6 pages.    « Previous | 1 | 2 | 3 | 4 | 5 | 6 | Next »