Home arrow static arrow Java Programming [Archive] - UTF-16: better than UTF-8? How to use?
Warning: Creating default object from empty value in /www/htdocs/w008deb8/wiki/components/com_staticxt/staticxt.php on line 51
Java Programming [Archive] - UTF-16: better than UTF-8? How to use?
This topic has 81 replies on 6 pages.    1 | 2 | 3 | 4 | 5 | 6 | Next »

Posts:3,369
Registered: 24.10.97
UTF-16: better than UTF-8? How to use?  
Jun 17, 2004 5:15 AM



 
I read about UTF-8/UTF-16 and found out, that UTF-16 characters all have the same encopded length while UTF-8 uses a more compact representation. But why does Eclipse just shows one line of strange characters (most times little squares) in my .java file when i switch Eclispe's encoding from UTF-8 to UTF-16? I thought Java is using UTF-16 internally all the time?
 

Posts:2,909
Registered: 13.8.2003
Re: UTF-16: better than UTF-8? How to use?  
Jun 17, 2004 5:42 AM (reply 1 of 81)



 
I don't know how you define which encoding is better...
UTF-8 takes less space than UTF-16, mainly when using "normal" alphabets.

Java does use UTF-16 (can't remember if it was BE or LE, I think LE) internally, but that has nothing to do with Eclipse. You tell eclipse which encoding you wish to use for your files and if you load an UTF-8 encoded file with eclipse wanting UTF-16, you can be sure that you won't get any decent output.

So, if you don't know what you're doing, don't play with eclipse's encodings.
 

Posts:6,487
Registered: 5/5/04
Re: UTF-16: better than UTF-8? How to use?  
Jun 17, 2004 8:37 AM (reply 2 of 81)



 
Note: UTF-16 is only required to
- specifiy the byte order
- support 32-bit characters. (UTF-8 also supports 32-bit characters)

UTF-16 typically uses twice the data size , whereas UTF-8 is compatible with ASCII text for standard characters.
Basically, I have never found a good use for UTF-16.
 

Posts:3,258
Registered: 00-08-28
Re: UTF-16: better than UTF-8? How to use?  
Jun 17, 2004 8:58 AM (reply 3 of 81)



 
even the CJK languages charactersets are supported for utf-8 so that leaves utf16 high and dry. The only place where utf16 could be helpful is for example in SMPP protocol where utf-8 is not supported so for international character set support you would need utf16.
 

Posts:6,147
Registered: 11/9/00
Re: UTF-16: better than UTF-8? How to use?  
Jun 17, 2004 9:25 AM (reply 4 of 81)



 
UTF-16 might prove more compact if almost all the characters being handled were non-latin and towards the higher end of the UNICODE range. Characters with more than 13 significant bits (from \2000) will encode as three byte sequences in UTF-8.

As to whether Java uses UTF-16 internally for strings - that's up to the implementors. Certainly string constants in class files are stored as UTF-8.
 

Posts:6,487
Registered: 5/5/04
Re: UTF-16: better than UTF-8? How to use?  
Jun 17, 2004 9:59 AM (reply 5 of 81)



 
My understanding is that java String are just 16-bit per character characters. As such 32-bit charcters are not supported. It is sometimes called UTF-16 but only when it doesn't make any difference. (Which is morst of the time)
 

Posts:3,258
Registered: 00-08-28
Re: UTF-16: better than UTF-8? How to use?  
Jun 17, 2004 10:27 AM (reply 6 of 81)



 

As to whether Java uses UTF-16 internally for strings
- that's up to the implementors. Certainly string
constants in class files are stored as UTF-8.

Well as per my info all the Strings in java are simply unicode. Strings donot have an encoding. Byte arrays and byte streams do.

Secondly just to add only the CJK languages have multi byte characters. None other.

 

Posts:2,909
Registered: 13.8.2003
Re: UTF-16: better than UTF-8? How to use?  
Jun 17, 2004 11:09 PM (reply 7 of 81)



 
As to whether Java uses UTF-16 internally for
strings
- that's up to the implementors. Certainly string
constants in class files are stored as UTF-8.

Hmm...you're right the JLS doesn't mention anything about it. Well, that was second hand information anyway.

Well as per my info all the Strings in java are simply
unicode. Strings donot have an encoding. Byte arrays
and byte streams do.

Strings must have an encoding, whereas byte arrays don't need an encoding. They are just bytes, Strings are characters (i.e. the in memory and on disk representation of a String is a sequence of bytes in a certain encoding).
 

Posts:14,142
Registered: 99-04-02
Re: UTF-16: better than UTF-8? How to use?  
Jun 18, 2004 9:03 AM (reply 8 of 81)



 
Java uses Unicode characters (16-bits, unsigned) in memory to hold characters. How they are written out to a stream of bytes (a file, or whatever) is where the encoding comes into play. Generally, there is no 100% sure way to know a file's encoding just from it's bytes. You can make some inferences, there are programs/code out there to do this. But generally you have to know which encoding was used to write the file when reading it. If Eclipse is writing the file, it should read it okay, unless you are not using the same settings on both sides of the process.
 

Posts:3,258
Registered: 00-08-28
Re: UTF-16: better than UTF-8? How to use?  
Jun 18, 2004 9:06 AM (reply 9 of 81)



 

Well as per my info all the Strings in java are
simply
unicode. Strings donot have an encoding. Byte
arrays
and byte streams do.

Strings must have an encoding, whereas byte arrays
don't need an encoding. They are just bytes, Strings
are characters (i.e. the in memory and on disk
representation of a String is a sequence of bytes in a
certain encoding).

well you need some revision of concepts then
 

Posts:27,518
Registered: 11/3/97
Re: UTF-16: better than UTF-8? How to use?  
Jun 18, 2004 9:35 AM (reply 10 of 81)



 

Well as per my info all the Strings in java are simply
unicode. Strings donot have an encoding. Byte arrays
and byte streams do.

Strings must have an encoding, whereas byte arrays
don't need an encoding. They are just bytes, Strings
are characters (i.e. the in memory and on disk
representation of a String is a sequence of bytes in
a
certain encoding).

well you need some revision of concepts then

Huh?

You already said that Strings use unicode. What do you think unicode is if not an encoding?

And what encoding does a jpg file use? And if I read that file into a byte array what encoding is the byte array using?
 

Posts:3,258
Registered: 00-08-28
Re: UTF-16: better than UTF-8? How to use?  
Jun 18, 2004 9:49 AM (reply 11 of 81)



 
String aString = "jschell";byte[] whateverEncodingBytes = myString.getBytes("whateverEncoding");

aString is in UTF-16, no matter what. The encoding specified in the getBytes() method is the target encoding. So whateverEncodingBytes is an array of bytes stored in a whatever encoding, rather than UTF-16. If you just use the method getBytes() the array of bytes would be stored in the default encoding which is iso-8859-1 unless otherwise specified.

String aNewString = new String(whateverEncodingBytes, "whateverEncoding");

aNewString is UTF-16 again, no matter what. The encoding specified in the constructor is the source encoding. So whateverEncodingBytes are still in an encoding. But building a new String out of them requires an encoding conversion to occur, as the String must be in UTF-16. So we need to let the constructor know which encoding to convert from. It would be like an interpretter to know which language to interpret from.

A Java byte can be in any encoding it likes - it's simply eight raw bits of data stored in an arbitrary order. Anything extending a Java object is stored in UTF-8, except for String, which keeps its internal characters in UTF-16 encoding.

In short, if you pull data into Java from an external source, the encoding it sits in within Java sticks to the rules above. If it first hits Java as a byte stream (say from an InputStream), it's going to be in whatever encoding it was already in. If it first hits Java as a String (say from request.getParameter() in a servlet or JSP), it will be in UTF-16, as Java will convert its encoding immediately.

The same holds when writing back out. If you write out a series of bytes (such as with OutputStream), the resulting output is in the encoding the bytes were in to begin with. If you're outputting a String (although this is a lot rarer), it will be in UTF-16. Or to be more specific as somebody pointed out Java uses Unicode characters (16-bits, unsigned) in memory to hold characters.

 

Posts:97
Registered: 5/26/04
Re: UTF-16: better than UTF-8? How to use?  
Jun 21, 2004 7:47 AM (reply 12 of 81)



 
so what was decided?
 

Posts:3,258
Registered: 00-08-28
Re: UTF-16: better than UTF-8? How to use?  
Jun 21, 2004 8:33 AM (reply 13 of 81)



 
so what was decided?

I think I couldnt have been more clear than in reply 11. So whats the question now?
 

Posts:27,518
Registered: 11/3/97
Re: UTF-16: better than UTF-8? How to use?  
Jun 21, 2004 9:27 AM (reply 14 of 81)



 
aString is in UTF-16, no matter what.

Which means it has an encoding.

The encoding
specified in the getBytes() method is the target
encoding. So whateverEncodingBytes is an array of
bytes stored in a whatever encoding, rather than
UTF-16. If you just use the method getBytes() the
array of bytes would be stored in the default encoding
which is iso-8859-1 unless otherwise specified.

Yes I understand that.

But a byte array in of itself does not have an encoding. First because it might have nothing to do with text at all and second because the byte array object itself does not encapsulate that encoding.

If I put a char value of 'a' in to an int variable does that int variable have an encoding?
 
This topic has 81 replies on 6 pages.    1 | 2 | 3 | 4 | 5 | 6 | Next »