PDA

View Full Version : UTF-8 vs. UTF-16 in Propeller Tool



darco
05-16-2008, 08:32 AM
I noticed that when I insert certain special characters into a spin source file and try to save it, the propeller tool will warn me that it will be saving it as a UTF-16 file.

I was just curious as to why UTF-16 was used instead of UTF-8. Is there any possibility of using UTF-8 in the future? Could the prop tool be used with UTF-8 now?

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔

darco
www.deepdarc.com/ (http://www.deepdarc.com/)
xmpp:darco@deepdarc.com (mailto:xmpp:darco@deepdarc.com)

rokicki
05-17-2008, 01:59 AM
I agree with this; dealing with Spin files in UTF-16 is a pain. UTF-8 is, for most purposes, the preferred
encoding with better interoperability with most tools.

Jeff Martin
05-17-2008, 12:04 PM
Hi,

While I understand your feelings on this, we had very good reasons for designing it the way we did and it wasn't anything we decided in haste.· If you're interested, here's the details:

The Propeller chip's built-in font uses every character cell for printable characters (although·8 are special-purpose characters meant to be printed in overlayed pairs).· When we decided we needed to create a like-designed font on the computer to make use of the special characters for diagramming right in source code, the legacy issues with ANSI (where the·$00-$1F characters and $80 - $9F characters are typically "special purpose" and·not printable) we realized we had a problem.

The solution was to make the·Parallax font·a Unicode-based font that uses·standard ANSI glyph·positions wherever it can to be·compatible with typical ANSI-only character sets (US ASCII + Latin Extension) and use higher code glyphs (above 255) that are·mapped·into the otherwise non-printable character codes to complete the set.· This allows Unicode-enabled software (like Notepad) to display all the glyphs we wanted.

The Propeller Tool generates ANSI-encoded or Unicode (UTF-16) files and will maintain ANSI-encoding as long as it can until at least one character from one of the higher locations (above 255) is inserted, then it automatically switches to UTF-16 for all internal operations on that file and will warn the user if trying to save over a file that is currently saved in ANSI-encoding.· This is simply a courtesy, in case that file was really meant as·a data file that is·supposed to·maintain 1 byte per character encoding.

I'm assuming there's a little misunderstanding behind·the opinions of whether UTF-8 or UTF-16 is the better encoding scheme... UTF-16 is a very simple encoding scheme, especially considering that for the Basic Multilingual Plane (of which nearly every character in the existing Unicode set is mapped) UTF-16 maintains a deterministic 2-byte per character encoding.· And EVERY character in the Parallax font set is within the Basic Multilingual Plane, so UTF-16 for us means a fixed 2-byte per character encoding scheme... simple.· UTF-8, however, is more of a variable-length encoding scheme in practice since only the first 128 ANSI characters can be encoded in a single byte, so for the Latin Extension (the upper 128 characters), it would require 2 bytes per character (and 3 bytes for other higher characters still within the BMP).· Just to be clear, ANSI encoding is a fixed single-byte per character encoding scheme, but is limited to 256 code points.· UTF-8 is not a fixed single-byte per character encoding scheme for that set of 256·characters.

All this being said, another major reason we chose UTF-16 is that even simple programs like Windows Notepad naturally supports it in Win2K and above, and that is a great indicator that many more editors will follow as time goes on and Internationalization·is seen more and more·a must rather than an option.· So using UTF-16 for Propeller source files containing at least on character that is higher than 255 allows those files to be properly viewed in Notepad so long as the Parallax font is used (and many characters map to "similar" characters in other sets (like the Courier font)).

Take care,




▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
--Jeff Martin

· Sr. Software Engineer
· Parallax, Inc.

darco
05-18-2008, 02:05 AM
Jeff, thanks for explaining the details of your decision! It is quite refreshing to hear a company representative elaborate on design decisions for their products. I wish other companies did the same.

That being said, I wasn't making a point that a unicode encoding shouldn't be used. In fact, I absolutely agree that it should be. Unicode is good. If everyone used a unicode encoding (be it UTF-8, UTF-16, or even UTF-32) for everything, the world would be a better place.

That point that I was making was that I think that in the future UTF-8 will have much wider adoption than UTF-16. The fact that it is variable length doesn't really matter, because it would be fairly easy to have the editor use a UTF-16 representation internally and just convert to and from UTF-8 when saving to a file.

UTF-8 is much easier to manipulate and work with than UTF-16 (at least as far as file editing with multiple editors is concerned), as it was designed to be an easy migration path to unicode compliance. I think that you are greatly overestimating the difficulty in implementing it as a supported encoding for the propeller tool.

But that's just my opinion.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔

darco
www.deepdarc.com/ (http://www.deepdarc.com/)
xmpp:darco@deepdarc.com (mailto:xmpp:darco@deepdarc.com)

Jeff Martin
05-18-2008, 06:33 AM
Hi darco,

I see what you saying, but I still don't understand why UTF-8 would have a wider adoption than UTF-16.· To me, the space savings is insignificant and in some cases it would take more file space, so it can't be that.· And to manipulate variable byte-width string encodings in an editor when doing copy/paste, page up/down, cursor movements, etc., is not impossible for sure, but certainly a niussance, so I agree that it'd be best to convert it to UTF-16 internally for easy handling.· Which again leads me to the same conclusion... why go through the trouble?· Why wouldn't UTF-16 be naturally widely accepted as time goes on?

By the way, I'm not trying to be argumentative, I'm just trying to understand something that I may not have realized or considered.

Thanks.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
--Jeff Martin

· Sr. Software Engineer
· Parallax, Inc.

rokicki
05-19-2008, 01:46 AM
UTF-8 is far, far more common and commonly supported by tools than UTF-16. A 16-bit representation is fine
for internal manipulation, and (for instance) both Java and Windows uses this internally, but external text
files are almost always stored as UTF-8. Some examples:

On linux, the default character encoding (these days) is UTF-8.

On the internet, the most common encoding is ISO-8859-1 (ughh) followed by UTF-8.

Even text editors that don't "know" UTF-8 can usually deal reasonably well with UTF-8 files, since all the
non-ASCII characters are represented by codes 0xa0-0xff, and thus out of the control character range.
They just show the non-ASCII characters in the local character set. This is most assuredly not true for
UTF-16 files.

This is the main objection to UTF-16; almost all tools that are not Unicode-aware, absolutely
choke and die a miserable death for UTF-16, but they actually work reasonably well (if not perfectly)
for UTF-8. (A simple test: does the Windows find command:

find "PUB" *.spin

do the right thing on UTF-16 files? I bet it does for UTF-8 files.)

On the other hand, notepad (yuck) may or may not deal well with UTF-8 files. And then there's the
matter of the byte order mark; many tools generate or want a byte order mark to indicate the file
is UTF-8, but then other tools do not generate it, or choke on it. So all is not well.

(The same problem exists for UTF-16; there's no way to tell if a file is UTF-16 or random binary
junk, or just a normal text file that just happens to have a lot of control characters in it.)

darco
05-19-2008, 04:10 AM
The big benefit of UTF-8 over other unicode encodings is that UTF-8 is a superset of the ubiquitous ASCII encoding, meaning that it is backward compatible. Any ASCII encoding is, by definition, also a UTF-8 encoding. UTF-8 recently superseded western european encodings as the most commonly used encoding on the internet.

The primary benefit of UTF-8 over UTF-16 is not file size, it is editing convenience. UTF-8 files are much easier to edit and work with in general, because you can still generally read and edit them even if you don't have a UTF-8 compliant editor—as long as the editor doesn't end up stripping out codes larger than 127.

I do not recommend using UTF-8 as the internal program representation. How you represent the file internally at runtime is of no consequence, and should be done in the most convenient way possible.

See en.wikipedia.org/wiki/UTF-8 (http://en.wikipedia.org/wiki/UTF-8) for more info.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔

darco
www.deepdarc.com/ (http://www.deepdarc.com/)
xmpp:darco@deepdarc.com (mailto:xmpp:darco@deepdarc.com)

Ariba
05-19-2008, 08:20 PM
Another problem is, that the Parallax font not works correct on Linux. Also if you can load an UTF-16 (for example with OpenOffice) you have to use the Parallax font for all the special electronic-characters. But the characters of the Parallax font don't display at the right positions, I think something with the kerning is wrong.
Does someone of the Linux guys know a free Truetype font editor, or has a corrected Parallax font ? (I try to make a Linux version of PropTerminal).

Andy

Sapieha
05-19-2008, 11:07 PM
Hi Andy.

In one of post I have posted Edited Parallax font.


Try it.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Nothing is impossible, there are only different degrees of difficulty.

Sapieha

Ariba
05-20-2008, 12:59 AM
Sapieha
I've found this: http://forums.parallax.com/showthread.php?p=684568
and tried it, but this font does also not work http://forums.parallax.com/images/smilies/shakehead.gif

Thank you anyway.

Andy

Sapieha
05-20-2008, 01:33 AM
Hi Ariba.

You must rename it and copy to system font and mar it read only.

Tols auto reinstals this in other instance

Ps. Have You any scren dump with problem?

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Nothing is impossible, there are only different degrees of difficulty.

Sapieha

darco
05-20-2008, 02:02 AM
Here is a few more references on the ubiquity of UTF-8 vs UTF-16:

This post (http://googleblog.blogspot.com/2008/05/moving-to-unicode-51.html) on the official google blog (http://googleblog.blogspot.com/) describes how december of 2007 marked a milestone on the web: for the first time ever, the UTF-8 encoding has become the most commonly used encoding on the internet. They even made a fancy graph:

http://forums.parallax.com/attachment.php?attachmentid=53698

Note that UTF-16 isn't even on the list.

BTW: Notepad, while not what I would remotely consider a "good citizen" with respect to text file editing, does indeed support UTF-8 encoded files (http://blogs.msdn.com/oldnewthing/archive/2004/03/24/95235.aspx).

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔

darco
www.deepdarc.com/ (http://www.deepdarc.com/)
xmpp:darco@deepdarc.com (mailto:xmpp:darco@deepdarc.com)

Post Edited (darco) : 5/19/2008 6:15:17 PM GMT

Ariba
05-20-2008, 02:35 AM
Sapieha
I try to use the Parallax Font on Linux. On Windows I have no problem with the original Font. I don't know how to make a Screenshot in Linux, but the result is very similar to this picture from the Thread on which you have posted your edited Font (see Attachment).

Andy

darco
05-20-2008, 03:06 AM
Ariba: Sounds like a separate issue from what I was talking about in this thread. You might want to start a new thread.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔

darco
www.deepdarc.com/ (http://www.deepdarc.com/)
xmpp:darco@deepdarc.com (mailto:xmpp:darco@deepdarc.com)

Ale
05-20-2008, 03:20 AM
Ariba: I do screenshots with gimp's aquire function. I'm sure there is a key combination or a menu option for gnome ;.), I use xfce.

A good editor is of course FontForge. I can even post my partial simil parallax font that I draw for pPropellerSim if someone wants to finish it http://forums.parallax.com/images/smilies/wink.gif

I prefer utf-8, anything windows "enforced" is a bad idea for me http://forums.parallax.com/images/smilies/smile.gif.

Ariba
05-20-2008, 03:51 AM
Yes darco, sorry for hijacking.

But when we have UTF-8 and can use other tools, and OSes with the Spin files, then the font problem will became an issue. ;)

(Ale: I will have a look at FontForge).

Cluso99
05-21-2008, 07:22 AM
Can anyone tell me how to make a UTF-16 file?

I want to store some results of pin samples in a format that the Propeller IDE can read so that I can use it to display a dataset of pin samples using the waveform characterset.

Thanks in advance.

kuroneko
05-21-2008, 07:29 AM
Cluso99 said...
Can anyone tell me how to make a UTF-16 file?


UTF-16 little endian marker (0xFF 0xFE) followed by the characters in little endian, e.g. ('A' 0x00).

Cluso99
05-23-2008, 05:23 PM
Thanks kuroneko,

I now have my DataLogger outputting in UTF-16 so that it can be captured by Hyperterminal in a *.spin file and can be read into the Propeller IDE as waveforms http://forums.parallax.com/images/smilies/smile.gif)