As far as I recall that is Arabic. There are similar letters in Hebrew.
Thing is I did not "write code like that". You can enter that into your editor in the normal western left to right manner. The editor then proceeds to swap things around and make gibberish on the screen. I consider this a bug in my editors. Worse still a deliberate malfunction. Introduced because, well, Unicode.
If you check the source code of this web page you will see that everything in the JS snippet is in the correct order. It is your browser that mangles it.
I agree that unicode is as Heater says. There have been alternative 8-bit systems for other alphabets for a long time, most of which are based on ASCII and work just fine. Applications that wish to mix languages need only keep the codepage information for the alphabet being used and use standard ASCII style character sets, there is even one for traditional mandarin (using 3 characters for one glyph though it still works).
On win32, we will all have our own views, you want to choke your potential market great, lets leave it at that.
@Coder Kid:
Why do you have a couple of hundred blank lines in your sig?
There are plenty of places to do that outside of the forums. You just made this thread completely unreadable, meaning that you likely lost the attention you are attempting to get.
I agree that unicode is as Heater says. There have been alternative 8-bit systems for other alphabets for a long time, most of which are based on ASCII and work just fine. Applications that wish to mix languages need only keep the codepage information for the alphabet being used and use standard ASCII style character sets, there is even one for traditional mandarin (using 3 characters for one glyph though it still works).
Nope, doesn't work, was used for years, so we do have the experience. I could be in ISO-8859-1, but wouldn't be able to (as I have had to do for decades) combine multiple European languages in the same email (or document), unless they all fit in ISO-8859-1. Not to mention if I need to include (as I now do) Asian glyphs. Unicode actually solves that. Nothing did before. It's now very annoying when I run into software which doesn't handle utf8. I typeset music, but I can't (in the software I prefer for the actual typesetting) include Japanese lyrics.
I agree that unicode is as Heater says. There have been alternative 8-bit systems for other alphabets for a long time, most of which are based on ASCII and work just fine. Applications that wish to mix languages need only keep the codepage information for the alphabet being used and use standard ASCII style character sets, there is even one for traditional mandarin (using 3 characters for one glyph though it still works).
Nope, doesn't work, was used for years, so we do have the experience. I could be in ISO-8859-1, but wouldn't be able to (as I have had to do for decades) combine multiple European languages in the same email (or document), unless they all fit in ISO-8859-1. Not to mention if I need to include (as I now do) Asian glyphs. Unicode actually solves that. Nothing did before. It's now very annoying when I run into software which doesn't handle utf8. I typeset music, but I can't (in the software I prefer for the actual typesetting) include Japanese lyrics.
Incorrect. There existed many well known and widely used formats for data that was in multiple languages, including languages that have different rules to the western european languages, such as the left to right of normal reasonable languages (like Hebrew, Arimaic, Aribic, etc), and those using completely different structure such as traditional Mandarin, or Nihonese. Thus the old system did work, and it worked very well, even when combining many diverse languages into one document.
Yes, I would also like to know.. certainly non of the ISO-8859-x variants I know about can do that. Was there ever something I could set Content-Type: to in my emails that would have solved the issues I used to have? (before utf8)
Interesting book. The method for input / transform of Japanese is not totally unlike the IME system for input.
I don't think I ever ran into TRON in real life though.
That's because Microsoft lobbied the US government to impose an import ban on any TRON based products. Resulting in many Japanese companies dropping it.
While I do not have the links (and can not find the old NNTP threads that were common with the information), the basic idea was that using ASCII as the base and using Characters 128-255 to switch character sets (and direction, etc) was the common method. In cases of character sets that had more than 96 symbols multiple sub sets were switched between as needed.
This was a method used by many of the old DOS, TOS, Amiga, CP/M, and other multi language programs, and there were a couple of standards for the assignments that caught on quite well.
While I do not have the links (and can not find the old NNTP threads that were common with the information), the basic idea was that using ASCII as the base and using Characters 128-255 to switch character sets (and direction, etc) was the common method. In cases of character sets that had more than 96 symbols multiple sub sets were switched between as needed.
This was a method used by many of the old DOS, TOS, Amiga, CP/M, and other multi language programs, and there were a couple of standards for the assignments that caught on quite well.
That system worked fine for single languages, or small groups of languages that shared an alphabet, but broke down dramatically if you ever tried to move files between computers that used different languages. I have some experience here, as I was one of the architects of TOS, and multi-language support was always a headache.
I think many of the criticisms here of Unicode are a bit unfair. The problems that Unicode is trying to solve are simply hard, and there are no easy solutions. Human languages are messy and there are a lot of them! The UTF-8 encoding of Unicode is probably the best of a bad lot, and if it had been available "back in the day" I think it would have really helped with compatibility issues. UTF-8 is definitely superior to most of the previous multi-byte encodings, such as the one proposed by TRON, because it's not state-dependent: a fragment of (well formed) UTF-8 text can always be processed in the same way, and it's easy to detect whether UTF-8 characters are well formed, whereas in TRON you have to know a global context (the current language) and treat the bytes differently depending on this (and change behavior when a language change marker is seen).
I will not, though, under any circumstances, attempt to defend the inclusion of emojis in Unicode .
While I do not have the links (and can not find the old NNTP threads that were common with the information), the basic idea was that using ASCII as the base and using Characters 128-255 to switch character sets (and direction, etc) was the common method. In cases of character sets that had more than 96 symbols multiple sub sets were switched between as needed.
This was a method used by many of the old DOS, TOS, Amiga, CP/M, and other multi language programs, and there were a couple of standards for the assignments that caught on quite well.
That system worked fine for single languages, or small groups of languages that shared an alphabet, but broke down dramatically if you ever tried to move files between computers that used different languages. I have some experience here, as I was one of the architects of TOS, and multi-language support was always a headache.
I think many of the criticisms here of Unicode are a bit unfair. The problems that Unicode is trying to solve are simply hard, and there are no easy solutions. Human languages are messy and there are a lot of them! The UTF-8 encoding of Unicode is probably the best of a bad lot, and if it had been available "back in the day" I think it would have really helped with compatibility issues. UTF-8 is definitely superior to most of the previous multi-byte encodings, such as the one proposed by TRON, because it's not state-dependent: a fragment of (well formed) UTF-8 text can always be processed in the same way, and it's easy to detect whether UTF-8 characters are well formed, whereas in TRON you have to know a global context (the current language) and treat the bytes differently depending on this (and change behavior when a language change marker is seen).
I will not, though, under any circumstances, attempt to defend the inclusion of emojis in Unicode .
Eric
Really, it worked well for mixing US english, Hebrew, Hindi, and Nihonese (phonetic spelling, the language miss-called Japanese), no trouble switching between character sets, or even directional styles. Those are only a few of the languages that could be mixed with the old method.
It also did well between CP/M, DOS, XENIX, TOS, Amiga, Mac OS, etc, when sending between different people.
Your experience evidently differs from that of most people.
As I mentioned, UTF-8 has significant advantages over most other coding systems. Ken Thomson and Rob Pike (the inventors of UTF-8) are very smart guys. For some examples:
- It's self-synchronizing (you can start reading a stream anywhere and quickly find the start of the next character, and know exactly which language you're in).
- UTF-8 strings can be sorted just like ASCII strings, and manipulated using standard C string functions. There's no possibility of confusion or aliasing of characters (a big problem in Shift-JIS, for example).
- To find a string in files you can use regular tools like grep. Doing that with code page schemes is a lot harder, since you have to have specialized tools that can keep track of the code page each file is encoded with.
There were three of them that were used for the exchanges, one that had been shared as source code (K&R C) was used by everyone I dealt with, and had been ported to quite a few systems, including those mentioned. The source file name was langedit.c, it had a minimal of dependencies, mostly to display graphical fonts (which some systems did not have as part of the OS in those days). It did not work on systems for which the graphical output could not be satisfied (some text only CP/M systems and others). For portability it required only a function to initialize graphics, a function to plot a pixel in either black or white, and a function to close graphics.
Perhaps you will find it while I am offline. If not I will do a deeper search when I get back on line.
To correct my omission, one notable thing about langedit was that it was a huge monolithic source file, that really should have been broken up into multi files.
It seemed to take forever the first time I read it (already modified to compile in DOS with Turbo C i think, perhaps something else), when I got it I did the normal command type langedit.c | more to see what was in it before I compiled it, and it seemed like it took over an hour to read.
- It's self-synchronizing (you can start reading a stream anywhere and quickly find the start of the next character, and know exactly which language you're in).
UTF-8 does have that magical property of being stateless. You always know if you are the beginning of sequence of bytes of a single code point or not. Also there are no endianess issues.
I can't help thinking that there might be higher level synchronization issues when using precomposed characters that consist of more than one code point. Have to look into this.
- UTF-8 strings can be sorted just like ASCII strings...
Does this really work with precomposed characters?
...manipulated using standard C string functions
Probably not a good idea. strlen won't tell you how many characters are in a UTF-8 string.
To find a string in files you can use regular tools like grep.
Nope, does not work. Consider the following unicode text file and the two greps with different results:
So that's "no" then. There were no "well known and widely used formats for data that was in multiple languages" back in the day.
langedit sounds kind of cool. But as nobody has ever heard of it, google cannot find it and it was not used by any commonly used operating systems or tools I'm not going to count it.
So that's "no" then. There were no "well known and widely used formats for data that was in multiple languages" back in the day.
langedit sounds kind of cool. But as nobody has ever heard of it, google cannot find it and it was not used by any commonly used operating systems or tools I'm not going to count it.
Everyone on earth across multiple countries that I was aware of at the time used these three tools, which were largely interchangable. So I would call it well known and widely used.
Also it was repeatedly published (with updates for more OS's) on the NNTP threads, and everyone talked about it when exchanging multiple language documents.
@Heater:
As much as I enjoy what I learn in our debates, I will be offline very soon (as soon as the sun goes down so it is cool enough to get moving), and will not be back for a few months.
So I think we should put this one off until I return.
The only format I ever found and had success with was actually the bitmap.
Code page type schemes gave me no end of headaches. Communicating technical requirements via those things pretty much hell.
The bitmap, just one bit per pixel is fine, and fax, or a file transport actually worked.
And that is what a ton of people in my industry niche did. The data being exchanged was engineering drawings, various instructions and requirements.
From that point of view today, as opposed to 80s and 90s era troubles, unicode largely just works. Making that happen is painful, but the end products of it all today, for people just looking to communicate, are useful and an improvement.
Lol, I like the emoji. Completely frivolous, largely useless, inane, all apply. But ordinary people like em.
Good. Some of this stuff can still be silly and as long as it is, ordinary people won't feel completely left out.
Comments
OK - who is this guy and what has he done with the REAL Heater???
Thing is I did not "write code like that". You can enter that into your editor in the normal western left to right manner. The editor then proceeds to swap things around and make gibberish on the screen. I consider this a bug in my editors. Worse still a deliberate malfunction. Introduced because, well, Unicode.
If you check the source code of this web page you will see that everything in the JS snippet is in the correct order. It is your browser that mangles it.
I agree that unicode is as Heater says. There have been alternative 8-bit systems for other alphabets for a long time, most of which are based on ASCII and work just fine. Applications that wish to mix languages need only keep the codepage information for the alphabet being used and use standard ASCII style character sets, there is even one for traditional mandarin (using 3 characters for one glyph though it still works).
On win32, we will all have our own views, you want to choke your potential market great, lets leave it at that.
@Coder Kid:
Why do you have a couple of hundred blank lines in your sig?
Well your computer likely can, provided the correct text editor. On the other hand the forum software would not allow a couple of million.
It makes any thread you post in exceedingly dificult to read.
Did someone named Unicode press and hold your Carriage Return key down for a good long while? If not then no you should not blame Unicode.
Can you not ruin this thread with you silly signature?
Having to scroll down for ages to get to the next post is not appreciated or funny.
Mods, can you nuke is several thousand line blank signature please?
There are plenty of places to do that outside of the forums. You just made this thread completely unreadable, meaning that you likely lost the attention you are attempting to get.
Incorrect. There existed many well known and widely used formats for data that was in multiple languages, including languages that have different rules to the western european languages, such as the left to right of normal reasonable languages (like Hebrew, Arimaic, Aribic, etc), and those using completely different structure such as traditional Mandarin, or Nihonese. Thus the old system did work, and it worked very well, even when combining many diverse languages into one document.
https://books.google.fi/books?id=Zk2qCAAAQBAJ&pg=PA97&lpg=PA97&dq=multi-language+computer+systems&source=bl&ots=8_NrtN9b6k&sig=cG4G7htUak4B5MgJqVSBYJWpMy8&hl=en&sa=X&redir_esc=y#v=onepage&q=multi-language computer systems&f=false
If we can call that "well known and widely used".
I don't think I ever ran into TRON in real life though.
Got'a love Microsoft.
This was a method used by many of the old DOS, TOS, Amiga, CP/M, and other multi language programs, and there were a couple of standards for the assignments that caught on quite well.
That system worked fine for single languages, or small groups of languages that shared an alphabet, but broke down dramatically if you ever tried to move files between computers that used different languages. I have some experience here, as I was one of the architects of TOS, and multi-language support was always a headache.
I think many of the criticisms here of Unicode are a bit unfair. The problems that Unicode is trying to solve are simply hard, and there are no easy solutions. Human languages are messy and there are a lot of them! The UTF-8 encoding of Unicode is probably the best of a bad lot, and if it had been available "back in the day" I think it would have really helped with compatibility issues. UTF-8 is definitely superior to most of the previous multi-byte encodings, such as the one proposed by TRON, because it's not state-dependent: a fragment of (well formed) UTF-8 text can always be processed in the same way, and it's easy to detect whether UTF-8 characters are well formed, whereas in TRON you have to know a global context (the current language) and treat the bytes differently depending on this (and change behavior when a language change marker is seen).
I will not, though, under any circumstances, attempt to defend the inclusion of emojis in Unicode .
Eric
I was there in the CP/M, DOS etc days. I don't recall any such multilingual encoding that "just worked"
Please be specific.
Your experience evidently differs from that of most people.
As I mentioned, UTF-8 has significant advantages over most other coding systems. Ken Thomson and Rob Pike (the inventors of UTF-8) are very smart guys. For some examples:
- It's self-synchronizing (you can start reading a stream anywhere and quickly find the start of the next character, and know exactly which language you're in).
- UTF-8 strings can be sorted just like ASCII strings, and manipulated using standard C string functions. There's no possibility of confusion or aliasing of characters (a big problem in Shift-JIS, for example).
- To find a string in files you can use regular tools like grep. Doing that with code page schemes is a lot harder, since you have to have specialized tools that can keep track of the code page each file is encoded with.
Perhaps you will find it while I am offline. If not I will do a deeper search when I get back on line.
It seemed to take forever the first time I read it (already modified to compile in DOS with Turbo C i think, perhaps something else), when I got it I did the normal command type langedit.c | more to see what was in it before I compiled it, and it seemed like it took over an hour to read.
I can't help thinking that there might be higher level synchronization issues when using precomposed characters that consist of more than one code point. Have to look into this. Does this really work with precomposed characters? Probably not a good idea. strlen won't tell you how many characters are in a UTF-8 string. Nope, does not work. Consider the following unicode text file and the two greps with different results: Why? Those pesky precomposed symbols again.
So that's "no" then. There were no "well known and widely used formats for data that was in multiple languages" back in the day.
langedit sounds kind of cool. But as nobody has ever heard of it, google cannot find it and it was not used by any commonly used operating systems or tools I'm not going to count it.
Also it was repeatedly published (with updates for more OS's) on the NNTP threads, and everyone talked about it when exchanging multiple language documents.
As much as I enjoy what I learn in our debates, I will be offline very soon (as soon as the sun goes down so it is cool enough to get moving), and will not be back for a few months.
So I think we should put this one off until I return.
Code page type schemes gave me no end of headaches. Communicating technical requirements via those things pretty much hell.
The bitmap, just one bit per pixel is fine, and fax, or a file transport actually worked.
And that is what a ton of people in my industry niche did. The data being exchanged was engineering drawings, various instructions and requirements.
From that point of view today, as opposed to 80s and 90s era troubles, unicode largely just works. Making that happen is painful, but the end products of it all today, for people just looking to communicate, are useful and an improvement.
Lol, I like the emoji. Completely frivolous, largely useless, inane, all apply. But ordinary people like em.
Good. Some of this stuff can still be silly and as long as it is, ordinary people won't feel completely left out.
It's culture, a lot like the animated gif is.
What people will do...
In case the other doesn't work.