ruiz: I agree with jazzed (Steve). Start a new thread please, and place a link here for ayone that wishes to follow it later. This is exciting.
BTW Brad (who wrote bst) has a large test suite for validation. He has been absent for some time. Perhaps he can help.
I've submitted an update (r32) that fixes the two bugs mentioned here and also adds the -c feature to match the BSTC -c feature (output a dat file with only the DAT sections compiled in).
ruiz, that's pretty awesome that you made another version of the compiler in straight C. Several people around here will like that once it's working fully. Although, I agree with others, it would be best to start your own thread for it, so as not to confuse and mix things up with my thread here.
As for testing, my testing so far just involves compiling a bunch of stuff with PropTool 1.3 and saving the binary files, and then compiling the same stuff with my code and doing binary compares (with beyond compare). Once Jeff Martin finishes what he is currently working on, the plan is to make a "test harness" or something that does something similar to what you described. Including both valid and invalid code snippets to test the compiler for both compiling good code, but also test it's error conditions.
Sorry to be perceived as 'hijacking' Roy's thread, my sincerest appologies. Below a wrap-up/changeover post.
<soapbox>I can't quite get my head around this community's preference for spreading related information thinly over many threads, especially as there is only the feeblest of attempt to consolidate know-how in wiki's etc. I know from recent experience that the resulting digging for info is not a pleasant new user experience. I've started a new thread "Spin development tool chain" that anyone can use for even vaguely related posts, so that all insights are in one thread. The new thread is currently sitting in the moderation queue.</soapbox>
@jazzed
'group project' in a limited sense. I figure that the group of significant contributors in the tool chain arena consists of about 10 people, with perhaps 5 having good C/C++ skills. Those 5 are already busy with stuff that is more important to the propeller and parallax than this, so distracting them with collaborative coding would be a waste. However, having a thread to discuss architecture, related/previous work and synergies (such as a shared file format for debug info) would be helpful. In other areas a group effort might be much more beneficial, see below.
@rayman
In the new thread I've posted a version with con.c renamed and the clock mode thing I mentioned in the first post fixed. If you let me know your mileage with that in the new thread we'll work on the build with visual studio and the cause of the 4 byte difference -- perhaps structure packing is different. If you send a PM, I'll mail you the new zip right away.
@cluso
Good to hear that Brad already has a test suite -- but I fear if he would have been willing to share it, he would have done so already. Would you have contact details for Brad, and perhaps also for 'Hippy' ?
@roy
Once again sorry for misposting. Can I ask you about some of the design choices in your code base in this thread, or would you prefer another thread for that too?
Once Jeff Martin finishes what he is currently working on, the plan is to make a "test harness" or something that does something similar to what you described.
I have both the handicap and the benefit of having an outsiders perspective. My best guess is that Jeff will be busy for several months with Prop II related stuff and after that can make one-third max of his time available for this. Considering the size of the task (writing a validation suite is typically a bigger job than writing the software), I think that it might be ready one year from now. Hopefully I am totally mistaken, but what if I'm right?
I respectfully suggest that creating test cases is a community effort par excellence: it consists of lots of small tasks that can be done in parallel and the talent pool for this is probably an order of magnitude larger than what I mentioned to jazzed above. Just my 2c worth.
Looking at the unicode handling. The way I understand your code base, it copies every file to memory as UCS-16 and then converts the entire source to the Propeller's special 8 bit char set; this converted source is then used in further compilation. This looks odd to me. I've only quickly read your code, so perhaps I misunderstood the algorithms used.
In my understanding there are five contexts to consider:
[1] comments - these can be any unicode code point and don't need to be translated as the lexer throws them out anyway (and should be preserved as-is in a list file)
[2] object names - these are (ascii) identifiers and don't need translation (translating is a no-op); if paths are allowed here, as in bst, see [3]
[3] dat section file includes - here the path name should be left as-is; seems to me your code corrupts non-ascii paths. (In the backend some translation will be necessary though as the windows API wants UTF-16 paths and the Posix API wants UTF-8 paths, perhaps most easily implemented as a wrapper around "fopen" and friends.)
[4] literal strings in methods and dat sections - these need translation from unicode to the Propeller char set.
[5] everything else - this must be ascii so no translation is necessary.
So, I would say that the easiest algorithm is to only translate the strings of type [4] above. This would typically be a few hundred bytes, if even that. That in turn enables using reverse lookup, ie. a table with 160 shorts mapping propeller code points to UCS-16 code points. The whole thing fits in 50 lines or so.
The x86 compiler code I started with expected the entire file to be converted to PASCII when passed into it. My code that does that is a porting of the Delphi code from Propellent.
In order to avoid converting to PASCII for most of the code, I would have to fix up all of the compiler code to deal with Unicode16, and then add code to convert only strings into PASCII. On top of that, what happens if the input file is not in Unicode16 form (which happens). Now I have to make the compiler code handle multiple character sizes/formats? I think it's easier to just convert any input file from whatever form it's in to PASCII and then pass that to the compiler. My current code handles Unicode16 and ASCII source files, I will be adding support for UTF8 as well.
PASCII = Propeller ASCII. It's 1 byte character with ASCII in the first 128 chars, and then the propeller font characters in the second 128. The name PASCII comes from the Propellent source.
In Unix circles it's called 7bit ASCII and is ISO-646. It would appear that ASCII only defines the 7 bits, the high bit is implemented by code page support and isn't standardized.
Roy
When one uses the -d option to compile doc's is what comes at me from stdout PASCII and not
unicode. I say that because I have been playing around with capturing it with a THandleStream and the
only thing that works in all cases is if I say the encoding is good old plain ASCII and let the special chars print
whatever goofy char it wants to replace it with.
Do you know of a function to convert PASCII to UTF-16.
Correct me if I am wrong but I did not see the command window on my win7 PC show the Parallax drawing chars.
The x86 compiler code I started with expected the entire file to be converted to PASCII when passed into it. My code that does that is a porting of the Delphi code from Propellent.
I appreciate that you had to start with a poorly architected code base, but if now is not the time to fix it, when is? The alternative is to document the broken behaviours.
Now I have to make the compiler code handle multiple character sizes/formats?
No. What I see most of the time (and leads to the cleanest code, imho) is to use 32-bit ("UTF-32") for single chars and UTF-8 for strings throughout the internals and convert to UTF-8/16/32 in the input/output routines as may be needed. Ten years ago it was popular to do all internals in UCS-2 and ignore China (where plane 1 code points are required).
My current code handles Unicode16 and ASCII source files.
I may have read your code too superficially, but I think it handles little-endian UCS-2 and ASCII. (sorry for calling UCS-2 'UCS-16' in the earlier post). See: http://en.wikipedia.org/wiki/UTF-16
ruiz,
I still don't see what is wrong with converting the incoming files to PASCII and having the compiler code work with PASCII. That's how I plan to leave it. I also wouldn't call the original code "poorly architected".
tdlivings,
The doc output is in PASCII. I suspect that the Proptool converts it back to unicode form for display. I don't have a function to do the reverse conversion yet. You won't see the parallax font characters in a command window, You need to be using the Parallax font to see those.
If this were a parser + tokenizer compiler, UTF would be implemented trivially by the parser. You would simply smash all non-7bit data as whitespace. The tokenizer handles structure and the parser makes tokens from the input stream.
@Roy
"Poorly architected" is indeed a bit strong, sorry -- just was trying to cut you some slack. I should have said "aged architecture": it looks to me that the PropTool builds on a long heritage and that some architectural choices that made sense at the time became less than convenient as the code base developed. It is not unusual for long lived projects to go through major re-architecting (think mozilla => firefox, sqlite2 => sqlite3) and the best projects refactor and re-architect continuously. Note that in re-architecting usually 90% or more of the code doesn't change.
It seems to me that, at in some point in time, the PropTool internationalization code was layered on top of an existing code base and that this is not the best way to do it. I think there are similar structural/algorithmic things deeper down in the code base that will make it unnecessarily hard to match BST feature for feature and that time invested now in some re-architecting will pay itself back twice over later on, but that is just a gut feel.
@Roy/Phil:
The problem is that the current code jumbles/corrupts non-ascii file paths. That may not be an issue to you, but it sure is a pain if you live in Brazil, Germany, Russia, etc.
Less important, the comments also get jumbled. The jumbling cannot be reversed fully, as many unicode code points all get mapped to 'pascii' 0xff.
I guess Roy's current goal is to match PropTool, even where it comes to less convenient "features" -- and this is a very understandable approach.
The problem is that the current code jumbles/corrupts non-ascii file paths.
No, the real problem is that file paths are OS-specific and should not be part of the Spin language, per se. The PASM file pragma, for example, should really be a pre-processor directive. The compiler should never have to see it. Same goes for "file names" in the OBJ section, which we've already discussed to exhaustion. That does leave open the question of whether the language ought to accommodate non-ASCII object, variable, and method names. ASCII-centrism may seem chauvinistic; but, being a common denominator, it does have the advantage of forcing cross-cultural compatibility.
Roy,
Thank's for verifing what my experimenting showed. In hind site one could have figured that in the first place but
I learned far more going the long way around trying all the unicode encodings. The memo component I am using does
have the Parallax fonts and will load the Parallax UTF-16 LE files and display the Parallax drawing chars.
I think converting PASCII back to UTD-16 LE would be the best way as PASCII was created by going the other way in the
first place. As you say the PropTool does it so there is an alogrithm but that one is not part of their open source.
The problem is that the current code jumbles/corrupts non-ascii file paths. That may not be an issue to you, but it sure is a pain if you live in Brazil, Germany, Russia, etc.
Maybe I'm mistaken, but doesn't the Parallax character set use the same codes as ISO 8859 for the common non-english characters? So it seems like it shouldn't be a problem in Brazil and Germany, but maybe in Russia, Greece, Japan and China. I would guess that the latter set of countries have to contend with this problem with many other software apps.
No, the real problem is that file paths are OS-specific and should not be part of the Spin language, per se. The PASM file pragma, for example, should really be a pre-processor directive.
The problem isn't file paths, but absolute file paths. Those should be avoided. There's nothing wrong with a relative path such as "cogcode/widget.dat" or "../include/header". Yes, it does use characters such as ".." and "/", but those are understood on many operating systems without translation. The "/" character just needs to be translated to "\" for Windows.
The problem isn't file paths, but absolute file paths.
Even relative file paths may contain characters foreign to PASCII and, hence, to Spin. These should be processed and resolved ahead of the compilation phase, so that all the compiler sees is the included text/data and a linked array of object source. By enforcing a partition between the language, which the compiler deals with, and the local implementation, which is the domain of a pre-processor, the compiler can be kept more portable.
One idea of how to handle this, without majorly changing the compiler, is to handle filenames before they are handed to the compiler proper.
Before 7bit conversion:
You scan the Spin file for filenames, copy the filename to a structure, then replace the filename with a program generated filename. When the compiler wants to open a file, the open is wrapped by a function that looks up the generated filename and substitutes the actual filename.
It's a thin translation layer that allows you to have 7bit clean filenames, while retaining i18n support at the filesystem level.
It would be better to copy the pre-processed objects under their generated names to a temporary directory and compile them from there.
One thing that needs to be resolved, when a pre-processor is used to modify the source, is to preserve line numbers for error reporting. When a compiler works with a pre-processor that can modify the source, there needs to be a communication protocol involving generated meta-comments that maintains synchrony between the original source and the processed source.
Better still would be for the pre-processor to run the show, maintain the line correspondences internally, and call the compiler as a sub-function. That way the compiler doesn't have to mess with that stuff.
Even relative file paths may contain characters foreign to PASCII and, hence, to Spin.
Phil, do you have a concrete example of this, or is this just a concern about a hypothetical case on some hypothetical OS? I don't understand the problem. Why doesn't this break C when it contains file paths in it's include statements?
It doesn't break C because it uses UTF-8, and converts it to the system native charset when necessary. Most POSIX filesystem calls take UTF-8 filenames, so no conversion is necessary. By forcing code through a to/from PASCII conversion, you will lose localized filenames that have characters that may not exist in PASCII.
I don't have a concrete example, and you're right that PASCII includes the extended Latin character set. But there are also localizations that use Cyrillic characters and those from other languages completely foreign to extended ASCII. I think Circuitsoft answered your question about C, which I'm not familiar enough with to have done so.
OK, that makes things a bit clearer. I suppose this could be handled in Spin by requiring that path names must be specified using the standard ASCII characters. For the most part, relative path names would reference just a few directory levels up or down from where the current source file is located.
I still think it's better to handle it by not letting the compiler see any path or file names, except those names temporarily produced by the pre-processor. Such OS- and localization-dependent matters should not be part of the Spin language specification.
I still think it's better to handle it by not letting the compiler see any path or file names, except those names temporarily produced by the pre-processor. Such OS- and localization-dependent matters should not be part of the Spin language specification.
Roy
spin.exe has a issue with these files. One seems to be an issue of having a space in the file name if I rename it QE.spin it works fine
The Walking Ring file I do not see a space maybe the name is to long. Both files show no error message from spin.exe they do return
a status of 1 indicating a problem and write the help to stderr as shown in my capture of Quad Encoder. In the pic first I just ran spin then
in the second part it shows failure but no indication of why. The PropTool has no issue with either file. I had one other files produce
status of 1 from spin.exe and it also had a space in the file name.
Seems to be in the area of building the filename in your spin.exe code.
Tom
tdlivings,
On the command line if you want to specify a name with a space in it you have to wrap it in quotes. Try spin "Quadrature Encoder.spin". That's just a standard thing with command line stuff.
I'll debug the other long filename tonight after work.
Comments
BTW Brad (who wrote bst) has a large test suite for validation. He has been absent for some time. Perhaps he can help.
ruiz, that's pretty awesome that you made another version of the compiler in straight C. Several people around here will like that once it's working fully. Although, I agree with others, it would be best to start your own thread for it, so as not to confuse and mix things up with my thread here.
As for testing, my testing so far just involves compiling a bunch of stuff with PropTool 1.3 and saving the binary files, and then compiling the same stuff with my code and doing binary compares (with beyond compare). Once Jeff Martin finishes what he is currently working on, the plan is to make a "test harness" or something that does something similar to what you described. Including both valid and invalid code snippets to test the compiler for both compiling good code, but also test it's error conditions.
Having 'con' in a filename is not valid for windows as it is a reserved name (legacy from the DOS days).
Cheers
<soapbox>I can't quite get my head around this community's preference for spreading related information thinly over many threads, especially as there is only the feeblest of attempt to consolidate know-how in wiki's etc. I know from recent experience that the resulting digging for info is not a pleasant new user experience. I've started a new thread "Spin development tool chain" that anyone can use for even vaguely related posts, so that all insights are in one thread. The new thread is currently sitting in the moderation queue.</soapbox>
@jazzed
'group project' in a limited sense. I figure that the group of significant contributors in the tool chain arena consists of about 10 people, with perhaps 5 having good C/C++ skills. Those 5 are already busy with stuff that is more important to the propeller and parallax than this, so distracting them with collaborative coding would be a waste. However, having a thread to discuss architecture, related/previous work and synergies (such as a shared file format for debug info) would be helpful. In other areas a group effort might be much more beneficial, see below.
@rayman
In the new thread I've posted a version with con.c renamed and the clock mode thing I mentioned in the first post fixed. If you let me know your mileage with that in the new thread we'll work on the build with visual studio and the cause of the 4 byte difference -- perhaps structure packing is different. If you send a PM, I'll mail you the new zip right away.
@cluso
Good to hear that Brad already has a test suite -- but I fear if he would have been willing to share it, he would have done so already. Would you have contact details for Brad, and perhaps also for 'Hippy' ?
@roy
Once again sorry for misposting. Can I ask you about some of the design choices in your code base in this thread, or would you prefer another thread for that too?
I have both the handicap and the benefit of having an outsiders perspective. My best guess is that Jeff will be busy for several months with Prop II related stuff and after that can make one-third max of his time available for this. Considering the size of the task (writing a validation suite is typically a bigger job than writing the software), I think that it might be ready one year from now. Hopefully I am totally mistaken, but what if I'm right?
I respectfully suggest that creating test cases is a community effort par excellence: it consists of lots of small tasks that can be done in parallel and the talent pool for this is probably an order of magnitude larger than what I mentioned to jazzed above. Just my 2c worth.
All replies in the new thread please!
Looking at the unicode handling. The way I understand your code base, it copies every file to memory as UCS-16 and then converts the entire source to the Propeller's special 8 bit char set; this converted source is then used in further compilation. This looks odd to me. I've only quickly read your code, so perhaps I misunderstood the algorithms used.
In my understanding there are five contexts to consider:
[1] comments - these can be any unicode code point and don't need to be translated as the lexer throws them out anyway (and should be preserved as-is in a list file)
[2] object names - these are (ascii) identifiers and don't need translation (translating is a no-op); if paths are allowed here, as in bst, see [3]
[3] dat section file includes - here the path name should be left as-is; seems to me your code corrupts non-ascii paths. (In the backend some translation will be necessary though as the windows API wants UTF-16 paths and the Posix API wants UTF-8 paths, perhaps most easily implemented as a wrapper around "fopen" and friends.)
[4] literal strings in methods and dat sections - these need translation from unicode to the Propeller char set.
[5] everything else - this must be ascii so no translation is necessary.
So, I would say that the easiest algorithm is to only translate the strings of type [4] above. This would typically be a few hundred bytes, if even that. That in turn enables using reverse lookup, ie. a table with 160 shorts mapping propeller code points to UCS-16 code points. The whole thing fits in 50 lines or so.
What am I missing?
In order to avoid converting to PASCII for most of the code, I would have to fix up all of the compiler code to deal with Unicode16, and then add code to convert only strings into PASCII. On top of that, what happens if the input file is not in Unicode16 form (which happens). Now I have to make the compiler code handle multiple character sizes/formats? I think it's easier to just convert any input file from whatever form it's in to PASCII and then pass that to the compiler. My current code handles Unicode16 and ASCII source files, I will be adding support for UTF8 as well.
Good grief, I have been mixing with computers for what seems like a hundred years but what on earth is PASCII?
Google is not helping here.
When one uses the -d option to compile doc's is what comes at me from stdout PASCII and not
unicode. I say that because I have been playing around with capturing it with a THandleStream and the
only thing that works in all cases is if I say the encoding is good old plain ASCII and let the special chars print
whatever goofy char it wants to replace it with.
Do you know of a function to convert PASCII to UTF-16.
Correct me if I am wrong but I did not see the command window on my win7 PC show the Parallax drawing chars.
Thank's
Tom
No. What I see most of the time (and leads to the cleanest code, imho) is to use 32-bit ("UTF-32") for single chars and UTF-8 for strings throughout the internals and convert to UTF-8/16/32 in the input/output routines as may be needed. Ten years ago it was popular to do all internals in UCS-2 and ignore China (where plane 1 code points are required).
I may have read your code too superficially, but I think it handles little-endian UCS-2 and ASCII. (sorry for calling UCS-2 'UCS-16' in the earlier post). See:
http://en.wikipedia.org/wiki/UTF-16
I still don't see what is wrong with converting the incoming files to PASCII and having the compiler code work with PASCII. That's how I plan to leave it. I also wouldn't call the original code "poorly architected".
tdlivings,
The doc output is in PASCII. I suspect that the Proptool converts it back to unicode form for display. I don't have a function to do the reverse conversion yet. You won't see the parallax font characters in a command window, You need to be using the Parallax font to see those.
-Phil
"Poorly architected" is indeed a bit strong, sorry -- just was trying to cut you some slack. I should have said "aged architecture": it looks to me that the PropTool builds on a long heritage and that some architectural choices that made sense at the time became less than convenient as the code base developed. It is not unusual for long lived projects to go through major re-architecting (think mozilla => firefox, sqlite2 => sqlite3) and the best projects refactor and re-architect continuously. Note that in re-architecting usually 90% or more of the code doesn't change.
It seems to me that, at in some point in time, the PropTool internationalization code was layered on top of an existing code base and that this is not the best way to do it. I think there are similar structural/algorithmic things deeper down in the code base that will make it unnecessarily hard to match BST feature for feature and that time invested now in some re-architecting will pay itself back twice over later on, but that is just a gut feel.
@Roy/Phil:
The problem is that the current code jumbles/corrupts non-ascii file paths. That may not be an issue to you, but it sure is a pain if you live in Brazil, Germany, Russia, etc.
Less important, the comments also get jumbled. The jumbling cannot be reversed fully, as many unicode code points all get mapped to 'pascii' 0xff.
I guess Roy's current goal is to match PropTool, even where it comes to less convenient "features" -- and this is a very understandable approach.
-Phil
Thank's for verifing what my experimenting showed. In hind site one could have figured that in the first place but
I learned far more going the long way around trying all the unicode encodings. The memo component I am using does
have the Parallax fonts and will load the Parallax UTF-16 LE files and display the Parallax drawing chars.
I think converting PASCII back to UTD-16 LE would be the best way as PASCII was created by going the other way in the
first place. As you say the PropTool does it so there is an alogrithm but that one is not part of their open source.
Tom
The problem isn't file paths, but absolute file paths. Those should be avoided. There's nothing wrong with a relative path such as "cogcode/widget.dat" or "../include/header". Yes, it does use characters such as ".." and "/", but those are understood on many operating systems without translation. The "/" character just needs to be translated to "\" for Windows.
-Phil
Before 7bit conversion:
You scan the Spin file for filenames, copy the filename to a structure, then replace the filename with a program generated filename. When the compiler wants to open a file, the open is wrapped by a function that looks up the generated filename and substitutes the actual filename.
It's a thin translation layer that allows you to have 7bit clean filenames, while retaining i18n support at the filesystem level.
It would be better to copy the pre-processed objects under their generated names to a temporary directory and compile them from there.
One thing that needs to be resolved, when a pre-processor is used to modify the source, is to preserve line numbers for error reporting. When a compiler works with a pre-processor that can modify the source, there needs to be a communication protocol involving generated meta-comments that maintains synchrony between the original source and the processed source.
Better still would be for the pre-processor to run the show, maintain the line correspondences internally, and call the compiler as a sub-function. That way the compiler doesn't have to mess with that stuff.
-Phil
I don't have a concrete example, and you're right that PASCII includes the extended Latin character set. But there are also localizations that use Cyrillic characters and those from other languages completely foreign to extended ASCII. I think Circuitsoft answered your question about C, which I'm not familiar enough with to have done so.
-Phil
-Phil
-Phil
spin.exe has a issue with these files. One seems to be an issue of having a space in the file name if I rename it QE.spin it works fine
The Walking Ring file I do not see a space maybe the name is to long. Both files show no error message from spin.exe they do return
a status of 1 indicating a problem and write the help to stderr as shown in my capture of Quad Encoder. In the pic first I just ran spin then
in the second part it shows failure but no indication of why. The PropTool has no issue with either file. I had one other files produce
status of 1 from spin.exe and it also had a space in the file name.
Seems to be in the area of building the filename in your spin.exe code.
Tom
On the command line if you want to specify a name with a space in it you have to wrap it in quotes. Try spin "Quadrature Encoder.spin". That's just a standard thing with command line stuff.
I'll debug the other long filename tonight after work.