The speedup of the new register load/save methods is more dramatic on the other benchmarks. This is for the simplespi.c benchmark:
1. Using individual rdlong/wrlong, but only on registers that are used: 572413 cycles. Again, this is the method Catalina currently uses.
2. Using instruction skipping, but only on registers that are used: 432125 cycles
3. Using "Fast Block Moves", but saving all registers every time: 416765 cycles
That's almost a 30% improvement. Clearly, I should pay a bit more attention to optimizing Catalina for speed!
1. Using individual rdlong/wrlong, but only on registers that are used: 17 iterations/sec.
2. Using instruction skipping, but only on registers that are used: 21 iterations/sec
3. Using "Fast Block Moves", but saving all registers every time: 23 iterations/sec
The thing is, if you are running hub exec AND accessing variables on the stack one at a time this is not going to be very efficient as you are also competing with the fifo to read instructions from hub at the same time, and you also incur the extra variable latency of every rdlong/wrlong individually. So bursting them all in one go is much better.
Also by increasing the number of registers available to act as local "register variables" this can also help reduce the number of stack accesses in general. Certainly p2gcc with its 16 registers was too little for high performance on the P2 as I noticed in some cases with the MicroPython codebase it needed to shuffle things around too much between its few working registers and would have to write them back to the stack more often than was ideal. Maybe 24-32 registers is okay, or even higher in cases where we are not so concerned about interrupt latency. I do realize that was a concern for you with Catalina's ISR response time.
Are you sure you can't use interrupts with skipping? I read this from the P2 manual, which seems to indicate it might be possible at least in PASM ISRs, but I guess it may still have problems if you try to start a new skip sequence in C coded ISRs. Not sure there.
"Within a skipping sequence, a CALL/CALLPA/CALLPB that is not skipped will execute all its nested subroutines normally, with the skipping sequence resuming after the returning RET/_RET_. This allows subroutines to be skipped or entirely executed without affecting the top-level skip sequence. As well, an interrupt service routine will execute normally during a skipping sequence, with the skipping sequence resuming upon its completion."
That says you can interrupt a skip, which is not the same thing as using skip in an interrupt. I'm looking at the top of page 18 in the Parallax Google Doc:
Skipping only works outside of interrupt service routines; i.e. in main code.
Catalina allows arbitrary C functions as interrupt service routines. I believe it may be unique in doing so. So, unless I am reading that wrong, I can't use skip instructions in C code without an awful lot of messing about to detect whether the function is used as an interrupt routine or not. And if it is a library function, you may not know this at compile time.
Nested skipping:
As a slight aside, I think it would be good for a future enhanced P2 or P3 to have a 64-bit wide hardware stack (or possibly 63 bits). The extra 32 (or 31) stack bits would be loaded with the skip bits remaining when a CALL instruction is executed and they would be loaded back into the skip bit register by a RET. This would allow ISRs to use skipping and, perhaps more importantly, permit nested skipping in main code. A new CALL+SKIPF instruction would be handy, similar to EXECF (which is really JMP+SKIPF).
One current issue is remembering not to use skipping in routines called from a skip sequence and sometimes this means duplicating code: one version with skipping and one without.
Are you sure you can't use interrupts with skipping? I read this from the P2 manual, which seems to indicate it might be possible at least in PASM ISRs, but I guess it may still have problems if you try to start a new skip sequence in C coded ISRs. Not sure there.
"Within a skipping sequence, a CALL/CALLPA/CALLPB that is not skipped will execute all its nested subroutines normally, with the skipping sequence resuming after the returning RET/_RET_. This allows subroutines to be skipped or entirely executed without affecting the top-level skip sequence. As well, an interrupt service routine will execute normally during a skipping sequence, with the skipping sequence resuming upon its completion."
That says you can interrupt a skip, which is not the same thing as using skip in an interrupt. I'm looking at the top of page 18 in the Parallax Google Doc:
Skipping only works outside of interrupt service routines; i.e. in main code.
Catalina allows arbitrary C functions as interrupt service routines. I believe it may be unique in doing so. So, unless I am reading that wrong, I can't use skip instructions in C code without an awful lot of messing about to detect whether the function is used as an interrupt routine or not. And if it is a library function, you may not know this at compile time.
Nested skipping:
As a slight aside, I think it would be good for a future enhanced P2 or P3 to have a 64-bit wide hardware stack (or possibly 63 bits). The extra 32 (or 31) stack bits would be loaded with the skip bits remaining when a CALL instruction is executed and they would be loaded back into the skip bit register by a RET. This would allow ISRs to use skipping and, perhaps more importantly, permit nested skipping in main code. A new CALL+SKIPF instruction would be handy, similar to EXECF (which is really JMP+SKIPF).
One current issue is remembering not to use skipping in routines called from a skip sequence and sometimes this means duplicating code: one version with skipping and one without.
Nested skipping:
As a slight aside, I think it would be good for a future enhanced P2 or P3 to have a 64-bit wide hardware stack (or possibly 63 bits). The extra 32 (or 31) stack bits would be loaded with the skip bits remaining when a CALL instruction is executed and they would be loaded back into the skip bit register by a RET. This would allow ISRs to use skipping and, perhaps more importantly, permit nested skipping in main code. A new CALL+SKIPF instruction would be handy, similar to EXECF (which is really JMP+SKIPF).
One current issue is remembering not to use skipping in routines called from a skip sequence and sometimes this means duplicating code: one version with skipping and one without.
Way too complex for a compiler to handle
It would all happen automatically with CALL/RET and all skip sequences would be 'interruptable'.
One current issue is remembering not to use skipping in routines called from a skip sequence and sometimes this means duplicating code: one version with skipping and one without.
Way too complex for a compiler to handle
That would be the Linker's job. Arduino does a very smart thing with "gcc -ffunction-sections -fdata-sections" and "gcc -Wl,--gc-sections". So, every function is in a section called .text.my_function_name and the linker throws away any unreferenced sections.
So, for this, the compiler duplicates every function as my_function and my_function.noskips. Obviously, the .noskips version of each function calls .noskips version of each internal leaf call. In case a function doesn't need skips, and doesn't have any leaf calls, the .noskips symbol can be an alias with the same section and location. Finally, the _crt_start function calls main, and every interrupt registration call uses the .noskips version of the passed-in function.
Finally, if you do -ffunction-sections and --gc-sections or a similar mechanism in your compiler, that will automatically take care of what is and isn't run in an interrupt context.
This may be of interest. I have taken version 1.2 of the FFT benchmark and modified it so that it should now compile on all the compilers (although I have only tried fastspin and Catalina). The files are attached.
The interesting thing about this version is that it includes the option of using OMP for parallelism, so I have also modified it to use Catalina's multi-processing capabilities (when compiled with Catalina) This is a fairly mechanical process, described in my "sieve" example (which I have also attached).
I will have to see if I can make Catalina's multi-processing as easy to use as OMP.
Here are the results:
Fastspin:
Command (using -O1, because -O2 gives an error):
fastspin -2 -D__P2__ -O1 fftbench.c
Result = 16626 us
Catalina:
Command to use 1 slice (i.e. 1 cog only, no parallelism):
catalina -p2 -lci -ltiny -lthreads -O5 -C P2_EVAL -C NATIVE -D __P2__ fftbench.c thread_factory.c -D LOG2_SLICES=0
Result = 34195 us
Command to use 2 slices (i.e. 2 worker cogs):
catalina -p2 -lci -ltiny -lthreads -O5 -C P2_EVAL -C NATIVE -D __P2__ fftbench.c thread_factory.c -D LOG2_SLICES=1
Result = 20593 us
Command to use 4 slices (i.e. 4 worker cogs):
catalina -p2 -lci -ltiny -lthreads -O5 -C P2_EVAL -C NATIVE -D __P2__ fftbench.c thread_factory.c -D LOG2_SLICES=2
Result = 14342 us
As you would expect, this shows that you can get a very significant speed improvements by using multiple cogs on the Propeller. It also makes Catalina competitive again!
You may want to try propgcc and riscvp2 and add the results. I know gcc actually implements OMP, but I don't know if anyone has got that working on the Propeller version.
Ross.
EDIT: It just occurred to me I can also use the new FAST_SAVE_RESTORE option (although doing so requires the thread library to be recompiled). If you do so, the 4 slices case comes down slightly to 13168 us
This may be of interest. I have taken version 1.2 of the FFT benchmark and modified it so that it should now compile on all the compilers (although I have only tried fastspin and Catalina). The files are attached.
The interesting thing about this version is that it includes the option of using OMP for parallelism, so I have also modified it to use Catalina's multi-processing capabilities (when compiled with Catalina) This is a fairly mechanical process, described in my "sieve" example (which I have also attached).
I will have to see if I can make Catalina's multi-processing as easy to use as OMP.
I've now completed this process - it turned out to be easier than I expected. So (similar to OMP) you can now just insert a few pragmas to "parallelize" the FFT program (or the Sieve program, or any other program) rather than having to manually rewrite them (as I originally had to do).
I looked at the OMP pragmas, but (similar to what I found with the Posix thread libraries) they seemed hideously complex and unnecessarily cumbersome, so I came up with some much simpler ones that are more suited to the Propeller.
The new pragmas, and the new version of the fftbench.c program, are described in detail here. I would be interested to hear if you wanted to pursue a similar method in your compilers.
I've now completed this process - it turned out to be easier than I expected. So (similar to OMP) you can now just insert a few pragmas to "parallelize" the FFT program (or the Sieve program, or any other program) rather than having to manually rewrite them (as I originally had to do).
I looked at the OMP pragmas, but (similar to what I found with the Posix thread libraries) they seemed hideously complex and unnecessarily cumbersome, so I came up with some much simpler ones that are more suited to the Propeller.
I think it's better to use some existing standard, or at least a subset of it. A lot of the complexity for OpenMP is probably necessary for the general case. The stuff that isn't could be left out. The same holds for Posix threads; it looks like your thread API is not that different from a subset of Posix, so why not go all the way and just make it a subset so programs could compile on both?
Making "benchmarks" that only run on a specific platform and with a specific compiler seems a bit counterproductive to me...
Making "benchmarks" that only run on a specific platform and with a specific compiler seems a bit counterproductive to me...
Which is precisely why we shouldn't use either OpenMP, or Posix threads - at least not directly.
OpenMP requires implementation by the compiler manufacturer. It is not a "bolt on" like a shared library - every compiler maker has to implement it within the compiler. You could perhaps get it working on gcc on the Propeller, but probably not on any other compiler, so it is not a very good option. However, there is nothing to stop you using OpenMP as well. The fftbench example I use demonstrates this - it will use either OpenMP or Catalina parallelization - whichever is available - and neither one requires any change to the actual algorithm.
At least there are some portable thread library implementations, so we are certainly better off if we use threads of some kind. But even if we agree on a specific thread library, using it directly is also not a good option, since programs would typically have to be modified to add parallel processing, and then they would no longer work on compilers that didn't implement that particular library.
This is why I ended up using pragmas, just as OpenMP does, and for the same reasons - pragmas allow you to hide the implementation details, and also allow the same programs to continue to work on compilers that don't implement them - so we now can have the same program that will run on all compilers. Unfortunately, just adopting the OpenMP pragmas (or a subset of them) is not practical - they have been defined assuming they will be implemented within the compiler, and this would be difficult for most (possibly all!) of the current Propeller compilers.
So, the answer is that we do use threads - but we don't use them directly. We use pragmas instead, which allows us to use whatever threads library the compiler supports (posix or otherwise). This allows us to parallelize programs in a manner which is independent of both the compiler and whatever threads library it implements (or how much of it is implemented, over and above the basic ability to create a thread), and doesn't require us to modify existing algorithms just to add parallel processing.
It also becomes possible to add parallel processing to existing compilers without the compilers themselves having to implement anything at all . There are open source thread libraries that require only a single platform-specific function to be implemented.
I am currently using SPIN2 to create a menu-driven, 54 channel Pulse Width Modulator (using Smartpins). This is for 1/4 wave radio/Tesla experiments for controlling plasma.
I have added duty cycle, syncing & phasing functionality - by using arrays that hold frequency, period, Dutycycle, Totalticks, Onticks, Offticks, Ontime, Offtime, Phase, Phasetime and Phaseticks.
However, my "tick" calculations are rounded - and don't match up with my "period' calculations after divisions are made. This causes sweeping of some frequencies (not all) - relative to adjacent frequencies (which breaks syncing & functional phasing).
Is there a working floating point library for SPIN2 - that could help with my rounding errors?
If not, which C compilers come with an IDE - that might be best for handling Smart-Pin signalling of 54 channels? How difficult is it - to convert SPIN2 code to C?
My working code is attached. It was compiled with Propeller Tool Alpha 2.3.0.0
fastspin can compile spin2 and use c or basic subroutines/objects together with the spin2 code. C and Basic supported by fastspin can do float math.
Not sure if it helps you but that would be one way to use floating point math with spin2.
Eric also has a spin2cpp tool converting spin to c (basically the origin of fastspin) if you want to move it completely to C.
You should try to compile your source with fastspin, it does not use chips bytecode interpreter but compiles to native code. This runs faster as compiled with PropTool, but needs more memory. Trying that may help with your phasing issue.
floating point isn't a magic bullet, and 32 bit floating point numbers only provide about 23 bits of accuracy. You can do much better by careful use of fixed point numbers. So personally I would avoid floating point if accuracy is important.
Comments
1. Using individual rdlong/wrlong, but only on registers that are used: 572413 cycles. Again, this is the method Catalina currently uses.
2. Using instruction skipping, but only on registers that are used: 432125 cycles
3. Using "Fast Block Moves", but saving all registers every time: 416765 cycles
That's almost a 30% improvement. Clearly, I should pay a bit more attention to optimizing Catalina for speed!
1. Using individual rdlong/wrlong, but only on registers that are used: 17 iterations/sec.
2. Using instruction skipping, but only on registers that are used: 21 iterations/sec
3. Using "Fast Block Moves", but saving all registers every time: 23 iterations/sec
Again, almost 30% improvement.
Also by increasing the number of registers available to act as local "register variables" this can also help reduce the number of stack accesses in general. Certainly p2gcc with its 16 registers was too little for high performance on the P2 as I noticed in some cases with the MicroPython codebase it needed to shuffle things around too much between its few working registers and would have to write them back to the stack more often than was ideal. Maybe 24-32 registers is okay, or even higher in cases where we are not so concerned about interrupt latency. I do realize that was a concern for you with Catalina's ISR response time.
Nested skipping:
As a slight aside, I think it would be good for a future enhanced P2 or P3 to have a 64-bit wide hardware stack (or possibly 63 bits). The extra 32 (or 31) stack bits would be loaded with the skip bits remaining when a CALL instruction is executed and they would be loaded back into the skip bit register by a RET. This would allow ISRs to use skipping and, perhaps more importantly, permit nested skipping in main code. A new CALL+SKIPF instruction would be handy, similar to EXECF (which is really JMP+SKIPF).
One current issue is remembering not to use skipping in routines called from a skip sequence and sometimes this means duplicating code: one version with skipping and one without.
Way too complex for a compiler to handle
It would all happen automatically with CALL/RET and all skip sequences would be 'interruptable'.
That would be the Linker's job. Arduino does a very smart thing with "gcc -ffunction-sections -fdata-sections" and "gcc -Wl,--gc-sections". So, every function is in a section called .text.my_function_name and the linker throws away any unreferenced sections.
So, for this, the compiler duplicates every function as my_function and my_function.noskips. Obviously, the .noskips version of each function calls .noskips version of each internal leaf call. In case a function doesn't need skips, and doesn't have any leaf calls, the .noskips symbol can be an alias with the same section and location. Finally, the _crt_start function calls main, and every interrupt registration call uses the .noskips version of the passed-in function.
Finally, if you do -ffunction-sections and --gc-sections or a similar mechanism in your compiler, that will automatically take care of what is and isn't run in an interrupt context.
This may be of interest. I have taken version 1.2 of the FFT benchmark and modified it so that it should now compile on all the compilers (although I have only tried fastspin and Catalina). The files are attached.
The interesting thing about this version is that it includes the option of using OMP for parallelism, so I have also modified it to use Catalina's multi-processing capabilities (when compiled with Catalina) This is a fairly mechanical process, described in my "sieve" example (which I have also attached).
I will have to see if I can make Catalina's multi-processing as easy to use as OMP.
Here are the results:
As you would expect, this shows that you can get a very significant speed improvements by using multiple cogs on the Propeller. It also makes Catalina competitive again!
You may want to try propgcc and riscvp2 and add the results. I know gcc actually implements OMP, but I don't know if anyone has got that working on the Propeller version.
Ross.
EDIT: It just occurred to me I can also use the new FAST_SAVE_RESTORE option (although doing so requires the thread library to be recompiled). If you do so, the 4 slices case comes down slightly to 13168 us
now somebody needs to get some basic posix supported and I might get GnuCobol running...
Mike
Catalina implements enough multi-tasking primitives to build a complete Posix threads implementation, but this is left as an exercise for the reader!
Further to my previous post ...
I've now completed this process - it turned out to be easier than I expected. So (similar to OMP) you can now just insert a few pragmas to "parallelize" the FFT program (or the Sieve program, or any other program) rather than having to manually rewrite them (as I originally had to do).
I looked at the OMP pragmas, but (similar to what I found with the Posix thread libraries) they seemed hideously complex and unnecessarily cumbersome, so I came up with some much simpler ones that are more suited to the Propeller.
The new pragmas, and the new version of the fftbench.c program, are described in detail here. I would be interested to hear if you wanted to pursue a similar method in your compilers.
Ross.
I think it's better to use some existing standard, or at least a subset of it. A lot of the complexity for OpenMP is probably necessary for the general case. The stuff that isn't could be left out. The same holds for Posix threads; it looks like your thread API is not that different from a subset of Posix, so why not go all the way and just make it a subset so programs could compile on both?
Making "benchmarks" that only run on a specific platform and with a specific compiler seems a bit counterproductive to me...
Which is precisely why we shouldn't use either OpenMP, or Posix threads - at least not directly.
OpenMP requires implementation by the compiler manufacturer. It is not a "bolt on" like a shared library - every compiler maker has to implement it within the compiler. You could perhaps get it working on gcc on the Propeller, but probably not on any other compiler, so it is not a very good option. However, there is nothing to stop you using OpenMP as well. The fftbench example I use demonstrates this - it will use either OpenMP or Catalina parallelization - whichever is available - and neither one requires any change to the actual algorithm.
At least there are some portable thread library implementations, so we are certainly better off if we use threads of some kind. But even if we agree on a specific thread library, using it directly is also not a good option, since programs would typically have to be modified to add parallel processing, and then they would no longer work on compilers that didn't implement that particular library.
This is why I ended up using pragmas, just as OpenMP does, and for the same reasons - pragmas allow you to hide the implementation details, and also allow the same programs to continue to work on compilers that don't implement them - so we now can have the same program that will run on all compilers. Unfortunately, just adopting the OpenMP pragmas (or a subset of them) is not practical - they have been defined assuming they will be implemented within the compiler, and this would be difficult for most (possibly all!) of the current Propeller compilers.
So, the answer is that we do use threads - but we don't use them directly. We use pragmas instead, which allows us to use whatever threads library the compiler supports (posix or otherwise). This allows us to parallelize programs in a manner which is independent of both the compiler and whatever threads library it implements (or how much of it is implemented, over and above the basic ability to create a thread), and doesn't require us to modify existing algorithms just to add parallel processing.
It also becomes possible to add parallel processing to existing compilers without the compilers themselves having to implement anything at all . There are open source thread libraries that require only a single platform-specific function to be implemented.
Ross.
I have added duty cycle, syncing & phasing functionality - by using arrays that hold frequency, period, Dutycycle, Totalticks, Onticks, Offticks, Ontime, Offtime, Phase, Phasetime and Phaseticks.
However, my "tick" calculations are rounded - and don't match up with my "period' calculations after divisions are made. This causes sweeping of some frequencies (not all) - relative to adjacent frequencies (which breaks syncing & functional phasing).
Is there a working floating point library for SPIN2 - that could help with my rounding errors?
If not, which C compilers come with an IDE - that might be best for handling Smart-Pin signalling of 54 channels? How difficult is it - to convert SPIN2 code to C?
My working code is attached. It was compiled with Propeller Tool Alpha 2.3.0.0
Not sure if it helps you but that would be one way to use floating point math with spin2.
Eric also has a spin2cpp tool converting spin to c (basically the origin of fastspin) if you want to move it completely to C.
You should try to compile your source with fastspin, it does not use chips bytecode interpreter but compiles to native code. This runs faster as compiled with PropTool, but needs more memory. Trying that may help with your phasing issue.
Enjoy!
Mike