I think you just explored the magic of the GCC optimizer with the 48 cycle results. This is about the time to load a constant in LMM mode, so I guess you have the testvalues hardcoded as constants, and the compiler realizes that and optimizes the whole procedure away to a single constant load.
I think you just explored the magic of the GCC optimizer with the 48 cycle results. This is about the time to load a constant in LMM mode, so I guess you have the testvalues hardcoded as constants, and the compiler realizes that and optimizes the whole procedure away to a single constant load.
Andy
I agree, I'm also seeing that using PASM inline has very little gain for tiny razor sharp code. All the conversions and frame references, or whatever, are getting in the way
unsigned int a = num_in;
start = CNT;
__asm__ volatile (
//move count start to here
"rev %[x],#0 \n\t" //Reverse the bits in x.
"neg %[y],%[x] \n\t" //Isolate least-significant "one" bit.
"and %[x],%[y] wz \n\t"
"mov %[y],#31 \n\t"
"if_z sub %[y],#1 \n\t" //Locate that "one" bit, if any.
"test %[x],%[_0000ffff] wz \n\t"
"if_z sub %[y],#16 \n\t"
"test %[x],%[_00ff00ff] wz \n\t"
"if_z sub %[y],#8 \n\t"
"test %[x],%[_0f0f0f0f] wz \n\t"
"if_z sub %[y],#4 \n\t"
"test %[x],%[_33333333] wz \n\t"
"if_z sub %[y],#2 \n\t"
"test %[x],%[_55555555] wz \n\t"
"if_z sub %[y],#1 \n\t"
//move count end to here
: /*outputs (+inputs) */
[y] "+r" (pow2)
: /*inputs */
[x] "r" (a),
[_0000ffff] "r" (_0000ffff),
[_00ff00ff] "r" (_00ff00ff),
[_0f0f0f0f] "r" (_0f0f0f0f),
[_33333333] "r" (_33333333),
[_55555555] "r" (_55555555)
);
end = CNT;
print("inline PASM) Arg: %d Characteristic: %d time: %d\n", num_in, pow2, (end-start)); // display the results
I think you just explored the magic of the GCC optimizer with the 48 cycle results. This is about the time to load a constant in LMM mode, so I guess you have the testvalues hardcoded as constants, and the compiler realizes that and optimizes the whole procedure away to a single constant load.
Andy
I misread your last comment and have to disagree.
There are no hard coded values except those used in the inline PASM
Here is the entire file:
Okay, I have tried your code in SimpleIDE and got the same results as you. Have you noticed that you can show the generated Assembly code with a right click on the C file?
If you do this, you can see that it's really the optimizer that allows this result, but not in the way I thought. What happens is that the read of the start CNT value is delayed until a after a big part of the AND and SHIFT calculations, just 2 instructions before the end CNT value is read. For the optimizer the CNT and the bit caclulations are two different things and can be moved to the best position to get the smallest or fastest code.
Okay, I have tried your code in SimpleIDE and got the same results as you. Have you noticed that you can show the generated Assembly code with a right click on the C file?
~~~
Andy
No, I don't know how to do that, but I do now.
I'm going to build a proper library for these different approaches and try to time them as method calls instead.
You, heater, and Phil really gave me some new tools in this thread, thank you
If you want to burn some memory, you can do it in about 8 instructions
Make a 256 value lookup table starting at memory location 0 in Cog-Ram that gives you the value of the msb. Call our value under test x
shift x right 24 bits
if prev result was not zero then add 24 to result
if previous result was zero, shift x right by 16 bits
if previous result was not zero then add 16 to result
if previous result was zero shift x right by 8 bits
if previous result was not zero then add 8 to result
move the source field of the previous result to the source field of the following instruction
add the value in the location specified by the source field to the result.
and we are done in 8 instructions, but we had to use 256 words of ram.
If you want to burn some memory, you can do it in about 8 instructions
Make a 256 value lookup table starting at memory location 0 in Cog-Ram that gives you the value of the msb. Call our value under test x
shift x right 24 bits
if prev result was not zero then add 24 to result
if previous result was zero, shift x right by 16 bits
if previous result was not zero then add 16 to result
if previous result was zero shift x right by 8 bits
if previous result was not zero then add 8 to result
move the source field of the previous result to the source field of the following instruction
add the value in the location specified by the source field to the result.
and we are done in 8 instructions, but we had to use 256 words of ram.
Why 256 values when my desired results are 0-31 & -1? Is it a sparse table?
Thanks for the cool contribution
This should be a bit clearer. Let x be the long word we are testing. y is a variable, a 32bit long word we use in the algorithm. R is the result, its probably stored in a 32 bit long word,but it will only ranged between 0 and 32. R will be the position of the most significant 1 when we are finished (unless I have made some sort of mistake) z is the zero flag. Its a processor flag that indicates when a result is zero. It is only set if the wz flag is set in the instruction, so we can protect its value.
shift x right 24 bits and store it as y /this moves the most significant 8 bits into the least significant 8 bits. Bits 9-31 are set to zero.
if y was not zero then add 24 to R (do not affect z) /If any bit was 1 in positions 24-31 then the previous result was NOT 0 and 24 is added to the result.
if y was zero, shift x right by 16 bits /if the first shift returned zero, we move bits 16-23 to locations 0-7.
if y was not zero then add 16 to result (do not affect z) /We have a new y! If any bit was 1 in position 16-23, then 16 is added to the result.
if y was zero shift x right by 8 bits /if the first two shifts were zero then we move bits 8-15 into locations 0-7.
if previous result was not zero then add 8 to result (do not affect z) /if any bit was 1 in position 8-15, then 8 is added to the result.
move the source field of y to the source field of the following instruction. /All possibilities have been covered, so no matter what,the block of 8 bits that contained the most significant 1 is now in the lowest 8 bits of y. The result, R so far, is either 24,16,8 or 0 depending on which block of 8 bits the most signficant 1 was found. So now we are ready to use the lookup table. We move the lowest 9 bits of 7,which is the source field for all instructions, to the source field of the NEXT instruction,which is an ADD.
add the value in the location specified by the source field to the result. / The source field of the add instruction is now y, which is simply x shifted by either 24,16 or 8 or 0 bits to the right. (whichever found the most significant 1) The 9th bit (bit 8 ) is always 0 because that's the way the shift instruction that we chose works. That's why we locate the lookup table starting at 0) We just take the table entry referenced by that lower 9 bits and add it to the result. There are actually 33 possible values that we can find in the lookup table. 1-32,and 0. (Or if you want,you can return bits 0-31 and some arbitrary negative number for "no bits set". A entry in the table simply contains the location of the leading one in its address.
here is an example some of the table entries.
Address Value
00000000 00000000
00000001 00000001
00000010 00000010
00000011 00000010
00000100 00000011
00000101 00000011
00000110 00000011
00000111 00000011
00001000 00000100
00001001 00000100
.
.
.
01111110 00000111
10000000 00001000
10000001 00001000
.
.
.,
11111110 00001000
11111111 00001000
We just add that value from the table to the result. So for instance, if X was
F4:65:CE:01 then the first bit set is clearly #32. That means that the very first shift would NOT have returned zero so we add 24 and SKIP the second shift. And now we have a problem. I see that there is a flaw in my algorithm. While we will skip the remaining shifts we will add 16 and 8 as well,which will get the wrong answer. The problem is that we only skip the add instructions BEFORE we find the most signifcant 1. All the adds after run. So we should add 8 each time. Then 3x8 is 24, 2x8 is 16 and 1x8 is 8 and it all works.
So now we have:
shift x right 24 bits and store it as y
if y was not zero then add 8 to R (do not affect z)
if y was zero, shift x right by 16 bits
if y was not zero then add 8 to result (do not affect z)
if y was zero shift x right by 8 bits
if previous result was not zero then add 8 to result (do not affect z)
move the source field of y to the source field of the following instruction.
add the value in the location specified by the source field to the result.
So lets look at this again for F4:65:CE:01
The first bit was clearly 32. That means the first shift does NOT return zero.
So we add 8 but do not update the Z flag.
since the Z flag is NOT set we skip the instruction shifting left by 16 bits
since the Z flag is NOT set we add 8 (again) but do not update the z flag
since the z flag is not set we SKIP the shift right by 8 instruction
since the z flag is not set we add 8 (again) but do not update the z flag.
now we move the source field to the following add instruction
our add instruction now adds the value of memory location 0F4 to the result. 0F4 will contain 8 so we add that to R and get 32.
But what if it was 00:65:CE:01
The first bit is now 23
So the first shift returns zero, so we DONT add 8.
The next shift does NOT return zero
so we DO add 8.
we skip the third shift
but the zero flag is still NOT set so we add another 8
we then move the source field to the add instruction
then we add the value at memory location 001100101 to the result. the value at that location will be 7, added to R,which is 16, we get 23.
This is btw very wasteful of memory. You really use 4 bytes for each value in your lookup table because the memory is not byte addressable,you end up having to store an entire 32 bit word when you only need 8.
This should be a bit clearer. Let x be the long word we are testing. y is a variable, a 32bit long word we use in the algorithm. R is the result, its probably stored in a 32 bit long word,but it will only ranged between 0 and 32. R will be the position of the most significant 1 when we are finished (unless I have made some sort of mistake) z is the zero flag. Its a processor flag that indicates when a result is zero. It is only set if the wz flag is set in the instruction, so we can protect its value.
shift x right 24 bits and store it as y /this moves the most significant 8 bits into the least significant 8 bits. Bits 9-31 are set to zero.
if y was not zero then add 24 to R (do not affect z) /If any bit was 1 in positions 24-31 then the previous result was NOT 0 and 24 is added to the result.
if y was zero, shift x right by 16 bits /if the first shift returned zero, we move bits 16-23 to locations 0-7.
if y was not zero then add 16 to result (do not affect z) /We have a new y! If any bit was 1 in position 16-23, then 16 is added to the result.
if y was zero shift x right by 8 bits /if the first two shifts were zero then we move bits 8-15 into locations 0-7.
if previous result was not zero then add 8 to result (do not affect z) /if any bit was 1 in position 8-15, then 8 is added to the result.
move the source field of y to the source field of the following instruction. /All possibilities have been covered, so no matter what,the block of 8 bits that contained the most significant 1 is now in the lowest 8 bits of y. The result, R so far, is either 24,16,8 or 0 depending on which block of 8 bits the most signficant 1 was found. So now we are ready to use the lookup table. We move the lowest 9 bits of 7,which is the source field for all instructions, to the source field of the NEXT instruction,which is an ADD.
add the value in the location specified by the source field to the result. / The source field of the add instruction is now y, which is simply x shifted by either 24,16 or 8 or 0 bits to the right. (whichever found the most significant 1) The 9th bit (bit 8 ) is always 0 because that's the way the shift instruction that we chose works. That's why we locate the lookup table starting at 0) We just take the table entry referenced by that lower 9 bits and add it to the result. There are actually 33 possible values that we can find in the lookup table. 1-32,and 0. (Or if you want,you can return bits 0-31 and some arbitrary negative number for "no bits set". A entry in the table simply contains the location of the leading one in its address.
here is an example some of the table entries.
Address Value
00000000 00000000
00000001 00000001
00000010 00000010
00000011 00000010
00000100 00000011
00000101 00000011
00000110 00000011
00000111 00000011
00001000 00000100
00001001 00000100
.
.
.
01111110 00000111
10000000 00001000
10000001 00001000
.
.
.,
11111110 00001000
11111111 00001000
Thank you for the detailed explanation. I'm going to have to write it out on paper to fully understand it, just follow some cases through until it becomes completely clear.
Create a 2048 byte long lookup table in hub memory starting at memory location 0. Thats an 11 bit address.
shift x right 22 bits and store it as y (you could even shift using the carry bit,giving you a handy 33 bits)
if y was not zero then add 11 to R (do not affect z)
if y was zero, shift x right by 11 bits
if y was not zero then add 11 to result (do not affect z)
read a byte from hub memory at the address in y
add the value you just read to the result.
This takes 6 instructions. The hub instruction will take 8-23 clocks,but if your careful coding you can make sure that after the first access you can access the hub in only 8. Since we are worried about how long this algorithm takes,presumably we are going to be doing it repeatedly,so I assume that the execution time will normally be the 8 cycles rather than the larger number. Therefore this is roughly equivalent to doing it in 7 instructions (because of the extra 4 cycles), however you get an extra bit of resolution by using the carry bit,and don't waste all that cog ram. (You don't waste any actually because hub ram is byte accessible)
Are the port B registers actually there just not connected to the outside world? Ive never tried it. if it is you could try two cogs.(I guess if you didnt need IO during the routine you could use port a)
Cog 1:
inst 0 :shift x right 22 bits and store it in the port B output register (you could even shift using the carry bit,giving you a handy 33 bits)
inst 1:if y was not zero then add 11 to port b direction register (do not affect z)
inst 3:if y was zero, shift x right by 11 bits and store it in the port b output register.
inst 4:if y was not zero then add 11 to port b direction register (do not affect z)
Cog 2:
inst 5: read byte from hub memory at the address in the port b direction register
inst 6: complete hub memory operation
inst 7: add value you just read to the port b direction register.
inst 8: free to do anything with you want,or just put a noop here. You could for instance copy the value from the direction register to cog ram.
using the two cogs, once you have the timing aligned you could do it in 4 cycles each if your doing many in repetition. I can imagine having all of cog ram filled with unrolled loops so you'd get everything synchronized then turn it lose and it would calculate like 100 or so values at a time in around 20us. A third cog could grab the values and do whatever you wanted with them,like for example lookup logarithms and copy the results to hub ram.
Are the port B registers actually there just not connected to the outside world? Ive never tried it. if it is you could try two cogs.(I guess if you didnt need IO during the routine you could use port a)
That is my understanding, the port b i/o are just registers and no pins because pins 32-63 don't exist yet. I've barely scratched the surface in PASM so I haven't used them but I did read about them in the manual.
Thanks for all this. Regardless of which option I employ I will be sure to understand, build, and share the different methods that you've taken the time to provide here.
The registers are not there. You cannot use the phantom port B for inter-COG communication.
Does Spin do something else with the data in outb because the register is available in Spin?
OBJ
system : "Propeller Board of Education"
pst : "Parallax Serial Terminal Plus"
VAR
long i
PUB go
system.Clock(80_000_000)
repeat i from 0 to 15
outb[i] := i//2
outa[i] := 0
repeat i from 16 to 31
outa[i] := 0
if i//3 == 1
outb[i] := 1
repeat i from 0 to 31
pst.Dec(outa[i])
pst.NewLine
repeat i from 0 to 31
pst.Dec(outb[i])
Well, alright, you are writing to and reading from port B from the same COG.
If I recall correctly there is RAM at the port B location so why not?
As an I/O register, even connected internally, not so much.
I figured you'd know. I tested it on 2 cogs and you're correct that it doesn't work
OBJ
system : "Propeller Board of Education"
pst : "Parallax Serial Terminal Plus"
VAR
long i, stack[32]
PUB go
system.Clock(80_000_000)
repeat i from 0 to 15
outb[i] := i//2
outa[i] := 0
repeat i from 16 to 31
outa[i] := 0
if i//3 == 1
outb[i] := 1
pst.Dec(COGNEW(copyab, @stack))
printoutab
copyab
printoutab
PUB copyab
repeat i from 0 to 15
outa[i] := outb[i]
PUB printoutab
pst.NewLine
repeat i from 0 to 31
pst.Dec(outa[i])
pst.NewLine
repeat i from 0 to 31
pst.Dec(outb[i])
Appreciate it.
Glad you helped me rule that out before I invested any real time in it. I will have to employ the cog ram instead as I cannot count on my pins being blank.
Create a 2048 byte long lookup table in hub memory starting at memory location 0. Thats an 11 bit address.
shift x right 22 bits and store it as y (you could even shift using the carry bit,giving you a handy 33 bits)
if y was not zero then add 11 to R (do not affect z)
if y was zero, shift x right by 11 bits
if y was not zero then add 11 to result (do not affect z)
read a byte from hub memory at the address in y
add the value you just read to the result.
This takes 6 instructions. The hub instruction will take 8-23 clocks,but if your careful coding you can make sure that after the first access you can access the hub in only 8. Since we are worried about how long this algorithm takes,presumably we are going to be doing it repeatedly,so I assume that the execution time will normally be the 8 cycles rather than the larger number. Therefore this is roughly equivalent to doing it in 7 instructions (because of the extra 4 cycles), however you get an extra bit of resolution by using the carry bit,and don't waste all that cog ram. (You don't waste any actually because hub ram is byte accessible)
Are the port B registers actually there just not connected to the outside world? Ive never tried it. if it is you could try two cogs.(I guess if you didnt need IO during the routine you could use port a)
Cog 1:
inst 0 :shift x right 22 bits and store it in the port B output register (you could even shift using the carry bit,giving you a handy 33 bits)
inst 1:if y was not zero then add 11 to port b direction register (do not affect z)
inst 3:if y was zero, shift x right by 11 bits and store it in the port b output register.
inst 4:if y was not zero then add 11 to port b direction register (do not affect z)
Cog 2:
inst 5: read byte from hub memory at the address in the port b direction register
inst 6: complete hub memory operation
inst 7: add value you just read to the port b direction register.
inst 8: free to do anything with you want,or just put a noop here. You could for instance copy the value from the direction register to cog ram.
using the two cogs, once you have the timing aligned you could do it in 4 cycles each if your doing many in repetition. I can imagine having all of cog ram filled with unrolled loops so you'd get everything synchronized then turn it lose and it would calculate like 100 or so values at a time in around 20us. A third cog could grab the values and do whatever you wanted with them,like for example lookup logarithms and copy the results to hub ram.
I do plan on having multiple cogs running math routines but without the outb, my project restrictions of outa, and the fact that this will be used regularly but sporadically, this method will not work for my regular library, but I bet it will kick some serious butt in my vector library.
All the details of the registers and logic involved in the Propeller I/O are clearly described in the Propeller Manual.
You are not going to use port A for inter-cog communications either. That is a precious I/O resource you will need.
COGS can share data through HUB. Which is slow. Which is why the Prop does not make a great parallel processing number cruncher. Which of course it was not designed to be.
Since port B is not connected outside the cogs, that unfortunately is out. Port A depends on whether you have a break where you dont need IO. Disabling IO would be the hard part. But only useful is particular cases. Like some old devices where all sorts of stuff would happen during vblank but you didnt care.
Since port B is not connected outside the cogs, that unfortunately is out. Port A depends on whether you have a break where you dont need IO. Disabling IO would be the hard part. But only useful is particular cases. Like some old devices where all sorts of stuff would happen during vblank but you didnt care.
That's the idea with the parallel SRAM bank. I want to add a few pins to identify which cog is responding and a lock bit and I can use it for RAM, persistent memory, and inter-cog communication. I figured out a memory shortcut for a cogmem array-like thing today. I still have some work to do on it but it increased my sequential 8 bit read from 1.9MB to nearly 5MB with the 8 bit RAM; should double with the 16 bit RAM. With all that bandwidth and flag bits I can copy the data to as many cogs as will listen and the RAM at the same time. I'll just read OUTA instead of INA for input from other cogs. When data does not need to be written to RAM, such as hypervisor instructions, they can still be read on OUTA and triggered with and by the same flag bits.
If I add a bus disconnect than I can have the memory controller page the SRAM to the SPI flash independent of the propeller and while it uses the same pins. This way the concepts are integrated by design and use the same helper methods.
Comments
-Phil
mikeologist:
I think you just explored the magic of the GCC optimizer with the 48 cycle results. This is about the time to load a constant in LMM mode, so I guess you have the testvalues hardcoded as constants, and the compiler realizes that and optimizes the whole procedure away to a single constant load.
Andy
I agree, I'm also seeing that using PASM inline has very little gain for tiny razor sharp code. All the conversions and frame references, or whatever, are getting in the way
I made Phil's code and I think that the reason for these timing results:
I am experiencing an odd result in printing the argument after the PASM function, probably a simple mistake on my part.
Thanks everyone for your help so far
I misread your last comment and have to disagree.
There are no hard coded values except those used in the inline PASM
Here is the entire file: It can't be caching the values. Your code and my tweak just run that fast.
If you do this, you can see that it's really the optimizer that allows this result, but not in the way I thought. What happens is that the read of the start CNT value is delayed until a after a big part of the AND and SHIFT calculations, just 2 instructions before the end CNT value is read. For the optimizer the CNT and the bit caclulations are two different things and can be moved to the best position to get the smallest or fastest code.
Here is the generated Assembly code:
Andy
No, I don't know how to do that, but I do now.
I'm going to build a proper library for these different approaches and try to time them as method calls instead.
You, heater, and Phil really gave me some new tools in this thread, thank you
Make a 256 value lookup table starting at memory location 0 in Cog-Ram that gives you the value of the msb. Call our value under test x
shift x right 24 bits
if prev result was not zero then add 24 to result
if previous result was zero, shift x right by 16 bits
if previous result was not zero then add 16 to result
if previous result was zero shift x right by 8 bits
if previous result was not zero then add 8 to result
move the source field of the previous result to the source field of the following instruction
add the value in the location specified by the source field to the result.
and we are done in 8 instructions, but we had to use 256 words of ram.
Why 256 values when my desired results are 0-31 & -1? Is it a sparse table?
Thanks for the cool contribution
shift x right 24 bits and store it as y /this moves the most significant 8 bits into the least significant 8 bits. Bits 9-31 are set to zero.
if y was not zero then add 24 to R (do not affect z) /If any bit was 1 in positions 24-31 then the previous result was NOT 0 and 24 is added to the result.
if y was zero, shift x right by 16 bits /if the first shift returned zero, we move bits 16-23 to locations 0-7.
if y was not zero then add 16 to result (do not affect z) /We have a new y! If any bit was 1 in position 16-23, then 16 is added to the result.
if y was zero shift x right by 8 bits /if the first two shifts were zero then we move bits 8-15 into locations 0-7.
if previous result was not zero then add 8 to result (do not affect z) /if any bit was 1 in position 8-15, then 8 is added to the result.
move the source field of y to the source field of the following instruction. /All possibilities have been covered, so no matter what,the block of 8 bits that contained the most significant 1 is now in the lowest 8 bits of y. The result, R so far, is either 24,16,8 or 0 depending on which block of 8 bits the most signficant 1 was found. So now we are ready to use the lookup table. We move the lowest 9 bits of 7,which is the source field for all instructions, to the source field of the NEXT instruction,which is an ADD.
add the value in the location specified by the source field to the result. / The source field of the add instruction is now y, which is simply x shifted by either 24,16 or 8 or 0 bits to the right. (whichever found the most significant 1) The 9th bit (bit 8 ) is always 0 because that's the way the shift instruction that we chose works. That's why we locate the lookup table starting at 0) We just take the table entry referenced by that lower 9 bits and add it to the result. There are actually 33 possible values that we can find in the lookup table. 1-32,and 0. (Or if you want,you can return bits 0-31 and some arbitrary negative number for "no bits set". A entry in the table simply contains the location of the leading one in its address.
here is an example some of the table entries.
Address Value
00000000 00000000
00000001 00000001
00000010 00000010
00000011 00000010
00000100 00000011
00000101 00000011
00000110 00000011
00000111 00000011
00001000 00000100
00001001 00000100
.
.
.
01111110 00000111
10000000 00001000
10000001 00001000
.
.
.,
11111110 00001000
11111111 00001000
We just add that value from the table to the result. So for instance, if X was
F4:65:CE:01 then the first bit set is clearly #32. That means that the very first shift would NOT have returned zero so we add 24 and SKIP the second shift. And now we have a problem. I see that there is a flaw in my algorithm. While we will skip the remaining shifts we will add 16 and 8 as well,which will get the wrong answer. The problem is that we only skip the add instructions BEFORE we find the most signifcant 1. All the adds after run. So we should add 8 each time. Then 3x8 is 24, 2x8 is 16 and 1x8 is 8 and it all works.
So now we have:
shift x right 24 bits and store it as y
if y was not zero then add 8 to R (do not affect z)
if y was zero, shift x right by 16 bits
if y was not zero then add 8 to result (do not affect z)
if y was zero shift x right by 8 bits
if previous result was not zero then add 8 to result (do not affect z)
move the source field of y to the source field of the following instruction.
add the value in the location specified by the source field to the result.
So lets look at this again for F4:65:CE:01
The first bit was clearly 32. That means the first shift does NOT return zero.
So we add 8 but do not update the Z flag.
since the Z flag is NOT set we skip the instruction shifting left by 16 bits
since the Z flag is NOT set we add 8 (again) but do not update the z flag
since the z flag is not set we SKIP the shift right by 8 instruction
since the z flag is not set we add 8 (again) but do not update the z flag.
now we move the source field to the following add instruction
our add instruction now adds the value of memory location 0F4 to the result. 0F4 will contain 8 so we add that to R and get 32.
But what if it was 00:65:CE:01
The first bit is now 23
So the first shift returns zero, so we DONT add 8.
The next shift does NOT return zero
so we DO add 8.
we skip the third shift
but the zero flag is still NOT set so we add another 8
we then move the source field to the add instruction
then we add the value at memory location 001100101 to the result. the value at that location will be 7, added to R,which is 16, we get 23.
This is btw very wasteful of memory. You really use 4 bytes for each value in your lookup table because the memory is not byte addressable,you end up having to store an entire 32 bit word when you only need 8.
Thank you for the detailed explanation. I'm going to have to write it out on paper to fully understand it, just follow some cases through until it becomes completely clear.
Create a 2048 byte long lookup table in hub memory starting at memory location 0. Thats an 11 bit address.
shift x right 22 bits and store it as y (you could even shift using the carry bit,giving you a handy 33 bits)
if y was not zero then add 11 to R (do not affect z)
if y was zero, shift x right by 11 bits
if y was not zero then add 11 to result (do not affect z)
read a byte from hub memory at the address in y
add the value you just read to the result.
This takes 6 instructions. The hub instruction will take 8-23 clocks,but if your careful coding you can make sure that after the first access you can access the hub in only 8. Since we are worried about how long this algorithm takes,presumably we are going to be doing it repeatedly,so I assume that the execution time will normally be the 8 cycles rather than the larger number. Therefore this is roughly equivalent to doing it in 7 instructions (because of the extra 4 cycles), however you get an extra bit of resolution by using the carry bit,and don't waste all that cog ram. (You don't waste any actually because hub ram is byte accessible)
Are the port B registers actually there just not connected to the outside world? Ive never tried it. if it is you could try two cogs.(I guess if you didnt need IO during the routine you could use port a)
Cog 1:
inst 0 :shift x right 22 bits and store it in the port B output register (you could even shift using the carry bit,giving you a handy 33 bits)
inst 1:if y was not zero then add 11 to port b direction register (do not affect z)
inst 3:if y was zero, shift x right by 11 bits and store it in the port b output register.
inst 4:if y was not zero then add 11 to port b direction register (do not affect z)
Cog 2:
inst 5: read byte from hub memory at the address in the port b direction register
inst 6: complete hub memory operation
inst 7: add value you just read to the port b direction register.
inst 8: free to do anything with you want,or just put a noop here. You could for instance copy the value from the direction register to cog ram.
using the two cogs, once you have the timing aligned you could do it in 4 cycles each if your doing many in repetition. I can imagine having all of cog ram filled with unrolled loops so you'd get everything synchronized then turn it lose and it would calculate like 100 or so values at a time in around 20us. A third cog could grab the values and do whatever you wanted with them,like for example lookup logarithms and copy the results to hub ram.
That is my understanding, the port b i/o are just registers and no pins because pins 32-63 don't exist yet. I've barely scratched the surface in PASM so I haven't used them but I did read about them in the manual.
Thanks for all this. Regardless of which option I employ I will be sure to understand, build, and share the different methods that you've taken the time to provide here.
Does Spin do something else with the data in outb because the register is available in Spin?
produces this output:
admittedly this says nothing about intercog comms
If I recall correctly there is RAM at the port B location so why not?
As an I/O register, even connected internally, not so much.
I figured you'd know. I tested it on 2 cogs and you're correct that it doesn't work
produces the output:
So, is outb ram just there as a placeholder?
Thanks
Appreciate it.
Glad you helped me rule that out before I invested any real time in it. I will have to employ the cog ram instead as I cannot count on my pins being blank.
Do you know how outa works by comparison?
Is there a register in each cog or is there a single outa register that all cogs can access?
I do plan on having multiple cogs running math routines but without the outb, my project restrictions of outa, and the fact that this will be used regularly but sporadically, this method will not work for my regular library, but I bet it will kick some serious butt in my vector library.
Thanks a bunch.
You are not going to use port A for inter-cog communications either. That is a precious I/O resource you will need.
COGS can share data through HUB. Which is slow. Which is why the Prop does not make a great parallel processing number cruncher. Which of course it was not designed to be.
That's the idea with the parallel SRAM bank. I want to add a few pins to identify which cog is responding and a lock bit and I can use it for RAM, persistent memory, and inter-cog communication. I figured out a memory shortcut for a cogmem array-like thing today. I still have some work to do on it but it increased my sequential 8 bit read from 1.9MB to nearly 5MB with the 8 bit RAM; should double with the 16 bit RAM. With all that bandwidth and flag bits I can copy the data to as many cogs as will listen and the RAM at the same time. I'll just read OUTA instead of INA for input from other cogs. When data does not need to be written to RAM, such as hypervisor instructions, they can still be read on OUTA and triggered with and by the same flag bits.
If I add a bus disconnect than I can have the memory controller page the SRAM to the SPI flash independent of the propeller and while it uses the same pins. This way the concepts are integrated by design and use the same helper methods.
I'm trying it with 3 different controllers.
Still rereading the manual