As there is no cache, it would have to wait for pc+4 to come around, so only one LMM op per 16 clock cycles... 200/16 = 12.5
next: RDLONG ins, pc ' 1/16 clock cycles, 1/8 instruction cycles to next window
add pc,#4 ' not needed if ptra++ is used for PC
... spacers
ins nop
jmp #next ' same time as reps, and any JMP/CALL to primitive would break out of reps
Adding 16 longs of cache would help quite a bit.
Then the maximum LMM performance would be:
REPS #511,#ins_in_loop ' depends on spacers, ptra usage
RDLONG ins, pc
add pc,#4
.. minimum number of spacers needed before ins is executable, I suspect minimum of two nops
ins nop
Absolute best case would be about 4 instruction cycles - 8 clock cycles for simple cog instructions, about 24 clock cycles for hub instructions, and over 40 clock cycles for jumps due to primitives and having to re-enter reps loop.
With one FJMP and six cog instructions, (8 longs), it would run at approx. 8*8+40 cycles, 104 cycles for eight longs, just under 20MIPS.
hubexec may not be worth it if we can transfer from HUB to get some decent LMM going. I'm okay with that. In the end, I suspect we will be happier if there is some kind of hubexec to improve speed for C and other compiled code. Its probably worth the pain of developing and trying out on FPGA.
Assuming a rdlong executes in three clocks if it hits the hub sweet spot, and with a hub address firing order of 07E5C3A18F6D4B29, this LMM loop executes at 28.5 MIPS:
As there is no cache, it would have to wait for pc+4 to come around, so only one LMM op per 16 clock cycles... 200/16 = 12.5
next: RDLONG ins, pc ' 1/16 clock cycles, 1/8 instruction cycles to next window
add pc,#4 ' not needed if ptra++ is used for PC
... spacers
ins nop
jmp #next ' same time as reps, and any JMP/CALL to primitive would break out of reps
Details :
Because of the Rotate details, I think an incrementing fetch has to wait 17 clocks between valid slots.
( 16 is for the same LSN)
If Chip reverses the spin-order, so LSN decrements with Slot-Scan, then an incrementing fetch can be 15 clocks between valid slots.
I believe rdlong is 2 clocks when hitting the sweet spot of hub cycles and using PTRx for the address instead of an arbitrary register.
This is based on info Chip shared with me while we were going over the pipeline.
Assuming a rdlong executes in three clocks if it hits the hub sweet spot, and with a hub address firing order of 07E5C3A18F6D4B29, this LMM loop executes at 28.5 MIPS:
However, such a hub address firing order breaks Video Linear DMA, and also stalls on opcodes <> assumed cycle#
The LMM tools also have to re-scramble code to exactly match the chosen 07E5C3A18F6D4B29 hub address firing order.
The results will be the same, whether it's two or three clocks, since the firing order is based on seven clocks. (7 is relatively prime to 16; 6 will not work.) And let's just assume I renamed ptrx.
Six clocks, it is two clocks per instruction. you assume ptra++ exists, I thought you were against PTRA/PTRB.
Cannot hit the sweet spot, as it would be far past the correct address to read from the multiplexer.
12.5MIPS
$xxxxx0 first lmm instruction that syncs - say it hits rdlong, let's say ptr was xxxx0
$xxxxx1 ' end of rdlong, second clock
$xxxxx2 start of jmp instruction
$xxxxx3 ' end of jump instruction
$xxxxx4 start of 2 cycle lmm instruction
$xxxxx5 end of 2 cycle lmm instruction
$xxxxx6 second instance of rdlong, wants xxxx1.. not available
$xxxxx7 ' second cycle of rdlong... but it is waiting for $xxxxxx1
$xxxxx8
$xxxxx9
$xxxxxA
$xxxxxB
$xxxxxC
$xxxxxD
$xxxxxE
$xxxxxF
$xxxxx0
$xxxxx1 ' rdlong completes
$xxxxx2 start of jmp instruction
$xxxxx3 ' end of jump instruction
$xxxxx4 start of 2 cycle lmm instruction
$xxxxx5 end of 2 cycle lmm instruction
$xxxxx6 third instance of rdlong, wants xxxx1.. not available
$xxxxx7 ' second cycle of rdlong... but it is waiting for $xxxxxx1
$xxxxx8
$xxxxx9
$xxxxxA
$xxxxxB
$xxxxxC
$xxxxxD
$xxxxxE
$xxxxxF
$xxxxx0
$xxxxx1 ' rdlong completes
Cycle by cycle analysis above, shows 16 cycles between rdlongs due to nature of hub.
12.5MIPS - max, before subtracting primitive overhead.
FYI, my best guess is that 2-3 spacer instructions will be required after an RDLONG before the value read can be executed, but that is only a guess, until Chip tells us what the verilog needs.
Phil,
Believe it or not, I greatly respect your work - I am especially impressed with your backpack, and other ntsc & radio work.
I know you can do analysis like I did above.
Please do analysis - like I do - before posting rebuttals. Your rebuttal was factually, and proven above, incorect.
Assuming a rdlong executes in three clocks if it hits the hub sweet spot, and with a hub address firing order of 07E5C3A18F6D4B29, this LMM loop executes at 28.5 MIPS:
However, such a hub address firing order breaks Video Linear DMA, and also stalls on opcodes <> assumed cycle#
So? The video data order can be changed in the hub to come out linearly. And no matter what the firing order is, there will be LMM opcodes, such as branches, that stall the process while the hub resynchonizes.
jmg,
What if the fifo wasn't a fifo so much as a 16 long buffer that could be filled out of order, but then read in order? (or filled out of order and read out of order?) Maybe it needs to be longer than 16 in order to satisfy full speed video streaming....
So? The video data order can be changed in the hub to come out linearly. And no matter what the firing order is, there will be LMM opcodes, such as branches, that stall the process while the hub resynchonizes.
the Nibble-adder design I looked into http://forums.parallax.com/showthread.php/155692-Nibble-Carry-Higher-speed-Buffers-FIFOs-using-new-HUB-Rotate
does exactly what your hub address firing order change does, but those opposing this, claimed linear addressing was important. - and that was just for doing it inside the Adder, (where it was optional) not in the Rotate Engine, where it affects ALL COGs, so all tools need to apply the magic-scramble.
jmg,
What if the fifo wasn't a fifo so much as a 16 long buffer that could be filled out of order, but then read in order? (or filled out of order and read out of order?) Maybe it needs to be longer than 16 in order to satisfy full speed video streaming....
If everyone is happy with scrambling order, then look at the Nibble-Adder I looked into earlier
Thanks for pointing out that I missed the firing order.
0
7
E
5
C
3
A
1
8
F
6
D
4
B
2
9
Ok, I see what you are doing. You are scrambling the banks by offsetting each bank by 7 clock cycles.
If all cogs scramble the same way, including fifo fills, and all possible access, such a scramble can work however you are inducing a jitter by using 7 instead of 8 as cog instructions take two cycles.
If any type of access does not scramble exactly the same way it is a HUGE headache, far worse than any slot scheme or mooch or other ojected to features.
Now assuming that everything scrambles the same way, we still have a problem - you are only allowing one spacer instruction after the read before trying to execute the instruction just fetched, which problem is NOT addressed by the scramble.
Assuming we would only execute two cycle cog only instructions:
1 spacer (like your loop) once the hub is in lock step to your scrambe would result in 200/6 = 33MIPS, so if we have eight longs, consisting of 6 cog instructions, a CALL #FJMP, and the long containing the address, we would need approximately 8*2+40 cycles, that is 56 cycles for 7 instructions, about 8 cycles per instruction average, or 25MIPS ... very close to your 28.5MIPS
2 spacers is 8*3+40 = 64 cycles, or 9.1 cycles average per instruction, or 22MIPS
3 spacers is 8*4+40 = 72 cycles, or 10.3 cycles avarega per instruction, which is 19.4MIPS
My apologies for missing the firing order, however if not all access from every cog for all uses has the same firing order, it would be a disaster.
So if the firing order is not visible to the software - ie the same hub location has the same address as far as every cog, video circuit, fifo etc is concerned, your firing order is definitely an improvement for LMM. How much of an improvement depends on the number of required spacer instructions.
It would slow down video streaming by a factor of 7 though, unless the fifo was filled in out of order, potentially causing 7 cycle delay between longs (unless fifo was say 32 levels deep)
Assuming a rdlong executes in three clocks if it hits the hub sweet spot, and with a hub address firing order of 07E5C3A18F6D4B29, this LMM loop executes at 28.5 MIPS:
I've been sitting patiently trying to ignore all the noise wishing that Chip could just throw the bit-file design over the fence for us to evaluate for quite a while (fingers crossed that it will happen soon).
Can we just assume a moment that there is some way to take advantage of that FIFO Chip is so happy about? Has this been explored at all? If so, what was the result?
I see some LMM analysis was written while I typed .... LOL
The basic question is: Can instructions be fetched and executed using a "FIFO sized" overlay area?
I haven't fully explored this and don't care to do it honestly, but it seems from 10000 feet that no one else has tried to have the conversation. Reading this thread and others like it really make me nauseous ....
Assumption:
Using Chip's FIFO, 16 or so instructions get read to a block from hub ram by RDBLOC, and instructions can be executed by the cog until a jump happens. At time of any jump, the PC gets adjusted, and RDBLOC runs again.
Considerations:
1. Any branch will waste cycles.
2. Maybe 5 instructions on average get run without a jump. Need more data....
3. Relative branches (+/- 256) PC can be adjusted by a tiny 3-ish instruction macro. Possible at all?
4. Any jump macro (like LMM) adjusts the PC based on the value stored in the next address register of the jump (ala P8x32a)
5. Other macros (ala LMM) would be used.
6. Other?
:fetch
RDBLOC :dst, pc ' pc
:dst
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
' special PC adjustment here.
' then.
jmp :fetch
Is it a complete waste of time to think about this? If so, please give at least a brief explanation.
It's based on getting one long every 16 clocks, so at 200Mhz you get 12.5MIPs. It's worse than 12.5MIPs when you factor in branching and other LMM stuff.
Although, I think LMM with the FIFO would give quite a bit more performance. Bill's earlier post shows that.
Actually its 1 in 17 clocks because the hub address also needs to advance too, so 200MHz -> 11.76MIPS without any caching/fifo. Branching of course is going to cause another great hit. And when you add in the hub stack for GCC, that is really going to bite.
I've been sitting patiently trying to ignore all the noise wishing that Chip could just throw the bit-file design over the fence for us to evaluate for quite a while (fingers crossed that it will happen soon).
Can we just assume a moment that there is some way to take advantage of that FIFO Chip is so happy about? Has this been explored at all? If so, what was the result?
I see some LMM analysis was written while I typed .... LOL
The basic question is: Can instructions be fetched and executed using a "FIFO sized" overlay area?
I haven't fully explored this and don't care to do it honestly, but it seems from 10000 feet that no one else has tried to have the conversation. Reading this thread and others like it really make me nauseous ....
Assumption:
Using Chip's FIFO, 16 or so instructions get read to a block from hub ram by RDBLOC, and instructions can be executed by the cog until a jump happens. At time of any jump, the PC gets adjusted, and RDBLOC runs again.
Considerations:
1. Any branch will waste cycles.
2. Maybe 5 instructions on average get run without a jump. Need more data....
3. Relative branches (+/- 256) PC can be adjusted by a tiny 3-ish instruction macro. Possible at all?
4. Any jump macro (like LMM) adjusts the PC based on the value stored in the next address register of the jump (ala P8x32a)
5. Other macros (ala LMM) would be used.
6. Other?
:fetch
RDBLOC :dst, pc ' pc
:dst
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
' special PC adjustment here.
' then.
jmp :fetch
Is it a complete waste of time to think about this? If so, please give at least a brief explanation.
jazzed,
I think it is valid, except maybe needing a spacer between the RDBLOC and the execution (move it to the end of the loop just before the JMP.
RDBLOC takes 18 clocks to complete and then 2 clocks for your loop jmp, then you run N instructions until a jump/branch. So you take 20 clocks + 2*N clocks + jump/branch overhead (Bill has been using 40 clocks).
if N = 5, then it's 20+10+40 = 70 clocks for 6 instructions (5+jump/branch) = about 17.14 MIPS
if N = 8, then it's 20+16+40 = 76 clocks for 9 instructions (8+jump/branch) = about 23.68 MIPS
in the rare case that you get to run all 16 instructions and have no jump/branch, then you get 20+32 = 52 clocks to run 16 instructions = about 61.54 MIPS
So in typical cases, I believe it compares favorably, but in worst case it's much slower (due to 18 clock RDBLOC overhead). In the best case it's much faster, but I dunno how common the best case is...
RBLOC would have other problems (ie different code generator, for 16 long blocks of code, jumping into them etc., fillers). My guess is that Parallax does not want to pay for such a back end.
Ariba's trick for Quad LMM may help some.
Why don't you guys work up a proper analysis for RBLOC? I am going to bed now, or wifey will kill me.
Bill,
I am pretty sure worst case will be one spacer needed for hub read to execution of that read long in cog memory. Look at Chip's discussions of the clock cycles to pipeline stages.
Bill,
I am pretty sure worst case will be one spacer needed for hub read to execution of that read long in cog memory. Look at Chip's discussions of the clock cycles to pipeline stages.
RBLOC would have other problems (ie different code generator, for 16 long blocks of code, jumping into them etc., fillers). My guess is that Parallax does not want to pay for such a back end.
Ariba's trick for Quad LMM may help some.
Why don't you guys work up a proper analysis for RBLOC? I am going to bed now, or wifey will kill me.
Bill
Also, every "call" to an LMM macro is a branch. How would the LDI macro work that is supposed to load the 32 bits following the JMP to #_LMM_LDI? There is no PC that is keeping pace with instruction execution. Would LDI waste the remainder of the 16 long block and get its immediate value from the following block?
...I loved having sdram on the P2, tons of memory for 1080p 24bpp, large capture buffers, tons of xmm code space... sniff.
We can have SDRAM on the new chip, too. It would just clock at 100MHz (Fsys/2). It takes, as I recall, 22 pins for control and address and then 16 pins for data. That's 38, total, leaving 26 for other stuff. That's workable. With the hub FIFO's, you could stream words at Fsys/2 by setting the NCO to 1/2. Use a smart pin to output Fsys/2. There's everything you need. One cog should handle it just fine.
Assuming a rdlong executes in three clocks if it hits the hub sweet spot, and with a hub address firing order of 07E5C3A18F6D4B29, this LMM loop executes at 28.5 MIPS:
loop long 0-0
rdlong loop,pc++
jmp #loop
That's faster than a P1 running from the cog.
-Phil
That's a pretty neat idea for getting fast LMM. We'll probably need to stick with a simple ascending hub slot order to make the FIFO work smoothly.
We will get hub exec, but not on the first FPGA release. Hub exec makes big code mindless to write. And it's fast. I've been struggling trying to get a bite on various aspects of this design. It's turning out to be a new group-up effort.
Tonight I hope to get the simplex FIFO done. It will be a few more days before I can actually test it, though, because I have to create a working cog, still. Pretty much, all the pieces are ready.
Using the 256-long LUT for cog code, ranging from addresses $200..$2FF would probably only need a few mux's to make work. No promises, but we'll investigate that down the line. It would be like internal-speed hub exec.
Comments
As there is no cache, it would have to wait for pc+4 to come around, so only one LMM op per 16 clock cycles... 200/16 = 12.5
Adding 16 longs of cache would help quite a bit.
Then the maximum LMM performance would be:
Absolute best case would be about 4 instruction cycles - 8 clock cycles for simple cog instructions, about 24 clock cycles for hub instructions, and over 40 clock cycles for jumps due to primitives and having to re-enter reps loop.
With one FJMP and six cog instructions, (8 longs), it would run at approx. 8*8+40 cycles, 104 cycles for eight longs, just under 20MIPS.
Regards,
Bill
hubexec may not be worth it if we can transfer from HUB to get some decent LMM going. I'm okay with that. In the end, I suspect we will be happier if there is some kind of hubexec to improve speed for C and other compiled code. Its probably worth the pain of developing and trying out on FPGA.
That's faster than a P1 running from the cog.
-Phil
Because of the Rotate details, I think an incrementing fetch has to wait 17 clocks between valid slots.
( 16 is for the same LSN)
If Chip reverses the spin-order, so LSN decrements with Slot-Scan, then an incrementing fetch can be 15 clocks between valid slots.
This is based on info Chip shared with me while we were going over the pipeline.
The LMM tools also have to re-scramble code to exactly match the chosen 07E5C3A18F6D4B29 hub address firing order.
The results will be the same, whether it's two or three clocks, since the firing order is based on seven clocks. (7 is relatively prime to 16; 6 will not work.) And let's just assume I renamed ptrx.
-Phil
Cannot hit the sweet spot, as it would be far past the correct address to read from the multiplexer.
12.5MIPS
$xxxxx0 first lmm instruction that syncs - say it hits rdlong, let's say ptr was xxxx0
$xxxxx1 ' end of rdlong, second clock
$xxxxx2 start of jmp instruction
$xxxxx3 ' end of jump instruction
$xxxxx4 start of 2 cycle lmm instruction
$xxxxx5 end of 2 cycle lmm instruction
$xxxxx6 second instance of rdlong, wants xxxx1.. not available
$xxxxx7 ' second cycle of rdlong... but it is waiting for $xxxxxx1
$xxxxx8
$xxxxx9
$xxxxxA
$xxxxxB
$xxxxxC
$xxxxxD
$xxxxxE
$xxxxxF
$xxxxx0
$xxxxx1 ' rdlong completes
$xxxxx2 start of jmp instruction
$xxxxx3 ' end of jump instruction
$xxxxx4 start of 2 cycle lmm instruction
$xxxxx5 end of 2 cycle lmm instruction
$xxxxx6 third instance of rdlong, wants xxxx1.. not available
$xxxxx7 ' second cycle of rdlong... but it is waiting for $xxxxxx1
$xxxxx8
$xxxxx9
$xxxxxA
$xxxxxB
$xxxxxC
$xxxxxD
$xxxxxE
$xxxxxF
$xxxxx0
$xxxxx1 ' rdlong completes
Cycle by cycle analysis above, shows 16 cycles between rdlongs due to nature of hub.
12.5MIPS - max, before subtracting primitive overhead.
FYI, my best guess is that 2-3 spacer instructions will be required after an RDLONG before the value read can be executed, but that is only a guess, until Chip tells us what the verilog needs.
Phil,
Believe it or not, I greatly respect your work - I am especially impressed with your backpack, and other ntsc & radio work.
I know you can do analysis like I did above.
Please do analysis - like I do - before posting rebuttals. Your rebuttal was factually, and proven above, incorect.
Best Wishes,
Bill
-Phil
-Phil
What if the fifo wasn't a fifo so much as a 16 long buffer that could be filled out of order, but then read in order? (or filled out of order and read out of order?) Maybe it needs to be longer than 16 in order to satisfy full speed video streaming....
the Nibble-adder design I looked into
http://forums.parallax.com/showthread.php/155692-Nibble-Carry-Higher-speed-Buffers-FIFOs-using-new-HUB-Rotate
does exactly what your hub address firing order change does, but those opposing this, claimed linear addressing was important. - and that was just for doing it inside the Adder, (where it was optional) not in the Rotate Engine, where it affects ALL COGs, so all tools need to apply the magic-scramble.
If everyone is happy with scrambling order, then look at the Nibble-Adder I looked into earlier
http://forums.parallax.com/showthread.php/155692-Nibble-Carry-Higher-speed-Buffers-FIFOs-using-new-HUB-Rotate
This applies out-of-linear-order pointers, and actually replaces the FIFO, by having the change in address match the desired feed rates.
For Odd-N it allows matching streaming rates with SW-loops, for best bandwidth use.
Even-N is a little more complex, but it does have a solution.
The FIFO design allows HW linear streaming, so it largely replaced the need for Nibble-Adder.
I quoted your message, I did not see the firing order.
I was replying to your rebuttal as written, and as written, with me not noticing the firing order string, my analysis stands.
Sorry about not noticing it, I will analyze its impact now.
Bill
0
7
E
5
C
3
A
1
8
F
6
D
4
B
2
9
Ok, I see what you are doing. You are scrambling the banks by offsetting each bank by 7 clock cycles.
If all cogs scramble the same way, including fifo fills, and all possible access, such a scramble can work however you are inducing a jitter by using 7 instead of 8 as cog instructions take two cycles.
If any type of access does not scramble exactly the same way it is a HUGE headache, far worse than any slot scheme or mooch or other ojected to features.
Now assuming that everything scrambles the same way, we still have a problem - you are only allowing one spacer instruction after the read before trying to execute the instruction just fetched, which problem is NOT addressed by the scramble.
Assuming we would only execute two cycle cog only instructions:
1 spacer (like your loop) once the hub is in lock step to your scrambe would result in 200/6 = 33MIPS, so if we have eight longs, consisting of 6 cog instructions, a CALL #FJMP, and the long containing the address, we would need approximately 8*2+40 cycles, that is 56 cycles for 7 instructions, about 8 cycles per instruction average, or 25MIPS ... very close to your 28.5MIPS
2 spacers is 8*3+40 = 64 cycles, or 9.1 cycles average per instruction, or 22MIPS
3 spacers is 8*4+40 = 72 cycles, or 10.3 cycles avarega per instruction, which is 19.4MIPS
My apologies for missing the firing order, however if not all access from every cog for all uses has the same firing order, it would be a disaster.
So if the firing order is not visible to the software - ie the same hub location has the same address as far as every cog, video circuit, fifo etc is concerned, your firing order is definitely an improvement for LMM. How much of an improvement depends on the number of required spacer instructions.
It would slow down video streaming by a factor of 7 though, unless the fifo was filled in out of order, potentially causing 7 cycle delay between longs (unless fifo was say 32 levels deep)
Regards,
Bill
Can we just assume a moment that there is some way to take advantage of that FIFO Chip is so happy about? Has this been explored at all? If so, what was the result?
I see some LMM analysis was written while I typed .... LOL
The basic question is: Can instructions be fetched and executed using a "FIFO sized" overlay area?
I haven't fully explored this and don't care to do it honestly, but it seems from 10000 feet that no one else has tried to have the conversation. Reading this thread and others like it really make me nauseous ....
Assumption:
Using Chip's FIFO, 16 or so instructions get read to a block from hub ram by RDBLOC, and instructions can be executed by the cog until a jump happens. At time of any jump, the PC gets adjusted, and RDBLOC runs again.
Considerations:
1. Any branch will waste cycles.
2. Maybe 5 instructions on average get run without a jump. Need more data....
3. Relative branches (+/- 256) PC can be adjusted by a tiny 3-ish instruction macro. Possible at all?
4. Any jump macro (like LMM) adjusts the PC based on the value stored in the next address register of the jump (ala P8x32a)
5. Other macros (ala LMM) would be used.
6. Other?
Is it a complete waste of time to think about this? If so, please give at least a brief explanation.
Please see
http://forums.parallax.com/showthread.php/155675-New-Hub-Scheme-For-Next-Chip?p=1269950&viewfull=1#post1269950
I analyzed that, showing two cases, for cog code, hubexec+fifo, lmm, lmm+fifo with 1/2/3 spacers needed.
Tables of results given at the link above.
Simplified mathematical model used to obtain the results above can be found in
http://forums.parallax.com/showthread.php/155675-New-Hub-Scheme-For-Next-Chip?p=1269942&viewfull=1#post1269942
Summary:
assuming two delay slots needed between RDLONG and execution, fifo'd LMM would be 2x faster than simple LMM.
fifo'd hubexec would be 2.5x+ faster than fifod LMM.
Hope that helps!
Bill
Relative DJNZ & Friends is +/-256. I forgot the operands were 9 bits, not 8.
Yes, I would just like any FPGA image. Don't care what it is missing. We need to start re-testing.
I think it is valid, except maybe needing a spacer between the RDBLOC and the execution (move it to the end of the loop just before the JMP.
RDBLOC takes 18 clocks to complete and then 2 clocks for your loop jmp, then you run N instructions until a jump/branch. So you take 20 clocks + 2*N clocks + jump/branch overhead (Bill has been using 40 clocks).
if N = 5, then it's 20+10+40 = 70 clocks for 6 instructions (5+jump/branch) = about 17.14 MIPS
if N = 8, then it's 20+16+40 = 76 clocks for 9 instructions (8+jump/branch) = about 23.68 MIPS
in the rare case that you get to run all 16 instructions and have no jump/branch, then you get 20+32 = 52 clocks to run 16 instructions = about 61.54 MIPS
So in typical cases, I believe it compares favorably, but in worst case it's much slower (due to 18 clock RDBLOC overhead). In the best case it's much faster, but I dunno how common the best case is...
Where is your pseudo code Bill?
fifo lmm is not the same as rdbloc lmm. fifo doesn't have the 18 clock overhead of reading the 16 longs into cog memory
Basically, best case is
My analysis showed two cases:
5 longs (5 hubexec instructions including hubexec jump, 4 cog + LMM jump)
8 longs (8 hubexec instructions including hubexec jump, 6 cog + LMM jump)
hub with mux as described by Chip/Roy
results in posting I previously linked to, was based on primitives (FJMP) completing in time for next hub cycle, which is optimistic.
Thanks.
RBLOC would have other problems (ie different code generator, for 16 long blocks of code, jumping into them etc., fillers). My guess is that Parallax does not want to pay for such a back end.
Ariba's trick for Quad LMM may help some.
Why don't you guys work up a proper analysis for RBLOC? I am going to bed now, or wifey will kill me.
Bill
I am pretty sure worst case will be one spacer needed for hub read to execution of that read long in cog memory. Look at Chip's discussions of the clock cycles to pipeline stages.
Chip's pipeline stages post: http://forums.parallax.com/showthread.php/155132-The-New-16-Cog-512KB-64-analog-I-O-Propeller-Chip?p=1261426&viewfull=1#post1261426
I gave results for one spacer.
We can have SDRAM on the new chip, too. It would just clock at 100MHz (Fsys/2). It takes, as I recall, 22 pins for control and address and then 16 pins for data. That's 38, total, leaving 26 for other stuff. That's workable. With the hub FIFO's, you could stream words at Fsys/2 by setting the NCO to 1/2. Use a smart pin to output Fsys/2. There's everything you need. One cog should handle it just fine.
That's a pretty neat idea for getting fast LMM. We'll probably need to stick with a simple ascending hub slot order to make the FIFO work smoothly.
We will get hub exec, but not on the first FPGA release. Hub exec makes big code mindless to write. And it's fast. I've been struggling trying to get a bite on various aspects of this design. It's turning out to be a new group-up effort.
Tonight I hope to get the simplex FIFO done. It will be a few more days before I can actually test it, though, because I have to create a working cog, still. Pretty much, all the pieces are ready.
Using the 256-long LUT for cog code, ranging from addresses $200..$2FF would probably only need a few mux's to make work. No promises, but we'll investigate that down the line. It would be like internal-speed hub exec.