New Hub Scheme For Next Chip

Bill Henning · 2014-05-21 20:58

Happy to.

As there is no cache, it would have to wait for pc+4 to come around, so only one LMM op per 16 clock cycles... 200/16 = 12.5

next:  RDLONG ins, pc  ' 1/16 clock cycles, 1/8 instruction cycles to next window
         add pc,#4 ' not needed if ptra++ is used for PC
         ... spacers
ins     nop
         jmp   #next    ' same time as reps, and any JMP/CALL to primitive would break out of reps

Adding 16 longs of cache would help quite a bit.

Then the maximum LMM performance would be:

        REPS #511,#ins_in_loop ' depends on spacers, ptra usage
        RDLONG ins, pc
        add pc,#4
.. minimum number of spacers needed before ins is executable, I suspect minimum of two nops
ins   nop

Absolute best case would be about 4 instruction cycles - 8 clock cycles for simple cog instructions, about 24 clock cycles for hub instructions, and over 40 clock cycles for jumps due to primitives and having to re-enter reps loop.

With one FJMP and six cog instructions, (8 longs), it would run at approx. 8*8+40 cycles, 104 cycles for eight longs, just under 20MIPS.

Regards,

Bill

Phil Pilgrim (PhiPi) wrote: »

Bill,

How did you come up with 12.5 MIPS? Please show an example of the LMM loop that operates at that speed.

Thanks,
-Phil

Invent-O-Doc · 2014-05-21 21:00

I'd like to echo Roy.

hubexec may not be worth it if we can transfer from HUB to get some decent LMM going. I'm okay with that. In the end, I suspect we will be happier if there is some kind of hubexec to improve speed for C and other compiled code. Its probably worth the pain of developing and trying out on FPGA.

Phil Pilgrim (PhiPi) · 2014-05-21 21:04

Assuming a rdlong executes in three clocks if it hits the hub sweet spot, and with a hub address firing order of 07E5C3A18F6D4B29, this LMM loop executes at 28.5 MIPS:

loop    long    0-0
        rdlong  loop,pc++
        jmp     #loop

That's faster than a P1 running from the cog.

-Phil

jmg · 2014-05-21 21:07

Bill Henning wrote: »
As there is no cache, it would have to wait for pc+4 to come around, so only one LMM op per 16 clock cycles... 200/16 = 12.5
next:  RDLONG ins, pc  ' 1/16 clock cycles, 1/8 instruction cycles to next window
         add pc,#4 ' not needed if ptra++ is used for PC
         ... spacers
ins     nop
         jmp   #next    ' same time as reps, and any JMP/CALL to primitive would break out of reps

Details :
Because of the Rotate details, I think an incrementing fetch has to wait 17 clocks between valid slots.
( 16 is for the same LSN)
If Chip reverses the spin-order, so LSN decrements with Slot-Scan, then an incrementing fetch can be 15 clocks between valid slots.

Roy Eltham · 2014-05-21 21:10

I believe rdlong is 2 clocks when hitting the sweet spot of hub cycles and using PTRx for the address instead of an arbitrary register.
This is based on info Chip shared with me while we were going over the pipeline.

jmg · 2014-05-21 21:12

Phil Pilgrim (PhiPi) wrote: »

Assuming a rdlong executes in three clocks if it hits the hub sweet spot, and with a hub address firing order of 07E5C3A18F6D4B29, this LMM loop executes at 28.5 MIPS:

However, such a hub address firing order breaks Video Linear DMA, and also stalls on opcodes <> assumed cycle#

The LMM tools also have to re-scramble code to exactly match the chosen 07E5C3A18F6D4B29 hub address firing order.

Phil Pilgrim (PhiPi) · 2014-05-21 21:12

Roy,

The results will be the same, whether it's two or three clocks, since the firing order is based on seven clocks. (7 is relatively prime to 16; 6 will not work.) And let's just assume I renamed ptrx.

-Phil

Bill Henning · 2014-05-21 21:15

Six clocks, it is two clocks per instruction. you assume ptra++ exists, I thought you were against PTRA/PTRB.

Cannot hit the sweet spot, as it would be far past the correct address to read from the multiplexer.

12.5MIPS

$xxxxx0 first lmm instruction that syncs - say it hits rdlong, let's say ptr was xxxx0
$xxxxx1 ' end of rdlong, second clock
$xxxxx2 start of jmp instruction
$xxxxx3 ' end of jump instruction
$xxxxx4 start of 2 cycle lmm instruction
$xxxxx5 end of 2 cycle lmm instruction

$xxxxx6 second instance of rdlong, wants xxxx1.. not available
$xxxxx7 ' second cycle of rdlong... but it is waiting for $xxxxxx1
$xxxxx8
$xxxxx9
$xxxxxA
$xxxxxB
$xxxxxC
$xxxxxD
$xxxxxE
$xxxxxF
$xxxxx0
$xxxxx1 ' rdlong completes

$xxxxx2 start of jmp instruction
$xxxxx3 ' end of jump instruction
$xxxxx4 start of 2 cycle lmm instruction
$xxxxx5 end of 2 cycle lmm instruction

$xxxxx6 third instance of rdlong, wants xxxx1.. not available
$xxxxx7 ' second cycle of rdlong... but it is waiting for $xxxxxx1
$xxxxx8
$xxxxx9
$xxxxxA
$xxxxxB
$xxxxxC
$xxxxxD
$xxxxxE
$xxxxxF
$xxxxx0
$xxxxx1 ' rdlong completes

Cycle by cycle analysis above, shows 16 cycles between rdlongs due to nature of hub.

12.5MIPS - max, before subtracting primitive overhead.

FYI, my best guess is that 2-3 spacer instructions will be required after an RDLONG before the value read can be executed, but that is only a guess, until Chip tells us what the verilog needs.

Phil,

Believe it or not, I greatly respect your work - I am especially impressed with your backpack, and other ntsc & radio work.

I know you can do analysis like I did above.

Please do analysis - like I do - before posting rebuttals. Your rebuttal was factually, and proven above, incorect.

Best Wishes,

Bill

Phil Pilgrim (PhiPi) wrote: »
Assuming a rdlong executes in three clocks if it hits the hub sweet spot, and with a hub address firing order of 07E5C3A18F6D4B29, this LMM loop executes at 28.5 MIPS:
loop    long    0-0
        rdlong  loop,pc++
        jmp     #loop
That's faster than a P1 running from the cog.

-Phil

Phil Pilgrim (PhiPi) · 2014-05-21 21:15

jmg wrote:

However, such a hub address firing order breaks Video Linear DMA, and also stalls on opcodes <> assumed cycle#

So? The video data order can be changed in the hub to come out linearly. And no matter what the firing order is, there will be LMM opcodes, such as branches, that stall the process while the hub resynchonizes.

-Phil

Phil Pilgrim (PhiPi) · 2014-05-21 21:18

Bill, if you took the time to analyze my example using the address firing order I proposed, you would see that my analysis is correct.

-Phil

Roy Eltham · 2014-05-21 21:18

jmg,
What if the fifo wasn't a fifo so much as a 16 long buffer that could be filled out of order, but then read in order? (or filled out of order and read out of order?) Maybe it needs to be longer than 16 in order to satisfy full speed video streaming....

jmg · 2014-05-21 21:21

Phil Pilgrim (PhiPi) wrote: »

So? The video data order can be changed in the hub to come out linearly. And no matter what the firing order is, there will be LMM opcodes, such as branches, that stall the process while the hub resynchonizes.

the Nibble-adder design I looked into
http://forums.parallax.com/showthread.php/155692-Nibble-Carry-Higher-speed-Buffers-FIFOs-using-new-HUB-Rotate
does exactly what your hub address firing order change does, but those opposing this, claimed linear addressing was important. - and that was just for doing it inside the Adder, (where it was optional) not in the Rotate Engine, where it affects ALL COGs, so all tools need to apply the magic-scramble.

jmg · 2014-05-21 21:28

Roy Eltham wrote: »

jmg,
What if the fifo wasn't a fifo so much as a 16 long buffer that could be filled out of order, but then read in order? (or filled out of order and read out of order?) Maybe it needs to be longer than 16 in order to satisfy full speed video streaming....

If everyone is happy with scrambling order, then look at the Nibble-Adder I looked into earlier

http://forums.parallax.com/showthread.php/155692-Nibble-Carry-Higher-speed-Buffers-FIFOs-using-new-HUB-Rotate
This applies out-of-linear-order pointers, and actually replaces the FIFO, by having the change in address match the desired feed rates.

For Odd-N it allows matching streaming rates with SW-loops, for best bandwidth use.
Even-N is a little more complex, but it does have a solution.

The FIFO design allows HW linear streaming, so it largely replaced the need for Nibble-Adder.

Bill Henning · 2014-05-21 21:31

Phil,

I quoted your message, I did not see the firing order.

I was replying to your rebuttal as written, and as written, with me not noticing the firing order string, my analysis stands.

Sorry about not noticing it, I will analyze its impact now.

Bill

Phil Pilgrim (PhiPi) wrote: »

Bill, if you took the time to analyze my example using the address firing order I proposed, you would see that my analysis is correct.

-Phil

Bill Henning · 2014-05-21 21:49

Thanks for pointing out that I missed the firing order.

0
7
E
5
C
3
A
1
8
F
6
D
4
B
2
9

Ok, I see what you are doing. You are scrambling the banks by offsetting each bank by 7 clock cycles.

If all cogs scramble the same way, including fifo fills, and all possible access, such a scramble can work however you are inducing a jitter by using 7 instead of 8 as cog instructions take two cycles.

If any type of access does not scramble exactly the same way it is a HUGE headache, far worse than any slot scheme or mooch or other ojected to features.

Now assuming that everything scrambles the same way, we still have a problem - you are only allowing one spacer instruction after the read before trying to execute the instruction just fetched, which problem is NOT addressed by the scramble.

Assuming we would only execute two cycle cog only instructions:

1 spacer (like your loop) once the hub is in lock step to your scrambe would result in 200/6 = 33MIPS, so if we have eight longs, consisting of 6 cog instructions, a CALL #FJMP, and the long containing the address, we would need approximately 8*2+40 cycles, that is 56 cycles for 7 instructions, about 8 cycles per instruction average, or 25MIPS ... very close to your 28.5MIPS

2 spacers is 8*3+40 = 64 cycles, or 9.1 cycles average per instruction, or 22MIPS

3 spacers is 8*4+40 = 72 cycles, or 10.3 cycles avarega per instruction, which is 19.4MIPS

My apologies for missing the firing order, however if not all access from every cog for all uses has the same firing order, it would be a disaster.

So if the firing order is not visible to the software - ie the same hub location has the same address as far as every cog, video circuit, fifo etc is concerned, your firing order is definitely an improvement for LMM. How much of an improvement depends on the number of required spacer instructions.

It would slow down video streaming by a factor of 7 though, unless the fifo was filled in out of order, potentially causing 7 cycle delay between longs (unless fifo was say 32 levels deep)

Regards,

Bill

Phil Pilgrim (PhiPi) wrote: »
Assuming a rdlong executes in three clocks if it hits the hub sweet spot, and with a hub address firing order of 07E5C3A18F6D4B29, this LMM loop executes at 28.5 MIPS:
loop    long    0-0
        rdlong  loop,pc++
        jmp     #loop
That's faster than a P1 running from the cog.

-Phil

jazzed · 2014-05-21 21:53

I've been sitting patiently trying to ignore all the noise wishing that Chip could just throw the bit-file design over the fence for us to evaluate for quite a while (fingers crossed that it will happen soon).

Can we just assume a moment that there is some way to take advantage of that FIFO Chip is so happy about? Has this been explored at all? If so, what was the result?

I see some LMM analysis was written while I typed .... LOL

The basic question is: Can instructions be fetched and executed using a "FIFO sized" overlay area?

I haven't fully explored this and don't care to do it honestly, but it seems from 10000 feet that no one else has tried to have the conversation. Reading this thread and others like it really make me nauseous ....

Assumption:

Using Chip's FIFO, 16 or so instructions get read to a block from hub ram by RDBLOC, and instructions can be executed by the cog until a jump happens. At time of any jump, the PC gets adjusted, and RDBLOC runs again.

Considerations:

1. Any branch will waste cycles.
2. Maybe 5 instructions on average get run without a jump. Need more data....
3. Relative branches (+/- 256) PC can be adjusted by a tiny 3-ish instruction macro. Possible at all?
4. Any jump macro (like LMM) adjusts the PC based on the value stored in the next address register of the jump (ala P8x32a)
5. Other macros (ala LMM) would be used.
6. Other?

:fetch
RDBLOC :dst, pc ' pc 
:dst
nop 
nop 
nop 
nop 
nop 
nop 
nop 
nop 
nop 
nop 
nop 
nop 
nop 
nop 
nop 
nop
' special PC adjustment here.
' then.
jmp :fetch

Is it a complete waste of time to think about this? If so, please give at least a brief explanation.

Cluso99 · 2014-05-21 21:56

Roy Eltham wrote: »

It's based on getting one long every 16 clocks, so at 200Mhz you get 12.5MIPs. It's worse than 12.5MIPs when you factor in branching and other LMM stuff.

Although, I think LMM with the FIFO would give quite a bit more performance. Bill's earlier post shows that.

Actually its 1 in 17 clocks because the hub address also needs to advance too, so 200MHz -> 11.76MIPS without any caching/fifo. Branching of course is going to cause another great hit. And when you add in the hub stack for GCC, that is really going to bite.

Bill Henning · 2014-05-21 22:04

Hi Steve,

Please see

http://forums.parallax.com/showthread.php/155675-New-Hub-Scheme-For-Next-Chip?p=1269950&viewfull=1#post1269950

I analyzed that, showing two cases, for cog code, hubexec+fifo, lmm, lmm+fifo with 1/2/3 spacers needed.

Tables of results given at the link above.

Simplified mathematical model used to obtain the results above can be found in

http://forums.parallax.com/showthread.php/155675-New-Hub-Scheme-For-Next-Chip?p=1269942&viewfull=1#post1269942

Summary:

assuming two delay slots needed between RDLONG and execution, fifo'd LMM would be 2x faster than simple LMM.

fifo'd hubexec would be 2.5x+ faster than fifod LMM.

Hope that helps!

Bill

jazzed wrote: »
I've been sitting patiently trying to ignore all the noise wishing that Chip could just throw the bit-file design over the fence for us to evaluate for quite a while (fingers crossed that it will happen soon).

Can we just assume a moment that there is some way to take advantage of that FIFO Chip is so happy about? Has this been explored at all? If so, what was the result?

I see some LMM analysis was written while I typed .... LOL

The basic question is: Can instructions be fetched and executed using a "FIFO sized" overlay area?

I haven't fully explored this and don't care to do it honestly, but it seems from 10000 feet that no one else has tried to have the conversation. Reading this thread and others like it really make me nauseous ....

Assumption:

Using Chip's FIFO, 16 or so instructions get read to a block from hub ram by RDBLOC, and instructions can be executed by the cog until a jump happens. At time of any jump, the PC gets adjusted, and RDBLOC runs again.

Considerations:

1. Any branch will waste cycles.
2. Maybe 5 instructions on average get run without a jump. Need more data....
3. Relative branches (+/- 256) PC can be adjusted by a tiny 3-ish instruction macro. Possible at all?
4. Any jump macro (like LMM) adjusts the PC based on the value stored in the next address register of the jump (ala P8x32a)
5. Other macros (ala LMM) would be used.
6. Other?
:fetch
RDBLOC :dst, pc ' pc 
:dst
nop 
nop 
nop 
nop 
nop 
nop 
nop 
nop 
nop 
nop 
nop 
nop 
nop 
nop 
nop 
nop
' special PC adjustment here.
' then.
jmp :fetch
Is it a complete waste of time to think about this? If so, please give at least a brief explanation.

Cluso99 · 2014-05-21 22:06

Thanks jazzed,
Relative DJNZ & Friends is +/-256. I forgot the operands were 9 bits, not 8.

Yes, I would just like any FPGA image. Don't care what it is missing. We need to start re-testing.

Roy Eltham · 2014-05-21 22:10

jazzed,
I think it is valid, except maybe needing a spacer between the RDBLOC and the execution (move it to the end of the loop just before the JMP.

RDBLOC takes 18 clocks to complete and then 2 clocks for your loop jmp, then you run N instructions until a jump/branch. So you take 20 clocks + 2*N clocks + jump/branch overhead (Bill has been using 40 clocks).

if N = 5, then it's 20+10+40 = 70 clocks for 6 instructions (5+jump/branch) = about 17.14 MIPS
if N = 8, then it's 20+16+40 = 76 clocks for 9 instructions (8+jump/branch) = about 23.68 MIPS
in the rare case that you get to run all 16 instructions and have no jump/branch, then you get 20+32 = 52 clocks to run 16 instructions = about 61.54 MIPS

So in typical cases, I believe it compares favorably, but in worst case it's much slower (due to 18 clock RDBLOC overhead). In the best case it's much faster, but I dunno how common the best case is...

jazzed · 2014-05-21 22:10

Roy, your analysis is interesting. How do you account for changing the PC?

Where is your pseudo code Bill?

Roy Eltham · 2014-05-21 22:12

Bill,
fifo lmm is not the same as rdbloc lmm. fifo doesn't have the 18 clock overhead of reading the 16 longs into cog memory

Bill Henning · 2014-05-21 22:17

on my lab computer, downstairs - I am in the living room.

Basically, best case is

top    rep   #511,#4 ' assuming 2 spacers
        ' i am assumnig no spacer is need here
         rdflong    ins
         nop ' spacer 1
         nop ' spacer 2
         'nop spacer 3
ins     nop
         jmp   #top

My analysis showed two cases:

5 longs (5 hubexec instructions including hubexec jump, 4 cog + LMM jump)
8 longs (8 hubexec instructions including hubexec jump, 6 cog + LMM jump)

hub with mux as described by Chip/Roy

results in posting I previously linked to, was based on primitives (FJMP) completing in time for next hub cycle, which is optimistic.

jazzed wrote: »

Where is your pseudo code Bill?

Bill Henning · 2014-05-21 22:23

Roy,

Thanks.

RBLOC would have other problems (ie different code generator, for 16 long blocks of code, jumping into them etc., fillers). My guess is that Parallax does not want to pay for such a back end.

Ariba's trick for Quad LMM may help some.

Why don't you guys work up a proper analysis for RBLOC? I am going to bed now, or wifey will kill me.

Bill

Roy Eltham wrote: »

Bill,
fifo lmm is not the same as rdbloc lmm. fifo doesn't have the 18 clock overhead of reading the 16 longs into cog memory

Roy Eltham · 2014-05-21 22:24

Bill,
I am pretty sure worst case will be one spacer needed for hub read to execution of that read long in cog memory. Look at Chip's discussions of the clock cycles to pipeline stages.

Chip's pipeline stages post: http://forums.parallax.com/showthread.php/155132-The-New-16-Cog-512KB-64-analog-I-O-Propeller-Chip?p=1261426&viewfull=1#post1261426

Bill Henning · 2014-05-21 22:27

That would be great!

I gave results for one spacer.

Roy Eltham wrote: »

Bill,
I am pretty sure worst case will be one spacer needed for hub read to execution of that read long in cog memory. Look at Chip's discussions of the clock cycles to pipeline stages.

Chip's pipeline stages post: http://forums.parallax.com/showthread.php/155132-The-New-16-Cog-512KB-64-analog-I-O-Propeller-Chip?p=1261426&viewfull=1#post1261426

David Betz · 2014-05-21 22:28

Bill Henning wrote: »

Roy,

Thanks.

RBLOC would have other problems (ie different code generator, for 16 long blocks of code, jumping into them etc., fillers). My guess is that Parallax does not want to pay for such a back end.

Ariba's trick for Quad LMM may help some.

Why don't you guys work up a proper analysis for RBLOC? I am going to bed now, or wifey will kill me.

Bill

Also, every "call" to an LMM macro is a branch. How would the LDI macro work that is supposed to load the 32 bits following the JMP to #_LMM_LDI? There is no PC that is keeping pace with instruction execution. Would LDI waste the remainder of the 16 long block and get its immediate value from the following block?

cgracey · 2014-05-21 22:37

Bill Henning wrote: »

...I loved having sdram on the P2, tons of memory for 1080p 24bpp, large capture buffers, tons of xmm code space... sniff.

We can have SDRAM on the new chip, too. It would just clock at 100MHz (Fsys/2). It takes, as I recall, 22 pins for control and address and then 16 pins for data. That's 38, total, leaving 26 for other stuff. That's workable. With the hub FIFO's, you could stream words at Fsys/2 by setting the NCO to 1/2. Use a smart pin to output Fsys/2. There's everything you need. One cog should handle it just fine.

cgracey · 2014-05-21 23:04

Phil Pilgrim (PhiPi) wrote: »
Assuming a rdlong executes in three clocks if it hits the hub sweet spot, and with a hub address firing order of 07E5C3A18F6D4B29, this LMM loop executes at 28.5 MIPS:
loop    long    0-0
        rdlong  loop,pc++
        jmp     #loop
That's faster than a P1 running from the cog.

-Phil

That's a pretty neat idea for getting fast LMM. We'll probably need to stick with a simple ascending hub slot order to make the FIFO work smoothly.

We will get hub exec, but not on the first FPGA release. Hub exec makes big code mindless to write. And it's fast. I've been struggling trying to get a bite on various aspects of this design. It's turning out to be a new group-up effort.

Tonight I hope to get the simplex FIFO done. It will be a few more days before I can actually test it, though, because I have to create a working cog, still. Pretty much, all the pieces are ready.

cgracey · 2014-05-21 23:08

Cluso,

Using the 256-long LUT for cog code, ranging from addresses $200..$2FF would probably only need a few mux's to make work. No promises, but we'll investigate that down the line. It would be like internal-speed hub exec.

New Hub Scheme For Next Chip

Comments