New Hub Scheme For Next Chip

Bill Henning · 2014-05-21 13:39

Not apples to apples.

Hand crafted small driver vs. compiler generated code.

Write it in C, look at compiled code, analyze that - I did such analysis.

Heater. wrote: »

David.

I guess fetching a stream of instructions helps a lot when you are executing linear code.

That is not my code. It takes a turn this way or that every handful of instructions. Look at FullDuplexSerial. Can you even find four instructions in a "row" their?

My prediction is that such a FIFO only has minimal impact on execution speed.

Unless it is more than a FIFO but more like a cache and you can keep your code in cache and keep your data locality high.

jmg · 2014-05-21 13:45

David Betz wrote: »

Whare are the HW FIFO use cases? Is the FIFO essentially a generalization of the video FIFO in P1? Would it be used to stream out video to the DACs?

Yes, HW fifo is used to stream Video (Optionally via LUT) to the Pins, or DACs. at fSys/N (any N, NCO derived)
Simple Full fSys speed transfer from HUB to DACs is a defining feature.

It can also stream capture Pins directly to HUB, for Logic Analyser/ Instrumentation type apps.

The SW-FIFO use, is a bonus of having the HW already there.

David Betz · 2014-05-21 13:47

jmg wrote: »

Yes, HW fifo is used to stream Video (Optionally via LUT) to the Pins, or DACs. at fSys/N (any N, NCO derived)
Simple Full fSys speed transfer from HUB to DACs is a defining feature.

It can also stream capture Pins directly to HUB, for Logic Analyser/ Instrumentation type apps.

The SW-FIFO use, is a bonus of having the HW already there.

Nice! I like the idea of more general HW.

Kerry S · 2014-05-21 13:49

Bill Henning wrote: »

On the P1 it is four ported, it was also four port on the P2 design.

The LMM interpreter, including FCACHE and FLIB areas is 2KB

What is large is the generated C code, and all the required C libraries, and data structures.

Hope this helps.

Bill,

Thank you. That explains a lot. The only C I have done on the propeller are things with my son for his P1 powered robot. All of my stuff has been assembly getting my drivers figured out. Have not gotten to the main program which, unfortunately, will require C or Spin since it will not fit into a cog. With gigs of ram it is easy to forget how much of a hog C is... Ahh for Delphi on the Propeller!

Thanks to everyone for the Prop design lessons.

RossH · 2014-05-21 13:50

Phil Pilgrim (PhiPi) wrote: »

My ears are starting to ring with the echoes of creeping featuritis, with the same people pushing for more, more, more. I fear that the FIFO is making things way too complicated. Why not just adjust the hub's address "firing order" to optimize software-mediated streaming and call it good? And forget hubexec; LMM is good enough, especially with the new hub bandwidth. Otherwise, this thing will never get finished. Sorry, Chip, but it's deja vu all over again.

-Phil

+1

Heater. · 2014-05-21 13:51

Bill,

I was comparing one cog to one LPC.
So if you want to say 200MIPS across 16 cores, I'll respond with 768MIPS across 16 LPC1114's.

You can respond like that. I think it's nuts. Who is going to build a system with 16 LPC1114's on it? And the get them tightly integrated with whatever communications they need?. Which would kill your performance immediately anyway. Not to mention practical issues of PCB space etc.

A single chip solution like the Propeller wins hands down.

Statistically about one in eight - ten instructions will be a jump taken (or call taken)

Perhaps. I don't believe it. I'll have to compile my FullDuplexSerial and FFT C source to assembler and check before I can can refute it though.

I crunch numbers.

Let's do that. See above.

I also don't believe anyone is going to write a compiler that makes use of any FIFO if it requires special instructions to deal with. Yes, that is just a belief as I'm not a compiler writer.

RossH · 2014-05-21 13:53

Heater. wrote: »

David.

I guess fetching a stream of instructions helps a lot when you are executing linear code.

That is not my code. It takes a turn this way or that every handful of instructions. Look at FullDuplexSerial. Can you even find four instructions in a "row" their?

My prediction is that such a FIFO only has minimal impact on execution speed.

Unless it is more than a FIFO but more like a cache and you can keep your code in cache and keep your data locality high.

Yes, this is true of compiled code as well. That's why I believe that the FIFO will turn out to be of only limited benefit for code execution. It will have to be dumped frequently. However, it will be useful for streaming applications.

Phil Pilgrim (PhiPi) · 2014-05-21 13:55

Bill Henning wrote:

On the P1 it is four ported,

The P1 cog RAM is single-ported. There is only one memory access per clock cycle.

-Phil

Bill Henning · 2014-05-21 13:57

FullDuplexSerial is a poor choice, as it will be replaced by smart pins

and it is a small driver. Not as much gain there.

FFT should show a good spedup, so should Dhrystone, Whetstone, inverse kinematics, anything computation heavy etc.

A nice large cache would be far better, but FIFO will give a decent boost.

Here is another example - assume that one in five instructions is a branch taken.

4 would execute in 8 cycles, and the branch (on average) in 2+8 cycles (8 for fifo reload) for a total of 18 cycles. cog code would do it in 10.

FIFO'd LMM in about 40 cycles, non-fifo'd LMM in about 80 cycles.

Heater. wrote: »

Bill,

You can respond like that. I think it's nuts. Who is going to build a system with 16 LPC1114's on it? And the get them tightly integrated with whatever communications they need?. Which would kill your performance immediately anyway. Not to mention practical issues of PCB space etc.

A single chip solution like the Propeller wins hands down.

Perhaps. I don't belive it. I'll have to compile my FullDuplexSerial and FFT C source to assembler and check before I can can refute it though.

Let's do that. See above.

Bill Henning · 2014-05-21 13:58

Thanks. Mixed it up with P2 original design.

Phil Pilgrim (PhiPi) wrote: »

No, the P1 cog RAM is single-ported. That's why it takes four system clocks for each instruction: one to fetch the instruction, two to fetch the operands, one to write the result.

-Phil

Heater. · 2014-05-21 14:01

Ross,

Yes, this is true of compiled code as well.

Sorry I did not specify, I was thinking of my FullDuplexSerial_ht that is written in C and runs native in a COG.
I have to check but I think heater_fft spends more time jumping than running in a straight line:)

David Betz · 2014-05-21 14:07

Heater. wrote: »

Ross,

Sorry I did not specify, I was thinking of my FullDuplexSerial_ht that is written in C and runs native in a COG.
I have to check but I think heater_fft spends more time jumping than running in a straight line:)

It would be interesting to know how frequently a program needs to branch before the FIFO actually hurts more than it helps because of the RDINIT required at each branch.

Bill Henning · 2014-05-21 14:07

This may help y'all:

For compiled code (assumes 2 cycles per instruction, for every random hub reference add 8 cycle average)

if for every X instructions there is a FJMP or FCALL taken (jumps not taken don't count)

cog only code

X * 2

simple lmm estimate is

X * 16 cycles

lmm with fifo estimate is

X * 4 + 8 cycles (zero spacers needed for RDFLONG) .. X * 8 + 8 (two spacers needed)

hubexec with fifo estimate is

X * 2 + 8 cycles

Heater. · 2014-05-21 14:07

Bill,

FullDuplexSerial is a poor choice, as it will be replaced by smart pins

Sounds good.

Now, if those "smart pins" are taking care of protocols like UART, SPI, I2C etc why on earth do we need 16 COGs?

Why not just use that 1 dollar ARM which is cheaper, smaller, faster and supports all those peripherals anyway?

You have convinced me. The P2 is doomed.

Bill Henning · 2014-05-21 14:09

Without the RDINIT, each hub reference would take 8 clock cycles... which is the same as the average RDINIT time, so I don't think it is possible for the FIFO to hurt.

David Betz wrote: »

It would be interesting to know how frequently a program needs to branch before the FIFO actually hurts more than it helps because of the RDINIT required at each branch.

Bill Henning · 2014-05-21 14:11

Sarcasm much?

I do use ARM's, PIC's, AVR's etc for projects where they fit better than a prop (usually for cost reasons).

We need 16 cogs to load random Obex objects of course

Actually I'd have been quite happy with a 4 cog 4 task per cog P2 optimized design, 16 cogs is a nice luxury "just in case".

I doubt the smart pins will absorb SPI, I2C, certainly they will help, but not absorb.

FYI, I bit bang uarts in pasm, not C, for higher baud rates

Heater. wrote: »

Bill,

Sounds good.

Now, if those "smart pins" are taking care of protocols like UART, SPI, I2C etc why on earth do we need 16 COGs?

Why not just use that 1 dollar ARM which is cheaper, smaller, faster and supports all those peripherals anyway?

You have convinced me. The P2 is doomed.

jmg · 2014-05-21 14:19

Phil Pilgrim (PhiPi) wrote: »

My ears are starting to ring with the echoes of creeping featuritis, with the same people pushing for more, more, more. I fear that the FIFO is making things way too complicated.

No, the FIFO allows users access to the full HUB bandwidth.
HW-FIFO especially gives performance SW cannot reach. Chip has fSys/N Video streaming from HUB, with optional LUT, via this HW-FIFO.

Phil Pilgrim (PhiPi) wrote: »

Why not just adjust the hub's address "firing order" to optimize software-mediated streaming and call it good?

You mean shuffle the spokes ? That would break Video Streaming, and besides,who decides the "new firing order" ?? - sounds like a lot of interaction and fish-hooks

Phil Pilgrim (PhiPi) · 2014-05-21 14:24

jmg wrote:

You mean shuffle the spokes ? That would break Video Streaming ...

Would it? You could still do full-speed block transfers and stream video out of cog RAM.

-Phil

jmg · 2014-05-21 14:31

Phil Pilgrim (PhiPi) wrote: »

Would it? You could still do full-speed block transfers and stream video out of cog RAM.

You forgot the LUT ?

I'm not seeing how shuffle the spokes does anything for all COGS, apart from add another variable.
Spoke order is rather locked to the memory LSB, and any shuffle of that might help one COG, but harm others.

Bill Henning · 2014-05-21 14:33

Ok, here are the results of a spreadsheet model:

NOTES:

- will have to correct model, as it does not currently take into account the time taken by the FJMP and/or FCALL primitive, so LMM results (all versions) will be slower in real life compared to hubexec.
- 0 spacer FIFO LMM extremely unlikely, 1 or 2 will be required, all figures shown for completeness

Conclusions:

fifo'd hubexec is 2.5x-3x+ faster than fifo'd LMM

fifo'd hubexec is 6x+ faster than non-fifo'd LMM

fifo'd LMM would be 2x-3x faster than non-fifo'd LMM

One jump/call taken in every 8 longs

clock freq	200		
cycl per ins	2		
avg hub cycl	8		
			
frq jmp/call	0.125		
in N instr	8		
			
	clock cycles	x faster LMM	% of cog spd
cog cycl	16	9.0	100.00%
fifo hubexec	24	6.0	66.67%
fifo lmm 0	40	3.6	40.00%
fifo lmm 1	56	2.6	28.57%
fifo lmm 2	72	2.0	22.22%
simpl lmm	144	1.0	11.11%

One jump/call taken in every 5 longs

cycl per ins	2		
avg hub cycl	8		
			
frq jmp/call	0.2		
in N instr	5		
			
	clock cycles	x faster LMM	% of cog spd
cog cycl	10	9.6	100.00%
fifo hubexec	18	5.3	55.56%
fifo lmm 0	28	3.4	35.71%
fifo lmm 1	38	2.5	26.32%
fifo lmm 2	48	2.0	20.83%
simpl lmm	96	1.0	10.42%

David Betz · 2014-05-21 14:43

Bill Henning wrote: »

Ok, here are the results of a spreadsheet model:

What code are you using? Which compiler and what settings?

Edit: Never mind. This is a spreadsheet model not based on actual code. Sorry.

Bill Henning · 2014-05-21 14:48

No worries.

The idea is we look at code, and plug in the numbers, and the spreadsheet will make a prediction that will be close "on average".

One example above assumes one jump/call taken in every 8 longs,

the other assumes one in every five longs.

The one in five is pretty much worst case for code that does something useful.

integer compute heavy code will have far fewer jumps taken.

jumps not taken do not count as jumps.

There wil be outlying degenerate cases that perform better - and some that perform worse - but this model can give a good indication.

Last time I looked at gcc output, I think it was about one FJMP in every eight longs "on average"

I'd be very interested in loop body results (dhrystone, fft, etc) for how many jumps taken in N longs, as it would provide more data for the model.

David Betz wrote: »

What code are you using? Which compiler and what settings?

Edit: Never mind. This is a spreadsheet model not based on actual code. Sorry.

dMajo · 2014-05-21 15:27

Bill Henning wrote: »

On the P1 it is four ported, it was also four port on the P2 design.

For the P16X64 Chip went to two-cycle instructions, and dual ports.

To execute an instruction, four cog memory accesses are required:

- read instruction
- read source value
- read destination value
- write result

Phil Pilgrim (PhiPi) wrote: »

The P1 cog RAM is single-ported. There is only one memory access per clock cycle.

-Phil

I believe the P1 design is with a dual port cog ram (1 read and 1 write, perhaps I am wrong but I remember Chip saying that in chip design the ports are not bidirectional).

As far as I know/remember the instructions takes 6 stages

1) fetch instruction (read)
2) decode instruction
3) fetch source value (read)
4) fetch destination value (read)
5) execute instruction
6) eventually write result to destination (write - based on nr)

stages 5/6 are overlapped with stages 1/2 of the next instruction to execute and thus the 4 clocks execution rate (except for the very first instruction I presume it needs 6 clocks).

Cluso99 · 2014-05-21 16:09

To correct somemisconceptions...

On P1, all memory (cog & hub) is single port.
On old P2, hub was single port, cog was quad port.
On current P16X64A, hub is single port, cog is dual port, LUT is single port.

BUT, additional cog ram only needs to be single port !

See my "Case for more cog ram". Near the end of that thread I explain to Chip how to use the LUT as additional cog ram.
The sole reason cog ram is currently dual port is because it is register space.
For instance, if we reduced the registers from 512 (2KB) to 256, then the 256 (1KB) could be converted to 2KB cog instruction/data single port cog ram, and use precisely the same silicon area.

Correct me if I am wrong, but isn't C CMM mode for tiny C code residing in cog only, ie no hub ?

Maybe its again time to..
* Consider larger cog ram
* Not all cogs are equal - give a couple a lot more ram

Programs running from cog run at full speed, are fully deterministic, have no latency, have no jitter. This is only introduced by the hub.

Bill Henning · 2014-05-21 16:26

Cluso99 wrote: »

To correct somemisconceptions...

On P1, all memory (cog & hub) is single port.
On old P2, hub was single port, cog was quad port.
On current P16X64A, hub is single port, cog is dual port, LUT is single port.

As wifey is fond of telling me "you are old and grey" (ie I am old and grey)

I had a brain fart, forgot P1 cog ram was single ported.

Cluso99 wrote: »

BUT, additional cog ram only needs to be single port !

See my "Case for more cog ram". Near the end of that thread I explain to Chip how to use the LUT as additional cog ram.
The sole reason cog ram is currently dual port is because it is register space.
For instance, if we reduced the registers from 512 (2KB) to 256, then the 256 (1KB) could be converted to 2KB cog instruction/data single port cog ram, and use precisely the same silicon area.

That would work well if we got PTRX & PTRY back to the LUT - however that would involve much screaming from some forum members.

Without PTRX & PTRY that code memory would not be able to have self modifying code in it - which also means no subroutines.

Cluso99 wrote: »

Correct me if I am wrong, but isn't C CMM mode for tiny C code residing in cog only, ie no hub ?

Your turn for a grey hair moment

CMM = Compressed Memory Model

Uses 16 bit opcodes in the hub, the kernel "expands" them before executing. Meant to allow more code in 32KB.

Cluso99 wrote: »

Maybe its again time to..
* Consider larger cog ram
* Not all cogs are equal - give a couple a lot more ram

Programs running from cog run at full speed, are fully deterministic, have no latency, have no jitter. This is only introduced by the hub.

Given that we lost the SDRAM interface (and a lot of pins) I'd like to keep the hub as big as possible.

David Betz · 2014-05-21 16:30

Cluso99 wrote: »

Correct me if I am wrong, but isn't C CMM mode for tiny C code residing in cog only, ie no hub ?

In PropGCC, the CMM VM runs in a COG but CMM code runs out of hub memory. COG mode code runs in the COG but uses a stack in hub memory.

Cluso99 · 2014-05-21 16:42

Bill,
Why do you need self- modifying code in the additional cog memory when it is acceptable to not have it in LMM and hubexec modes? Seems an illogical argument.

PTRA/B and INDA/B can be achieved other ways if we cannot have them. While the workaround may use more instructions, it is superior performance.

jazzed · 2014-05-21 16:42

Eric's CMM definition:

https://code.google.com/p/propgcc/source/browse/doc/CMM.txt

CMM will still be important with the new chip even with 512KB HUB RAM.

Cluso99 · 2014-05-21 16:44

Thanks for the CMM info.
So more cog memory would no doubt aid CMM mode too.

Bill Henning · 2014-05-21 16:50

Nope. Very logical argument.

JMPRET is used for subroutine calls for cog mode (remember, no LIFO stack, no CLUT stack)

Single port cog memory would require an additional cycle for every JMP/branch to write the destination address, so all branches would be 50% slower. Not worth it, better keep cog memory dual ported.

FJMP/FCALL are horribly expensive in LMM modes (compared to hubexec), but do not need self modifying code for subroutines.

CALL in hubexec does not need self modifying code.

Using INDA/B (or X/Y) for stack with additional instructions slows things down a lot (compared to dual ported cog memory).

With hubexec (as long as there is a FIFO or some caching) we can have huge programs with good performance.

For drivers, 512 cog locations is fine, and besides, with FIFO drivers can stream to/from hub REALLY fast.

Cluso99 wrote: »

Bill,
Why do you need self- modifying code in the additional cog memory when it is acceptable to not have it in LMM and hubexec modes? Seems an illogical argument.

PTRA/B and INDA/B can be achieved other ways if we cannot have them. While the workaround may use more instructions, it is superior performance.

New Hub Scheme For Next Chip

Comments