I guess fetching a stream of instructions helps a lot when you are executing linear code.
That is not my code. It takes a turn this way or that every handful of instructions. Look at FullDuplexSerial. Can you even find four instructions in a "row" their?
My prediction is that such a FIFO only has minimal impact on execution speed.
Unless it is more than a FIFO but more like a cache and you can keep your code in cache and keep your data locality high.
Whare are the HW FIFO use cases? Is the FIFO essentially a generalization of the video FIFO in P1? Would it be used to stream out video to the DACs?
Yes, HW fifo is used to stream Video (Optionally via LUT) to the Pins, or DACs. at fSys/N (any N, NCO derived)
Simple Full fSys speed transfer from HUB to DACs is a defining feature.
It can also stream capture Pins directly to HUB, for Logic Analyser/ Instrumentation type apps.
The SW-FIFO use, is a bonus of having the HW already there.
Yes, HW fifo is used to stream Video (Optionally via LUT) to the Pins, or DACs. at fSys/N (any N, NCO derived)
Simple Full fSys speed transfer from HUB to DACs is a defining feature.
It can also stream capture Pins directly to HUB, for Logic Analyser/ Instrumentation type apps.
The SW-FIFO use, is a bonus of having the HW already there.
On the P1 it is four ported, it was also four port on the P2 design.
The LMM interpreter, including FCACHE and FLIB areas is 2KB
What is large is the generated C code, and all the required C libraries, and data structures.
Hope this helps.
Bill,
Thank you. That explains a lot. The only C I have done on the propeller are things with my son for his P1 powered robot. All of my stuff has been assembly getting my drivers figured out. Have not gotten to the main program which, unfortunately, will require C or Spin since it will not fit into a cog. With gigs of ram it is easy to forget how much of a hog C is... Ahh for Delphi on the Propeller!
My ears are starting to ring with the echoes of creeping featuritis, with the same people pushing for more, more, more. I fear that the FIFO is making things way too complicated. Why not just adjust the hub's address "firing order" to optimize software-mediated streaming and call it good? And forget hubexec; LMM is good enough, especially with the new hub bandwidth. Otherwise, this thing will never get finished. Sorry, Chip, but it's deja vu all over again.
I was comparing one cog to one LPC.
So if you want to say 200MIPS across 16 cores, I'll respond with 768MIPS across 16 LPC1114's.
You can respond like that. I think it's nuts. Who is going to build a system with 16 LPC1114's on it? And the get them tightly integrated with whatever communications they need?. Which would kill your performance immediately anyway. Not to mention practical issues of PCB space etc.
A single chip solution like the Propeller wins hands down.
Statistically about one in eight - ten instructions will be a jump taken (or call taken)
Perhaps. I don't believe it. I'll have to compile my FullDuplexSerial and FFT C source to assembler and check before I can can refute it though.
I crunch numbers.
Let's do that. See above.
I also don't believe anyone is going to write a compiler that makes use of any FIFO if it requires special instructions to deal with. Yes, that is just a belief as I'm not a compiler writer.
I guess fetching a stream of instructions helps a lot when you are executing linear code.
That is not my code. It takes a turn this way or that every handful of instructions. Look at FullDuplexSerial. Can you even find four instructions in a "row" their?
My prediction is that such a FIFO only has minimal impact on execution speed.
Unless it is more than a FIFO but more like a cache and you can keep your code in cache and keep your data locality high.
Yes, this is true of compiled code as well. That's why I believe that the FIFO will turn out to be of only limited benefit for code execution. It will have to be dumped frequently. However, it will be useful for streaming applications.
You can respond like that. I think it's nuts. Who is going to build a system with 16 LPC1114's on it? And the get them tightly integrated with whatever communications they need?. Which would kill your performance immediately anyway. Not to mention practical issues of PCB space etc.
A single chip solution like the Propeller wins hands down.
Perhaps. I don't belive it. I'll have to compile my FullDuplexSerial and FFT C source to assembler and check before I can can refute it though.
No, the P1 cog RAM is single-ported. That's why it takes four system clocks for each instruction: one to fetch the instruction, two to fetch the operands, one to write the result.
Sorry I did not specify, I was thinking of my FullDuplexSerial_ht that is written in C and runs native in a COG.
I have to check but I think heater_fft spends more time jumping than running in a straight line:)
Sorry I did not specify, I was thinking of my FullDuplexSerial_ht that is written in C and runs native in a COG.
I have to check but I think heater_fft spends more time jumping than running in a straight line:)
It would be interesting to know how frequently a program needs to branch before the FIFO actually hurts more than it helps because of the RDINIT required at each branch.
Without the RDINIT, each hub reference would take 8 clock cycles... which is the same as the average RDINIT time, so I don't think it is possible for the FIFO to hurt.
It would be interesting to know how frequently a program needs to branch before the FIFO actually hurts more than it helps because of the RDINIT required at each branch.
My ears are starting to ring with the echoes of creeping featuritis, with the same people pushing for more, more, more. I fear that the FIFO is making things way too complicated.
No, the FIFO allows users access to the full HUB bandwidth.
HW-FIFO especially gives performance SW cannot reach. Chip has fSys/N Video streaming from HUB, with optional LUT, via this HW-FIFO.
Why not just adjust the hub's address "firing order" to optimize software-mediated streaming and call it good?
You mean shuffle the spokes ? That would break Video Streaming, and besides,who decides the "new firing order" ?? - sounds like a lot of interaction and fish-hooks
Would it? You could still do full-speed block transfers and stream video out of cog RAM.
You forgot the LUT ?
I'm not seeing how shuffle the spokes does anything for all COGS, apart from add another variable.
Spoke order is rather locked to the memory LSB, and any shuffle of that might help one COG, but harm others.
- will have to correct model, as it does not currently take into account the time taken by the FJMP and/or FCALL primitive, so LMM results (all versions) will be slower in real life compared to hubexec.
- 0 spacer FIFO LMM extremely unlikely, 1 or 2 will be required, all figures shown for completeness
Conclusions:
fifo'd hubexec is 2.5x-3x+ faster than fifo'd LMM
fifo'd hubexec is 6x+ faster than non-fifo'd LMM
fifo'd LMM would be 2x-3x faster than non-fifo'd LMM
One jump/call taken in every 8 longs
clock freq 200
cycl per ins 2
avg hub cycl 8
frq jmp/call 0.125
in N instr 8
clock cycles x faster LMM % of cog spd
cog cycl 16 9.0 100.00%
fifo hubexec 24 6.0 66.67%
fifo lmm 0 40 3.6 40.00%
fifo lmm 1 56 2.6 28.57%
fifo lmm 2 72 2.0 22.22%
simpl lmm 144 1.0 11.11%
One jump/call taken in every 5 longs
cycl per ins 2
avg hub cycl 8
frq jmp/call 0.2
in N instr 5
clock cycles x faster LMM % of cog spd
cog cycl 10 9.6 100.00%
fifo hubexec 18 5.3 55.56%
fifo lmm 0 28 3.4 35.71%
fifo lmm 1 38 2.5 26.32%
fifo lmm 2 48 2.0 20.83%
simpl lmm 96 1.0 10.42%
The P1 cog RAM is single-ported. There is only one memory access per clock cycle.
-Phil
I believe the P1 design is with a dual port cog ram (1 read and 1 write, perhaps I am wrong but I remember Chip saying that in chip design the ports are not bidirectional).
As far as I know/remember the instructions takes 6 stages
1) fetch instruction (read) 2) decode instruction 3) fetch source value (read) 4) fetch destination value (read) 5) execute instruction 6) eventually write result to destination (write - based on nr)
stages 5/6 are overlapped with stages 1/2 of the next instruction to execute and thus the 4 clocks execution rate (except for the very first instruction I presume it needs 6 clocks).
On P1, all memory (cog & hub) is single port.
On old P2, hub was single port, cog was quad port.
On current P16X64A, hub is single port, cog is dual port, LUT is single port.
BUT, additional cog ram only needs to be single port !
See my "Case for more cog ram". Near the end of that thread I explain to Chip how to use the LUT as additional cog ram.
The sole reason cog ram is currently dual port is because it is register space.
For instance, if we reduced the registers from 512 (2KB) to 256, then the 256 (1KB) could be converted to 2KB cog instruction/data single port cog ram, and use precisely the same silicon area.
Correct me if I am wrong, but isn't C CMM mode for tiny C code residing in cog only, ie no hub ?
Maybe its again time to..
* Consider larger cog ram
* Not all cogs are equal - give a couple a lot more ram
Programs running from cog run at full speed, are fully deterministic, have no latency, have no jitter. This is only introduced by the hub.
On P1, all memory (cog & hub) is single port.
On old P2, hub was single port, cog was quad port.
On current P16X64A, hub is single port, cog is dual port, LUT is single port.
As wifey is fond of telling me "you are old and grey" (ie I am old and grey)
I had a brain fart, forgot P1 cog ram was single ported.
BUT, additional cog ram only needs to be single port !
See my "Case for more cog ram". Near the end of that thread I explain to Chip how to use the LUT as additional cog ram.
The sole reason cog ram is currently dual port is because it is register space.
For instance, if we reduced the registers from 512 (2KB) to 256, then the 256 (1KB) could be converted to 2KB cog instruction/data single port cog ram, and use precisely the same silicon area.
That would work well if we got PTRX & PTRY back to the LUT - however that would involve much screaming from some forum members.
Without PTRX & PTRY that code memory would not be able to have self modifying code in it - which also means no subroutines.
Bill,
Why do you need self- modifying code in the additional cog memory when it is acceptable to not have it in LMM and hubexec modes? Seems an illogical argument.
PTRA/B and INDA/B can be achieved other ways if we cannot have them. While the workaround may use more instructions, it is superior performance.
JMPRET is used for subroutine calls for cog mode (remember, no LIFO stack, no CLUT stack)
Single port cog memory would require an additional cycle for every JMP/branch to write the destination address, so all branches would be 50% slower. Not worth it, better keep cog memory dual ported.
FJMP/FCALL are horribly expensive in LMM modes (compared to hubexec), but do not need self modifying code for subroutines.
CALL in hubexec does not need self modifying code.
Using INDA/B (or X/Y) for stack with additional instructions slows things down a lot (compared to dual ported cog memory).
With hubexec (as long as there is a FIFO or some caching) we can have huge programs with good performance.
For drivers, 512 cog locations is fine, and besides, with FIFO drivers can stream to/from hub REALLY fast.
Bill,
Why do you need self- modifying code in the additional cog memory when it is acceptable to not have it in LMM and hubexec modes? Seems an illogical argument.
PTRA/B and INDA/B can be achieved other ways if we cannot have them. While the workaround may use more instructions, it is superior performance.
Comments
Hand crafted small driver vs. compiler generated code.
Write it in C, look at compiled code, analyze that - I did such analysis.
Yes, HW fifo is used to stream Video (Optionally via LUT) to the Pins, or DACs. at fSys/N (any N, NCO derived)
Simple Full fSys speed transfer from HUB to DACs is a defining feature.
It can also stream capture Pins directly to HUB, for Logic Analyser/ Instrumentation type apps.
The SW-FIFO use, is a bonus of having the HW already there.
Bill,
Thank you. That explains a lot. The only C I have done on the propeller are things with my son for his P1 powered robot. All of my stuff has been assembly getting my drivers figured out. Have not gotten to the main program which, unfortunately, will require C or Spin since it will not fit into a cog. With gigs of ram it is easy to forget how much of a hog C is... Ahh for Delphi on the Propeller!
Thanks to everyone for the Prop design lessons.
+1
A single chip solution like the Propeller wins hands down. Perhaps. I don't believe it. I'll have to compile my FullDuplexSerial and FFT C source to assembler and check before I can can refute it though. Let's do that. See above.
I also don't believe anyone is going to write a compiler that makes use of any FIFO if it requires special instructions to deal with. Yes, that is just a belief as I'm not a compiler writer.
Yes, this is true of compiled code as well. That's why I believe that the FIFO will turn out to be of only limited benefit for code execution. It will have to be dumped frequently. However, it will be useful for streaming applications.
-Phil
FFT should show a good spedup, so should Dhrystone, Whetstone, inverse kinematics, anything computation heavy etc.
A nice large cache would be far better, but FIFO will give a decent boost.
Here is another example - assume that one in five instructions is a branch taken.
4 would execute in 8 cycles, and the branch (on average) in 2+8 cycles (8 for fifo reload) for a total of 18 cycles. cog code would do it in 10.
FIFO'd LMM in about 40 cycles, non-fifo'd LMM in about 80 cycles.
I have to check but I think heater_fft spends more time jumping than running in a straight line:)
For compiled code (assumes 2 cycles per instruction, for every random hub reference add 8 cycle average)
if for every X instructions there is a FJMP or FCALL taken (jumps not taken don't count)
cog only code
X * 2
simple lmm estimate is
X * 16 cycles
lmm with fifo estimate is
X * 4 + 8 cycles (zero spacers needed for RDFLONG) .. X * 8 + 8 (two spacers needed)
hubexec with fifo estimate is
X * 2 + 8 cycles
Sounds good.
Now, if those "smart pins" are taking care of protocols like UART, SPI, I2C etc why on earth do we need 16 COGs?
Why not just use that 1 dollar ARM which is cheaper, smaller, faster and supports all those peripherals anyway?
You have convinced me. The P2 is doomed.
I do use ARM's, PIC's, AVR's etc for projects where they fit better than a prop (usually for cost reasons).
We need 16 cogs to load random Obex objects of course
Actually I'd have been quite happy with a 4 cog 4 task per cog P2 optimized design, 16 cogs is a nice luxury "just in case".
I doubt the smart pins will absorb SPI, I2C, certainly they will help, but not absorb.
FYI, I bit bang uarts in pasm, not C, for higher baud rates
No, the FIFO allows users access to the full HUB bandwidth.
HW-FIFO especially gives performance SW cannot reach. Chip has fSys/N Video streaming from HUB, with optional LUT, via this HW-FIFO.
You mean shuffle the spokes ? That would break Video Streaming, and besides,who decides the "new firing order" ?? - sounds like a lot of interaction and fish-hooks
-Phil
You forgot the LUT ?
I'm not seeing how shuffle the spokes does anything for all COGS, apart from add another variable.
Spoke order is rather locked to the memory LSB, and any shuffle of that might help one COG, but harm others.
NOTES:
- will have to correct model, as it does not currently take into account the time taken by the FJMP and/or FCALL primitive, so LMM results (all versions) will be slower in real life compared to hubexec.
- 0 spacer FIFO LMM extremely unlikely, 1 or 2 will be required, all figures shown for completeness
Conclusions:
fifo'd hubexec is 2.5x-3x+ faster than fifo'd LMM
fifo'd hubexec is 6x+ faster than non-fifo'd LMM
fifo'd LMM would be 2x-3x faster than non-fifo'd LMM
One jump/call taken in every 8 longs
One jump/call taken in every 5 longs
Edit: Never mind. This is a spreadsheet model not based on actual code. Sorry.
The idea is we look at code, and plug in the numbers, and the spreadsheet will make a prediction that will be close "on average".
One example above assumes one jump/call taken in every 8 longs,
the other assumes one in every five longs.
The one in five is pretty much worst case for code that does something useful.
integer compute heavy code will have far fewer jumps taken.
jumps not taken do not count as jumps.
There wil be outlying degenerate cases that perform better - and some that perform worse - but this model can give a good indication.
Last time I looked at gcc output, I think it was about one FJMP in every eight longs "on average"
I'd be very interested in loop body results (dhrystone, fft, etc) for how many jumps taken in N longs, as it would provide more data for the model.
I believe the P1 design is with a dual port cog ram (1 read and 1 write, perhaps I am wrong but I remember Chip saying that in chip design the ports are not bidirectional).
As far as I know/remember the instructions takes 6 stages
1) fetch instruction (read)
2) decode instruction
3) fetch source value (read)
4) fetch destination value (read)
5) execute instruction
6) eventually write result to destination (write - based on nr)
stages 5/6 are overlapped with stages 1/2 of the next instruction to execute and thus the 4 clocks execution rate (except for the very first instruction I presume it needs 6 clocks).
On P1, all memory (cog & hub) is single port.
On old P2, hub was single port, cog was quad port.
On current P16X64A, hub is single port, cog is dual port, LUT is single port.
BUT, additional cog ram only needs to be single port !
See my "Case for more cog ram". Near the end of that thread I explain to Chip how to use the LUT as additional cog ram.
The sole reason cog ram is currently dual port is because it is register space.
For instance, if we reduced the registers from 512 (2KB) to 256, then the 256 (1KB) could be converted to 2KB cog instruction/data single port cog ram, and use precisely the same silicon area.
Correct me if I am wrong, but isn't C CMM mode for tiny C code residing in cog only, ie no hub ?
Maybe its again time to..
* Consider larger cog ram
* Not all cogs are equal - give a couple a lot more ram
Programs running from cog run at full speed, are fully deterministic, have no latency, have no jitter. This is only introduced by the hub.
As wifey is fond of telling me "you are old and grey" (ie I am old and grey)
I had a brain fart, forgot P1 cog ram was single ported.
That would work well if we got PTRX & PTRY back to the LUT - however that would involve much screaming from some forum members.
Without PTRX & PTRY that code memory would not be able to have self modifying code in it - which also means no subroutines.
Your turn for a grey hair moment
CMM = Compressed Memory Model
Uses 16 bit opcodes in the hub, the kernel "expands" them before executing. Meant to allow more code in 32KB.
Given that we lost the SDRAM interface (and a lot of pins) I'd like to keep the hub as big as possible.
Why do you need self- modifying code in the additional cog memory when it is acceptable to not have it in LMM and hubexec modes? Seems an illogical argument.
PTRA/B and INDA/B can be achieved other ways if we cannot have them. While the workaround may use more instructions, it is superior performance.
Eric's CMM definition:
https://code.google.com/p/propgcc/source/browse/doc/CMM.txt
CMM will still be important with the new chip even with 512KB HUB RAM.
So more cog memory would no doubt aid CMM mode too.
JMPRET is used for subroutine calls for cog mode (remember, no LIFO stack, no CLUT stack)
Single port cog memory would require an additional cycle for every JMP/branch to write the destination address, so all branches would be 50% slower. Not worth it, better keep cog memory dual ported.
FJMP/FCALL are horribly expensive in LMM modes (compared to hubexec), but do not need self modifying code for subroutines.
CALL in hubexec does not need self modifying code.
Using INDA/B (or X/Y) for stack with additional instructions slows things down a lot (compared to dual ported cog memory).
With hubexec (as long as there is a FIFO or some caching) we can have huge programs with good performance.
For drivers, 512 cog locations is fine, and besides, with FIFO drivers can stream to/from hub REALLY fast.