Ruminations while awaiting an FPGA image (was "Hello...... Anyone out there?")

jmg · 2014-07-30 18:35

Heater. wrote: »

Did we?

The current Prop II design may not be a drop in replacement for the Prop 1 software wise. Code will need "porting". The last time I expressed concerns about code compatibility I was assured that Propeller 1 PASM could be tweaked to run on a Propeller II without a total rewrite and re-architecting. This is very important in terms of reusing much existing code, the Propeller is particularly in need of this as it has no hardware peripherals and requires that code in order to be useful to people out of the box.

Thinking a little about this code-portability, which to me seems very important.

Spin & C hide the ASM level differences, and there may be a case for a P2 Spin directive, that gives a choice of
* Slower and designed to be largely timing compatible with P1
( could be used in conjunction with a COG clock scaler ? )
* As fast as possible.

For ASM level porting, some macros may make this easier, and a script or two could automate some aspects.
It will not be 100% drop-in-line-for-line, but it should be an aspiration to allow one ASM source file build for P1 or P2 using conditional directives.

Dave Hein · 2014-07-30 18:51

jmg wrote: »

Spin & C hide the ASM level differences, and there may be a case for a P2 Spin directive, that gives a choice of
* Slower and designed to be largely timing compatible with P1
( could be used in conjunction with a COG clock scaler ? )
* As fast as possible.

I don't see the reason for a slow mode just to provide timing compatibility. Some code changes will be required to port P1 Spin code to P2, and any timing requirements should also be handled by source code changes. Given that P1 Spin executes slowly there are few objects that depend on the slow speed. The only objects I can think of that depend on the slow speed are the I2C objects. However, these can be accommodated by adding delays using waitcnt similar to the way the PASM I2C drivers do.

jmg · 2014-07-30 23:52

Dave Hein wrote: »

I don't see the reason for a slow mode just to provide timing compatibility. Some code changes will be required to port P1 Spin code to P2, and any timing requirements should also be handled by source code changes. Given that P1 Spin executes slowly there are few objects that depend on the slow speed. The only objects I can think of that depend on the slow speed are the I2C objects. However, these can be accommodated by adding delays using waitcnt similar to the way the PASM I2C drivers do.

The idea is to avoid those "handled by source code changes" and "adding delays using waitcnt" caveats, so that code 'just works'.
Quite a few subtle errors can creep into old code that was only every tested at slower speeds.

Heater. · 2014-07-31 02:42

Bah, spin code that is clock speed or interpreter efficiency dependent and breaks if it is run faster is broken code and should be fixed. What if that code you write for your 80MHz Propeller board does not work on my boards clocked at 104MHz ? waitcnt is there for a reason, it allows you to hit timing requirements despite variations in clock rate or the unkown nature of the byte code interpreter run times.

Properly written Spin or other high level language code should "just work" on the P2 save for that which actually accesses pins or counters etc And Perhaps even that can be abstracted out by the compiler so that Spin 1 code can be compiled or translated to Spin 2 code.

PASM code on the other hand is totally different. We can expect breakage there due to instruction set and architectural changes.

By analogy: Back in the day Intel had the PL/M language for 8 bit 8080/8085. That code could then be run on the new 16 bit 8086 when it came. And even 8080 assembler programs could be translated to 8086 assembler code by the conv86 utility. This all worked so well that I managed to get tens of thousands of lines of 8080 assembler up and running on 8086. It was even 8086 plugged into the 8085 sockets of 8 bit boards via a little adapter circuit.

Hmmm...I think I have to make a PII board in a DIP 40 form factor as a plug in replacement for DIP Props on some boards!

Cluso99 · 2014-07-31 06:53

I think a number of people are forgetting that the P2 is not a replacement for the P1. There will be a lot of designs that will require the P1 because a P2 is not suitable - eg low power, small package (QFP / QFN).

MJB · 2014-07-31 13:58

Heater. wrote: »

Hmmm...I think I have to make a PII board in a DIP 40 form factor as a plug in replacement for DIP Props on some boards!

and an optional top connector for the rest of the IO-pins
great

Heater. · 2014-07-31 15:20

MJB,

Yes, great. In the same way that they make a Propeller Mini board that is a plug in replacement for Stamps I believe or at least usable like a Stamp. It does not bring out all the pins. Not every application needs all the pins.

For a cheap and cheerful board to get familiar with the P2 a 40 pin DIP might be just fine. It could even have more I/O pins available than the Propeller 1 as we can use the XTAL, BOE, programming, EEPROM pins. Perhaps even two of the power pins. Not exactly plug compatible in that case but good enough.

I did used to dream of a Prop II in 64 pin DIP package like the old 68000!

jmg · 2014-07-31 17:06

Heater. wrote: »

I did used to dream of a Prop II in 64 pin DIP package like the old 68000!

Can you still buy sockets that size ?

Bill Henning · 2014-07-31 18:28

http://www.digikey.com/product-detail/en/SA649000/ED3036-ND/3313528

jmg wrote: »

Can you still buy sockets that size ?

Peter Jakacki · 2014-07-31 21:10

Heater. wrote: »

I did used to dream of a Prop II in 64 pin DIP package like the old 68000!

Wouldn't you want it octal socket compatible too!? (6.3VAC compatiblity too would probably help.)

Cluso99 · 2014-07-31 23:04

Peter Jakacki wrote: »

Wouldn't you want it octal socket compatible too!? (6.3VAC compatiblity too would probably help.)

Yes, and he would like a pilot light on as well

Heater. · 2014-07-31 23:43

jmg,

Can you still buy sockets that size ?

Sure: http://www.jameco.com/1/1/3013-6100-64d-socket-ic-64-pin-machine-tooled-low-profile.html

Which is good to know as I acquired a couple of 68000 microprocessors recently. They are crying out to me to get them working.

Peter,

Wouldn't you want it octal socket compatible too!? (6.3VAC compatiblity too would probably help.)

...
Cluso,

Yes, and he would like a pilot light on as well

Chuckle. Hey, are you guys ripping on me:)

I did once see that cosy warm orange tube heater like glow coming out of a semiconductor chip. It was an EPROM that I had managed to plug in backwards glowing at me brightly through it's quartz window!

I have yet to make a Propeller glow...

evanh · 2014-08-01 02:45

Heater. wrote: »

I'm no expert but you saying that does not convince me.

I used your very own example of extra ports. Not that it needs to but doubling the number of ports for double the number of Cogs entirely matches bandwidth on a per Cog basis.

If it were so straight forward how come my quad core Intel does not have four buses out to RAM? How come Amdahl's law exists? How come the world speaks so much of the "von neumann bottleneck". How come XMOS don't do this?

Well, both Intel and AMD do nominally use at least two way switching, ie: "Dual channel DDR3". The top end Xeons do have 4 ports I believe. GPUs are commonly using 4 way switches. It's a very good question as to why bigger crosspoint switches aren't more generally used. I guess, aside from the pin count on the multi-processor chip, it wouldn't be cheap having to have 16 or 32 or 64 banks of DDR DRAM.

The Prop has the advantage of RAM being internal so the huge pin count vanishes.

Heater. · 2014-08-01 05:38

evanh,

...doubling the number of ports for double the number of Cogs entirely matches bandwidth on a per Cog basis.

True. Well, almost true. There are still the cases where processors are fighting over the same address location at the same time. At which point they really have to take turns.

Anyway, how does the number of transistors and amount of silicon real estate scale with COGS in your plan? Or scale with memory size? Or both? Linearly, in a square law?

Lawson · 2014-08-01 10:54

Heater. wrote: »

evanh,

True. Well, almost true. There are still the cases where processors are fighting over the same address location at the same time. At which point they really have to take turns.

Anyway, how does the number of transistors and amount of silicon real estate scale with COGS in your plan? Or scale with memory size? Or both? Linearly, in a square law?

I expect the cross-point switch will scale N*M*B^2 in area, where N is cogs, M is ram blocks and B is buss width. If it ever changes to a multi-level cross-point, I it'll scale slower, but still at least N*log(M)*B^2 or something. Thankfully I expect the cross-point switch to be sparse on the transistor layer. If the chip has 4 or more routing layers, the "big math" logic should be able to hide in the gaps.

Marty

evanh · 2014-08-01 16:07

Lawson wrote: »

I expect the cross-point switch will scale N*M*B^2 in area, where N is cogs, M is ram blocks and B is buss width....

I'll take that as given. B is something like 50 bits ... Chip's current design will be something like 16 x 16 x 50^2 = 640000 transistors just for the central switch, plus a decent amount of control logic around that also. I believe Chip was saying it's roughly the same size as a whole Cog in area.

Both N and M double together in our example - going from 16 Cogs and RAMs up to 32 of each ... 32 x 32 x 50^2 = 2560000 transistors would then be needed.

evanh · 2014-08-01 16:10

Heater. wrote: »

There are still the cases where processors are fighting over the same address location at the same time. At which point they really have to take turns.

Prop2 Hub is sequenced to always take turns so the Cogs never clash. Very much like the Prop1 but multiplied. You could say the Prop1 and Prop2 are sequencing just the same but the difference is an 8x1 switch vs 16x16.

I don't know what is the norm for AMD/Intel/nVidia. It's possible they've taken the simpler approach and functionally combined all addressing to make the channels run in parallel, ie: They're not really crosspoints at all. It probably doesn't make a great deal of difference in terms of power consumption when driving large external devices like that. So, the question of would a crosspoint help is more down to how serialised is the task being executed? Or how many tasks are running concurrently?

PS: I guess, technically, 8x1 cannot be called a crosspoint. That's just a plain 8-way mux.

Heater. · 2014-08-01 22:58

It is possible that I'm totally misunderstanding the new HUB scheme for the P2. Please someone put me out of my misery if so.

Let me summarise my problem:

1) evanh expressed a desire for a 32 COG Propeller in the future.

2) I suggested that would not be a good idea as it halves HUB bandwidth so COG performance would suck.

3) evanh responded that was not so as more "ports" would restore HUB bandwidth. Sounds reasonable, more ports/pathways more switches, must boost performance right?

BUT here is my issue. Looking at the "egg beater" diagram that Chip posted here: http://forums.parallax.com/showthread.php/155675-New-Hub-Scheme-For-Next-Chip we see that:

1) When a COG makes a HUB access it may gets it immediately if the low 4 bits of the address match the block of HUB RAM available to it at that moment. If not it has to wait for the HUB mechanism to "tick" around until the required HUB block is available. The maximum time it has to wait is 15 of those "hub tick" periods. The average wait time is 7.5 "hub ticks". Clear enough.

2) If I were to redraw that picture with 32 COGs it would be apparent that a COG gets access immediately if the low 5 bits of the address match the block of HUB RAM available at the time. If not it has to wait, as above, but the maximum wait time is now 31 "hub ticks" and the average is 15.5 "hub ticks"

So, I conclude that doubling the number of COGs doubles the average HUB access latency and COG performance would indeed suck in such a system.

This seems paradoxical because when going from 16 to 32 COGs we have added a corresponding number of HUB blocks and ports and those little HUB planetary gear things in the egg beater diagram.

So despite at least doubling the amount of hardware we have still halved the COG-HUB performance.

Am I missing something here? What?

Note:

I didn't like to say "HUB bandwidth" above. Seems to me that theoretical peak COG-HUB bandwidth is the same in the 16 or 32 COG case. That is when your accesses are sequential and you can always hit the required HUB block.

During normal random HUB access it's that latency that is slugging everything.

evanh · 2014-08-01 23:21

That all sounds right. Though, I'll point out this is the first time you've indicated latency as the concern. Other examples are buffering and caching being mechanisms added to up the bandwidth while having no effect or detrimental effect on latency. Indeed, these will be needed in conjunction with the crosspoint switch to achieve the increased bandwidth in the Prop2.

On that note it should also be pointed out that to make use of the higher bandwidth of the Prop2 over the Prop1 will require different processing models and uses different instructions. Extending this new model to handle even longer latency seems perfectly reasonable to me.

Heater. · 2014-08-01 23:52

evanh,

That is true, it is the first time I raised "latency" as the issue. I seem to recall that in discussions following the egg beater announcement we just talked about random access vs sequential access or some such.

Anyway I had not really thought through your arguments and had to go back an review that diagram to straighten things out in my mind.

And yes, you are right to achieve the higher bandwidth one would have to try and arrange block transfers to and from HUB rather than just random access to code and data.

To complicate matters Chip seems to be adding some FIFO's into those "planetary gear things" or some caching mechanism at least to help speed block transfers, sequential access, along a bit. I have not really followed the details of that much.

Then there are those that point out that code access tends to have rather more sequential addressing than random addressing. So code execution will be able to hit higher HUB access rates.

At that point things become rather more fuzzy and I at least cannot reason about it very well. Like modern Intel processors with their branch prediction, out of order execution, multiple execution units, multiple cache levels and whatever else actually determining how long any piece of code will take to execute becomes impossible.

My gut tells me that no matter how careful you are to arrange your code and data access for egg beater friendliness that latency is still going to be dominant. Code execution does have a lot of jumps and calls to non-sequential addresses, any processing will be using variables in random locations at random times.

Ergo doubling the number of COGs will halve the performance per COG.

evanh · 2014-08-02 01:05

We're gonna find out soon enough. Purely random accesses will suffer for sure.

Heater. · 2014-08-02 01:34

I'm having a hard time figuring out if the egg beater has any benefit over the old "mono-pole" HUB for random access to code and data at all.

At a quick glance it looks like you have 16 times as many ways into HUB than than before. Must be better right?

But wait, with the mono-pole hub we have an average latency of 7.5 hub ticks for any given HUB access. That's the same as for the egg-beater as I said above.

There is no gain.

What about more sequential access?

Chip is putting in some FIFO's or some such. Presumably that hoovers up data as it flies past no matter if the COG has asked for it or not. Presumably the addresses it hoovers up from are nearby the last access the HUB made.

So whilst stepping through HUB addresses sequentially, even if not actually in sync with the egg beater, things will be a lot faster.

But wait, that will work nicely for code executing from COG and sequentially accessing data in HUB, but what about code executing from HUB, Spin byte codes and LMM? In that case we have code and data accesses going on at the same time. The code may be straight line and the data may be sequentially accessed but when you interleave them both together you have effectively randomised HUB access again.

Looks to be like there is not going to be any benefit for the egg beater over the old mono-pole HUB for most code. Only for those few drivers and such that can run from COG and shift blocks of data in and out of HUB.

evanh · 2014-08-02 02:00

If I'm not mistaken Chip is intending to use the I/O DMA buffer for instruction cache when processing Hubexec. Which means streaming I/O disables Hubexec.

The data FIFO is a separate buffer, it will share Hub bursts with the instruction cache(DMA buffer). So, with Hubexec there is likely a certain level Harvard architecture at play.

Heater. · 2014-08-02 03:35

All of which I have not caught up with yet or did not understand as it flew by.

The way you describe it it all sounds far more complex than I might ever want to bother with.

I guess we will see.

Bill Henning · 2014-08-02 06:18

Actually, there is a huge performance boost for the following cases:

1) cog drivers with HUGE hub read/write bandwidth (display, logic analyzer etc)

2) hubexec code in general (C, etc) which will use the non-FIFO RDxxx/WRxxx that do not disturb the fifo (code access becomes essentially cog speed, random non-code hub access avg. 4 cycles)

3) VM's (Spin, Zog) using the FIFO for code acess, regular hub acess avg 4 cycles, but code fetches esentially free

The degenerate "slow as without eggbeater" case you refer to would only happen if:

- hubexec video driver trying to use FIFO for hub reads

- hubexec logic analyzer trying to use FIFO

So for everyone who is not silly enough to try to write HD video drivers in hubexec code (or logic analyzer sampling code in hubexec) the FIFO is a huge win.

So as someone who tends to write extremely high performance drivers, vm's etc... I LOVE THE EGGBEATER WITH FIFO!

Heater. wrote: »

I'm having a hard time figuring out if the egg beater has any benefit over the old "mono-pole" HUB for random access to code and data at all.

At a quick glance it looks like you have 16 times as many ways into HUB than than before. Must be better right?

But wait, with the mono-pole hub we have an average latency of 7.5 hub ticks for any given HUB access. That's the same as for the egg-beater as I said above.

There is no gain.

What about more sequential access?

Chip is putting in some FIFO's or some such. Presumably that hoovers up data as it flies past no matter if the COG has asked for it or not. Presumably the addresses it hoovers up from are nearby the last access the HUB made.

So whilst stepping through HUB addresses sequentially, even if not actually in sync with the egg beater, things will be a lot faster.

But wait, that will work nicely for code executing from COG and sequentially accessing data in HUB, but what about code executing from HUB, Spin byte codes and LMM? In that case we have code and data accesses going on at the same time. The code may be straight line and the data may be sequentially accessed but when you interleave them both together you have effectively randomised HUB access again.

Looks to be like there is not going to be any benefit for the egg beater over the old mono-pole HUB for most code. Only for those few drivers and such that can run from COG and shift blocks of data in and out of HUB.

cgracey · 2014-08-02 09:25

The cog FIFO is always either in RDxxxx or WRxxxx mode. It smooths out the transfers to and from hub memory so that the cog can continuously transfer a byte, word, or long per clock. As you all know, there is an initial latency in getting the FIFO established so that, starting on the next clock, it can convey data continuously, in any sequence of bytes, words, and/or longs. There are no more alignment issues for words and longs. The entire memory is addressed using a 19-bit byte-level address. Words and longs can start anywhere.

This last week, I had to make a detour to accomplish some Prop1 work which will be announced next week. It's going to open doors to people innovating on the current design.

Heater. · 2014-08-02 10:04

Chip,

Call be a dumb *** but I'm still not following. Can I presume that the FIFO is sucking up data ahead of the point that was just accessed by the COG on the assumption that the next access will likely be the next thing further up in memory?

That is to say that truly random access sees very little benefit from the FIFO? Yes, no, maybe?

"Prop1 work" - What a tease!

cgracey · 2014-08-02 10:29

Heater. wrote: »

Chip,

Call be a dumb *** but I'm still not following. Can I presume that the FIFO is sucking up data ahead of the point that was just accessed by the COG on the assumption that the next access will likely be the next thing further up in memory?

That is to say that truly random access sees very little benefit from the FIFO? Yes, no, maybe?

"Prop1 work" - What a tease!

That's right. The FIFO begins loading from hub on a RDFAST (after writing any lingering WRFAST data). Then, when you do RFBYTE/RFWORD/RFLONG's, it just gives you the next byte/word/long out of the FIFO, stepping the FIFO when a whole long is consumed. The WRFAST instruction writes any lingering WRFAST data and then takes byte/words/longs that you give it via WFBYTE/WFLONG/WFWORD instructions, writing as many as are in the FIFO on the next sync. It's pretty mindless to use.

And yes, random accesses get no benefit from the FIFO. For those, we still have the old RDxxxx/WRxxxx instructions, which now have screwy timing.

The great part about RDFAST/WRFAST is that it enables a potential fire hydrant of data to be conveyed via simple hardware, without the ongoing attendance of instructions.

Rayman · 2014-08-02 14:43

Heater. wrote: »

"Prop1 work" - What a tease!

I hope it's a P1+ in same package. But, that's very unlikely... Maybe PropGCC release?

Bill Henning · 2014-08-02 15:34

I am guessing it will be a 64 I/O P1B, perhaps 200MHz/50MIPS per cog.

Ruminations while awaiting an FPGA image (was "Hello...... Anyone out there?")

Comments