New Hub Scheme For Next Chip

potatohead · 2014-05-21 23:28

We will get hub exec, but not on the first FPGA release. Hub exec makes big code mindless to write. And it's fast.

Ok this is good for me. I trust the outcome will be refined. It has to date. Honestly, I just want it sane more than I want the very peak performance possible. So that's my vote while I'm outta here.

Re: Scrambled order, linear addressing.

If we can avoid this, we really should. What happens is downstream code gets slower and or larger. We trade code size and efficiency in many cases for a few specific cases. IMHO, this isn't a smart trade. Or, if we keep code size within some reasonable size to what it would have been for linear addressing, then the data will have to be messy. This costs something no matter what.

Long ago, on the Apple ][ computer, this was actually done to save a little silicon. I think it was two or one chip. By funking up the addressing for the video buffers, a couple of chips could be removed from the BOM. The product of that was very bizarre, large, table driven and complex code needed to present graphics to the user. Recently, I just ended up writing one of those and compared to a linear addressing scheme, I was kind of amazed at what it took to work, and actually go fast. Never quite as fast as linear would have been, but plenty fast.

Non-linear access of this kind will have a lot of ripple effects that will negate the performance in a whole lot of cases.

Frankly, I don't want to answer those questions all the time either. Better things to do.

ozpropdev · 2014-05-21 23:47

Sounds great Chip!

Looking forward to "clicking" on my favourite button in Quartus.

Roy Eltham · 2014-05-21 23:53

jazzed,
Regarding advancing the PC. I was assuming that the rdbloc with prtx would advance ptrx properly. Then banches would need to account for it,

May not be good enough, but I don't think it matter in the long run...

Phil Pilgrim (PhiPi) · 2014-05-21 23:54

Bill Henning wrote:

... however you are inducing a jitter by using 7 instead of 8 as cog instructions take two cycles.

Actually, 8 introduces jitter; 7 does not. Any n that's relatively prime to 16 will produce a jitter-free sequence.

potatohead wrote:

Re: Scrambled order, linear addressing. If we can avoid this, we really should.

I do not see why. The software will be totally order-agnostic. One order is no different from another as far as the code is concerned: The data is either there when you want it, or you have to wait for it, regardless of what the firing order is. The only difference is that performance peaks with certain firing orders in certain situations. I identifed one possible order in what is likely to be a common situation and demonstrated its advantages there. What makes it nice in that case is that no FIFO is necessary, and the performance boost could quite likely reduce the need for Hubexec.

-Phil

jmg · 2014-05-22 00:03

cgracey wrote: »

That's a pretty neat idea for getting fast LMM. We'll probably need to stick with a simple ascending hub slot order to make the FIFO work smoothly..

If you flip to descending hub slot order, the most-common case of PTR++ next read, can be ready in 15 Cycles, instead of 17 Cycles.

Addit: oops - that helps LMM a few %, but breaks DMA, so a simple ascending hub slot order is important.

cgracey wrote: »

We will get hub exec, but not on the first FPGA release. Hub exec makes big code mindless to write. And it's fast.

Sounds good

cgracey wrote: »

Tonight I hope to get the simplex FIFO done.

How does it look in LUT and MHz on the FPGA ?

jmg · 2014-05-22 00:08

Phil Pilgrim (PhiPi) wrote: »

I do not see why. The software will be totally order-agnostic. One order is no different from another as far as the code is concerned: The data is either there when you want it, or you have to wait for it, regardless of what the firing order is.

Wow, I think you have totally missed that the LSB of the Address IS the firing order.
- so far from being agnostic, there will need to be (mind bending) order-shuffles performed by all the upstream tools, and this applies to ALL GOGs.

In my Nibble-adder approach, so long as the memory WRITE pointer follows the same adder-rules as the read, this is hidden, but it is not hidden from any linear addressing.

Roy Eltham · 2014-05-22 00:15

Phil,
firing order matters when you do block read or fifo reads because those hit every window and read them all. So you would get the longs out of order and not be able to feed them in order to the pins/dacs at full clock speed (or even lower ones).

Phil Pilgrim (PhiPi) · 2014-05-22 00:19

jmg wrote:

Wow, I think you have totally missed that the LSB of the Address IS the firing order.

Wow. I think you have totally missed that it doesn't have to be.

Regardless of the firing order, the program results will be the same. Except for speed issues, the programmer does not have to know what the firing order is. The only thing that changes is how long the program has to wait for a slot in certain circumstances. The worst case will always be 16 clocks. The best case could be much less than that, depending upon circumstances and firing-order optimization. I identified a very common circumstance and a firing order that optimized speed in that circumstance.

-Phil

Phil Pilgrim (PhiPi) · 2014-05-22 00:23

Roy,

My comments are predicated on the notion that a FIFO will not be necessary or present. The idea is to optimize performance without one, because I think the FIFO has become an unnecessary complication and distraction. As far as block reads are concerned, there's no reason to assume that cog RAM has to be filled in linear order if the hub RAM is being read out of order. It could simply be filled in the same scrambled order in which the hub is read, resulting in perfectly-ordered data when the operation is complete. The same would apply to block writes to the hub.

-Phil

jmg · 2014-05-22 00:25

Phil Pilgrim (PhiPi) wrote: »

My comments are predicated on the notion that a FIFO will not be necessary or present. The idea is to optimize performance without one.

but that misses that the FIFO solves problems that SW cannot - like the LUT streaming of video.
So it is rather a leap backwards.

Phil Pilgrim (PhiPi) · 2014-05-22 00:31

jmg wrote:

but that misses that the FIFO solves problems that SW cannot

Show me an example where a FIFO is necessary, and I will try -- but may fail -- to show how it could be done without the FIFO.

-Phil

Roy Eltham · 2014-05-22 00:37

Phil,
Streaming longs from HUB to pins/dacs at 200Mhz (or 100Mhz) in order. The in order firing is required for that to work, even if there is no FIFO.

Phil Pilgrim (PhiPi) · 2014-05-22 00:41

Roy,

There's no reason the order can't be changed in the hub to compensate pre-emptively.

-Phil

Roy Eltham · 2014-05-22 01:03

Phil, that's silly. You are suggesting that wav data, bitmap data, signal data all be scrambled in hub memory before sending? That's really super fun and easy... :P

It's completely impractical and down right painful to inflict on people. Seriously?!

cgracey · 2014-05-22 01:13

Phil Pilgrim (PhiPi) wrote: »

Roy,

There's no reason the order can't be changed in the hub to compensate pre-emptively.

-Phil

Change the data positions instead of the hub order. Neat!

Then, you'd read the LMM code in a funny pattern, probably in some straight-lined LMM executor that would cycle through the proper address sequence to hit the hub cycles on the nose. That would enable the hub order to remain ascending. You would only scramble LMM code positions. Not a big deal, if you wanted some fast LMM.

As long as there's a FIFO, though, there's no need to do this.

Roy Eltham · 2014-05-22 01:19

Chip, nice flip!

It would be a lot easier and feasable to have LMM code be reordered into the correct pattern than to change the hardware and require all data to be scrambled. Of course, all the branches have to be fixed up accordingly, but that's easy too.

cgracey · 2014-05-22 01:24

Roy Eltham wrote: »

Chip, nice flip!

It would be a lot easier and feasable to have LMM code be reordered into the correct pattern than to change the hardware and require all data to be scrambled. Of course, all the branches have to be fixed up accordingly, but that's easy too.

I think that's actually what Phil was trying to convey, but your response made me think otherwise, until I thought about it some more.

jmg · 2014-05-22 01:28

cgracey wrote: »

Change the data positions instead of the hub order. Neat!

Then, you'd read the LMM code in a funny pattern, probably in some straight-lined LMM executor that would cycle through the proper address sequence to hit the hub cycles on the nose. That would enable the hub order to remain ascending. You would only scramble LMM code positions. Not a big deal, if you wanted some fast LMM.

To do that the fastest, you would need a nibble adder option on the Incrementing pointer. (LMM-Add?)
Odd-N is relatively simple to do, and that would allow the FIFO to be used for Data or video.

Roy Eltham · 2014-05-22 01:31

The way I read Phil's posts, he wanted the order of HUB reads into cogs to change to allow for faster LMM without the FIFO.

His reply to me about re-ordering stuff in HUB memory was in regards to dealing with the hub cog access order being non-linear in his idea, and the re-ordering would allow full speed feeding of the data from HUB to pins/dacs.

However, re-ordering the code in HUB first accomplishes the same thing with a little compiler work to output the code in the alternate order, and keeps the actual hub access order linear.

Cluso99 · 2014-05-22 01:34

cgracey wrote: »

Cluso,

Using the 256-long LUT for cog code, ranging from addresses $200..$2FF would probably only need a few mux's to make work. No promises, but we'll investigate that down the line. It would be like internal-speed hub exec.

Thanks Chip.

In fact, as soon as you have DJZ, DJNZ, TJZ, TJNZ, TJS, TJNS, JP and JNP D,S/@ working, then...

If you add JMP and CALL #abs/@rel where call saves the return in a fixed register (say $1EF) where there is up to 17 immediate address bits (9 will do for now) (this is the GCC LR call required version), then...

We can test the hubexec mode running from the cog ram (at full cog speed).
We don't even need the LUT as extended cog ram to test it

We don't even need the PC to be increased from 9 bits to test it

And please don't worry about implementing this until after you have done an FPGA. It doesn't matter what the fgpa is missing.

jmg · 2014-05-22 02:07

Roy Eltham wrote: »

However, re-ordering the code in HUB first accomplishes the same thing with a little compiler work to output the code in the alternate order, and keeps the actual hub access order linear.

Yes, keeping memory linear is easiest, and to go with your 'a little compiler work', you also need something in HW to 'serve up the addresses' INC'd by (eg) 7, designed to only giving a carry from the Nibble when all 16 addresses in a page have been scanned - Hence the nibble-adder. Can be part of an LMM-fetch opcode.

cgracey · 2014-05-22 03:33

jmg wrote: »

Yes, keeping memory linear is easiest, and to go with your 'a little compiler work', you also need something in HW to 'serve up the addresses' INC'd by (eg) 7, designed to only giving a carry from the Nibble when all 16 addresses in a page have been scanned - Hence the nibble-adder. Can be part of an LMM-fetch opcode.

Maybe you could just code one LMM loop like this:

loop	RDLONG	i0,PTRA[step0++/--]
	RDLONG	i1,PTRA[step1++/--]
i0	NOP
i1	NOP
	RDLONG	i2,PTRA[step2++/--]
	RDLONG	i3,PTRA[step3++/--]
i2	NOP
i3	NOP
	RDLONG	i4,PTRA[step4++/--]
	RDLONG	i5,PTRA[step5++/--]
i4	NOP
i5	NOP
	RDLONG	i6,PTRA[step6++/--]
	RDLONG	i7,PTRA[step7++/--]
i6	NOP
i7	NOP
	RDLONG	i8,PTRA[step8++/--]
	RDLONG	i9,PTRA[step9++/--]
i8	NOP
i9	NOP
	RDLONG	i10,PTRA[step10++/--]
	RDLONG	i11,PTRA[step11++/--]
i10	NOP
i11	NOP
	RDLONG	i12,PTRA[step12++/--]
	RDLONG	i13,PTRA[step13++/--]
i12	NOP
i13	NOP
	RDLONG	i14,PTRA[step14++/--]
	RDLONG	i15,PTRA[step15++/--]
i14	NOP
i15	NOP
	JMP	#loop

dMajo · 2014-05-22 05:01

Cluso99 wrote: »

Bill,
I am proposing the additional cog ram would run precisely the same as hubexec. The only difference is that the instruction/data is on cog and so it does not require the hub slot to run.
So no, there is no need for dual port (besides, that also requires additional S & D bits >9).

So, extended cog ram is better than hubexec and LMM in any implementation. There are absolutely no deficiencies over hubexec or LMM. Period !

Except that wit hubexec/lmm you can run two istances of the same program/routine (code) on multiple cogs while being that local will require multiple copies of the code (and perhaps more than multiple if the loader needs to load the hub to allow than it to be copied to cogs) thus multiple use of the amount of memory .... and I think that increasing local ram will decrease the hub one. IMHO this can be useful only by re-utilizing in such way the LUT (if at low/zero cost) where it is not needed.

Roy Eltham wrote: »

Phil, that's silly. You are suggesting that wav data, bitmap data, signal data all be scrambled in hub memory before sending? That's really super fun and easy... :P

It's completely impractical and down right painful to inflict on people. Seriously?!

cgracey wrote: »

Change the data positions instead of the hub order. Neat!

Then, you'd read the LMM code in a funny pattern, probably in some straight-lined LMM executor that would cycle through the proper address sequence to hit the hub cycles on the nose. That would enable the hub order to remain ascending. You would only scramble LMM code positions. Not a big deal, if you wanted some fast LMM.

As long as there's a FIFO, though, there's no need to do this.

If you consider the hub just as a virtualized/abstracted storage (like a cloud: you put in the data, you don't know on which server/san they are stored, but you are able to get them identically back) and if the fifo is not strictly a fifo (in its meaning), and rd/wr-block is also not linear on the hub interface while they can be to the cog/pins ... with other words if everything interfaces to the hub (and of course binaries generation from the tools) keeps the same order on the hub side of the tunnel, than this is transparent to the programmer, it's only a hw (clever) trick, and I do not see nothing wrong in it. ... specially if it saves some other complexities from the design.

BTW: I remember Phil proposing it long ago (and perhaps jmg also at different stage/functional block, but with same targets/results), now he has also proven it.

Chip's reversal of data order (and patterned indexing) to preserve the rotation slot order is the same Phil's idea, viewed from the opposite side, thus imho the same rules should apply: as long as all this is transparent to the programmer everything is good ... in every way it is done.

Bill Henning · 2014-05-22 06:49

True!

(great news about hubexec/fifo! THANKS!)

VGA/component video 3..5 pins
Kb, mouse 4 pins
serial, boot flash 6 pins
uSD card 4 pins (or 1 if we share boot flash SCK/MOSI/MISO)
----
14..19 pins

26-14 = 12

Single chip self-hosted with lot of memory can still happen! With 12 user I/O's readily available.

cgracey wrote: »

We can have SDRAM on the new chip, too. It would just clock at 100MHz (Fsys/2). It takes, as I recall, 22 pins for control and address and then 16 pins for data. That's 38, total, leaving 26 for other stuff. That's workable. With the hub FIFO's, you could stream words at Fsys/2 by setting the NCO to 1/2. Use a smart pin to output Fsys/2. There's everything you need. One cog should handle it just fine.

Bill Henning · 2014-05-22 07:16

Very interesting! Branches would have to go to 8 byte boundaries, which is not a major problem, but may throw off the memory stride, depending on step size and pattern.

cgracey wrote: »

Maybe you could just code one LMM loop like this:

loop	RDLONG	i0,PTRA[step0++/--]
	RDLONG	i1,PTRA[step1++/--]
i0	NOP
i1	NOP
	RDLONG	i2,PTRA[step2++/--]
	RDLONG	i3,PTRA[step3++/--]
i2	NOP
i3	NOP
	RDLONG	i4,PTRA[step4++/--]
	RDLONG	i5,PTRA[step5++/--]
i4	NOP
i5	NOP
	RDLONG	i6,PTRA[step6++/--]
	RDLONG	i7,PTRA[step7++/--]
i6	NOP
i7	NOP
	RDLONG	i8,PTRA[step8++/--]
	RDLONG	i9,PTRA[step9++/--]
i8	NOP
i9	NOP
	RDLONG	i10,PTRA[step10++/--]
	RDLONG	i11,PTRA[step11++/--]
i10	NOP
i11	NOP
	RDLONG	i12,PTRA[step12++/--]
	RDLONG	i13,PTRA[step13++/--]
i12	NOP
i13	NOP
	RDLONG	i14,PTRA[step14++/--]
	RDLONG	i15,PTRA[step15++/--]
i14	NOP
i15	NOP
	JMP	#loop

jazzed · 2014-05-22 08:17

You know what?

Everything we do requires a cost analysis. There are real costs and opportunity costs.

Personally, I'm not interested in wasting time (opportunity costs) trying to optimize something until kingdom come.

All of these potential quirks are really putting me off.

Kerry S · 2014-05-22 08:28

jazzed wrote: »

All of these potential quirks are really putting me off.

+1

The simplicity of the P1 is going away quickly. Soon it will be easier to grab 16 PIC32 and program them to talk on a SPI bus than to deal with the ever growing quirks and hoops to jump through just to write a program.

Please... 8 cores, single channel memory (4 clocks per instruction) w/ 64K each + 64K+ of shared simple access hub ram. Bingo 50MIPS each @ 200MHz which is a 2.5x improvement over the P1, plenty of CORE memory for programs, plus Prop style easy data sharing via a tried and true HUB...

All of these wild, amazing, off the wall things can go into the P3...

jazzed · 2014-05-22 09:14

jazzed wrote: »

All of these potential quirks are really putting me off.

Thinking about it after another cup of joe, that statement was probably a bit harsh or simply premature.

I think that good non-quirky solutions can be found eventually ... before kingdom come of course.

A very simple preliminary bit file is needed before anything else quirks or no quirks.

Phil Pilgrim (PhiPi) · 2014-05-22 11:00

I think it's always a good exercise to explore the software consequences of any hardware decision, especailly when the question is, "What can we remove, and how much can we do with less?" Sometimes with less, there's also less to get in the way. The P1 is like a basic Lego Technics set: you can build almost anything with it. What we don't want to end up with in the P2 is the Lego Starwars Edition, where the prefabbed pieces leave less to the imagination and creativity. Just give us the most basic hardware building blocks and the freedom to create with them, and we programmers will make the S2 soar to new heights.

-Phil

Baggers · 2014-05-22 11:51

Chip, don't forget when you do the next FPGA update, to hook up the PS/2 pins to the emulated prop so we can start adding keyboard input too

New Hub Scheme For Next Chip

Comments