We will get hub exec, but not on the first FPGA release. Hub exec makes big code mindless to write. And it's fast.
Ok this is good for me. I trust the outcome will be refined. It has to date. Honestly, I just want it sane more than I want the very peak performance possible. So that's my vote while I'm outta here.
Re: Scrambled order, linear addressing.
If we can avoid this, we really should. What happens is downstream code gets slower and or larger. We trade code size and efficiency in many cases for a few specific cases. IMHO, this isn't a smart trade. Or, if we keep code size within some reasonable size to what it would have been for linear addressing, then the data will have to be messy. This costs something no matter what.
Long ago, on the Apple ][ computer, this was actually done to save a little silicon. I think it was two or one chip. By funking up the addressing for the video buffers, a couple of chips could be removed from the BOM. The product of that was very bizarre, large, table driven and complex code needed to present graphics to the user. Recently, I just ended up writing one of those and compared to a linear addressing scheme, I was kind of amazed at what it took to work, and actually go fast. Never quite as fast as linear would have been, but plenty fast.
Non-linear access of this kind will have a lot of ripple effects that will negate the performance in a whole lot of cases.
Frankly, I don't want to answer those questions all the time either. Better things to do.
Re: Scrambled order, linear addressing. If we can avoid this, we really should.
I do not see why. The software will be totally order-agnostic. One order is no different from another as far as the code is concerned: The data is either there when you want it, or you have to wait for it, regardless of what the firing order is. The only difference is that performance peaks with certain firing orders in certain situations. I identifed one possible order in what is likely to be a common situation and demonstrated its advantages there. What makes it nice in that case is that no FIFO is necessary, and the performance boost could quite likely reduce the need for Hubexec.
I do not see why. The software will be totally order-agnostic. One order is no different from another as far as the code is concerned: The data is either there when you want it, or you have to wait for it, regardless of what the firing order is.
Wow, I think you have totally missed that the LSB of the Address IS the firing order.
- so far from being agnostic, there will need to be (mind bending) order-shuffles performed by all the upstream tools, and this applies to ALL GOGs.
In my Nibble-adder approach, so long as the memory WRITE pointer follows the same adder-rules as the read, this is hidden, but it is not hidden from any linear addressing.
Phil,
firing order matters when you do block read or fifo reads because those hit every window and read them all. So you would get the longs out of order and not be able to feed them in order to the pins/dacs at full clock speed (or even lower ones).
Wow, I think you have totally missed that the LSB of the Address IS the firing order.
Wow. I think you have totally missed that it doesn't have to be.
Regardless of the firing order, the program results will be the same. Except for speed issues, the programmer does not have to know what the firing order is. The only thing that changes is how long the program has to wait for a slot in certain circumstances. The worst case will always be 16 clocks. The best case could be much less than that, depending upon circumstances and firing-order optimization. I identified a very common circumstance and a firing order that optimized speed in that circumstance.
My comments are predicated on the notion that a FIFO will not be necessary or present. The idea is to optimize performance without one, because I think the FIFO has become an unnecessary complication and distraction. As far as block reads are concerned, there's no reason to assume that cog RAM has to be filled in linear order if the hub RAM is being read out of order. It could simply be filled in the same scrambled order in which the hub is read, resulting in perfectly-ordered data when the operation is complete. The same would apply to block writes to the hub.
Phil,
Streaming longs from HUB to pins/dacs at 200Mhz (or 100Mhz) in order. The in order firing is required for that to work, even if there is no FIFO.
Phil, that's silly. You are suggesting that wav data, bitmap data, signal data all be scrambled in hub memory before sending? That's really super fun and easy... :P
It's completely impractical and down right painful to inflict on people. Seriously?!
There's no reason the order can't be changed in the hub to compensate pre-emptively.
-Phil
Change the data positions instead of the hub order. Neat!
Then, you'd read the LMM code in a funny pattern, probably in some straight-lined LMM executor that would cycle through the proper address sequence to hit the hub cycles on the nose. That would enable the hub order to remain ascending. You would only scramble LMM code positions. Not a big deal, if you wanted some fast LMM.
As long as there's a FIFO, though, there's no need to do this.
It would be a lot easier and feasable to have LMM code be reordered into the correct pattern than to change the hardware and require all data to be scrambled. Of course, all the branches have to be fixed up accordingly, but that's easy too.
It would be a lot easier and feasable to have LMM code be reordered into the correct pattern than to change the hardware and require all data to be scrambled. Of course, all the branches have to be fixed up accordingly, but that's easy too.
I think that's actually what Phil was trying to convey, but your response made me think otherwise, until I thought about it some more.
Change the data positions instead of the hub order. Neat!
Then, you'd read the LMM code in a funny pattern, probably in some straight-lined LMM executor that would cycle through the proper address sequence to hit the hub cycles on the nose. That would enable the hub order to remain ascending. You would only scramble LMM code positions. Not a big deal, if you wanted some fast LMM.
To do that the fastest, you would need a nibble adder option on the Incrementing pointer. (LMM-Add?)
Odd-N is relatively simple to do, and that would allow the FIFO to be used for Data or video.
The way I read Phil's posts, he wanted the order of HUB reads into cogs to change to allow for faster LMM without the FIFO.
His reply to me about re-ordering stuff in HUB memory was in regards to dealing with the hub cog access order being non-linear in his idea, and the re-ordering would allow full speed feeding of the data from HUB to pins/dacs.
However, re-ordering the code in HUB first accomplishes the same thing with a little compiler work to output the code in the alternate order, and keeps the actual hub access order linear.
Using the 256-long LUT for cog code, ranging from addresses $200..$2FF would probably only need a few mux's to make work. No promises, but we'll investigate that down the line. It would be like internal-speed hub exec.
Thanks Chip.
In fact, as soon as you have DJZ, DJNZ, TJZ, TJNZ, TJS, TJNS, JP and JNP D,S/@ working, then...
If you add JMP and CALL #abs/@rel where call saves the return in a fixed register (say $1EF) where there is up to 17 immediate address bits (9 will do for now) (this is the GCC LR call required version), then...
We can test the hubexec mode running from the cog ram (at full cog speed).
We don't even need the LUT as extended cog ram to test it
We don't even need the PC to be increased from 9 bits to test it
And please don't worry about implementing this until after you have done an FPGA. It doesn't matter what the fgpa is missing.
However, re-ordering the code in HUB first accomplishes the same thing with a little compiler work to output the code in the alternate order, and keeps the actual hub access order linear.
Yes, keeping memory linear is easiest, and to go with your 'a little compiler work', you also need something in HW to 'serve up the addresses' INC'd by (eg) 7, designed to only giving a carry from the Nibble when all 16 addresses in a page have been scanned - Hence the nibble-adder. Can be part of an LMM-fetch opcode.
Yes, keeping memory linear is easiest, and to go with your 'a little compiler work', you also need something in HW to 'serve up the addresses' INC'd by (eg) 7, designed to only giving a carry from the Nibble when all 16 addresses in a page have been scanned - Hence the nibble-adder. Can be part of an LMM-fetch opcode.
Bill,
I am proposing the additional cog ram would run precisely the same as hubexec. The only difference is that the instruction/data is on cog and so it does not require the hub slot to run.
So no, there is no need for dual port (besides, that also requires additional S & D bits >9).
So, extended cog ram is better than hubexec and LMM in any implementation. There are absolutely no deficiencies over hubexec or LMM. Period !
Except that wit hubexec/lmm you can run two istances of the same program/routine (code) on multiple cogs while being that local will require multiple copies of the code (and perhaps more than multiple if the loader needs to load the hub to allow than it to be copied to cogs) thus multiple use of the amount of memory .... and I think that increasing local ram will decrease the hub one. IMHO this can be useful only by re-utilizing in such way the LUT (if at low/zero cost) where it is not needed.
Phil, that's silly. You are suggesting that wav data, bitmap data, signal data all be scrambled in hub memory before sending? That's really super fun and easy... :P
It's completely impractical and down right painful to inflict on people. Seriously?!
Change the data positions instead of the hub order. Neat!
Then, you'd read the LMM code in a funny pattern, probably in some straight-lined LMM executor that would cycle through the proper address sequence to hit the hub cycles on the nose. That would enable the hub order to remain ascending. You would only scramble LMM code positions. Not a big deal, if you wanted some fast LMM.
As long as there's a FIFO, though, there's no need to do this.
If you consider the hub just as a virtualized/abstracted storage (like a cloud: you put in the data, you don't know on which server/san they are stored, but you are able to get them identically back) and if the fifo is not strictly a fifo (in its meaning), and rd/wr-block is also not linear on the hub interface while they can be to the cog/pins ... with other words if everything interfaces to the hub (and of course binaries generation from the tools) keeps the same order on the hub side of the tunnel, than this is transparent to the programmer, it's only a hw (clever) trick, and I do not see nothing wrong in it. ... specially if it saves some other complexities from the design.
BTW: I remember Phil proposing it long ago (and perhaps jmg also at different stage/functional block, but with same targets/results), now he has also proven it.
Chip's reversal of data order (and patterned indexing) to preserve the rotation slot order is the same Phil's idea, viewed from the opposite side, thus imho the same rules should apply: as long as all this is transparent to the programmer everything is good ... in every way it is done.
We can have SDRAM on the new chip, too. It would just clock at 100MHz (Fsys/2). It takes, as I recall, 22 pins for control and address and then 16 pins for data. That's 38, total, leaving 26 for other stuff. That's workable. With the hub FIFO's, you could stream words at Fsys/2 by setting the NCO to 1/2. Use a smart pin to output Fsys/2. There's everything you need. One cog should handle it just fine.
Very interesting! Branches would have to go to 8 byte boundaries, which is not a major problem, but may throw off the memory stride, depending on step size and pattern.
All of these potential quirks are really putting me off.
+1
The simplicity of the P1 is going away quickly. Soon it will be easier to grab 16 PIC32 and program them to talk on a SPI bus than to deal with the ever growing quirks and hoops to jump through just to write a program.
Please... 8 cores, single channel memory (4 clocks per instruction) w/ 64K each + 64K+ of shared simple access hub ram. Bingo 50MIPS each @ 200MHz which is a 2.5x improvement over the P1, plenty of CORE memory for programs, plus Prop style easy data sharing via a tried and true HUB...
All of these wild, amazing, off the wall things can go into the P3...
I think it's always a good exercise to explore the software consequences of any hardware decision, especailly when the question is, "What can we remove, and how much can we do with less?" Sometimes with less, there's also less to get in the way. The P1 is like a basic Lego Technics set: you can build almost anything with it. What we don't want to end up with in the P2 is the Lego Starwars Edition, where the prefabbed pieces leave less to the imagination and creativity. Just give us the most basic hardware building blocks and the freedom to create with them, and we programmers will make the S2 soar to new heights.
Comments
Ok this is good for me. I trust the outcome will be refined. It has to date. Honestly, I just want it sane more than I want the very peak performance possible. So that's my vote while I'm outta here.
Re: Scrambled order, linear addressing.
If we can avoid this, we really should. What happens is downstream code gets slower and or larger. We trade code size and efficiency in many cases for a few specific cases. IMHO, this isn't a smart trade. Or, if we keep code size within some reasonable size to what it would have been for linear addressing, then the data will have to be messy. This costs something no matter what.
Long ago, on the Apple ][ computer, this was actually done to save a little silicon. I think it was two or one chip. By funking up the addressing for the video buffers, a couple of chips could be removed from the BOM. The product of that was very bizarre, large, table driven and complex code needed to present graphics to the user. Recently, I just ended up writing one of those and compared to a linear addressing scheme, I was kind of amazed at what it took to work, and actually go fast. Never quite as fast as linear would have been, but plenty fast.
Non-linear access of this kind will have a lot of ripple effects that will negate the performance in a whole lot of cases.
Frankly, I don't want to answer those questions all the time either. Better things to do.
Looking forward to "clicking" on my favourite button in Quartus.
Regarding advancing the PC. I was assuming that the rdbloc with prtx would advance ptrx properly. Then banches would need to account for it,
May not be good enough, but I don't think it matter in the long run...
I do not see why. The software will be totally order-agnostic. One order is no different from another as far as the code is concerned: The data is either there when you want it, or you have to wait for it, regardless of what the firing order is. The only difference is that performance peaks with certain firing orders in certain situations. I identifed one possible order in what is likely to be a common situation and demonstrated its advantages there. What makes it nice in that case is that no FIFO is necessary, and the performance boost could quite likely reduce the need for Hubexec.
-Phil
If you flip to descending hub slot order, the most-common case of PTR++ next read, can be ready in 15 Cycles, instead of 17 Cycles.
Addit: oops - that helps LMM a few %, but breaks DMA, so a simple ascending hub slot order is important.
Sounds good
How does it look in LUT and MHz on the FPGA ?
Wow, I think you have totally missed that the LSB of the Address IS the firing order.
- so far from being agnostic, there will need to be (mind bending) order-shuffles performed by all the upstream tools, and this applies to ALL GOGs.
In my Nibble-adder approach, so long as the memory WRITE pointer follows the same adder-rules as the read, this is hidden, but it is not hidden from any linear addressing.
firing order matters when you do block read or fifo reads because those hit every window and read them all. So you would get the longs out of order and not be able to feed them in order to the pins/dacs at full clock speed (or even lower ones).
Regardless of the firing order, the program results will be the same. Except for speed issues, the programmer does not have to know what the firing order is. The only thing that changes is how long the program has to wait for a slot in certain circumstances. The worst case will always be 16 clocks. The best case could be much less than that, depending upon circumstances and firing-order optimization. I identified a very common circumstance and a firing order that optimized speed in that circumstance.
-Phil
My comments are predicated on the notion that a FIFO will not be necessary or present. The idea is to optimize performance without one, because I think the FIFO has become an unnecessary complication and distraction. As far as block reads are concerned, there's no reason to assume that cog RAM has to be filled in linear order if the hub RAM is being read out of order. It could simply be filled in the same scrambled order in which the hub is read, resulting in perfectly-ordered data when the operation is complete. The same would apply to block writes to the hub.
-Phil
but that misses that the FIFO solves problems that SW cannot - like the LUT streaming of video.
So it is rather a leap backwards.
-Phil
Streaming longs from HUB to pins/dacs at 200Mhz (or 100Mhz) in order. The in order firing is required for that to work, even if there is no FIFO.
There's no reason the order can't be changed in the hub to compensate pre-emptively.
-Phil
It's completely impractical and down right painful to inflict on people. Seriously?!
Change the data positions instead of the hub order. Neat!
Then, you'd read the LMM code in a funny pattern, probably in some straight-lined LMM executor that would cycle through the proper address sequence to hit the hub cycles on the nose. That would enable the hub order to remain ascending. You would only scramble LMM code positions. Not a big deal, if you wanted some fast LMM.
As long as there's a FIFO, though, there's no need to do this.
It would be a lot easier and feasable to have LMM code be reordered into the correct pattern than to change the hardware and require all data to be scrambled. Of course, all the branches have to be fixed up accordingly, but that's easy too.
I think that's actually what Phil was trying to convey, but your response made me think otherwise, until I thought about it some more.
To do that the fastest, you would need a nibble adder option on the Incrementing pointer. (LMM-Add?)
Odd-N is relatively simple to do, and that would allow the FIFO to be used for Data or video.
His reply to me about re-ordering stuff in HUB memory was in regards to dealing with the hub cog access order being non-linear in his idea, and the re-ordering would allow full speed feeding of the data from HUB to pins/dacs.
However, re-ordering the code in HUB first accomplishes the same thing with a little compiler work to output the code in the alternate order, and keeps the actual hub access order linear.
In fact, as soon as you have DJZ, DJNZ, TJZ, TJNZ, TJS, TJNS, JP and JNP D,S/@ working, then...
If you add JMP and CALL #abs/@rel where call saves the return in a fixed register (say $1EF) where there is up to 17 immediate address bits (9 will do for now) (this is the GCC LR call required version), then...
We can test the hubexec mode running from the cog ram (at full cog speed).
We don't even need the LUT as extended cog ram to test it
We don't even need the PC to be increased from 9 bits to test it
And please don't worry about implementing this until after you have done an FPGA. It doesn't matter what the fgpa is missing.
Yes, keeping memory linear is easiest, and to go with your 'a little compiler work', you also need something in HW to 'serve up the addresses' INC'd by (eg) 7, designed to only giving a carry from the Nibble when all 16 addresses in a page have been scanned - Hence the nibble-adder. Can be part of an LMM-fetch opcode.
Maybe you could just code one LMM loop like this:
Except that wit hubexec/lmm you can run two istances of the same program/routine (code) on multiple cogs while being that local will require multiple copies of the code (and perhaps more than multiple if the loader needs to load the hub to allow than it to be copied to cogs) thus multiple use of the amount of memory .... and I think that increasing local ram will decrease the hub one. IMHO this can be useful only by re-utilizing in such way the LUT (if at low/zero cost) where it is not needed.
If you consider the hub just as a virtualized/abstracted storage (like a cloud: you put in the data, you don't know on which server/san they are stored, but you are able to get them identically back) and if the fifo is not strictly a fifo (in its meaning), and rd/wr-block is also not linear on the hub interface while they can be to the cog/pins ... with other words if everything interfaces to the hub (and of course binaries generation from the tools) keeps the same order on the hub side of the tunnel, than this is transparent to the programmer, it's only a hw (clever) trick, and I do not see nothing wrong in it. ... specially if it saves some other complexities from the design.
BTW: I remember Phil proposing it long ago (and perhaps jmg also at different stage/functional block, but with same targets/results), now he has also proven it.
Chip's reversal of data order (and patterned indexing) to preserve the rotation slot order is the same Phil's idea, viewed from the opposite side, thus imho the same rules should apply: as long as all this is transparent to the programmer everything is good ... in every way it is done.
(great news about hubexec/fifo! THANKS!)
VGA/component video 3..5 pins
Kb, mouse 4 pins
serial, boot flash 6 pins
uSD card 4 pins (or 1 if we share boot flash SCK/MOSI/MISO)
----
14..19 pins
26-14 = 12
Single chip self-hosted with lot of memory can still happen! With 12 user I/O's readily available.
Everything we do requires a cost analysis. There are real costs and opportunity costs.
Personally, I'm not interested in wasting time (opportunity costs) trying to optimize something until kingdom come.
All of these potential quirks are really putting me off.
+1
The simplicity of the P1 is going away quickly. Soon it will be easier to grab 16 PIC32 and program them to talk on a SPI bus than to deal with the ever growing quirks and hoops to jump through just to write a program.
Please... 8 cores, single channel memory (4 clocks per instruction) w/ 64K each + 64K+ of shared simple access hub ram. Bingo 50MIPS each @ 200MHz which is a 2.5x improvement over the P1, plenty of CORE memory for programs, plus Prop style easy data sharing via a tried and true HUB...
All of these wild, amazing, off the wall things can go into the P3...
Thinking about it after another cup of joe, that statement was probably a bit harsh or simply premature.
I think that good non-quirky solutions can be found eventually ... before kingdom come of course.
A very simple preliminary bit file is needed before anything else quirks or no quirks.
-Phil