I understand this kind of memory interface is common in GPU land where throughput per watt is a really big deal.
If we go back to the ugly hub discussion, we all struggled with speed, determinism, etc...
You mentioned a random scheme and the implications on code. The summary conclusion is people can code for worst case where it is very important to be deterministic, and not care where it only needs to be fast.
P2 has a timer interrupt which can resolve this nicely.
None of that is as easy as P1 is, but done the way it currently is, cog portability is still very much intact, meaning the part we like most remain possible.
A COG is still very much a COG, which was not true of the various slot access streams, and does not have the problems associated with things like mooch did.
We do have a COG access interrupt that does depend on a COG number,so a minor edit will be needed there, but that is likely true for any hub code needed anyway.
For known and consistent tasks, like a video driver, this scheme will deliver very high throughput without the drawbacks inherent in the P1 slow scheme and access slot schemes.
A quick sync to the egg beater leaves the program aware of its access addresses and access times. From there it is all about how data is arranged.
Or, ignore that, use the FIFO and or streamer and you will know your max times. Those are by nature more conservative than real times, and that will indicate when to use another cog.
Nobody got what they wanted on this.
But it is fast, consistent, and preserves cog portability.
Can't wait to explore it some. I suspect how we arrange data will play out in interesting ways.
And I am holding judgement back, until we try it. Roy has a lot of experience in related areas and would not have suggested it to Chip as a solution if it were horrible.
What I think and expect is we think a bit differently than we did on P1. As that plays out,we will do the same thing we did on P1, which is find the sweet spot cases and publish those along with objects that will not differ too much from P1 objects in terms of ease of use.
They will differ a bit from P1 in that create may be tougher. But we get a lot for that being true.
As long as we get a very high degree of cog portability, I believe the end result will be worth it.
It was your arguments related to the Tasker and how the cog is the object target that drives most of these opinions.
I have no idea about GPU's but looking at the shelves full of graphics cards in the local PC super store I see that they have huge heat sinks and fans. They don't seem too worry about watts much. Different story on mobile devices of course.
I seem to remember I did once say something about a random HUB access scheme. I was comparing to the early networking world where there was "token ring" that was supposed to be some kind of organized and scheduled way for nodes to access the network medium in an orderly manner vs Ethernet which was a free for all random bash at the net until you get in. Turns out the random, non-deterministic chaos of Ethernet could achieve larger throughput overall and was easier to manage and more robust. At the cost of determinism.
With the "eggbeater" scheme we don't need to add extra randomness. That is provided by the impossibility of knowing what all the peer COGS are wanting to do with HUB RAM at any given time.
At the end of the day, if you optimize for overall greatest execution rate or throughput with multiple competing actors, COGS, then you have to give up all hope of determinism. Like Ethernet.
Also at the end of the day I think I'm all for it. If a code in COG wants to be deterministic, on an instruction by instruction basis, then it just has to hide itself away in it's COG for the duration.
The timer, and other interrupts will of course help in hitting the timing target you have. But at a much courser grain size than the instruction cycle counting of bit banging from a COG.
Something doesn't look right to me with the hub timing. Suppose a cog calls:
RDLONG reg0, addr0 ' addr0 => $200RDLONG reg1, addr1 ' addr1 => $201
The first instructions will stall until the cog is aligned with bank 0 (low nibble matches).
But what does the second instruction do? If the hub is rotating every clock cycle, then by the time that the second instruction executes (two clock cycles later), the cog is aligned with bank 2, causing the instruction to stall for 15 more clock cycles.
It seems that the only way that memory HUBOPs can take advantage of the eggbeater is if cog memory banks are rotated every two clock cycles. But Chip clearly says per-clock access is occurring, and several others seem to be under the same impression (at least in the context of the instruction streamer).
So, is the hub switching banks twice as fast as the instructions effectively execute, meaning that optimized HUBOPs would have to do things like:
RDLONG reg0, addr0 ' Read 5 longs. There might be a stallRDLONG reg2, addr2 ' on the first RDLONG, and there will beRDLONG reg4, addr4 ' a 10 clock delay on the fourth RDLONG.RDLONG reg1, addr1 ' All other RDLONGs will be aligned to theRDLONG reg3, addr3 ' correct bank and incur no stalls.
Or, is the rotation actually occurring every two clock cycles, with the instruction streamer using one cycle and any pending HUBOP using the the other cycle?
Is there any mechanism by which a program can read or write hubram to cogram at a rate of one long per clock?
From what I understand, when using hubexec it will generally be necessary to have a function that always lives in cogram that uses the hub streamer to load a block of hubram into cogram or cogram into hubram in order to quickly read and write blocks of data out of and into hubram. This function would either use blockloading instructions if there are ones or just do it manually. This function would probably have to exist even if it just called a hardware blockloader instruction in order to restore the hub streamer from data to hubexec mode via a return instruction.
So, is the hub switching banks twice as fast as the instructions effectively execute, meaning that optimized HUBOPs would have to do things like:
RDLONG reg0, addr0 ' Read 5 longs. There might be a stallRDLONG reg2, addr2 ' on the first RDLONG, and there will beRDLONG reg4, addr4 ' a 10 clock delay on the fourth RDLONG.RDLONG reg1, addr1 ' All other RDLONGs will be aligned to theRDLONG reg3, addr3 ' correct bank and incur no stalls.
I think that is correct.
However, I also see opcodes tagged
RDFAST D/#,S/PTRx mem (waits for mem, setup fast read)
FBLOCK D/#,S/PTRx mem (update RDFAST/WRFAST size,start)
I dont think all the options for reading a hub block to cog ram have been detailed fully. Chip has certainly said we can read at full speed.
Just be prepared to wait for the precise details.
Agreed. Sequential addresses were to get smoothed out, and the burst read write instructions work at full speed.
I know we had a brief discussion about this and the DAC and signals. One complication was on the fly changes which the Tasker in hot helped with. This one has interrupts needed to deal.
Won't be too long now. Chip has the overall timing to optimize, cleanups and the boot stuff to complete, and maybe we get a first image!
Just to clarify, the only difference between cog exec and hub exec is that in hub exec, instructions are fetched from hub RAM via the streaming FIFO, instead of from cog RAM. D and S registers are still in cog RAM.
It's true that in hub exec mode, a near branch backwards would cause a hub FIFO reload, even though the instructions might be in the FIFO. The problem is that a whole mess of mux's would be required to get those instructions out of the FIFO. I think those cases are not that common compared to normal branches and would just grow the hardware for a mediocre return.
Hi Chip
How does the fifo design got finally implemented?
I had thought a small block of dual ported ram was used in its construct, provided with some address boundaries check, to infer the need to reload it or not.
Does some ram based design precludes using it for single versus multiple data moves, between COG and HUB memories?
I can't help remembering Chip's slick text-processing instructions for P2-Hot:
CHKDEC D {WC} 'if D is "0".."9", convert to 0..9 and set C, otherwise leave D the same and clear C.
CHKHEX D {WC} 'if D is "0".."9"/"A".."F"/"a".."f", convert to 0..15 and set C, otherwise leave D the same and clear C.
CHKLET D {WC} 'if D is "A".."Z"/"a".."z", convert to "A".."Z" and set C, otherwise leave D the same and clear C.
CHKSYM D {WC} 'if D is "A".."Z"/"a".."z"/"_", convert to "A".."Z"/"_" and set C, otherwise leave D the same and clear C.
Link: http://forums.parallax.com/discussion/125543/propeller-ii-update-blog/p212#15
He later mentioned, "Of course, these can be done in software, but having instructions handy in PASM will make textual interfaces really simple to code. The logic expense to implement these is less than 0.1% of the silicon."
I know we're trying to avoid a hot chip, but text is so important in our world that making it easier to process makes sense where possible. It's easy to imagine the P2 handling a lot of text.
Please, no text processing instructions, what a pointless complication.
Yes text is important but saving a few cycles every time you display a few strings on a user interface or parsing user input is never going to be noticeable in use.
Yes they may make programming such things in PASM a tiny bit easier, meh.
The upper case conversions don't even work for öäå and other commonly found letters.
It's just a bunch of transistors and wires that will sit their doing nothing 99.999% of the time. Like the useless DAA instruction in your PC.
Chip has named those text instructions "CHKxxx". Not only do they convert characters but they return an indication of what the character was.
There is one place where this may be worthwhile speed boost and or make code easier to write, lexical analysis. That is when compiling source code on the Prop itself. Lexers are a notorious bottle neck in compiler writing. Every character has to be read and evaluated one by one.
Perhaps Chip is being sneaky when selling those instructions as "user interface" helpers, what he has in mind is that Spin compiler that is self hosted on the PII
Of course such instructions may also be a boost when reading human readable ASCII protocols over serial lines and whatever.
Are they really a worthwhile gain? I'm not convinced.
They are not listed in the latest instructions. The list is generated from the verilog. Therefore I expect they are not in the new P2.
For the record, there are much more important things so leave them out.
Is there such an animal ?
I thought the P2 ROM (in the true meaning of the word) was very, very small, just enough for a boot stub.
IIRC it was going to be mask-patched RAM, but that may not be compatible with the new design flow, so a serial loader/shifter state engine from other-rom may be the present plan.
All the above makes P2 Monitor RAM based, not in ROM ?
I think Chip said the New Debug interrupt can even branch to HUBEXEC code, which makes compact size less critical (now it's %/512K, not %/2K)
Just to clarify, the only difference between cog exec and hub exec is that in hub exec, instructions are fetched from hub RAM via the streaming FIFO, instead of from cog RAM. D and S registers are still in cog RAM.
It's true that in hub exec mode, a near branch backwards would cause a hub FIFO reload, even though the instructions might be in the FIFO. The problem is that a whole mess of mux's would be required to get those instructions out of the FIFO. I think those cases are not that common compared to normal branches and would just grow the hardware for a mediocre return.
Hi Chip
How does the fifo design got finally implemented?
I had thought a small block of dual ported ram was used in its construct, provided with some address boundaries check, to infer the need to reload it or not.
Does some ram based design precludes using it for single versus multiple data moves, between COG and HUB memories?
Henrique
It's a stack of registers, where each can be loaded from either the main data input (push) or from the register above (pop). I never thought of using a dual-port RAM, which would have made random accesses possible for caching. At this point, I think we'd better stick with the logic approach, considering development time.
Something doesn't look right to me with the hub timing. Suppose a cog calls:
RDLONG reg0, addr0 ' addr0 => $200RDLONG reg1, addr1 ' addr1 => $201
The first instructions will stall until the cog is aligned with bank 0 (low nibble matches).
But what does the second instruction do? If the hub is rotating every clock cycle, then by the time that the second instruction executes (two clock cycles later), the cog is aligned with bank 2, causing the instruction to stall for 15 more clock cycles.
It seems that the only way that memory HUBOPs can take advantage of the eggbeater is if cog memory banks are rotated every two clock cycles. But Chip clearly says per-clock access is occurring, and several others seem to be under the same impression (at least in the context of the instruction streamer).
So, is the hub switching banks twice as fast as the instructions effectively execute, meaning that optimized HUBOPs would have to do things like:
RDLONG reg0, addr0 ' Read 5 longs. There might be a stallRDLONG reg2, addr2 ' on the first RDLONG, and there will beRDLONG reg4, addr4 ' a 10 clock delay on the fourth RDLONG.RDLONG reg1, addr1 ' All other RDLONGs will be aligned to theRDLONG reg3, addr3 ' correct bank and incur no stalls.
Or, is the rotation actually occurring every two clock cycles, with the instruction streamer using one cycle and any pending HUBOP using the the other cycle?
You're right about the delays and I've been thinking that a block read which plays along with the hub ram rotation is badly needed.
I'm adding two instructions:
RDBLOCK D/#,S/#
WRBLOCK D/#,S/#
D is %rrrrrbbbbb, where %rrrrr is the x16 register base address and %bbbbb + 1 is the number of 16-long blocks to read/write from/to hub RAM starting at {S[19:6],5'b00000}. For example:
RDBLOCK %00000_11110,#$00000
...would read locations $00000..$007BF from the hub into registers $000..$1EF in the cog.
These instructions are fast and deterministic, taking 5+blocks*16 clocks to get the job done. They start at whatever position the hub is in and read/write 16 longs from/to the same hub page (16 longs on a 16-long boundary) before moving to the next page and the next 16-register block in the cog.
RDBLOCK/WRBLOCK don't use the FIFO streamer like hub exec does, but use the same conduit as the random RDxxxx/WRxxxx instructions do, without waiting for a particular hub position. This means that hub exec code can quickly load cog exec code and execute it, or just read/write data blocks.
I think this totally rounds out the Prop 2 instruction set.
I believe I have the interrupt stuff all done, with the breakpoint and single-step functions, but I'm still on the road, so I haven't been able to test anything. I'll probably have this RDBLOCK/WRBLOCK implemented before we get back home.
Sorry I haven't been more responsive to this thread in the last few days. We've been doing lots of things with our kids on this trip and whole days have gone by without any computer time.
Just to clarify, the only difference between cog exec and hub exec is that in hub exec, instructions are fetched from hub RAM via the streaming FIFO, instead of from cog RAM. D and S registers are still in cog RAM.
It's true that in hub exec mode, a near branch backwards would cause a hub FIFO reload, even though the instructions might be in the FIFO. The problem is that a whole mess of mux's would be required to get those instructions out of the FIFO. I think those cases are not that common compared to normal branches and would just grow the hardware for a mediocre return.
Hi Chip
How does the fifo design got finally implemented?
I had thought a small block of dual ported ram was used in its construct, provided with some address boundaries check, to infer the need to reload it or not.
Does some ram based design precludes using it for single versus multiple data moves, between COG and HUB memories?
Henrique
It's a stack of registers, where each can be loaded from either the main data input (push) or from the register above (pop). I never thought of using a dual-port RAM, which would have made random accesses possible for caching. At this point, I think we'd better stick with the logic approach, considering development time.
Thanks Chip, for the explanation.
Now I understand its construct and behavior.
Sure, stick with what you had designed and tested.
Henrique
P.S.
Only for completeness, this is not a claim nor even a suggestion:
During the time I was thinking about using a dual ported design, I was partially blocked by the assumption that some kind of CAM should also be used, to deal with address compares, during read and write operations. It would be a huge solution, so better forget it.
But, some time later in the process of thinking about it, I realized that, by its own nature, the egg beater scheme is self addressable, i. e., if a string of 16 x (32 (data bits) + 16 (address bits) + 4 (pending ops bits) + 4 (ops select bits) ) DPRAM was crafted, addressable from 0000b thru 1111b, when the beater pointer matches those bits, the Hub-to-cache interface would be enabled, and the address+tag part would be operated on.
The combination of (pending and select) ops bits would command the rest of the operation, if any.
To speed up the whole process, the addresses+tags could be implemented as addressable registers, although this will involve muxes, but I'm still unsure if it's is not totally mandatory, by timing constraints. (Look at P. S. II)
On exit, pending ops bits should be reset, to signal operation completeness, enabling the Cog to catch the data that was read or to place new data to be written in another cycle of the beater.
Since Hub memory is a single port instance, the individualization of pending+select ops bits was meant to enable individual bytes, words or longs to be operated, but I'm unsure if simultaneous reading and writing of individual bytes, within the same long, couldn't be issued.
To be absolutely clear, I'm not asking for anything. I'm only explaining my thoughts.
Henrique
P. S. II
I saw the cat, and totally missed the rat's tail popping out of its mouth!
The egg beater is fully deterministic.
So, to avoid timing constraints, from the Hub side perspective, its enough to enable the tag part of the ram at the preceding cycle, i. e., the one just before the Cog time slot happens.
Then reading and writing from/to the Hub ram will be a just-in-time event.
I think this totally rounds out the Prop 2 instruction set.
Chip, could you post an updated document that contains all of the P2 instructions? It would be good to have the binary values for the new instructions and brief descriptions about what they do in one document.
Chip, not to derail you any, but since you have some work to do yet on smart pins I thought I'd ask.
Way back when I said it would be cool if the P2 had a 256LE FPGA on chip, to do the heavy lifting of some hardware functions. Perhaps 256LE is a bit optimistic, but how hard do you think it would be to integrate a small CPLD module with smart pins? The module would be programmed at runtime by a cog, so no config needs to be saved on chip.
Perhaps 32LEs attached to 8 pins, with 8 registers that can be interfaced?
This sounds like something that would be an amusing diversion for a few days, but I think it would make your chip more useful. Now you've got 512KB of RAM, 16 COGs, and low latency access to HUB RAM, those are all great features above any beyond the P1.
Comments
Thanks for the link to the HUB memory "eggbeater" diagrams. I could not find them.
I think if the PII ever sees the light of day that picture should be a poster on the wall.
A reminder to PASM programmers and compiler writers as to how to optimize code for the PII
It's the "Story of Mel" all over again! http://the.linuxd.org/the-story-of-mel/
If we go back to the ugly hub discussion, we all struggled with speed, determinism, etc...
You mentioned a random scheme and the implications on code. The summary conclusion is people can code for worst case where it is very important to be deterministic, and not care where it only needs to be fast.
P2 has a timer interrupt which can resolve this nicely.
None of that is as easy as P1 is, but done the way it currently is, cog portability is still very much intact, meaning the part we like most remain possible.
A COG is still very much a COG, which was not true of the various slot access streams, and does not have the problems associated with things like mooch did.
We do have a COG access interrupt that does depend on a COG number,so a minor edit will be needed there, but that is likely true for any hub code needed anyway.
For known and consistent tasks, like a video driver, this scheme will deliver very high throughput without the drawbacks inherent in the P1 slow scheme and access slot schemes.
A quick sync to the egg beater leaves the program aware of its access addresses and access times. From there it is all about how data is arranged.
Or, ignore that, use the FIFO and or streamer and you will know your max times. Those are by nature more conservative than real times, and that will indicate when to use another cog.
Nobody got what they wanted on this.
But it is fast, consistent, and preserves cog portability.
Can't wait to explore it some. I suspect how we arrange data will play out in interesting ways.
And I am holding judgement back, until we try it. Roy has a lot of experience in related areas and would not have suggested it to Chip as a solution if it were horrible.
What I think and expect is we think a bit differently than we did on P1. As that plays out,we will do the same thing we did on P1, which is find the sweet spot cases and publish those along with objects that will not differ too much from P1 objects in terms of ease of use.
They will differ a bit from P1 in that create may be tougher. But we get a lot for that being true.
As long as we get a very high degree of cog portability, I believe the end result will be worth it.
It was your arguments related to the Tasker and how the cog is the object target that drives most of these opinions.
I have no idea about GPU's but looking at the shelves full of graphics cards in the local PC super store I see that they have huge heat sinks and fans. They don't seem too worry about watts much. Different story on mobile devices of course.
I seem to remember I did once say something about a random HUB access scheme. I was comparing to the early networking world where there was "token ring" that was supposed to be some kind of organized and scheduled way for nodes to access the network medium in an orderly manner vs Ethernet which was a free for all random bash at the net until you get in. Turns out the random, non-deterministic chaos of Ethernet could achieve larger throughput overall and was easier to manage and more robust. At the cost of determinism.
With the "eggbeater" scheme we don't need to add extra randomness. That is provided by the impossibility of knowing what all the peer COGS are wanting to do with HUB RAM at any given time.
At the end of the day, if you optimize for overall greatest execution rate or throughput with multiple competing actors, COGS, then you have to give up all hope of determinism. Like Ethernet.
Also at the end of the day I think I'm all for it. If a code in COG wants to be deterministic, on an instruction by instruction basis, then it just has to hide itself away in it's COG for the duration.
The timer, and other interrupts will of course help in hitting the timing target you have. But at a much courser grain size than the instruction cycle counting of bit banging from a COG.
(quote hard on mobile)
It shouldn't matter at all. Unused slots are still unused. There are now a lot more slots.
The scheme is consistent and continuous.
Doing that with a fan improves what is possible overall.
And it improves what a mobile chip can do too.
A GPU running too hot for its process physics will need a fan to perform, and won't work well at all in mobile.
That is the difference.
Something doesn't look right to me with the hub timing. Suppose a cog calls:
RDLONG reg0, addr0 ' addr0 => $200RDLONG reg1, addr1 ' addr1 => $201
The first instructions will stall until the cog is aligned with bank 0 (low nibble matches).
But what does the second instruction do? If the hub is rotating every clock cycle, then by the time that the second instruction executes (two clock cycles later), the cog is aligned with bank 2, causing the instruction to stall for 15 more clock cycles.
It seems that the only way that memory HUBOPs can take advantage of the eggbeater is if cog memory banks are rotated every two clock cycles. But Chip clearly says per-clock access is occurring, and several others seem to be under the same impression (at least in the context of the instruction streamer).
So, is the hub switching banks twice as fast as the instructions effectively execute, meaning that optimized HUBOPs would have to do things like:
RDLONG reg0, addr0 ' Read 5 longs. There might be a stallRDLONG reg2, addr2 ' on the first RDLONG, and there will beRDLONG reg4, addr4 ' a 10 clock delay on the fourth RDLONG.RDLONG reg1, addr1 ' All other RDLONGs will be aligned to theRDLONG reg3, addr3 ' correct bank and incur no stalls.
Or, is the rotation actually occurring every two clock cycles, with the instruction streamer using one cycle and any pending HUBOP using the the other cycle?
From what I understand, when using hubexec it will generally be necessary to have a function that always lives in cogram that uses the hub streamer to load a block of hubram into cogram or cogram into hubram in order to quickly read and write blocks of data out of and into hubram. This function would either use blockloading instructions if there are ones or just do it manually. This function would probably have to exist even if it just called a hardware blockloader instruction in order to restore the hub streamer from data to hubexec mode via a return instruction.
So, is the hub switching banks twice as fast as the instructions effectively execute, meaning that optimized HUBOPs would have to do things like:
RDLONG reg0, addr0 ' Read 5 longs. There might be a stallRDLONG reg2, addr2 ' on the first RDLONG, and there will beRDLONG reg4, addr4 ' a 10 clock delay on the fourth RDLONG.RDLONG reg1, addr1 ' All other RDLONGs will be aligned to theRDLONG reg3, addr3 ' correct bank and incur no stalls.
I think that is correct.
However, I also see opcodes tagged
RDFAST D/#,S/PTRx mem (waits for mem, setup fast read)
FBLOCK D/#,S/PTRx mem (update RDFAST/WRFAST size,start)
which may address your issue ?
Just be prepared to wait for the precise details.
I know we had a brief discussion about this and the DAC and signals. One complication was on the fly changes which the Tasker in hot helped with. This one has interrupts needed to deal.
Won't be too long now. Chip has the overall timing to optimize, cleanups and the boot stuff to complete, and maybe we get a first image!
Just to clarify, the only difference between cog exec and hub exec is that in hub exec, instructions are fetched from hub RAM via the streaming FIFO, instead of from cog RAM. D and S registers are still in cog RAM.
It's true that in hub exec mode, a near branch backwards would cause a hub FIFO reload, even though the instructions might be in the FIFO. The problem is that a whole mess of mux's would be required to get those instructions out of the FIFO. I think those cases are not that common compared to normal branches and would just grow the hardware for a mediocre return.
Hi Chip
How does the fifo design got finally implemented?
I had thought a small block of dual ported ram was used in its construct, provided with some address boundaries check, to infer the need to reload it or not.
Does some ram based design precludes using it for single versus multiple data moves, between COG and HUB memories?
Henrique
CHKDEC D {WC} 'if D is "0".."9", convert to 0..9 and set C, otherwise leave D the same and clear C.
CHKHEX D {WC} 'if D is "0".."9"/"A".."F"/"a".."f", convert to 0..15 and set C, otherwise leave D the same and clear C.
CHKLET D {WC} 'if D is "A".."Z"/"a".."z", convert to "A".."Z" and set C, otherwise leave D the same and clear C.
CHKSYM D {WC} 'if D is "A".."Z"/"a".."z"/"_", convert to "A".."Z"/"_" and set C, otherwise leave D the same and clear C.
Link: http://forums.parallax.com/discussion/125543/propeller-ii-update-blog/p212#15
He later mentioned, "Of course, these can be done in software, but having instructions handy in PASM will make textual interfaces really simple to code. The logic expense to implement these is less than 0.1% of the silicon."
I know we're trying to avoid a hot chip, but text is so important in our world that making it easier to process makes sense where possible. It's easy to imagine the P2 handling a lot of text.
Yes text is important but saving a few cycles every time you display a few strings on a user interface or parsing user input is never going to be noticeable in use.
Yes they may make programming such things in PASM a tiny bit easier, meh.
The upper case conversions don't even work for öäå and other commonly found letters.
It's just a bunch of transistors and wires that will sit their doing nothing 99.999% of the time. Like the useless DAA instruction in your PC.
There is one place where this may be worthwhile speed boost and or make code easier to write, lexical analysis. That is when compiling source code on the Prop itself. Lexers are a notorious bottle neck in compiler writing. Every character has to be read and evaluated one by one.
Perhaps Chip is being sneaky when selling those instructions as "user interface" helpers, what he has in mind is that Spin compiler that is self hosted on the PII
Of course such instructions may also be a boost when reading human readable ASCII protocols over serial lines and whatever.
Are they really a worthwhile gain? I'm not convinced.
If it comes down to it, we don't need them. However, I do think they would be a gain, meaning I'm not opposed. They are cheap.
Fast streaming of human readable data may actually be a feature of merit.
For the record, there are much more important things so leave them out.
Is there such an animal ?
I thought the P2 ROM (in the true meaning of the word) was very, very small, just enough for a boot stub.
IIRC it was going to be mask-patched RAM, but that may not be compatible with the new design flow, so a serial loader/shifter state engine from other-rom may be the present plan.
All the above makes P2 Monitor RAM based, not in ROM ?
I think Chip said the New Debug interrupt can even branch to HUBEXEC code, which makes compact size less critical (now it's %/512K, not %/2K)
Just to clarify, the only difference between cog exec and hub exec is that in hub exec, instructions are fetched from hub RAM via the streaming FIFO, instead of from cog RAM. D and S registers are still in cog RAM.
It's true that in hub exec mode, a near branch backwards would cause a hub FIFO reload, even though the instructions might be in the FIFO. The problem is that a whole mess of mux's would be required to get those instructions out of the FIFO. I think those cases are not that common compared to normal branches and would just grow the hardware for a mediocre return.
Hi Chip
How does the fifo design got finally implemented?
I had thought a small block of dual ported ram was used in its construct, provided with some address boundaries check, to infer the need to reload it or not.
Does some ram based design precludes using it for single versus multiple data moves, between COG and HUB memories?
Henrique
It's a stack of registers, where each can be loaded from either the main data input (push) or from the register above (pop). I never thought of using a dual-port RAM, which would have made random accesses possible for caching. At this point, I think we'd better stick with the logic approach, considering development time.
Something doesn't look right to me with the hub timing. Suppose a cog calls:
RDLONG reg0, addr0 ' addr0 => $200RDLONG reg1, addr1 ' addr1 => $201
The first instructions will stall until the cog is aligned with bank 0 (low nibble matches).
But what does the second instruction do? If the hub is rotating every clock cycle, then by the time that the second instruction executes (two clock cycles later), the cog is aligned with bank 2, causing the instruction to stall for 15 more clock cycles.
It seems that the only way that memory HUBOPs can take advantage of the eggbeater is if cog memory banks are rotated every two clock cycles. But Chip clearly says per-clock access is occurring, and several others seem to be under the same impression (at least in the context of the instruction streamer).
So, is the hub switching banks twice as fast as the instructions effectively execute, meaning that optimized HUBOPs would have to do things like:
RDLONG reg0, addr0 ' Read 5 longs. There might be a stallRDLONG reg2, addr2 ' on the first RDLONG, and there will beRDLONG reg4, addr4 ' a 10 clock delay on the fourth RDLONG.RDLONG reg1, addr1 ' All other RDLONGs will be aligned to theRDLONG reg3, addr3 ' correct bank and incur no stalls.
Or, is the rotation actually occurring every two clock cycles, with the instruction streamer using one cycle and any pending HUBOP using the the other cycle?
You're right about the delays and I've been thinking that a block read which plays along with the hub ram rotation is badly needed.
I'm adding two instructions:
RDBLOCK D/#,S/#
WRBLOCK D/#,S/#
D is %rrrrrbbbbb, where %rrrrr is the x16 register base address and %bbbbb + 1 is the number of 16-long blocks to read/write from/to hub RAM starting at {S[19:6],5'b00000}. For example:
RDBLOCK %00000_11110,#$00000
...would read locations $00000..$007BF from the hub into registers $000..$1EF in the cog.
These instructions are fast and deterministic, taking 5+blocks*16 clocks to get the job done. They start at whatever position the hub is in and read/write 16 longs from/to the same hub page (16 longs on a 16-long boundary) before moving to the next page and the next 16-register block in the cog.
RDBLOCK/WRBLOCK don't use the FIFO streamer like hub exec does, but use the same conduit as the random RDxxxx/WRxxxx instructions do, without waiting for a particular hub position. This means that hub exec code can quickly load cog exec code and execute it, or just read/write data blocks.
I think this totally rounds out the Prop 2 instruction set.
I believe I have the interrupt stuff all done, with the breakpoint and single-step functions, but I'm still on the road, so I haven't been able to test anything. I'll probably have this RDBLOCK/WRBLOCK implemented before we get back home.
Sorry I haven't been more responsive to this thread in the last few days. We've been doing lots of things with our kids on this trip and whole days have gone by without any computer time.
Block read while also still able to do hubexec on the streamer is deluxe!
Just to clarify, the only difference between cog exec and hub exec is that in hub exec, instructions are fetched from hub RAM via the streaming FIFO, instead of from cog RAM. D and S registers are still in cog RAM.
It's true that in hub exec mode, a near branch backwards would cause a hub FIFO reload, even though the instructions might be in the FIFO. The problem is that a whole mess of mux's would be required to get those instructions out of the FIFO. I think those cases are not that common compared to normal branches and would just grow the hardware for a mediocre return.
Hi Chip
How does the fifo design got finally implemented?
I had thought a small block of dual ported ram was used in its construct, provided with some address boundaries check, to infer the need to reload it or not.
Does some ram based design precludes using it for single versus multiple data moves, between COG and HUB memories?
Henrique
It's a stack of registers, where each can be loaded from either the main data input (push) or from the register above (pop). I never thought of using a dual-port RAM, which would have made random accesses possible for caching. At this point, I think we'd better stick with the logic approach, considering development time.
Thanks Chip, for the explanation.
Now I understand its construct and behavior.
Sure, stick with what you had designed and tested.
Henrique
P.S.
Only for completeness, this is not a claim nor even a suggestion:
During the time I was thinking about using a dual ported design, I was partially blocked by the assumption that some kind of CAM should also be used, to deal with address compares, during read and write operations. It would be a huge solution, so better forget it.
But, some time later in the process of thinking about it, I realized that, by its own nature, the egg beater scheme is self addressable, i. e., if a string of 16 x (32 (data bits) + 16 (address bits) + 4 (pending ops bits) + 4 (ops select bits) ) DPRAM was crafted, addressable from 0000b thru 1111b, when the beater pointer matches those bits, the Hub-to-cache interface would be enabled, and the address+tag part would be operated on.
The combination of (pending and select) ops bits would command the rest of the operation, if any.
To speed up the whole process, the addresses+tags could be implemented as addressable registers, although this will involve muxes, but I'm still unsure if it's is not totally mandatory, by timing constraints. (Look at P. S. II)
On exit, pending ops bits should be reset, to signal operation completeness, enabling the Cog to catch the data that was read or to place new data to be written in another cycle of the beater.
Since Hub memory is a single port instance, the individualization of pending+select ops bits was meant to enable individual bytes, words or longs to be operated, but I'm unsure if simultaneous reading and writing of individual bytes, within the same long, couldn't be issued.
To be absolutely clear, I'm not asking for anything. I'm only explaining my thoughts.
Henrique
P. S. II
I saw the cat, and totally missed the rat's tail popping out of its mouth!
The egg beater is fully deterministic.
So, to avoid timing constraints, from the Hub side perspective, its enough to enable the tag part of the ram at the preceding cycle, i. e., the one just before the Cog time slot happens.
Then reading and writing from/to the Hub ram will be a just-in-time event.
My fault! Sorry.
Henrique
Chip, could you post an updated document that contains all of the P2 instructions? It would be good to have the binary values for the new instructions and brief descriptions about what they do in one document.
The interrupts are done. The P2 instruction set is completed. P2 Day is almost here!
Way back when I said it would be cool if the P2 had a 256LE FPGA on chip, to do the heavy lifting of some hardware functions. Perhaps 256LE is a bit optimistic, but how hard do you think it would be to integrate a small CPLD module with smart pins? The module would be programmed at runtime by a cog, so no config needs to be saved on chip.
Perhaps 32LEs attached to 8 pins, with 8 registers that can be interfaced?
This sounds like something that would be an amusing diversion for a few days, but I think it would make your chip more useful. Now you've got 512KB of RAM, 16 COGs, and low latency access to HUB RAM, those are all great features above any beyond the P1.
No more frikken new features please. Get me the chip already !