- Hardware memory caches
- Larger cog memory
- More efficient and flexible hub access mode
- 1080p60 display capability with external DRAM
- What!? Caches are not sensible let alone viable. All fetching is timed to keep the pipes filled already. Where do you envisage a hardware cache being used?
- The hub access flexibility issue is a side effect of Hub space not being part of native Cog addressing and also hugely beyond the address range of Cog space. Extended (64 bit) instructions would be needed which is no better than using two 32 bit instructions.
- 1080p60 thing is a bit meaningless. There is far too many interpretations of that spec, not to mention you are jumping the gun somewhat given the Prop2 is not out yet.
- Regarding the larger Cog address space issue, it would need a change in architecture. Either by going to regular Von-Neumann with a much smaller register set; Or what might be closer to the Prop is separating out the instruction fetches, ala Harvard but not quite that either, fetching instructions directly from Hub space instead.
The only hiccup with the second approach, apart from the obviously loss of any self-modifying options, is Hub bandwidth would be entirely consumed by instruction fetches. So any instructions that access hub space would have to stall the instruction fetches. It is still predictable but obviously the availably data bandwidth to Hub space takes a knock.
The good part is you get the large execution space of native code without losing the deterministic eight core setup or adding any more bloat.
Hardware memory caches
LMM implements a level 1 cache in software. XMM implements a level 2 cache in software. These execution modes would be faster and more efficient if implemented in hardware. P2 will have a very limited hardware cache. Perhaps P3 would improve on that.
Larger cog memory
Cog memory addressing could be increased without architecture changes by implementing relative-jump instructions and indirect cog data addressing using an index register. A 4KB cog memory would be a vast improvement over the 2KB memory size. Even a 3KB memory space would allow a 2X speed improvement over the current Spin interpreter.
More efficient and flexible hub access mode
The current fixed-slot hub access algorithm is very inefficient. Most applications do not require cycle-accurate precision. It would be much more efficient to use a round-robin hub access arbitrater than the current fixed-slot method. If cycle-accuracy is required, a cog could set a bit to dedicate an access slot to it.
1080p60display capability with extern DRAM
Maybe it doesn't have to be 1080p60, but I would like to see a full 24-bit high resolution display capability by using external RAM. Low-res or tiled graphics and 8-bit dithered images just isn't cutting edge. It's really not that hard to do.
Larger cog memory
Cog memory addressing could be increased without architecture changes by implementing relative-jump instructions and indirect cog data addressing using an index register. A 4KB cog memory would be a vast improvement over the 2KB memory size. Even a 3KB memory space would allow a 2X speed improvement over the current Spin interpreter.
Interesting stat.
How does the current Spin interpreter split between use of COG ram as Data, as Code, and as Modifiable Code ?
4 port register memory is certainly more costly than R/W SRAM, so perhaps a low cost means would be to double COG ram, and allow a Harvard split, so half is register/data 4 port, and half is R/W, nominally for code (not single-line self modifying), or you can flip this and run code from 4 port unchanged, and use the R/W block for data fifos/arrays. No backward compatibility impact.
Perhaps that R/W RAM could include the sibling port swap option, allowing cross-cog data sharing.
A high cost in the Prop, is how much of valuable 4 port memory, is used merely as code ROM.
Larger cog memory
Cog memory addressing could be increased without architecture changes by implementing relative-jump instructions and indirect cog data addressing using an index register. A 4KB cog memory would be a vast improvement over the 2KB memory size. Even a 3KB memory space would allow a 2X speed improvement over the current Spin interpreter.
Neat idea. The spin interpreter can be sped up 20-25% by using 1KB hub for a vector table. The additional fifos used as stacks in P2 will improve the interpreter.
More efficient and flexible hub access mode
The current fixed-slot hub access algorithm is very inefficient. Most applications do not require cycle-accurate precision. It would be much more efficient to use a round-robin hub access arbitrater than the current fixed-slot method. If cycle-accuracy is required, a cog could set a bit to dedicate an access slot to it.
Agreed but implement such that a cog can set a flag to permit it to utilise unused slots. This then allows one of more cogs to share the unused slots, permitting the concepts of super cogs.
One of the great features of the prop, against other micros, is that it is RAM based (besides the ROM tables and spin interpreter).
With the P2 comes quadlong hub accesses, autoincrement/autodecrement, and the repeat instruction, and 1-in-8 slots. Loading cog memory will be severly faster!!! It even has an effective cache for delivering bytes from quadfetches.
I have thought (even proposed) an additional 2KB block of cog ram shared between each set of cog pairs (as in 4 blocks cog0-1, 2-3, 4-5, 6-7). Could be used by one or the other cog, or 1 in 2 slot access.
Hardware memory caches
LMM implements a level 1 cache in software. XMM implements a level 2 cache in software. These execution modes would be faster and more efficient if implemented in hardware. P2 will have a very limited hardware cache. Perhaps P3 would improve on that.
Adding support for certain software methods is not quite what I was thinking when you said hardware caching.
I wouldn't call the quad word accesses a hardware caching system. Just a couple of wide buffers per Cog.
Larger cog memory
Cog memory addressing could be increased without architecture changes by implementing relative-jump instructions and indirect cog data addressing using an index register. A 4KB cog memory would be a vast improvement over the 2KB memory size. Even a 3KB memory space would allow a 2X speed improvement over the current Spin interpreter.
Oh, is that all. I was thinking much bigger addressing ranges. There is the new CLUT in the Prop2 that could fill the job.
More efficient and flexible hub access mode
The current fixed-slot hub access algorithm is very inefficient. Most applications do not require cycle-accurate precision. It would be much more efficient to use a round-robin hub access arbitrater than the current fixed-slot method. If cycle-accuracy is required, a cog could set a bit to dedicate an access slot to it.
Hehe, didn't twig to what you were after on that one.
Yep, this was proposed a number of times in the very first Prop2 thread all those years back. I remember Chip saying it wasn't desirable but not sure if I actually saw a reason why.
1080p60display capability with extern DRAM
Maybe it doesn't have to be 1080p60, but I would like to see a full 24-bit high resolution display capability by using external RAM. Low-res or tiled graphics and 8-bit dithered images just isn't cutting edge. It's really not that hard to do.
The big problem with 24/32 bit colour modes is the shear number of pins needed for the bandwidth. The Prop2 will likely be able to handle it but at the cost of using virtually all it's I/O.
A few megabytes of embedded ram, keeping the whole framebuffer internal, would make the approach much more useful. But, obviously, that makes the silicon bigger again.
I wouldn't call the quad word accesses a hardware caching system. Just a couple of wide buffers per Cog.
I guess the test for is it a cache or not is: Does it speed up raw throughput? Ie: Facilitating bursting or width translation or synchronising. If so then it's a buffer. If it, however, reduces the need for throughput then it's a cache.
How so? No matter how fast the HUB RAM is we have 8 cogs accessing it. They have to wait for each other.
Yes, they would still have to wait for each other, but it would take only one instruction cycle regardless of which cog is doing the reading or writing. Of course it's really just silly nonsense as I don't see memory ever being that much faster than a processor core unless it had some very large and complex architecture. But if it was possible, you might as well get rid of cog memory and extend the opcodes to address the larger amount of ram.
A 32 bit counter register is tiny, and the Adder can be Mux'd - the real cost in video generation, is at the pins.
It's not so much the register, but the counter logic. Since it's my understanding that the counters can operate faster than instruction execution, I don't see how they can be multiplexed, at least without slowing them down some. Also, P2 video generation is still done in the cog, only the DAC is in the pins.
@Dave Hein
I never looked at the interpreter code (and if I did, I might as well have stared at hieroglyphs), but am wondering how much code space might be saved by eliminating the math routines that the P2 will support in hardware?
24/32-bit color for a P3 would be nice, but 16-bit color for a P1.5 would be wonderful as I feel there is a much more noticeable difference between 6/8-bit color and 16-bit, verses 16 and 24-bit.
It seems that we're about split between whether we would like to see the next chip after the P2 to be a P1.5 or a P3. At the very least, a 64 GPIO P1 would be nice, and from what Chip has said, is practically complete.
I still maintain that a P1 with more refined peripherals (the few that it has), higher clock, more GPIO and more hub ram to be hugely desirable. If hub speed could be increased from 1/16 to 1/8 clock, would you rather keep 8 cogs, or increase it to 16? Obviously, this would bit into hub ram as well.
For those that want a smaller P1 I hope to have DIP-32, DIP-24, and DIP-20 available soon. All will be 0.6" wide. The DIP-32 has P0-12 and P20-31, the DIP-24 has P0-7 and P24-31, and the DIP-20 has P2-7 and P24-29. Note the DIP-20 does not have download capability so it must boot from a programmed EEPROM.
I have not tested them yet, but believe the yield will be fine - just waiting to get my Dremel cutting blade.
I expect pricing to be about $15 ea for DIP 24 & 32 and $25 for the DIP 20. No volume discounts and quantities are limited as I have to cut each chip down from a DIP-40
Comments
- What!? Caches are not sensible let alone viable. All fetching is timed to keep the pipes filled already. Where do you envisage a hardware cache being used?
- The hub access flexibility issue is a side effect of Hub space not being part of native Cog addressing and also hugely beyond the address range of Cog space. Extended (64 bit) instructions would be needed which is no better than using two 32 bit instructions.
- 1080p60 thing is a bit meaningless. There is far too many interpretations of that spec, not to mention you are jumping the gun somewhat given the Prop2 is not out yet.
- Regarding the larger Cog address space issue, it would need a change in architecture. Either by going to regular Von-Neumann with a much smaller register set; Or what might be closer to the Prop is separating out the instruction fetches, ala Harvard but not quite that either, fetching instructions directly from Hub space instead.
The only hiccup with the second approach, apart from the obviously loss of any self-modifying options, is Hub bandwidth would be entirely consumed by instruction fetches. So any instructions that access hub space would have to stall the instruction fetches. It is still predictable but obviously the availably data bandwidth to Hub space takes a knock.
The good part is you get the large execution space of native code without losing the deterministic eight core setup or adding any more bloat.
LMM implements a level 1 cache in software. XMM implements a level 2 cache in software. These execution modes would be faster and more efficient if implemented in hardware. P2 will have a very limited hardware cache. Perhaps P3 would improve on that.
Larger cog memory
Cog memory addressing could be increased without architecture changes by implementing relative-jump instructions and indirect cog data addressing using an index register. A 4KB cog memory would be a vast improvement over the 2KB memory size. Even a 3KB memory space would allow a 2X speed improvement over the current Spin interpreter.
More efficient and flexible hub access mode
The current fixed-slot hub access algorithm is very inefficient. Most applications do not require cycle-accurate precision. It would be much more efficient to use a round-robin hub access arbitrater than the current fixed-slot method. If cycle-accuracy is required, a cog could set a bit to dedicate an access slot to it.
1080p60display capability with extern DRAM
Maybe it doesn't have to be 1080p60, but I would like to see a full 24-bit high resolution display capability by using external RAM. Low-res or tiled graphics and 8-bit dithered images just isn't cutting edge. It's really not that hard to do.
Interesting stat.
How does the current Spin interpreter split between use of COG ram as Data, as Code, and as Modifiable Code ?
4 port register memory is certainly more costly than R/W SRAM, so perhaps a low cost means would be to double COG ram, and allow a Harvard split, so half is register/data 4 port, and half is R/W, nominally for code (not single-line self modifying), or you can flip this and run code from 4 port unchanged, and use the R/W block for data fifos/arrays. No backward compatibility impact.
Perhaps that R/W RAM could include the sibling port swap option, allowing cross-cog data sharing.
A high cost in the Prop, is how much of valuable 4 port memory, is used merely as code ROM.
With the P2 comes quadlong hub accesses, autoincrement/autodecrement, and the repeat instruction, and 1-in-8 slots. Loading cog memory will be severly faster!!! It even has an effective cache for delivering bytes from quadfetches.
I have thought (even proposed) an additional 2KB block of cog ram shared between each set of cog pairs (as in 4 blocks cog0-1, 2-3, 4-5, 6-7). Could be used by one or the other cog, or 1 in 2 slot access.
Adding support for certain software methods is not quite what I was thinking when you said hardware caching.
I wouldn't call the quad word accesses a hardware caching system. Just a couple of wide buffers per Cog.
Oh, is that all. I was thinking much bigger addressing ranges. There is the new CLUT in the Prop2 that could fill the job.
Hehe, didn't twig to what you were after on that one.
Yep, this was proposed a number of times in the very first Prop2 thread all those years back. I remember Chip saying it wasn't desirable but not sure if I actually saw a reason why.
The big problem with 24/32 bit colour modes is the shear number of pins needed for the bandwidth. The Prop2 will likely be able to handle it but at the cost of using virtually all it's I/O.
A few megabytes of embedded ram, keeping the whole framebuffer internal, would make the approach much more useful. But, obviously, that makes the silicon bigger again.
I guess the test for is it a cache or not is: Does it speed up raw throughput? Ie: Facilitating bursting or width translation or synchronising. If so then it's a buffer. If it, however, reduces the need for throughput then it's a cache.
Yes, they would still have to wait for each other, but it would take only one instruction cycle regardless of which cog is doing the reading or writing. Of course it's really just silly nonsense as I don't see memory ever being that much faster than a processor core unless it had some very large and complex architecture. But if it was possible, you might as well get rid of cog memory and extend the opcodes to address the larger amount of ram.
It's not so much the register, but the counter logic. Since it's my understanding that the counters can operate faster than instruction execution, I don't see how they can be multiplexed, at least without slowing them down some. Also, P2 video generation is still done in the cog, only the DAC is in the pins.
@Dave Hein
I never looked at the interpreter code (and if I did, I might as well have stared at hieroglyphs), but am wondering how much code space might be saved by eliminating the math routines that the P2 will support in hardware?
24/32-bit color for a P3 would be nice, but 16-bit color for a P1.5 would be wonderful as I feel there is a much more noticeable difference between 6/8-bit color and 16-bit, verses 16 and 24-bit.
It seems that we're about split between whether we would like to see the next chip after the P2 to be a P1.5 or a P3. At the very least, a 64 GPIO P1 would be nice, and from what Chip has said, is practically complete.
I still maintain that a P1 with more refined peripherals (the few that it has), higher clock, more GPIO and more hub ram to be hugely desirable. If hub speed could be increased from 1/16 to 1/8 clock, would you rather keep 8 cogs, or increase it to 16? Obviously, this would bit into hub ram as well.
P2 will allow inter-prop-coms to add more cogs by adding another P2. It would be nice to add that com-link to a P1.5.
Need more Pins and Cogs? just add another chip.(ok. and a couple things more)
Port B would be nice, But how to get that in a DIP?
ahh. and +1 for the second counter for 64bit cnt
Enjoy!
Mike
For those that want a smaller P1 I hope to have DIP-32, DIP-24, and DIP-20 available soon. All will be 0.6" wide. The DIP-32 has P0-12 and P20-31, the DIP-24 has P0-7 and P24-31, and the DIP-20 has P2-7 and P24-29. Note the DIP-20 does not have download capability so it must boot from a programmed EEPROM.
I have not tested them yet, but believe the yield will be fine - just waiting to get my Dremel cutting blade.
I expect pricing to be about $15 ea for DIP 24 & 32 and $25 for the DIP 20. No volume discounts and quantities are limited as I have to cut each chip down from a DIP-40