Hey, I'm just throwing this oddball hub-on-steroids scheme out there to add some variety while we wait for Chip to resurface for air and shed some further light on various things (but the longer he's down there at depth, the better for all from a progress standpoint). In the nutshell, the scheme involves dividing the hub into 16 banks, wherein each cog accesses one bank at a time on a rotating (hub) basis, with all of the cogs doing so simultaneously. It also involves or necessitates accessing memory in a spaced-apart manner to make access flow. That's the gist.
...
But going in the opposite direction, how about a hub-sharing mechanism on steroids involving memory banks? The idea is that the hub arbitrator would consist of 16 sub-arbitrators (one for each current bank-cog pairing), such that each cog got access to one 32KB bank at a time (512KB/16), followed immediately by access privilege to the next sequential 32KB bank and so on. All cogs would have potential access to a bank at the same time, but no two cogs would have access to the same bank at the same time. All 16 cogs would be circularly following each other lock-step in terms of bank access (each cog with simultaneious access to a separate 32KB bank per 1/16th hub cycle). Maybe each cog would have a 16-bit circular shift register with only 1 bit set to feed the bus assignment logic of the 16 sub-arbitrators, the shift registers of adjacent cogs being offset by one bit in each direction from each other in terms of the set bit.
When making a memory request, a first thought was that perhaps one would specify the bank number and the cog would block until that bank became available to that cog. But preferably the bank number would not have to be specified, as the hardware could automatically span consecutive addresses (not sure if the addresses point to longs or quads, a critical detail) across the 16 32KB banks, letting the memory be treated as one continuous strip, even though the memory was broken up into longs (or quads) that were actually separated by 32KB (though they would seem to be adjacent from the user's perspective). In such usage, a cog's data would be spread across the banks of memory (perhaps kind of like data in a RAID hard drive system). But in this way, every cog could access "chunks" of data at full speed, with each cog accessing a separate bank at the same time. Such a scheme could fly if "sequential" access to data were needed, sequential in the sense of from bank-to-bank, that is. But if needing to access the hub/banks in a random access way, then things would slow down to 1/8th speed (not counting other overhead), on average, as a cog would block until it got its shot at the desired bank. And sustained access within the same bank (if needed for some reason, though I'm not sure what that would be) would slow to 1/16th speed (but such access would only be possible if the programmer specified data addresses spaced apart by 16 to overcome (perhaps "defeat" is more correct) the way the logic would automatically spread access across hub banks).
Take the case of multiple cogs executing code directly from the hub (if that does get implemented): each cog doing so could read in the code at full hub speed (presuming, of course, that the data were spread across the hub banks). In a sane world, that would require the cooperation of the compiler to automatically spread instructions (and data) across memory just right (a key requirement of this oddball scheme). And for coding, machine instructions would be implemented in such a way as to automate bank spreading. For example, if we do have indexing, then the index could automatically point to the the same long (or quad) of the next bank (with wrap-around-plus-one) instead of the next actual long (or quad). <bold emphasis added>
The whole of memory
& I think this can be achieved, with some control over bandwidth as well, using the core rotate engine of #1.
It levers off my Sparse idea.
It's just not possible if you're accessing random addresses (but at fixed intervals like in the example jazzed provided) if they're spread across ram pages. There's no other way to do it but to restrict your data to one page in such a scenario. It's the biggest drawback of this scheme, and it's kind of a major one. Other than that, it's fantastic
Rayman, this involves you... I've decided to get rid of the PLL's in the CTR's, since they were only being used for video - and video, having another clock domain to deal with, was complicating everything. Video, a special case of fast DAC output, will now be tied to the system clock with a special NCO/divide-by-N circuit. The CTRs are gone, too! I actually talked with Phil Pilgrim about this and he agreed that smart I/O could completely compensate for lack of CTRs. What this means for you is that video will now be able to go out over the pins again, as straight bits, so you can drive LCD's. <bold emphasis added>
One of the hub memories would have all $xxxx0 long addresses, while the next would have all $xxxx1 long addresses, and so on. Any cog that wanted to read $xxxxD, for example, would have to wait for that memory to come to his window. All longs at long addresses $xxxxD would be in the same physical memory block. Does that make sense?
Will we still be doing rdlong, etc... the same as before to get deterministic timing or do we have to worry about the hub address with that too?
I.E.
rdlong d, s
instruction
instruction
rdlong d, s
instruction
instruction
etc....
The hub address will change your timing. You could arrange your RDLONG/WRLONG sequence to take advantage of the hub order, though, for much improved performance in reading and writing records.
What if the existing RDLONG/WRLONG used addresses that didn't include the low nibble, which was instead added with the value of COGID? This would give each cog its own 32K HUB space. Then, add a new RDLONGX/WRLONGX instructions that worked as you are describing it now.
RDLONG/WRLONG (etc.)
Drivers would most likely use this because the addresses would work for any cog. In other words, addresses are local to the 32K block, so the driver is cog-agnostic (rather COGID-agnostic).
It provides a sort of isolation for drivers (WRLONG for one driver would *never* clobber data in another cog's hub space).
The 1:16 timing allows code to be deterministic in the same manner that it currently is
RDLONGX/WRLONGX (etc.)
Used by code that needs to interact with the drivers. In this case, the low nibble is the COGID of the driver.
The "X" helps explicitly indicate that the code may be accessing any of the 16 32K blocks.
RDBLOCX/WRBLOCX
Same as the proposed RDBLOCK/WRBLOCK, but changed the last character to X to be consistent with the "X" meaning "across 32K block boundaries.
Looking at the hub memory with this point of view, I suspect that RDLONGX/WRLONGX would be used primarily for accessing low hub memory (like interacting with drivers and passing messages between various cogs), while RDBLOCX/WRBLOCX would primarily be using high hub memory (primarily to avoid clobbering driver data). Knowing that both of these modes are likely to be mixed within the same application, drivers would generally limit themselves to the lower 16K of their cog's hub block, while the upper 16K of each block would be used in the other mode (for a total of 256K) to provide support for LMM (and possibly hubexec). In this case, the cog that was running a large program (out of the upper 256K) would also have the lower 16K of its own hub block for data usage.
You only need to be aware if you want to optimize timing. I will have to make some 16-long alignment directives, as you noted.
Can you make it so you can pass a parameter to this directive to tell it to offset it by some amount? So (address % 16) = offset? Better yet, can you also make the 16 overideable, so we can align things however we want? So (address % x) == y?
What if the existing RDLONG/WRLONG used addresses that didn't include the low nibble, which was instead added with the value of COGID? This would give each cog its own 32K HUB space. Then, add a new RDLONGX/WRLONGX instructions that worked as you are describing it now.
Can you make it so you can pass a parameter to this directive to tell it to offset it by some amount? So (address % 16) = offset? Better yet, can you also make the 16 overideable, so we can align things however we want? So (address % x) == y?
Roy Eltham came by and we were talking about all this concern over hub sharing. He had a great idea of having the individual memories act as hubs, instead of just one hub controlling all the memories and serving just one cog at a time.
We happen to have 16 instances of hub RAM, anyway, that make up the total 512KB. By having them each be 8192 locations x 32 bits, and distributing contiguous long addresses among all of them, we could make it so that every cog could read or write a subsequent long on every clock - which is 4x faster than the RDQUAD/WRQUAD scheme. This also allows all memories to be 32 bits wide, instead of 128, which cuts memory power by 65%. Initial latency varies now by the long address, but once locked, longs can be read or written on every clock.
I know this will sound crazy but can there be a hardware feature that automatically copies a long or group of longs (simultaneously or immediately following a write to the cog's own ram) to another address belonging to another cog. This allows another cog that is to use the updated data immediately or a clock later than it was written. Sort of a dynamic memory.
I know this will sound crazy but can there be a hardware feature that automatically copies a long or group of longs (simultaneously or immediately following a write to the cog's own ram) to another address belonging to another cog. This allows another cog that is to use the updated data immediately or a clock later than it was written. Sort of a dynamic memory.
If you can live with 16 SysCLKS for Copy of 16 Longs, it is already there
Roy Eltham came by and we were talking about all this concern over hub sharing. He had a great idea of having the individual memories act as hubs, instead of just one hub controlling all the memories and serving just one cog at a time.
We happen to have 16 instances of hub RAM, anyway, that make up the total 512KB. By having them each be 8192 locations x 32 bits, and distributing contiguous long addresses among all of them, we could make it so that every cog could read or write a subsequent long on every clock - which is 4x faster than the RDQUAD/WRQUAD scheme. This also allows all memories to be 32 bits wide, instead of 128, which cuts memory power by 65%. Initial latency varies now by the long address, but once locked, longs can be read or written on every clock.
Hmmm, sounds a bit like banking. I'm not entirely sure I get this system. If I've got a long in hub ram that I want to access do I address it as "hub 5 byte 73" or do I get a flat address space up to 512KB and then the cog "hangs" until that particular ram hub ( or whatever it gets called) comes around?
What if the existing RDLONG/WRLONG used addresses that didn't include the low nibble, which was instead added with the value of COGID? This would give each cog its own 32K HUB space. Then, add a new RDLONGX/WRLONGX instructions that worked as you are describing it now.
Looking at the hub memory with this point of view, I suspect that RDLONGX/WRLONGX would be used primarily for accessing low hub memory (like interacting with drivers and passing messages between various cogs), while RDBLOCX/WRBLOCX would primarily be using high hub memory (primarily to avoid clobbering driver data). Knowing that both of these modes are likely to be mixed within the same application, drivers would generally limit themselves to the lower 16K of their cog's hub block, while the upper 16K of each block would be used in the other mode (for a total of 256K) to provide support for LMM (and possibly hubexec). In this case, the cog that was running a large program (out of the upper 256K) would also have the lower 16K of its own hub block for data usage.
Reading through this I was initially intrigued by the ideas presented above but I think you just identified the main pitfall here. For non-OBEX code it can be dealt with internally by the coder, but for OBEX you'd definitely need some rather strict convention to limit the RAM ranges used by the COG drivers, otherwise you'd have a hard time carving up the RAM when you combine objects and also when you want to make use of burst block transfers with RDBLOCX/WRBLOCX. However having strict upper/lower conventions like this put in place will limit what the OBEX code will be able to do and how it can use its memory. I just feel that's a bit of a problem.
One of the hub memories would have all $xxxx0 long addresses, while the next would have all $xxxx1 long addresses, and so on. Any cog that wanted to read $xxxxD, for example, would have to wait for that memory to come to his window. All longs at long addresses $xxxxD would be in the same physical memory block. Does that make sense?
I guess this answers my question. Should have finished the thread before asking questions.
It seems like this scheme will require tweaking loops to be multiple of 16 cycles for efficient execution. Others have already commented on the random-access problem. The linear-execution feature is nice, but a lot of code executes short linear pieces, and then jumps, calls or returns. Also, LMM code that accesses data will be accessing instructions and data from different areas of memory. Maybe the use of instruction and data caches would help.
OK, so 16x may not be accurate, but loops will need tweaking. Doing efficient loops in C will be difficult. It looks like a PITA to program efficiently.
I was more wondering if routing all of those busses was going to cause congestion.
There are gains and losses, Chip said this
["We happen to have 16 instances of hub RAM, anyway, that make up the total 512KB. .... This also allows all memories to be 32 bits wide, instead of 128, which cuts memory power by 65%"]
So there were already 16 memories, they are now merged differently.
Before there was 128b wide, now it is16 x 32 Data Bus.
Address Bus handling is similar
I see it more like a lazy Susan in a Chinese restaurant. If you want to fill your plate quickly, you do a block read, starting with whatever is in front of you as the thing rotates, grabbing each dish in turn until your plate is full. Seconds on a particular item might happen right away, or you may have to wait awhile for it to come around again.
I see it more like a lazy Susan in a Chinese restaurant. If you want to fill your plate quickly, you do a block read, starting with whatever is in front of you as the thing rotates, grabbing each dish in turn until your plate is full. Seconds on a particular item might happen right away, or you may have to wait awhile for it to come around again.
["We happen to have 16 instances of hub RAM, anyway, that make up the total 512KB. .... This also allows all memories to be 32 bits wide, instead of 128, which cuts memory power by 65%"]
So there were already 16 memories, they are now merged differently.
Before there was 128b wide, now it is16 x 32 Data Bus.
Address Bus handling is similar
Yeah but 1 x 128 = 128 but 16 x 32 = 512. Doesn't that mean that there are four times as many wires in this new scheme?
I think not ==16, see #72 as it depends on how your code changes the address.
You could take advantage of this for large memory (matrix, imaging etc) operations. It would be a kind of abuse, but would really pay off
If your tight loop needs say 14 cycles, you might decrement the hub address by 2 after each loop, to perform operations on all the even memory addresses. Then repeat to take care of all the odd memory locations. If your loop needs 12 cycles, decrement by 4, etc.
Similar for loops that need a little more than 16 cycles - increment accordingly
We're going to need all the train, lazy susan, postman analogies we can get to assist explaining this
Wouldn't this require a 32 bit data and 19 bit address bus for each of the 16 ram blocks along with 16x51 (or maybe only 47) bit multiplexers. IOW an 816 bit buss between the hub and cogs? Or is there some other way of doing this?
The more I think about this scheme, the more I like it. How else could you get as many as 16 simultaneous reads or writes without a 16-port RAM? In terms of efficiency and throughput, it's hard to top! It's like 16 keyholes into 16 small rooms, instead of one keyhole into one large room.
Comments
I'm not convinced that anyone is an expert on this yet ;-)
Right. The sequence is the same for all cogs.
However, I am quite pleased with the out of box thinking. Good job Chip and Roy.
It's just not possible if you're accessing random addresses (but at fixed intervals like in the example jazzed provided) if they're spread across ram pages. There's no other way to do it but to restrict your data to one page in such a scenario. It's the biggest drawback of this scheme, and it's kind of a major one. Other than that, it's fantastic
What if the existing RDLONG/WRLONG used addresses that didn't include the low nibble, which was instead added with the value of COGID? This would give each cog its own 32K HUB space. Then, add a new RDLONGX/WRLONGX instructions that worked as you are describing it now.
RDLONG/WRLONG (etc.)
RDLONGX/WRLONGX (etc.)
RDBLOCX/WRBLOCX
Same as the proposed RDBLOCK/WRBLOCK, but changed the last character to X to be consistent with the "X" meaning "across 32K block boundaries.
Looking at the hub memory with this point of view, I suspect that RDLONGX/WRLONGX would be used primarily for accessing low hub memory (like interacting with drivers and passing messages between various cogs), while RDBLOCX/WRBLOCX would primarily be using high hub memory (primarily to avoid clobbering driver data). Knowing that both of these modes are likely to be mixed within the same application, drivers would generally limit themselves to the lower 16K of their cog's hub block, while the upper 16K of each block would be used in the other mode (for a total of 256K) to provide support for LMM (and possibly hubexec). In this case, the cog that was running a large program (out of the upper 256K) would also have the lower 16K of its own hub block for data usage.
Can you make it so you can pass a parameter to this directive to tell it to offset it by some amount? So (address % 16) = offset? Better yet, can you also make the 16 overideable, so we can align things however we want? So (address % x) == y?
Sure.
Yes. I'll have to be careful about only switching the mux's when they are needed.
If you can live with 16 SysCLKS for Copy of 16 Longs, it is already there
Hmmm, sounds a bit like banking. I'm not entirely sure I get this system. If I've got a long in hub ram that I want to access do I address it as "hub 5 byte 73" or do I get a flat address space up to 512KB and then the cog "hangs" until that particular ram hub ( or whatever it gets called) comes around?
Reading through this I was initially intrigued by the ideas presented above but I think you just identified the main pitfall here. For non-OBEX code it can be dealt with internally by the coder, but for OBEX you'd definitely need some rather strict convention to limit the RAM ranges used by the COG drivers, otherwise you'd have a hard time carving up the RAM when you combine objects and also when you want to make use of burst block transfers with RDBLOCX/WRBLOCX. However having strict upper/lower conventions like this put in place will limit what the OBEX code will be able to do and how it can use its memory. I just feel that's a bit of a problem.
I guess this answers my question. Should have finished the thread before asking questions.
I think not ==16, see #72 as it depends on how your code changes the address.
There are gains and losses, Chip said this
["We happen to have 16 instances of hub RAM, anyway, that make up the total 512KB. .... This also allows all memories to be 32 bits wide, instead of 128, which cuts memory power by 65%"]
So there were already 16 memories, they are now merged differently.
Before there was 128b wide, now it is16 x 32 Data Bus.
Address Bus handling is similar
-Phil
I'm feeling "hungry" now... Pardon the pun.
Yeah but 1 x 128 = 128 but 16 x 32 = 512. Doesn't that mean that there are four times as many wires in this new scheme?
You could take advantage of this for large memory (matrix, imaging etc) operations. It would be a kind of abuse, but would really pay off
If your tight loop needs say 14 cycles, you might decrement the hub address by 2 after each loop, to perform operations on all the even memory addresses. Then repeat to take care of all the odd memory locations. If your loop needs 12 cycles, decrement by 4, etc.
Similar for loops that need a little more than 16 cycles - increment accordingly
We're going to need all the train, lazy susan, postman analogies we can get to assist explaining this
-Phil