Get the difference? it makes perfect sense, but I sure wasn't anticipating it.
Those look different to me. Your original code captures old_value to qx, while the new code captures new_value.
By trying to capture the old value while writing a new one, for two different ports, connected to the same array, it was thinking that something like a 4-port memory was needed. It never occurred to me to cut the read path short on a write, but that helps the compiler to infer a two-port memory, instead of some logical monstrosity.
...So, for those of us that aren't familiar with FPGA design, if the synthesis process is the step that takes the Verilog code from written form to gates (logical connections), is the "fitter" process the part that lays out all the circuitry?...
That's right.
So, here are the results from the full-chip compile on the Cyclone V -A9:
Logic utilization (in ALMs): 65,479 / 113,560 ( 58 % )
Total registers: 53,531
Total block memory bits: 2,621,440 / 12,492,800 ( 21 % )
Total DSP Blocks: 16 / 342 ( 5 % )
Fmax: 82.56 MHz, which could be improved by more aggressive Fmax-conscious compilation.
Compile time on my laptop: I don't know because the computer shut down after the compile.
By the time we add smart pins, we'll be maybe 20% bigger.
The whole chip will not fit in an -A7 device, but needs an -A9. For those with -A7 boards, you'll get ~12 cogs with ~32 smart pins, which is plenty to play with.
The -A9 could support a whole 1MB of hub RAM (the compile called for only 256KB), while the -A7 could do 512KB, which is what the chip will have.
Right now, Treehouse has the Verilog from the current design and they are going to run some early synthesis tests to determine what our silicon area will be at 180nm. With 16 cogs, this is going to be about half the area that the 8-cog P2-Hot was for its synthesized logic, and about 1/4 the power.
@Chip,
From your perspective do you think the pace of the development has improved now you're back on the forums?The reason I ask is that I can imagine taking a project on like this as the sole architect must be a struggle at times and take some real tenacity to get through. I know in my profession it is better to be able to bounce ideas off others when I can't get around the challenges that I'm faced with. That often leads to much better solutions in the end....
Just curious.
Regards,
Coley
Chip,
"...about half the area that the 8-cog P2-Hot was for its synthesized logic, and about 1/4 the power."
That is not only cool, literally, but sounds like in means twice as many chips per wafer and hence higher yield and lower cost per chip.
That's really koool.
Get the difference? it makes perfect sense, but I sure wasn't anticipating it.
Those look different to me. Your original code captures old_value to qx, while the new code captures new_value.
By trying to capture the old value while writing a new one, for two different ports, connected to the same array, it was thinking that something like a 4-port memory was needed. It never occurred to me to cut the read path short on a write, but that helps the compiler to infer a two-port memory, instead of some logical monstrosity.
Yeah, I experienced that too when trying to mod P1V. Mostly, I was just pointing out the change in behavior. If you didn't rely on the read-first behavior, then great!
@Chip,
From your perspective do you think the pace of the development has improved now you're back on the forums?The reason I ask is that I can imagine taking a project on like this as the sole architect must be a struggle at times and take some real tenacity to get through. I know in my profession it is better to be able to bounce ideas off others when I can't get around the challenges that I'm faced with. That often leads to much better solutions in the end....
Just curious.
Regards,
Coley
Yes! I'll work a long time on my own and pretty much get done what I had planned, but when we discuss things on the forum, there are blasts of productivity that really surprise me.
I'm just one person with limited thoughts, but all you guys have your own wealths of experience and ideas that are foreign to me, but enrich the heck out the Propeller effort.
Prop2-Hot was a Colossus of awesome ideas that would never have occurred to me, working alone. Your contributions amounted to probably 80% of the overall design. My job has been implementer and refiner, which has been really exciting. In fact, much of the refining came from you guys in the form of suggestions and incidental discussion.
Nobody could hire a group of idea people that could better your casual efforts here.
Chip,
"...about half the area that the 8-cog P2-Hot was for its synthesized logic, and about 1/4 the power."
That is not only cool, literally, but sounds like in means twice as many chips per wafer and hence higher yield and lower cost per chip.
That's really koool.
Remember, though, that we doubled the hub RAM to 512k. That ate up a lot of silicon.
re: long-aligned instructions in hub ram
I don't think there's inherently a performance benefit to having your instructions long-aligned. Sure, it could take two hub reads for one instruction otherwise, but the instruction fetch window is luck of the draw either way. Unaligned instructions could go both ways. For example: part of your instruction may be in a hub ram "slice" ahead of where it would otherwise be if it was aligned, so there's effectively no latency in terms of instruction fetch. And as Chip said, once it starts streaming, it executes at full speed until the next jump. Unaligned instructions may seem like a terrible idea at first glance, but I think an actualized performance hit is non-existent.
I don't think misaligned instructions will ever cost more than a clock after each jump, if that. My problem with misaligned instructions is mainly that it complicates the assembler by making all addresses byte-aligned, even for cogexec code, unnecessarily complicating every compile-time calculated register address.
I don't think misaligned instructions will ever cost more than a clock after each jump, if that. My problem with misaligned instructions is mainly that it complicates the assembler by making all addresses byte-aligned, even for cogexec code, unnecessarily complicating every compile-time calculated register address.
In practice, you will probably never do anything like '$1F4*4' because your register symbols will be symbolic, and stepping by 4 when you declare them.
There will probably be some development in the assembler's semantics that will further simplify cog vs hub address reckoning.
I don't feel like the hardware address paradigm is flawed, myself. It feels cleaner to me than Prop2-Hot. And we have constant JMP, CALL, and locating instructions which are 20-bit-range and byte-address-granular. I hope we've got everything covered, anyway.
All the world is byte addressable. I see no issue with the P II being byte addressable, in fact we require it for accessing byte data in HUB. It makes sense to use the same addressing in COG and HUB.
Not all the world can execute code from non-aligned byte addresses. Intel x86 can, older ARMs could not. I don't see it as an issue, compilers will not generate non-aligned code and assemblers presumably won't do it unless you go out of your way make them.
All sounds good to me.
All the world is byte addressable. I see no issue with the P II being byte addressable, in fact we require it for accessing byte data in HUB. It makes sense to use the same addressing in COG and HUB.
Not all the world can execute code from non-aligned byte addresses. Intel x86 can, older ARMs could not. I don't see it as an issue, compilers will not generate non-aligned code and assemblers presumably won't do it unless you go out of your way make them.
All sounds good to me.
I agree. I only mentioned this issue in case there was considerable cost associated with allowing instructions to be fetched from non-aligned addresses. From the discussion that followed, it seems that this falls out of allowing data to be accessed at non-aligned addresses and that is a useful feature. I certainly don't see any problem with allowing non-aligned instructions if there is no cost to doing so.
Chip (or anybody else that knows), does hubexec use an instruction cache? I don't see how it could achieve full speed execution without an instruction cache. It seems like you would actually need 2 16-long caches, where one is prefilling while the cog is executing out of the other one. Is that how it works?
If you do use instruction caches, are they filled at a rate of one long per cycle, or is it one long every 2 cycles?
Dave,It uses the "streamer" fifo hardware for hubexec. So there is an initial stall as it fills (or a stall on branch for refill). The streamer fills a long per clock I believe, so it stays ahead of the execution easily once it has the first long required.
How many longs is the streamer FIFO? If the FIFO retains some history it would allow for small low overhead loops where it wouldn't have to refill the last N instructions.
Actually, I'd love it if someone could clarify all of this.
In my simple mind I see things like this:
1) A COG dispatches instructions at x MIPs. This may not be the XTAL frequency or the system clock frequency, whatever it is x MIPS)
2) An instruction could need 4 accesses to RAM: Read instruction, read two operands, write a result. Think "ADD x, y"
3) In order for 16 COGs to be doing this at full speed at the same time requires a shared memory (HUB) bandwidth of 16 * 4 * x accesses per second.
This seems impossible!
What am I missing here?
I can imagine there might be a wide bus to HUB so that a COG can actually read 2, 4, 8 or more longs at a time. Perhaps fine for code fetching but what about the data that is spread around all over the place?
I can imagine the access rate to HUB is faster than that x MIPs rate.
Someone please do tell.
In the "hot" chip, we had some discussion about pointers, accessing HUB memory, etc... Early on, HUBEXEC used the COG registers.
Basically, one could do that "add a, b" and have it end up in COG registers and run at a full clip. To get the data into or out to the HUB, an additional instruction is needed. So the instruction comes from the HUB, and the data from the COG.
That is the simple case.
Have we gone beyond that on this chip like "hot" did?
The eggbeater hub gives a cog access to 1/16 of the hub RAM on each cycle on a rotating basis. The hub RAM segments are interleaved so that once a cog can access a hub RAM segment it can access the next segment on the next cycle, and so on. This allows for fast access of sequential LONG addresses.
Roy or Chip, is there a separate data FIFO, or do data and instruction accesses share the same FIFO? And how big are the FIFOs?
Dave,
Last I read about "eggbeater" (old P II Hot days) was that it would allow a COG to access RAM at any time, provided the COG that would normally have access at that time is not actually needing access.
This does not really answer my question, see above, that involves 16 COGs running code from HUB at the same.
I'm just fishing to get the current picture of how this all works and what the limits are. My example may well be an unrealistic extreme.
On cycle 0, cog 0 gets access to hub addresses ending in 0, cog 1 gets addresses ending in 1 and so on.
On cycle 1, cog 0 gets access to hub addresses ending in 1, cog 1 gets addresses ending in 2 and so on.
The FIFO smooths sequential access, random access could take up to 15 access cycle times to complete.
That's how I understood it when Roy suggested the division of HUB into 16 memories and suggested the lower nibble be used to map cog access to hub access.
All 16 cogs could access the hub together provided their access address nibbles are unique and aligned with the round Robin cycle scheme.
Heater
potatohead has anwered your question:There are not 4 hub accesses, only one - the instruction fetch.The source and destination operand access and the write back all happens in cog ram.
If you need data from the hubram in hubexec mode, I think you use RDLONG, RDWORD, RDBYTE to load it first into a register. This just works like other RISC processors.
Andy
potatohead.
"On cycle 0, cog 0 gets access to hub addresses ending in 0, cog 1 gets addresses ending in 1 and so on."
That implies that the HUB can deliver 16 longs per "cycle" to 16 COGs. That seems to imply a pretty wide bus to HUB.
Which is great when fetching instructions sequentially.
So we are still short of the other three long accesses required per required to get full speed execution. Which is perhaps asking too much.
Just to clarify, the only difference between cog exec and hub exec is that in hub exec, instructions are fetched from hub RAM via the streaming FIFO, instead of from cog RAM. D and S registers are still in cog RAM.
It's true that in hub exec mode, a near branch backwards would cause a hub FIFO reload, even though the instructions might be in the FIFO. The problem is that a whole mess of mux's would be required to get those instructions out of the FIFO. I think those cases are not that common compared to normal branches and would just grow the hardware for a mediocre return.
@Chip
How is ALTDS finally implemented?Does it support post increment/decrement of the S and D fields in the D register?And does it support the redirection of the result write to another register than D?
Andy
Chip.
"...the only difference between cog exec and hub exec is that in hub exec, instructions are fetched from hub RAM via the streaming FIFO, instead of from cog RAM. D and S registers are still in cog RAM"
Ah, OK. So in order add A to B in HUB variables we need to do the normal RDLONG, RDLONG, ADD, WRLONG thing.
Fair enough.
I can appreciate that jumps put a spanner in the works, also fair enough.
I guess I'm just trying to get a handle on the expected execution speed of C code, with all code and data in HUB. Kind of un-knowable really until the prop-gcc guys have done their magic with registers in COG and so on.
Heater,If data is in hub memory than the straight cog pasm would need to use RDLONG/etc. So hub exec would be equivalent speed to cog exec except on jumps or when trying to use the streamer for data and code (hub exec loses badly here).
I think C code execution speed will be near native pasm speed in most cases. Sure you can hand make pasm to beat it, but it will be similar to other CPUs.
Comments
Those look different to me. Your original code captures old_value to qx, while the new code captures new_value.
By trying to capture the old value while writing a new one, for two different ports, connected to the same array, it was thinking that something like a 4-port memory was needed. It never occurred to me to cut the read path short on a write, but that helps the compiler to infer a two-port memory, instead of some logical monstrosity.
That's right.
So, here are the results from the full-chip compile on the Cyclone V -A9:
Logic utilization (in ALMs): 65,479 / 113,560 ( 58 % )
Total registers: 53,531
Total block memory bits: 2,621,440 / 12,492,800 ( 21 % )
Total DSP Blocks: 16 / 342 ( 5 % )
Fmax: 82.56 MHz, which could be improved by more aggressive Fmax-conscious compilation.
Compile time on my laptop: I don't know because the computer shut down after the compile.
By the time we add smart pins, we'll be maybe 20% bigger.
The whole chip will not fit in an -A7 device, but needs an -A9. For those with -A7 boards, you'll get ~12 cogs with ~32 smart pins, which is plenty to play with.
The -A9 could support a whole 1MB of hub RAM (the compile called for only 256KB), while the -A7 could do 512KB, which is what the chip will have.
Right now, Treehouse has the Verilog from the current design and they are going to run some early synthesis tests to determine what our silicon area will be at 180nm. With 16 cogs, this is going to be about half the area that the 8-cog P2-Hot was for its synthesized logic, and about 1/4 the power.
From your perspective do you think the pace of the development has improved now you're back on the forums?The reason I ask is that I can imagine taking a project on like this as the sole architect must be a struggle at times and take some real tenacity to get through. I know in my profession it is better to be able to bounce ideas off others when I can't get around the challenges that I'm faced with. That often leads to much better solutions in the end....
Just curious.
Regards,
Coley
For that matter, any ideas on target devices for the first image?
"...about half the area that the 8-cog P2-Hot was for its synthesized logic, and about 1/4 the power."
That is not only cool, literally, but sounds like in means twice as many chips per wafer and hence higher yield and lower cost per chip.
That's really koool.
Those look different to me. Your original code captures old_value to qx, while the new code captures new_value.
By trying to capture the old value while writing a new one, for two different ports, connected to the same array, it was thinking that something like a 4-port memory was needed. It never occurred to me to cut the read path short on a write, but that helps the compiler to infer a two-port memory, instead of some logical monstrosity.
Yeah, I experienced that too when trying to mod P1V. Mostly, I was just pointing out the change in behavior. If you didn't rely on the read-first behavior, then great!
From your perspective do you think the pace of the development has improved now you're back on the forums?The reason I ask is that I can imagine taking a project on like this as the sole architect must be a struggle at times and take some real tenacity to get through. I know in my profession it is better to be able to bounce ideas off others when I can't get around the challenges that I'm faced with. That often leads to much better solutions in the end....
Just curious.
Regards,
Coley
Yes! I'll work a long time on my own and pretty much get done what I had planned, but when we discuss things on the forum, there are blasts of productivity that really surprise me.
I'm just one person with limited thoughts, but all you guys have your own wealths of experience and ideas that are foreign to me, but enrich the heck out the Propeller effort.
Prop2-Hot was a Colossus of awesome ideas that would never have occurred to me, working alone. Your contributions amounted to probably 80% of the overall design. My job has been implementer and refiner, which has been really exciting. In fact, much of the refining came from you guys in the form of suggestions and incidental discussion.
Nobody could hire a group of idea people that could better your casual efforts here.
"...about half the area that the 8-cog P2-Hot was for its synthesized logic, and about 1/4 the power."
That is not only cool, literally, but sounds like in means twice as many chips per wafer and hence higher yield and lower cost per chip.
That's really koool.
Remember, though, that we doubled the hub RAM to 512k. That ate up a lot of silicon.
I don't think there's inherently a performance benefit to having your instructions long-aligned. Sure, it could take two hub reads for one instruction otherwise, but the instruction fetch window is luck of the draw either way. Unaligned instructions could go both ways. For example: part of your instruction may be in a hub ram "slice" ahead of where it would otherwise be if it was aligned, so there's effectively no latency in terms of instruction fetch. And as Chip said, once it starts streaming, it executes at full speed until the next jump. Unaligned instructions may seem like a terrible idea at first glance, but I think an actualized performance hit is non-existent.
In practice, you will probably never do anything like '$1F4*4' because your register symbols will be symbolic, and stepping by 4 when you declare them.
There will probably be some development in the assembler's semantics that will further simplify cog vs hub address reckoning.
I don't feel like the hardware address paradigm is flawed, myself. It feels cleaner to me than Prop2-Hot. And we have constant JMP, CALL, and locating instructions which are 20-bit-range and byte-address-granular. I hope we've got everything covered, anyway.
Not all the world can execute code from non-aligned byte addresses. Intel x86 can, older ARMs could not. I don't see it as an issue, compilers will not generate non-aligned code and assemblers presumably won't do it unless you go out of your way make them.
All sounds good to me.
Not all the world can execute code from non-aligned byte addresses. Intel x86 can, older ARMs could not. I don't see it as an issue, compilers will not generate non-aligned code and assemblers presumably won't do it unless you go out of your way make them.
All sounds good to me.
I agree. I only mentioned this issue in case there was considerable cost associated with allowing instructions to be fetched from non-aligned addresses. From the discussion that followed, it seems that this falls out of allowing data to be accessed at non-aligned addresses and that is a useful feature. I certainly don't see any problem with allowing non-aligned instructions if there is no cost to doing so.
I'll get for you Peter. I see you posted it over on the other thread.
If you do use instruction caches, are they filled at a rate of one long per cycle, or is it one long every 2 cycles?
In my simple mind I see things like this:
1) A COG dispatches instructions at x MIPs. This may not be the XTAL frequency or the system clock frequency, whatever it is x MIPS)
2) An instruction could need 4 accesses to RAM: Read instruction, read two operands, write a result. Think "ADD x, y"
3) In order for 16 COGs to be doing this at full speed at the same time requires a shared memory (HUB) bandwidth of 16 * 4 * x accesses per second.
This seems impossible!
What am I missing here?
I can imagine there might be a wide bus to HUB so that a COG can actually read 2, 4, 8 or more longs at a time. Perhaps fine for code fetching but what about the data that is spread around all over the place?
I can imagine the access rate to HUB is faster than that x MIPs rate.
Someone please do tell.
Basically, one could do that "add a, b" and have it end up in COG registers and run at a full clip. To get the data into or out to the HUB, an additional instruction is needed. So the instruction comes from the HUB, and the data from the COG.
That is the simple case.
Have we gone beyond that on this chip like "hot" did?
Roy or Chip, is there a separate data FIFO, or do data and instruction accesses share the same FIFO? And how big are the FIFOs?
Last I read about "eggbeater" (old P II Hot days) was that it would allow a COG to access RAM at any time, provided the COG that would normally have access at that time is not actually needing access.
This does not really answer my question, see above, that involves 16 COGs running code from HUB at the same.
I'm just fishing to get the current picture of how this all works and what the limits are. My example may well be an unrealistic extreme.
On cycle 0, cog 0 gets access to hub addresses ending in 0, cog 1 gets addresses ending in 1 and so on.
On cycle 1, cog 0 gets access to hub addresses ending in 1, cog 1 gets addresses ending in 2 and so on.
The FIFO smooths sequential access, random access could take up to 15 access cycle times to complete.
That's how I understood it when Roy suggested the division of HUB into 16 memories and suggested the lower nibble be used to map cog access to hub access.
All 16 cogs could access the hub together provided their access address nibbles are unique and aligned with the round Robin cycle scheme.
Or, all 16 can do it through a FIFO.
potatohead has anwered your question:There are not 4 hub accesses, only one - the instruction fetch.The source and destination operand access and the write back all happens in cog ram.
If you need data from the hubram in hubexec mode, I think you use RDLONG, RDWORD, RDBYTE to load it first into a register. This just works like other RISC processors.
Andy
"On cycle 0, cog 0 gets access to hub addresses ending in 0, cog 1 gets addresses ending in 1 and so on."
That implies that the HUB can deliver 16 longs per "cycle" to 16 COGs. That seems to imply a pretty wide bus to HUB.
Which is great when fetching instructions sequentially.
So we are still short of the other three long accesses required per required to get full speed execution. Which is perhaps asking too much.
It's true that in hub exec mode, a near branch backwards would cause a hub FIFO reload, even though the instructions might be in the FIFO. The problem is that a whole mess of mux's would be required to get those instructions out of the FIFO. I think those cases are not that common compared to normal branches and would just grow the hardware for a mediocre return.
How is ALTDS finally implemented?Does it support post increment/decrement of the S and D fields in the D register?And does it support the redirection of the result write to another register than D?
Andy
"...the only difference between cog exec and hub exec is that in hub exec, instructions are fetched from hub RAM via the streaming FIFO, instead of from cog RAM. D and S registers are still in cog RAM"
Ah, OK. So in order add A to B in HUB variables we need to do the normal RDLONG, RDLONG, ADD, WRLONG thing.
Fair enough.
I can appreciate that jumps put a spanner in the works, also fair enough.
I guess I'm just trying to get a handle on the expected execution speed of C code, with all code and data in HUB. Kind of un-knowable really until the prop-gcc guys have done their magic with registers in COG and so on.
I think C code execution speed will be near native pasm speed in most cases. Sure you can hand make pasm to beat it, but it will be similar to other CPUs.