Why are even 5 levels necessary for the two cog version?
I don't know if you figured this out already but it's a case of that's the fetch latency of the HubRAM mechanism irrespective of any additional eggbeater slot delays. It's likely possible to reduce this further for a single core design but that would be throwing away the Hub structure entirely.
This makes a direct RDLONG instruction for a 16 Cog version at a wide variable range of 6 to 21 clocks - Two clocks quicker than the Prop1. And for a 2 Cog version that will drop down to the narrow range of 6 to 7 clocks. Note the minimum time doesn't change.
For the FIFO to work without a glitch it has to be able to request data that far ahead and therefore be able to store that data too. This then is all effectively pre-fetch.
NOTE: Don't take those numbers as gospel. There might be a +2 for instruction execution on top of them. In which case, for example, the Prop2 16 Cog version will be 8 to 23 clocks for a RDLONG - Same as the Prop1.
These FIFO's have grown pretty big and on the 16-cog version, they are as big as smart pins. I contacted OnSemi this morning about getting a small dual-port SRAM instance from them that would handle this requirement a lot more gracefully. We need a 32-location by 36-bit dual-port SRAM. This would save a lot of logic on the FPGA, too, as there are plenty of unused RAM resources we could tap, instead of the logic fabric.
Would that also reduce the latency from 5 clocks to 1 clock?
These FIFO's have grown pretty big and on the 16-cog version, they are as big as smart pins. I contacted OnSemi this morning about getting a small dual-port SRAM instance from them that would handle this requirement a lot more gracefully. We need a 32-location by 36-bit dual-port SRAM. This would save a lot of logic on the FPGA, too, as there are plenty of unused RAM resources we could tap, instead of the logic fabric.
Would that also reduce the latency from 5 clocks to 1 clock?
No. It wouldn't change that. That latency could be reduced on the ASIC, but I added levels of flops on the FPGA to overcome long routing delays, going between mux's, RAMs, and mux's.
These FIFO's have grown pretty big and on the 16-cog version, they are as big as smart pins. I contacted OnSemi this morning about getting a small dual-port SRAM instance from them that would handle this requirement a lot more gracefully. We need a 32-location by 36-bit dual-port SRAM. This would save a lot of logic on the FPGA, too, as there are plenty of unused RAM resources we could tap, instead of the logic fabric.
You would hope their tools can infer Dual Port (1 x WR, 1 x Rd) RAM with some helper lines in the code, and then compile the memory using a memory compiler ?
D-FF should compile more compactly in ASIC than they do in FPGA, but it would be a good idea to match the underlying structure in FPGA to ASIC.
These FIFO's have grown pretty big and on the 16-cog version, they are as big as smart pins. I contacted OnSemi this morning about getting a small dual-port SRAM instance from them that would handle this requirement a lot more gracefully. We need a 32-location by 36-bit dual-port SRAM. This would save a lot of logic on the FPGA, too, as there are plenty of unused RAM resources we could tap, instead of the logic fabric.
You would hope their tools can infer Dual Port (1 x WR, 1 x Rd) RAM with some helper lines in the code, and then compile the memory using a memory compiler ?
D-FF should compile more compactly in ASIC than they do in FPGA, but it would be a good idea to match the underlying structure in FPGA to ASIC.
I started doing the switchover to dual-port RAM today and I quickly realized that the latencies involved in RAM make the idea unworkable. We need instantaneous logic for this FIFO. It has to react each clock cycle. The problem with RAM is that it takes 1 clock to issue the read, then another for the data to come out and be latched. We'd need a mini logic FIFO around the dual-port RAM FIFO. No thanks! I think we'll just have to stick with the logic.
Altera has some FIFO modes for its memory resources that probably get around this problem, but I can't introduce Altera-specific blocks into the Verilog. We'll have to live with the logic, I guess.
The problem with RAM is that it takes 1 clock to issue the read, then another for the data to come out and be latched. We'd need a mini logic FIFO around the dual-port RAM FIFO. No thanks! I think we'll just have to stick with the logic.
I'm not quite following, Isn't there a 5 clock budget for this already ?
Write has to burst-sync with HUB rotate anyway, and read has to follow, slower.
So you have two interacting paused-sawtooths on address terms.
The problem with RAM is that it takes 1 clock to issue the read, then another for the data to come out and be latched. We'd need a mini logic FIFO around the dual-port RAM FIFO. No thanks! I think we'll just have to stick with the logic.
I'm not quite following, Isn't there a 5 clock budget for this already ?
Write has to burst-sync with HUB rotate anyway, and read has to follow, slower.
So you have two interacting paused-sawtooths on address terms.
Getting data into the DP RAM is easy. Getting data out on zero notice is impossible. This matter is outside of the 5-clock latency. You just don't know when the streamer is going to want a blast of data and, therefore, you'd need a mini logic-based FIFO to smooth out the 2-clock latency for reading data from the DP RAM.
Getting data into the DP RAM is easy. Getting data out on zero notice is impossible. This matter is outside of the 5-clock latency. You just don't know when the streamer is going to want a blast of data and, therefore, you'd need a mini logic-based FIFO to smooth out the 2-clock latency for reading data from the DP RAM.
If you needed fastest possible 'go', then you could queue by a sort of pre-read - ie you know the address you need next, well in advance, so that can be assigned and stable, and you can have the data ready too, but you do not want to signal to the DP-FIFO that you actually have read the data just yet.
That's probably much what you are saying with a "mini logic-based FIFO", but that can be quite mini ?
Altera has some FIFO modes for its memory resources that probably get around this problem, but I can't introduce Altera-specific blocks into the Verilog. We'll have to live with the logic, I guess.
It does sound like an issue worth pursuing, in that see what the Altera resources can produce and if that does a good job then enquire as to an equivalent On-Semi resource.
Comments
If my model is right, which it seems to be, then I believe #cogs+11 is sufficient.
Good idea. I'll check that.
Could you please try this out and see if it works?
I will throw some more code at it to verify.
Cheers
Brian
A2 is running hubexec nicely now.
So far all tests passed Ok.
Vital signs looking good.
Super! Thanks for testing that. I'm recompiling everything now.
This makes a direct RDLONG instruction for a 16 Cog version at a wide variable range of 6 to 21 clocks - Two clocks quicker than the Prop1. And for a 2 Cog version that will drop down to the narrow range of 6 to 7 clocks. Note the minimum time doesn't change.
For the FIFO to work without a glitch it has to be able to request data that far ahead and therefore be able to store that data too. This then is all effectively pre-fetch.
Would that also reduce the latency from 5 clocks to 1 clock?
No. It wouldn't change that. That latency could be reduced on the ASIC, but I added levels of flops on the FPGA to overcome long routing delays, going between mux's, RAMs, and mux's.
You would hope their tools can infer Dual Port (1 x WR, 1 x Rd) RAM with some helper lines in the code, and then compile the memory using a memory compiler ?
D-FF should compile more compactly in ASIC than they do in FPGA, but it would be a good idea to match the underlying structure in FPGA to ASIC.
I started doing the switchover to dual-port RAM today and I quickly realized that the latencies involved in RAM make the idea unworkable. We need instantaneous logic for this FIFO. It has to react each clock cycle. The problem with RAM is that it takes 1 clock to issue the read, then another for the data to come out and be latched. We'd need a mini logic FIFO around the dual-port RAM FIFO. No thanks! I think we'll just have to stick with the logic.
Altera has some FIFO modes for its memory resources that probably get around this problem, but I can't introduce Altera-specific blocks into the Verilog. We'll have to live with the logic, I guess.
Write has to burst-sync with HUB rotate anyway, and read has to follow, slower.
So you have two interacting paused-sawtooths on address terms.
Getting data into the DP RAM is easy. Getting data out on zero notice is impossible. This matter is outside of the 5-clock latency. You just don't know when the streamer is going to want a blast of data and, therefore, you'd need a mini logic-based FIFO to smooth out the 2-clock latency for reading data from the DP RAM.
That's probably much what you are saying with a "mini logic-based FIFO", but that can be quite mini ?