2 Cog DE0-Nano/CV-A2 Hubexec fifo broken

jmg · 2016-06-07 02:34

cgracey wrote: »

The streamer could stress it in such a way that all those FIFO levels are needed.

I was really asking, are you 100% certain it can never need #cogs + 12 ?

cgracey · 2016-06-07 02:49

jmg wrote: »

cgracey wrote: »

The streamer could stress it in such a way that all those FIFO levels are needed.

I was really asking, are you 100% certain it can never need #cogs + 12 ?

If my model is right, which it seems to be, then I believe #cogs+11 is sufficient.

evanh · 2016-06-07 02:59

Only corner case I can think of is a possible extra clock needed when not long aligned.

cgracey · 2016-06-07 03:14

evanh wrote: »

Only corner case I can think of is a possible extra clock needed when not long aligned.

Good idea. I'll check that.

cgracey · 2016-06-07 06:41

Ozpropdev,

Could you please try this out and see if it works?

BeMicro_A2_Prop2_v9b.zip

ozpropdev · 2016-06-07 06:55

cgracey wrote: »

Ozpropdev,

Could you please try this out and see if it works?

Looks Ok Chip.

I will throw some more code at it to verify.
Cheers
Brian

ozpropdev · 2016-06-07 07:16

Chip
A2 is running hubexec nicely now.
So far all tests passed Ok.
Vital signs looking good.

cgracey · 2016-06-07 07:17

ozpropdev wrote: »

Chip
A2 is running hubexec nicely now.
So far all tests passed Ok.
Vital signs looking good.

Super! Thanks for testing that. I'm recompiling everything now.

evanh · 2016-06-07 09:22

Electrodude wrote: »

Why are even 5 levels necessary for the two cog version?

I don't know if you figured this out already but it's a case of that's the fetch latency of the HubRAM mechanism irrespective of any additional eggbeater slot delays. It's likely possible to reduce this further for a single core design but that would be throwing away the Hub structure entirely.

This makes a direct RDLONG instruction for a 16 Cog version at a wide variable range of 6 to 21 clocks - Two clocks quicker than the Prop1. And for a 2 Cog version that will drop down to the narrow range of 6 to 7 clocks. Note the minimum time doesn't change.

For the FIFO to work without a glitch it has to be able to request data that far ahead and therefore be able to store that data too. This then is all effectively pre-fetch.

evanh · 2016-06-07 09:50

NOTE: Don't take those numbers as gospel. There might be a +2 for instruction execution on top of them. In which case, for example, the Prop2 16 Cog version will be 8 to 23 clocks for a RDLONG - Same as the Prop1.

Seairth · 2016-06-07 18:16

cgracey wrote: »

These FIFO's have grown pretty big and on the 16-cog version, they are as big as smart pins. I contacted OnSemi this morning about getting a small dual-port SRAM instance from them that would handle this requirement a lot more gracefully. We need a 32-location by 36-bit dual-port SRAM. This would save a lot of logic on the FPGA, too, as there are plenty of unused RAM resources we could tap, instead of the logic fabric.

Would that also reduce the latency from 5 clocks to 1 clock?

cgracey · 2016-06-07 20:06

Seairth wrote: »

cgracey wrote: »

These FIFO's have grown pretty big and on the 16-cog version, they are as big as smart pins. I contacted OnSemi this morning about getting a small dual-port SRAM instance from them that would handle this requirement a lot more gracefully. We need a 32-location by 36-bit dual-port SRAM. This would save a lot of logic on the FPGA, too, as there are plenty of unused RAM resources we could tap, instead of the logic fabric.

Would that also reduce the latency from 5 clocks to 1 clock?

No. It wouldn't change that. That latency could be reduced on the ASIC, but I added levels of flops on the FPGA to overcome long routing delays, going between mux's, RAMs, and mux's.

jmg · 2016-06-07 23:11

cgracey wrote: »

These FIFO's have grown pretty big and on the 16-cog version, they are as big as smart pins. I contacted OnSemi this morning about getting a small dual-port SRAM instance from them that would handle this requirement a lot more gracefully. We need a 32-location by 36-bit dual-port SRAM. This would save a lot of logic on the FPGA, too, as there are plenty of unused RAM resources we could tap, instead of the logic fabric.

You would hope their tools can infer Dual Port (1 x WR, 1 x Rd) RAM with some helper lines in the code, and then compile the memory using a memory compiler ?
D-FF should compile more compactly in ASIC than they do in FPGA, but it would be a good idea to match the underlying structure in FPGA to ASIC.

cgracey · 2016-06-07 23:28

jmg wrote: »

cgracey wrote: »

These FIFO's have grown pretty big and on the 16-cog version, they are as big as smart pins. I contacted OnSemi this morning about getting a small dual-port SRAM instance from them that would handle this requirement a lot more gracefully. We need a 32-location by 36-bit dual-port SRAM. This would save a lot of logic on the FPGA, too, as there are plenty of unused RAM resources we could tap, instead of the logic fabric.

You would hope their tools can infer Dual Port (1 x WR, 1 x Rd) RAM with some helper lines in the code, and then compile the memory using a memory compiler ?
D-FF should compile more compactly in ASIC than they do in FPGA, but it would be a good idea to match the underlying structure in FPGA to ASIC.

I started doing the switchover to dual-port RAM today and I quickly realized that the latencies involved in RAM make the idea unworkable. We need instantaneous logic for this FIFO. It has to react each clock cycle. The problem with RAM is that it takes 1 clock to issue the read, then another for the data to come out and be latched. We'd need a mini logic FIFO around the dual-port RAM FIFO. No thanks! I think we'll just have to stick with the logic.

Altera has some FIFO modes for its memory resources that probably get around this problem, but I can't introduce Altera-specific blocks into the Verilog. We'll have to live with the logic, I guess.

jmg · 2016-06-07 23:40

cgracey wrote: »

The problem with RAM is that it takes 1 clock to issue the read, then another for the data to come out and be latched. We'd need a mini logic FIFO around the dual-port RAM FIFO. No thanks! I think we'll just have to stick with the logic.

I'm not quite following, Isn't there a 5 clock budget for this already ?
Write has to burst-sync with HUB rotate anyway, and read has to follow, slower.
So you have two interacting paused-sawtooths on address terms.

cgracey · 2016-06-07 23:53

jmg wrote: »

cgracey wrote: »

The problem with RAM is that it takes 1 clock to issue the read, then another for the data to come out and be latched. We'd need a mini logic FIFO around the dual-port RAM FIFO. No thanks! I think we'll just have to stick with the logic.

I'm not quite following, Isn't there a 5 clock budget for this already ?
Write has to burst-sync with HUB rotate anyway, and read has to follow, slower.
So you have two interacting paused-sawtooths on address terms.

Getting data into the DP RAM is easy. Getting data out on zero notice is impossible. This matter is outside of the 5-clock latency. You just don't know when the streamer is going to want a blast of data and, therefore, you'd need a mini logic-based FIFO to smooth out the 2-clock latency for reading data from the DP RAM.

jmg · 2016-06-08 00:05

cgracey wrote: »

Getting data into the DP RAM is easy. Getting data out on zero notice is impossible. This matter is outside of the 5-clock latency. You just don't know when the streamer is going to want a blast of data and, therefore, you'd need a mini logic-based FIFO to smooth out the 2-clock latency for reading data from the DP RAM.

If you needed fastest possible 'go', then you could queue by a sort of pre-read - ie you know the address you need next, well in advance, so that can be assigned and stable, and you can have the data ready too, but you do not want to signal to the DP-FIFO that you actually have read the data just yet.

That's probably much what you are saying with a "mini logic-based FIFO", but that can be quite mini ?

evanh · 2016-06-08 00:50

cgracey wrote: »

Altera has some FIFO modes for its memory resources that probably get around this problem, but I can't introduce Altera-specific blocks into the Verilog. We'll have to live with the logic, I guess.

It does sound like an issue worth pursuing, in that see what the Altera resources can produce and if that does a good job then enquire as to an equivalent On-Semi resource.

2 Cog DE0-Nano/CV-A2 Hubexec fifo broken

Comments