Fast hub RAM timing
TonyB_
Posts: 2,178
in Propeller 2
Why does RDFAST wait until the FIFO contains read data before completing? Couldn't any wait, if necessary, be added to RFBYTE/RFWORD/etc. instead?
It seems a waste of clock cycles to me. I'd like to do a RDFAST on a certain clock tick, then execute some other code, then do a RFBYTE on the next clock tick. The tick period would be long enough for the FIFO to contain data. In effect, the RDFAST is the start of an internal memory read request and the RFBYTE latches the read data.
Do we simply subtract 8 from the various timings in the Instructions v31 now there won't be 16 cogs in the finished chip? Wouldn't it be better for the timings to be changed to eight cogs?
It seems a waste of clock cycles to me. I'd like to do a RDFAST on a certain clock tick, then execute some other code, then do a RFBYTE on the next clock tick. The tick period would be long enough for the FIFO to contain data. In effect, the RDFAST is the start of an internal memory read request and the RFBYTE latches the read data.
Do we simply subtract 8 from the various timings in the Instructions v31 now there won't be 16 cogs in the finished chip? Wouldn't it be better for the timings to be changed to eight cogs?
Comments
The timings need to be updated. In some cases, the improvement is more than 8 clocks.
RDBYTE/RDWORD/RDLONG
WRBYTE/WRWORD/WRLONG
RDFAST
Is the maximum now eight fewer?
I need to run some tests and determine a formula to adjust all numbers by. At least 8 clocks will be saved on each worst-case clock count.
Each cog outputs 4 DAC channels. The streamer in each cog can drive new values into all 4 channels on each clock.
Each pin can select which cog's DAC channel it receives, in such a way that pin %xxxxCC can pick DAC channel %CC from any cog.
Oh, wow, I hadn't noticed that at all. I thought the only hand setting of a DAC was to setup a Smartpin.
And: Okay, that's clear enough. Each channel of each Cog can output to 16 different DACs. And as you've said, all DACs can be reached from every Cog. I hadn't realised that that much routing still existed. That'll be 1/4 of first Prop2-Hot incarnation I presume?
It's been programmable for a long time now. For a while, cog DAC channels were fixed to sets of 4 pins. Now, any pin can select which cog it gets its DAC channel data from.
What IS fixed is that:
Pins 0/4/8/12/...60 can select any cog's DAC channel 0
Pin 1/5/9/13/...61 can select any cog's DAC channel 1
Pins 2/6/10/14/...62 can select any cog's DAC channel 2
Pins 3/7/11/15/...63 can select any cog's DAC channel 3
Special RDFAST and WRFAST that take only TWO clocks (no waiting). You just have to make sure you elapse enough clocks before doing another hub memory operation, to ensure that the RDFAST/WRFAST is done.
Two-clock RDFAST and WRFAST, as TonyB_ pointed out, will enable timing determinancy, so that we can better approach FPGA-replacement apps, where timing must be exact.
This alternate behavior is triggered when D[31] is high on RDFAST/WRFAST. The block count is always in D[13:0] and D[31] is normally low.
We could actually have two-clock WRxxxx, as well, but we only have two '{#}D,{#}S' instruction slots. I suppose a WXLONG instruction would be most valuable. Maybe a WXBYTE, also. 'X' is for 'exit'. There's no room for three of them, though. I think long- and byte-writes would be most valuable. No, that actually won't work because the cog must be waiting, after all. RDFAST/WRFAST can be sped up, though.
However, I'm all for removing instruction stalls ... and have suggested similar strategies a couple of years back. It came to a head when Cluso was trying to streamline his serial routines and wanted to pair up Cogs so he didn't have to pass any data via HubRAM. And the only reason he didn't like HubRAM was because of the hard to predict instruction stalls.
You may remember me trying to sell you on having a second FIFO per Cog. That was to remove stalls while providing both HubRAM reads and writes.
RDxxxx and WRxxxx are just too sticky to free, but RDFAST/WRFAST can be sped up.
Here is the original code:
And here is the modified code to realize 2-clock RDFAST/WRFAST when D[31] is set:
I'm compiling it now to test it.
EDIT: Hmm, that link isn't lining up correctly. Here's the quote:
EDIT2: It would seem I spelt programmatically way wrong. I suspect the spell-checker must have intervened.
Maybe because things are now more settled, I can think about it more clearly. When you brought it up, it just seemed all too messy.
Although late in the day, it is such a simple enhancement, trivial in terms of logic, but the benefits could be massive. As well as 100% predictable timing, otherwise wasted wait states can now be used productively, as described in the top post.
The modified worst-case timings with only eight cogs would be useful to know, sometime. Anyway, the fog of uncertainty has lifted and it's clear blue skies from now on!
If you get the timing wrong, during development) what happens ? (eg does read return a copy of previous read data ?)
ie how does a developer confirm/prove they have this correct ?
If you do have the timing exact, and flip the opcode, does it still work exactly the same ?
It would be good to have a table of the cycles-trade-offs here : code simplicity vs hand crafting, to allow informed decision making.
What happens if WFxxxx follows a RFxxxx without a WRFAST and vice-versa?
I didn't mention WRFAST because I was concerned mainly with fast hub reads. It's interesting that a fast WRFAST takes two cycles (keeping the cycle count even, which might be important) compared to three for the slow version (both assuming previous WRFAST has finished). Therefore one instruction must come between fast WRFAST and WFxxxx, as I understand it.
a one clock NOP.
I can imagine for fine synchronization of loops, or even cog2cog a interesting idea.
Mike
It would be good to have the 8-cog timings for the different hub RAM reads and writes or formulae so that we can calculate them, while this matter is fresh in our minds. Also the answer to some other points, such as what happens:
(a) if RFxxxx occurs too soon after fast RDFAST?
(b) if WFxxxx follows RFxxxx without intervening WRFAST?
(c) if RFxxxx follows WFxxxx without intervening RDFAST?
I posted the old timings so they could be copied and replaced with the new ones here.
64 bytes is 16 longs or 16 cogs * one long per cog, which is what determined the minimum size of the hub FIFO? Now there are only 8 cogs is the FIFO smaller?
The FIFO is smaller, yes.
The block size could be smaller, too.
I keep the block size at 16 longs, though, so that we can have software compatibility across a family chips, with up to 16 cogs.
a) It returns errant data. Ideally it should ignore all WFxxxx and RFxxxx operations if 'not ready'.
b and c) I need to look into this. Ideally, it should ignore ignore WFxxxx in RDFAST most and RFxxxx in WRFAST mode.
Thanks for bringing these things up. I've been really tied up with Treehouse and OnSemi the past few days. I intend to get into these other matters soon.