The idea is to change the offer order for each slot, so that no slot has a perpetual advantage in receiving second-hand slots.
Sounds good, and can still work on a paired basis if the designer wants to allocate by not using.
Only issue is this gives a minimum BW, so avoids starving, but the upper limit is less defined, and some code uses the HUB timing for pacing.
That could give some hard to determine bugs and interactions ?
BTW, When you get a chance ... any luck with SERDES ?
It's on the list with a few augmented pin instruction pairs (complementary pair I/o pins), single bit general purpose atomic crc calculation, and a special read usb pin (nrzi bit-banging read) instruction.
The issue I see with item 2 being ANY slot is the loss of determinism.
C.W.
How about an instruction SETSLOT D/# that writes an 8-bit value to a slot-okay register, where %00000001 is the default and means "my slot". %00010001 would mean "my slot or my slot plus 4". You could set all 8 bits to take any available slot (offered in some equitable order). The LSB is the hard slot, with upward bits representing slots upward, relatively.
This would address any case, but doesn't cover the order in which slots are offered.
(I spend a day with family... another P2 revolution happens... maybe I should spend more time with family)
20% is HUGE.
Could you give us an idea of what that 10mm2 would correspond to in:
- hub memory (kbytes, single ported, total, not per cog)
- aux memory (kbytes, dual ported, total, not per cog)
- cog memory (kbytes, quad ported, total, not per cog)
And for the sake of interest:
- octal ported memory (kbytes)
The reason for all these questions is so we can all chew on the best way to allocate these new riches (of transistors).
I did read ahead, and more aux memory is a good default, as it would help compiled code greatly (by holding the stack), but there may be better uses.
Ray mentioned the 8 ported memory, as 16 longs, as well.
No matter what, every pin will have a 9-bit DAC and a delta-sigma ADC. Those are built-in and all fit underneath the power rings.
What is taking 20% of the chip area, or 10 square mm, is the huge DAC bus for DAC signals that come out of cogs and update on every system clock or video clock. If we eliminate that bus, we'll have double the space we have now for the core, which would be a huge deal..
The limitation caused by getting rid of that huge DAC bus would be that certain pins would now be tied to certain cogs for outputting new DAC data on every clock. It wouldn't affect anything else, like static DAC updates or ADC's. It would just mean that for outputting video or analog CTR signals, certain pins would relate to certain cogs. Is that a limitation that we can live with?
Such a change would allow all 1.8V pin logic to be synthesized within the core and drastically reduce timing complexities for I/O. Also, we could put things like parallel DAC updating and DAC dither into the core-side logic for each pin. We could even have ADC tallies computed per pin, or PWM output. This stuff is all very simple to do, actually. I'm also aware that what we currently have for the core logic is bigger than our last fab attempt, and probably won't fit into the old space. We're going to have to eliminate at least part of that huge DAC bus. I'm thinking of getting rid of it, altogether. What do you guys think?
We really are out of room. We can't grow the die anymore, because it is at the limit of what will go into the biggest-die-pad TQFP-128 package. Even the pad frame is max'd out and can't accommodate any more I/O's or power pins. It just seems like an exorbitant expense to have 20% of the die spent on a giant data bus of 288 signals that circle the chip, that are likely to be only partially used in even a very busy application.
About more hub RAM: the next step would be to double what we've got, and that would eat up the entire 10 square mm. We could double or quadruple the AUX RAMs, though.
Leave it up to the sw to determine priority. Determinism is up to the user - if he specifically allocates another cog's slot that only this cog has access to, then this can be deterministic depending on what the other cog is doing which is user determined anyway.
If I understood Ray correctly, he statically mapped cog+4 to use the hub slot of cog if it did not need it.
How about this variation:
- Each cog gets its slot if it has a request pending
- there is a four bit 'LEND' register that holds the number of the cog to lend the hub cycle to if the cog does not need it
The encoding for LEND can be:
HNNN - h = 0 is a 'hard' lend, so Cluso's suggestion would be coded as
0 (%0100)
1 (%0101)
...
7 (%0111)
But as the requester is programmed in a register, a single cog could take all the unclaimed hub cycles, or half of them, in 1/8th incremends,
and %1xxx could mean 'next requesting cog' for a round robin mode.
This allows great flexibility, and allows assigning hub bandwidth programatically
Proposed new instruction (probably needs to be hub oriented)
I'm thinking that there is a general-case rule we could come up with, for which your proposal is just a subset (perhaps the first tier of the rule).
The criteria are:
1) Every cog gets 1st priority to its own slot
2) If a cog doesn't use its slot, it must be equitably offered to all other cogs, in such a way that no particular cog has an unfair advantage (ie no simple 0..7 priority list, but one that it distributed, based on slot number).
As I understand it, the outer ring of logic and I/O is hand laid by Beau and is fixed. That defines the package and is too big a job to change.
Yes the pace is frenetic!
Leave it up to the sw to determine priority. Determinism is up to the user - if he specifically allocates another cog's slot that only this cog has access to, then this can be deterministic depending on what the other cog is doing which is user determined anyway.
What about the second-hand slot offer order? Is it that critical? It could be established by each cog (8 x 3-bit fields determine which cog is offered the slot first, then the lowest 8 bits can be what we've already discussed for SETSLOT).
How about an instruction SETSLOT D/# that writes an 8-bit value to a slot-okay register, where %00000001 is the default and means "my slot". %00010001 would mean "my slot or my slot plus 4". You could set all 8 bits to take any available slot (offered in some equitable order). The LSB is the hard slot, with upward bits representing slots upward, relatively.
This would address any case, but doesn't cover the order in which slots are offered.
As I understand it, the outer ring of logic and I/O is hand laid by Beau and is fixed. That defines the package and is too big a job to change.
Yes the pace is frenetic!
That's right. The die would have to grow to even accommodate more power/ground pins. I don't know how big of a die pad is available for the TQFP-144 leadframe. We are at the limit now for the TQFP-128.
There is a huge advantage to getting a bigger die pad on the lead frame... If it's big enough, and especially if it's a big metal square that is exposed on the bottom, all the ground connections can be "down bonds", where they go from the die right down to the die pad, bypassing the pins. This would free up all the current ground pins, leaving a lot more space for I/O signals and power pins. This would be ideal for a DDR2 version, especially.
As I understand it, the outer ring of logic and I/O is hand laid by Beau and is fixed. That defines the package and is too big a job to change.
Yes the pace is frenetic!
RDQUAD/WRQUAD can take 1 clock. WRBYTE/WRWORD/WRLONG can take 1 clock. RDBYTE/RDWORD/RDLONG can take 3 clocks.
So if cog 0 executed the instructions RDQUAD/WRQUAD or WRBYTE/WRWORD/WRLONG and the current slot was free (its own cog slot or another to which it was permitted access/priority) then this instruction would execute. The next instruction could be the same and if that next slot was free, it would also execute. Same applies in a RET w/rxxx loop.
However, if RDBYTE/RDWORD/RDLONG is executed, then the setup takes 2 clocks before the read is completed. So these instructions can only utilise every 3rd real hub slot???
That's right. The die would have to grow to even accommodate more power/ground pins. I don't know how big of a die pad is available for the TQFP-144 leadframe. We are at the limit now for the TQFP-128.
There is a huge advantage to getting a bigger die pad on the lead frame... If it's big enough, and especially if it's a big metal square that is exposed on the bottom, all the ground connections can be "down bonds", where they go from the die right down to the die pad, bypassing the pins. This would free up all the current ground pins, leaving a lot more space for I/O signals and power pins. This would be ideal for a DDR2 version, especially.
So if cog 0 executed the instructions RDQUAD/WRQUAD or WRBYTE/WRWORD/WRLONG and the current slot was free (its own cog slot or another to which it was permitted access/priority) then this instruction would execute. The next instruction could be the same and if that next slot was free, it would also execute. Same applies in a RET w/rxxx loop.
However, if RDBYTE/RDWORD/RDLONG is executed, then the setup takes 2 clocks before the read is completed. So these instructions can only utilise every 3rd real hub slot???
How about an instruction SETSLOT D/# that writes an 8-bit value to a slot-okay register, where %00000001 is the default and means "my slot". %00010001 would mean "my slot or my slot plus 4". You could set all 8 bits to take any available slot (offered in some equitable order). The LSB is the hard slot, with upward bits representing slots upward, relatively.
This would address any case, but doesn't cover the order in which slots are offered.
Could the LSB be set to 0? This would be like in my earlier option 3.
From what I've seen, RDQUAD could only use every 4th slot.
However I think consecutive RDLONGC's (to consecutive hub addresses) would effectively be 1 clock cycle each, as long as a cog got 1/4th of the hub slots.
Chip,
You are having problems fitting a single cog into the DE0.
Does making the DAC changes reduce the FPGA size enough?
If not, what block(s) (if any) can you remove to make your life easier?
ozpropdev (and others):
What could be left out for your work?
Brian, you are pushing the envelope the most here, so your input is likely the most valuable.
I am not using video atm (other than to see what others are doing, and for that I can load a new fpga design). I suspect I am in the minority here.
Wow! There a lot of good ideas here.
It's taken me a while to get through it all and absorb it!
Ray, in the case of leaving something out in the Nano FPGA I now use a DE2 so that wouldn't effect me.
On the subject of video I am currently experimenting with a 1 COG VGA driver that only uses 1/16 time slot (6.25% cog time @ 80 MHz)
In order to achieve this I obviously use QUAD transfer. This works well, but it creates some problems in multi-tasking code.
The first problem that this locks out my other task from using RD/WRQUAD for their use.
I had to do quite a bit of tuning to my other task to avoid pipeline stall causing sync issues.
The HUB is always the villain here. Giving my VID task an extra HUB slot would definitely solve this.
I can hear people saying "Just up the VID to 2 time slots and/or @200MHz this won't be a problem"
That maybe so, but seeing my other task (graphics engine, snippet handler) requires maximum perfprmance
pushing the limits now equates to sizzling speed in silicon!
The ideas bouncing around about HUB time slots is encouraging. I'm scribbling on my whiteboard now trying to get my head around it all
Here's a universal solution, but perhaps it's overly verbose:
SETSLOT D/#
The eight LSBs determine which slots may be used by this cog, with bit0 being the native slot, and upward bits being the next slots, in order.
%00000001 in this field means that only the native slot will be used.
%11111111 means that any offered slot will be used.
%10101010 means that any relatively-odd slot will be used.
%00000000 just don't try to execute a hub instruction after setting this value - you'll hang.
Upward from the 8 LSBs, you have eight 3-bit fields (value %000 being you) which determine which relative cog is offered your slot, in order from bottom 3-bit field to top 3-bit field.
This would cover every possibility posited here, so far. The eight 3-bit field thing is a little awkward to set up, though, so maybe some single formula would suffice, instead.
Here's a universal solution, but perhaps it's overly verbose:
SETSLOT D/#
The eight LSBs determine which slots may be used by this cog, with bit0 being the native slot, and upward bits being the next slots, in order.
%00000001 in this field means that only the native slot will be used.
%11111111 means that any offered slot will be used.
%10101010 means that any relatively-odd slot will be used.
%00000000 just don't try to execute a hub instruction after setting this value - you'll hang
Upward from the 8 LSBs, you have eight 3-bit fields (value %000 being you) which determine which cog is offered your slot, in order from bottom 3-bit field to top 3-bit field.
This would cover every possibility posited here so far. The eight 3-bit field thing is a little awkward to set up, though, so maybe some single formula would suffice, instead.
I think the SETSLOT D/# makes the most sense. Leave it to the user to determine which cogs can access which free slots.
[code]
SETSLOT D/#
where D/# is (bit 0 is this cog, 1 the next, etc)
76543210:
xxxxxxxp
where
p: 1 bit for this cog: 0 = cannot use any another cogs slot, 1 = can use another cogs slot - see xxxxxxx for priority slots
x: 1 bit per each other cog slot. Provided p=1 then, 0 = use if not required by higher priority, 1 = use if cog x slot available (high priority)
eg
px: 00 = cannot use this cog x slot (default)
01 = cannot use this cog x slot
10 = use this cog x slot if available and not required by high priority cog (low priority)
11 = use this cog x slot if available (high priority cog)
SETSLOT D/# sets a combined hub set of registers as follows...
Each cog has a 1 bit slot register and is set by the "p" bit. This denies/enables additional time slots.
Each hub slot has a 3 bit register. If the "p" bit =1 then if the respective "x" bit =1 then the cog# of executing the SETSOT D/# command. ie this allocates the respective slots priority cog. Each SETSLOT D/# (by any cog) can overwrite this priority cog as there will be only 1 priority cog.
For each slot that becomes available in the round-robbin sequence, if the cog has a hub instruction waiting, it will be executed, and the slot increments.
If the cog does not have a hub instruction, it is available, and the hub will see if its priority cog has a hub instruction waiting, and if so will execute it.
Otherwise, the hub will determine if another low priority cog wants the slot, in the order of this cog++. If no cog wants it, the slot is lost.
Looks good - but I wonder if instead of the eight 3 bit fields another eight bits might not suffice?
ie lowest 8 bits = slots current cog can use
next lowest 8 bits = slots of cogs that can "steal" this cogs slot.
I am wondering if using relative cog indications will be difficult to document
maybe it would be easier if COGID returned a cog bit mask instead ... ie %00000001 ... %10000000, a return of %00000000 would mean could not allocate cog
then the cog references could be an absolute bit mask, allowing easier planning?
other cog | this cog
can use | claims
76543210|76543210 cog id
11111111|00000001 cog 1 says any cog can use its slot
Here's a universal solution, but perhaps it's overly verbose:
SETSLOT D/#
The eight LSBs determine which slots may be used by this cog, with bit0 being the native slot, and upward bits being the next slots, in order.
%00000001 in this field means that only the native slot will be used.
%11111111 means that any offered slot will be used.
%10101010 means that any relatively-odd slot will be used.
%00000000 just don't try to execute a hub instruction after setting this value - you'll hang.
Upward from the 8 LSBs, you have eight 3-bit fields (value %000 being you) which determine which relative cog is offered your slot, in order from bottom 3-bit field to top 3-bit field.
This would cover every possibility posited here, so far. The eight 3-bit field thing is a little awkward to set up, though, so maybe some single formula would suffice, instead.
Comments
Chip,
Are back-to-back hub reads and/or writes possible? Quads too?
Sounds good, and can still work on a paired basis if the designer wants to allocate by not using.
Only issue is this gives a minimum BW, so avoids starving, but the upper limit is less defined, and some code uses the HUB timing for pacing.
That could give some hard to determine bugs and interactions ?
That is ok, if the other modes can cover it ? Too-much BW is less of a brick wall, than too-little
RDQUAD/WRQUAD can take 1 clock. WRBYTE/WRWORD/WRLONG can take 1 clock. RDBYTE/RDWORD/RDLONG can take 3 clocks.
How about an instruction SETSLOT D/# that writes an 8-bit value to a slot-okay register, where %00000001 is the default and means "my slot". %00010001 would mean "my slot or my slot plus 4". You could set all 8 bits to take any available slot (offered in some equitable order). The LSB is the hard slot, with upward bits representing slots upward, relatively.
This would address any case, but doesn't cover the order in which slots are offered.
(I spend a day with family... another P2 revolution happens... maybe I should spend more time with family)
20% is HUGE.
Could you give us an idea of what that 10mm2 would correspond to in:
- hub memory (kbytes, single ported, total, not per cog)
- aux memory (kbytes, dual ported, total, not per cog)
- cog memory (kbytes, quad ported, total, not per cog)
And for the sake of interest:
- octal ported memory (kbytes)
The reason for all these questions is so we can all chew on the best way to allocate these new riches (of transistors).
I did read ahead, and more aux memory is a good default, as it would help compiled code greatly (by holding the stack), but there may be better uses.
Ray mentioned the 8 ported memory, as 16 longs, as well.
Leave it up to the sw to determine priority. Determinism is up to the user - if he specifically allocates another cog's slot that only this cog has access to, then this can be deterministic depending on what the other cog is doing which is user determined anyway.
The TQFP-144 body is 16x16mm, whereas the TQFP-128 is 14x14mm. It would be good to have a lot more 1.8V/GND pins.
How about this variation:
- Each cog gets its slot if it has a request pending
- there is a four bit 'LEND' register that holds the number of the cog to lend the hub cycle to if the cog does not need it
The encoding for LEND can be:
HNNN - h = 0 is a 'hard' lend, so Cluso's suggestion would be coded as
0 (%0100)
1 (%0101)
...
7 (%0111)
But as the requester is programmed in a register, a single cog could take all the unclaimed hub cycles, or half of them, in 1/8th incremends,
and %1xxx could mean 'next requesting cog' for a round robin mode.
This allows great flexibility, and allows assigning hub bandwidth programatically
Proposed new instruction (probably needs to be hub oriented)
SETHUB cog,#%Hnnn
Yes the pace is frenetic!
What about the second-hand slot offer order? Is it that critical? It could be established by each cog (8 x 3-bit fields determine which cog is offered the slot first, then the lowest 8 bits can be what we've already discussed for SETSLOT).
I posted something in #3102 that would allow specific allocation ... but I like your posting better.
That's right. The die would have to grow to even accommodate more power/ground pins. I don't know how big of a die pad is available for the TQFP-144 leadframe. We are at the limit now for the TQFP-128.
There is a huge advantage to getting a bigger die pad on the lead frame... If it's big enough, and especially if it's a big metal square that is exposed on the bottom, all the ground connections can be "down bonds", where they go from the die right down to the die pad, bypassing the pins. This would free up all the current ground pins, leaving a lot more space for I/O signals and power pins. This would be ideal for a DDR2 version, especially.
Good point about the package size.
So if cog 0 executed the instructions RDQUAD/WRQUAD or WRBYTE/WRWORD/WRLONG and the current slot was free (its own cog slot or another to which it was permitted access/priority) then this instruction would execute. The next instruction could be the same and if that next slot was free, it would also execute. Same applies in a RET w/rxxx loop.
However, if RDBYTE/RDWORD/RDLONG is executed, then the setup takes 2 clocks before the read is completed. So these instructions can only utilise every 3rd real hub slot???
That's all correct.
Could the LSB be set to 0? This would be like in my earlier option 3.
C.W.
However I think consecutive RDLONGC's (to consecutive hub addresses) would effectively be 1 clock cycle each, as long as a cog got 1/4th of the hub slots.
Wow! There a lot of good ideas here.
It's taken me a while to get through it all and absorb it!
Ray, in the case of leaving something out in the Nano FPGA I now use a DE2 so that wouldn't effect me.
On the subject of video I am currently experimenting with a 1 COG VGA driver that only uses 1/16 time slot (6.25% cog time @ 80 MHz)
In order to achieve this I obviously use QUAD transfer. This works well, but it creates some problems in multi-tasking code.
The first problem that this locks out my other task from using RD/WRQUAD for their use.
I had to do quite a bit of tuning to my other task to avoid pipeline stall causing sync issues.
The HUB is always the villain here. Giving my VID task an extra HUB slot would definitely solve this.
I can hear people saying "Just up the VID to 2 time slots and/or @200MHz this won't be a problem"
That maybe so, but seeing my other task (graphics engine, snippet handler) requires maximum perfprmance
pushing the limits now equates to sizzling speed in silicon!
The ideas bouncing around about HUB time slots is encouraging. I'm scribbling on my whiteboard now trying to get my head around it all
SETSLOT D/#
The eight LSBs determine which slots may be used by this cog, with bit0 being the native slot, and upward bits being the next slots, in order.
%00000001 in this field means that only the native slot will be used.
%11111111 means that any offered slot will be used.
%10101010 means that any relatively-odd slot will be used.
%00000000 just don't try to execute a hub instruction after setting this value - you'll hang.
Upward from the 8 LSBs, you have eight 3-bit fields (value %000 being you) which determine which relative cog is offered your slot, in order from bottom 3-bit field to top 3-bit field.
This would cover every possibility posited here, so far. The eight 3-bit field thing is a little awkward to set up, though, so maybe some single formula would suffice, instead.
That would be very nice!
C.W.
Improvements at blinking eyes pace!
And the best part of it: there are no pals, trying to kill each other, simply by reading something about changes on the round robin scheme.
Yanomani
[code]
SETSLOT D/#
where D/# is (bit 0 is this cog, 1 the next, etc)
76543210:
xxxxxxxp
where
p: 1 bit for this cog: 0 = cannot use any another cogs slot, 1 = can use another cogs slot - see xxxxxxx for priority slots
x: 1 bit per each other cog slot. Provided p=1 then, 0 = use if not required by higher priority, 1 = use if cog x slot available (high priority)
eg
px: 00 = cannot use this cog x slot (default)
01 = cannot use this cog x slot
10 = use this cog x slot if available and not required by high priority cog (low priority)
11 = use this cog x slot if available (high priority cog)
SETSLOT D/# sets a combined hub set of registers as follows...
Each cog has a 1 bit slot register and is set by the "p" bit. This denies/enables additional time slots.
Each hub slot has a 3 bit register. If the "p" bit =1 then if the respective "x" bit =1 then the cog# of executing the SETSOT D/# command. ie this allocates the respective slots priority cog. Each SETSLOT D/# (by any cog) can overwrite this priority cog as there will be only 1 priority cog.
For each slot that becomes available in the round-robbin sequence, if the cog has a hub instruction waiting, it will be executed, and the slot increments.
If the cog does not have a hub instruction, it is available, and the hub will see if its priority cog has a hub instruction waiting, and if so will execute it.
Otherwise, the hub will determine if another low priority cog wants the slot, in the order of this cog++. If no cog wants it, the slot is lost.
ie lowest 8 bits = slots current cog can use
next lowest 8 bits = slots of cogs that can "steal" this cogs slot.
I am wondering if using relative cog indications will be difficult to document
maybe it would be easier if COGID returned a cog bit mask instead ... ie %00000001 ... %10000000, a return of %00000000 would mean could not allocate cog
then the cog references could be an absolute bit mask, allowing easier planning?
Yes, that's right. We could even make it so a cog could offer its own slot to another cog before it takes the slot, itself.