A 32-slot Approach (was: An interleaved hub approach)

Seairth · 2014-05-11 17:23

[post=1266697]

Todd Marshall wrote: »

Ok. I've looked there. #45 sort of summed it up for me (I hope) and all the other stuff is implementation details. Seems to answer all my questions except (4) which evidently SuperCOG0 answers.

Is this "32 slot approach" slot sharing? slot assignment? or something else? These are terms I keep seeing in my reading and they are ambiguous (COIK - Clear only if known ... but a reflection of my lack of standing).

[/post]

In answer to your question

[post=1266685]

Todd Marshall wrote: »

4) Does the HUB idle through or skip unoccupied slots ... or is that optional?

[/post]

The hub gives access to a cog on every cycle (whichever cog is next in the table). As the table contains 4-bit values and there are 16 cogs, every slot is always assigned a cog. Whether the cog chooses to use it its slot(s), or is even running, is another matter. ReLoad changes this rule somewhat, of course. And because you asked the question in the context of the SuperCog thread, this approach (with only the table and ReLoad) does not give the unused time slot to another cog that may happen to be waiting. This is, I believe, where Cluso (or someone) proposed a second table to provide such functionality (you'll have to read the thread to get those details).

jmg · 2014-05-11 17:28

Todd Marshall wrote: »

And I can see where its (the so-called SuperCog's) behavior could be obtained with some dialect of a shared slot/assigned slot model which makes my question valid here.

Correct, there is a Table Design that can be easily configured to give SuperCogN, and when N=0, that covers this thread's narrow use case.

Todd Marshall wrote: »

Ok. I've looked there. #45 sort of summed it up for me (I hope) and all the other stuff is implementation details. Seems to answer all my questions except (4) which evidently SuperCOG0 answers.

#45 is the basic Table Allocate, with no COG signals.(no conditional decisions). Other posts describe the conditional options.

In order to allocate conditionally, some 'Want-Hub' signals from COGS are assumed in all Conditional designs.

Todd Marshall wrote: »

Is this "32 slot approach" slot sharing? slot assignment? or something else? These are terms I keep seeing in my reading and they are ambiguous (COIK - Clear only if known ... but a reflection of my lack of standing).

Different posters use different words. The hardware used can often be easier to follow.

kwinn summarised it as "a 5 bit register, a 5 bit counter, and a 32x8 look up table"

I prefer to use Allocate and Primary and Secondary, as slots cannot actually be shared, and assignment is conditional, so Allocate is a better fit.

In the most-flexible(conditional) form, the Table design assigns primary slots (using any modulus) to COGid's and then IF that Primary COGs does not need that slot at that instant, the COGid is Allocated from the Secondary Slot value.

You are right, that to cover the narrow, single-use case of OP, simply Load Primary Table Slots 0..15, and Secondary Table Slots 0..0

FPGA P&R reports of some variants can be found here (these use a single Boolean to default to P1 like scanning )

http://forums.parallax.com/showthread.php/155561-A-32-slot-Approach-(was-An-interleaved-hub-approach)?p=1266305&viewfull=1#post1266305

jmg · 2014-05-11 17:38

Seairth wrote: »

This is, I believe, where Cluso (or someone) proposed a second table to provide such functionality (you'll have to read the thread to get those details).

Yes, the secondary table works like this :

Table design chooses any Modulus, and Allocates primary slots to COGid's and then IF* that Primary COGs does not need that slot at that instant, then the COGid is Allocated from the Secondary Slot value.
(* needs a signal from each COG to say 'This Opcode will use HUB' )

The Allocate table is scanned in a linear, circular manner, but the COGid's read are variables, so can have any order/sequence. This solves the COG-DAC-PIN constraint issues present in the P1+.

RossH · 2014-05-11 17:42

jmg wrote: »

Correct, there is a Table Design that can be easily configured to give SuperCogN, and when N=0, that covers this thread's narrow use case

Is it, though?

The SuperCog only ever "mooches" slots, whereas my understanding is that all the table scheme have slots permanently allocated.

Or do you have a "mooch" option as well?

If not, then these solutions are in fact quite different, and one is not a special case of the other.

Ross.

RossH · 2014-05-11 17:42

RossH wrote: »

Is it, though?

The SuperCog only ever "mooches" slots, whereas my understanding is that all the table scheme have slots permanently allocated.

Or do you have a "mooch" option as well?

If not, then these solutions are in fact quite different, and one is not a special case of the other.

Ross.

EDIT: I see you have now amended your proposal to include this - more complexity!

jmg · 2014-05-11 17:49

RossH wrote: »

EDIT: I see you have now amended your proposal to include this - more complexity!

Nope, not amended, that has been there for a long time - you somehow overlooked it.

Yes, it does add a Signal from each COG, but then so does too the more limited SuperCog0, so the complexity claim fails there.

RossH · 2014-05-11 18:03

jmg wrote: »

Nope, not amended, that has been there for a long time - you somehow overlooked it.

Yes, it does add a Signal from each COG, but then so does too the more limited SuperCog0, so the complexity claim fails there.

Nope - it just means the complexity of your scheme is actually much worse than I originally thought!

jmg · 2014-05-11 18:19

RossH wrote: »

Nope - it just means the complexity of your scheme is actually much worse than I originally thought!

It's not really 'my' scheme, I simply coded a number of ideas in Verilog, to properly check how complex they actually are.
You see, I always prefer hard numbers to hyperbole claims.
I was more focused on speed, than size, as there is only one of these cells needed.

A few lines of verilog is certainly not complex to anyone used to doing this.

Still, we do have progress, as you admit SuperCog0 is a subset, and your level of hyperbole claim is even reducing

mark · 2014-05-11 18:31

Bask in my MSpaint skills!

To keep it a bit simple, I only included enough for 4 cogs, but it should get the point across. I really don't know much about counters, but this might in effect be a divide-by-n counter with a seedable count value.

Anyway, lets go through an example and see if the logic holds up... In the seed register we place a value which loads 1 into the first counter, 2 in the second... 4 into the fourth. In this example, we'll load the value 4 into all of the 6-bit registers. Now, since both the 4th counter and register value is 4, the output of the 4th comparator is "true" (the yellow line coming out the bottom), which would turn on the mux circuitry in the hub for a specific cog. The line also runs to a "reset" input in the counter, so on the next clock, it'll reset to one (lets just say 1 instead of 0 for sake of simplicity). Well, since everything is attached to the same clock, the values in each counter increase by one, so now the value in our 3rd counter is 4 which makes the output of its respective comparator "true".. and the cycle goes on and on.

Of course, that is the least "interesting" scenario, so what can we accomplish by loading different values in the seed and 6-bit registers? Lets try loading the value 3 in the top two 6-bit registers, and the value 6 in the bottom two. In the counters, starting from the top down, we'll load 1,2,3,6

This is how it looks as the counters increment (counter number in red, followed by its incrementing value, numbers in bold indicate hub access).

1..1..2..3..1..2..3..1..2..3..1..2..3..1..2..3
2..2..3..1..2..3..1..2..3..1..2..3..1..2..3..1
3..3..4..5..6..1..2..3..4..5..6..1..2..3..4..5
4..6..1..2..3..4..5..6..1..2..3..4..5..6..1..2

As I mentioned in my previous post, my math skill are pretty Smile, so I don't know how to figure out an algorithm that would let me come up with various possible combinations, but there should be a handful.

Is this better or more efficient than using tables? I have no idea, but it's just one other possible option to add to the list. Depending on how much access we give the programmer to the seed and the registers feeding into the comparators, the possibility for Very Bad Things may very well be there. Either the circuitry should be designed to prevent the magic black smoke from escaping, or the programmer shouldn't have direct access to these, but can select a handful of different values from a rom.

jmg · 2014-05-11 18:39

[QUOTE=mark

mark · 2014-05-11 19:13

jmg wrote: »

You still have to load all the values, so it is no gain there, over a table.

Indeed. I don't see how that's really a detriment either way. Of course, they're only loaded initially (unless you intend to dynamically change at run-time)

jmg wrote: »

However it does use many compares, and many counters, which consume more logic than simple table RAMs.

Yeah, there is quite a bit of logic, but I'd imagine that at some point, depending on how granular you want to make it, it ends up being more efficient

jmg wrote: »

The biggest issue tho, is what #61 asks, of collision/contention, as it is important only one cog is ever allowed HUB access at a time.

Collisions should never occur if you have the proper values loaded in the registers. The best way to make sure valid values are loaded into them is open to debate.

I had another thought on the design several minutes ago, and while I haven't hashed it out to know for sure if it would be viable, I think the number of counters (and therefore the size of the seed register) could be lowered. For example, if we stuck with 4 counters, but 16 comparators (or whatever the number of cogs is), we could break down the hub window timings to 4 groups, with 4 cogs in each group, by attaching 4 comparators to the output of each counter. Of course, only one comparator in each group would be in charge of sending the reset signal to its respective counter.

jmg · 2014-05-11 19:19

[QUOTE=mark

RossH · 2014-05-11 19:41

jmg wrote: »

Still, we do have progress, as you admit SuperCog0 is a subset, and your level of hyperbole claim is even reducing

No, don't mistake me - I still think all the table-based schemes would be insanely complex for users, and will seriously degrade the attraction of the Propeller if anyone is mad enough to adopt them.

There are only two schemes that I think are even remotely feasible:

1. Heater's SuperCog notion
2. Cluso's "Paired cog" notion

I support these because they maintain determinism, are simple to understand, are simple to use, and allow plug and play.

I have not yet to see any other schemes that can achieve these things.

Ross.

mark · 2014-05-11 19:57

jmg wrote: »

With a RAM based table, there simply are no invalid values to avoid, and a table more closely matches what physically happens in allocating access to the HUB, right down to 1 cycle granularity.

I can't argue with that. While a 4 counter, 16 comparator system might only need 15 bytes of register memory compared to at least 32 for a LUT, the simplicity of the rest of the circuit in a table-based system likely makes up for it.

jmg · 2014-05-11 20:00

RossH wrote: »

... I still think all the table-based schemes would be insanely complex for users, and will seriously degrade the attraction of the Propeller ....

Yes, I'll admit I spoke too - soon the bemusing hyperbole is certainly back !

Like many, I am less fussed about plug-and-play, than I am about being able to design what I want.
This is an Embedded Microcontroller, not a cellphone running Apps.

RossH · 2014-05-11 20:03

jmg wrote: »

Like many, I am less fussed about plug-and-play, than I am about being able to design what I want.
This is an Embedded Microcontroller, not a cellphone running Apps.

So why are you using a Propeller, if you are not interested in its key selling features?

Ross.

jmg · 2014-05-11 20:04

[QUOTE=mark

jmg · 2014-05-11 20:09

RossH wrote: »

So why are you using a Propeller, if you are not interested in its key selling features?

You seem to confuse plug and play as a key selling feature ?

Libraries are great used as coding examples, but it is very rare to take someone else's PASM code, and drop it in untouched.

Setting up a tiny table does not frighten me, and I know the FPGA overhead is very low, as P&R shows me.

So I can see what is easily possible, what surprises me more, is the aversion some have, to designer control.

RossH · 2014-05-11 20:16

jmg wrote: »

You seem to confuse plug and play as a key selling feature ?

Yes, of course it is. Coupled with determinism, it means you never have to worry about whether two objects will work together - they always will.

These properties are unique to the Propeller, and these properties are the ones that are lost with a table scheme.

Ross.

Cluso99 · 2014-05-11 20:17

The original expanded and fleshed out idea (discussed many times Ross) is a two level table scheme. Each table has 32 time slots.

Table#1
* 32 x 4bit slots
* Each slot is a hub slot, so 32 slots for the round-robbin hub
* Each 4bits represents the cog# that will be offered the slot first (priority cog)
* Default (reset) Table#1 will be loaded with 0,1,2,...15,0,1,2,...15 (which means each cog will get access 1:16 slots = same as without tables)

Table#2
* Same as Table#1 (32 x 4bit slots) except:
* If the slot offered to cog# in Table#1 is not required, then it is offered to the Cog# (in the same slot) stored in Table#2. (mooch/secondary cog)
* If that slot is still not used, it is wasted.
* Default (reset) is 0,0,0,....0
* The use of Table#2 is enabled by a MOOCH instructions (default = not used)

Summary
In default/reset mode, the hub operates precisely as it does without slot-sharing and each cog gets its turn as 1 slot every 16 slots.
Table#1 can be changed to alter the order of cogs getting slots, or give more slots to one cog over another. The default is each cog gets 2 slots out of 32, separated by 16 slots.
Table#2 The default/reset is OFF, operating precisely as it does using only Table#1. When enabled, it permits 1 or more cogs to mooch by writing the values in Table#2. So, for heaters suggestion of Cog#0 mooching, just enable Table#2 (because its default values are all 0's). For Cogs 0 & 1 to mooch alternate slots, set Table#2 to 0,1,0,1,0,1,...0,1.
This also caters for the case where you want to use co-operating cogs sharing their slots eg for Cogs 1 & 9 to share, set Table#1 Slot#1=1 Slot#9=1 and Table#2 Slot#1=9 Slot9=9 where cog 1 gets offered both slots 1 & 9 and if it does not require them they are offered to cog 9 (priority cog 1) or set Table#1 Slot#1=1 Slot#9=9 and Table#2 Slot#1=9 Slot9=1 and cog 1 gets its own slot and cog 9's slot if cog 9 doesn't require it, and visa versa.
In this method, any cog requiring its original 1:16 slot (for determinism etc) can still keep its own original values in Table#1 (its own cog# in its 2 slots), and it will still get its original 2:32=1:16 slots.

So here, we have the best of all worlds. It is not very complex, and the silicon cost is anyway shared between 16 cogs.

An extension to this was proposed to set the Table Slot Size to 32 or less by writing a value to Slot Size. In this case, everything would be as above, except that the table counter would reset back to slot 0 when the count was reached. The only thing I dislike about this, is that it is likely to break existing determinism of 1:16 for all cogs. Therefore, the user has taken total control of the slot mechanism and is therefore responsible for any objects breaking down.

jmg: Could you post your Verilog code for 2 table 32 slots? I think it might help others to see how simple it is to code. (I already know it is easy although my Verilog is quite limited).

Cluso99 · 2014-05-11 20:20

RossH wrote: »

Yes, of course it is. Coupled with determinism, it means you never have to worry about whether two objects will work together - they always will.

These properties are unique to the Propeller, and these properties are the ones that are lost with a table scheme.

Ross.

Ross,
They are not lost with a table scheme. They still remain, but by being able to modify the table values, we can also do many other things such as slot-sharing between specific cogs, mooching for more than one cog, etc. What is being offered is a super-set !

Cluso99 · 2014-05-11 20:23

jmg wrote: »

Libraries are great used as coding examples, but it is very rare to take someone else's PASM code, and drop it in untouched.

With the P1, I have to disagree with this point. I have used many objects from OBEX and elsewhere, totally untouched. I have even written some.

But it doesn't lessen the advantages to have tables.

potatohead · 2014-05-11 20:24

Seconded Cluso. Well stated, sans tables.

Now, we know how to write them such that not touching them is a realistic goal. I particularly like Dave Hein's method of putting storage right at the beginning of the COG, so that a mailbox address is the COGSTART address (it's a JMP, until replaced at runtime), the old, early, easy and somewhat painful "poke it in, then run it" method still works, and the whole thing can be packaged up and called from C with a simple struct definition aligned with the variables.

And yes, this is a big deal, particularly when combined with some mention of having defined objects in the IDE... Pick 'em, run 'em, with only some minor code needed to communicate with them.

Sweet!

RossH · 2014-05-11 20:29

Cluso99 wrote: »

Ross,
They are not lost with a table scheme. They still remain, but by being able to modify the table values, we can also do many other things such as slot-sharing between specific cogs, mooching for more than one cog, etc. What is being offered is a super-set !

You are proposing a scheme where users can no longer load any object into any cog. Madness.

Ross.

jmg · 2014-05-11 20:31

Cluso99 wrote: »

jmg: Could you post your Verilog code for 2 table 32 slots? I think it might help others to see how simple it is to code. (I already know it is easy although my Verilog is quite limited).

Sure, it really is this simple :

First, there is the 32x or 64x table scanner ReLoad counter

// 6 Bit Table scan, with ReLoad, 'Default', 'RST' Control bits.
always  @(posedge CLK)        // Modulus 6b Table Scan, 
begin
     if (RST) begin
          SC <= 0;
     end else if (SC == 6'b000000) begin
        if (Default)
          SC <= 6'd15;	      // 16:1 scan 
	else 
          SC <= ReLoad;	      // Modulus ReLoad 6b User Set Value.
     end else begin        
          SC <= SC -1;    
     end  
end

then, the MUX to choose table entry, (and most of this is Config Booleans for PairEn, Default choices.)

always  @(posedge CLK) begin  // sync output 
    F_NeedsHUB <= (OPCisHUB[MapCog]);  // THIS opcode will use HUB (adds one more pipeline delay)
end

always  @(*) begin       // MUX output (357.910MHz, with Block RAM and distributed RAM: 6 (12 LUT4s))
      if (Default) begin
        bAllocCOG = SC[3:0];             // force just 1:16 scanner 
        ramSC = SC;                     // Not used, but map as below, helps Logic reduction 
      end else begin
        bAllocCOG = MapCog;                 // 64x index MUX choice
        if (PairEn) begin                  // dual 32x 
	   if (F_NeedsHUB)  begin          // If F needs, it wins, (SC is ReLoad limited to < 32 in this case.)
              ramSC = {1'b0,SC[4:0]};      // Lower HALF 32x
           end else begin
              ramSC = {1'b1,SC[4:0]};      // Upper Half 32x
	   end   
	end else begin                     // Single 64x 
              ramSC = SC;                  // No addr MUX  64x
        end	
      end
end

To avoid needing ANY table pre-loading in default cases, I used config booleans, and the above includes option for 64x or 32x/32x or Default.

Table RAM is dual port, so update on the fly is legal, and scan continues.

The FPGA RAM blocks code faster, and smaller than LUT4 logic, but do impose a small restriction on how those can be written.

jmg · 2014-05-11 20:45

RossH wrote: »

You are proposing a scheme where users can no longer load any object into any cog. Madness.

This hyperbole just gets simply sillier.
Nowhere in the verilog snippets I posted, does it know, or care, what object a user loads. The user has total control.

potatohead · 2014-05-11 20:55

No, it's valid.

Sure, the user has total control. Of course, having cog code requiring a lot of hub access means requiring other cog code to work with less than nominal hub access. So the user has the control to load them, and the control needed to go and fix them when the table allocation doesn't mesh with the cog code requirements when combined.

And code written for a specific hub allocation may perform very differently on a different allocation scheme, due to how that normally gets done in PASM. The amount of instructions and the order of tasks completed gets optimized for the hub access period. Change that period and the cog code isn't going to play along a lot of the time.

Meaning, users will be in the more difficult portion of PASM attempting to make it work on a different scheme. And they will be there a lot too, where they only go there for specific reasons on P1.

Or, they could write their own.

Or, we might have objects that are written lots of different ways to match more schemes. (I'm not doing that, BTW)

So yes, you are technically right. They can load into any cog, just as you say. The question is whether or not doing that results in something that works.

Notice Kurenko's WHOP scheme? That is a great way to get more out of a video driver, and it takes advantage of how the waitvid works from the D&S buses in the P1. Those solutions always work on a P1 at 80Mhz no matter what they are combined with. And those problems are solved for all time too. Once done, always done, drop it in, go.

We no longer get to say that with a scheme like this, but we do get to say we have a lot of control, don't we?

Cluso99 · 2014-05-11 21:11

I just had a thought about the instructions needed to write to these tables...

We have some hub ROM from approx. $0-EFF. By utilising a few addresses, a WRBYTE to say hub $0 would store 2 nibbles. So we don't require any special cog instructions. Note we cannot read but I don't think this is a problem.

Perhaps this method could even be used for some of the other hub instructions???

jmg · 2014-05-11 21:21

Cluso99 wrote: »

We have some hub ROM from approx. $0-EFF. By utilising a few addresses, a WRBYTE to say hub $0 would store 2 nibbles. So we don't require any special cog instructions. Note we cannot read but I don't think this is a problem.

Perhaps this method could even be used for some of the other hub instructions???

Sounds valid, but if you are decoding an address, it could be any address ? Read would be useful for Debug.
Present code is for nibble RAM, but BYTE (paired Primary-Secondary) is a minor change.

RossH · 2014-05-11 22:08

jmg wrote: »

This hyperbole just gets simply sillier.
Nowhere in the verilog snippets I posted, does it know, or care, what object a user loads. The user has total control.

Your user now needs to know what the cog table looks like, to figure out if his object will run in the cog he intends to load it in.

Are you proposing to add any automated warnings if it will not, or just let them find out for themselves?

Ross.

EDIT: I see potatohead has raised the same issue. I agree with his post.

A 32-slot Approach (was: An interleaved hub approach)

Comments