Prop2 FPGA files!!! - Updated 2 June 2018 - Final Version 32i

jmg · 2015-10-28 20:55

mindrobots wrote: »

Instead <snip>You have a system ClockTick register (CT) and the three user Clock Tick registers CT1, CT2 and CT3 and your instructions are all nice again:

GETCT (get clock ticks)
ADDCT1, ADDCT2, ADDCT3 - add S/# to D (and store result in CT1/2/3)
WAITCT1, WAITCT2, WAITCT3
POLLCT1, POLLCT2, POLLCT3

CNT just ceases to be.

That works too...
ClockTick register (CT) is easy to follow, and the primary opcode action is not the 'squashed' word.

cgracey · 2015-10-28 20:55

evanh wrote: »

cgracey wrote: »

Everything but hub and WAITxxx instructions are two clocks, if I'm not missing anything.

Cool. Weren't all the branch instructions four clocks? I must have missed a change somewhere.

You're right. My mind is a steal sieve.

Seairth · 2015-10-28 21:36

Out of curiosity, when you call one of the event-configuration SETxxx instructions, does that automatically clear the pending event flag?

Put another way:

SETEDG #cfg1
' some time passes and the edge event flag is set
SETEDG #cfg2
POLLEDG

In the above code, is POLLEDG necessary to clear the pending event flag from "cfg1", or does the second SETEDG do that automatically?

Edit: never mind. I just read Chip's documentation, which clearly states that it is cleared by the SETxxx instruction.

cgracey · 2015-10-28 23:03

jmg wrote: »

mindrobots wrote: »

Instead <snip>You have a system ClockTick register (CT) and the three user Clock Tick registers CT1, CT2 and CT3 and your instructions are all nice again:

GETCT (get clock ticks)
ADDCT1, ADDCT2, ADDCT3 - add S/# to D (and store result in CT1/2/3)
WAITCT1, WAITCT2, WAITCT3
POLLCT1, POLLCT2, POLLCT3

CNT just ceases to be.

That works too...
ClockTick register (CT) is easy to follow, and the primary opcode action is not the 'squashed' word.

This smells the best!

I will make this change in the next release. GETCNT will become GETCT.

cgracey · 2015-10-28 23:05

I just finished documenting the events (POLLxxx/WAITxxx) and interrupts. It's in the Google Doc linked at the top of this thread.

cgracey · 2015-10-28 23:34

I set up the Google Doc so that you guys can comment.

Yanomani · 2015-10-29 02:10

Hi Chip

Does something changed in the interim, or perhaps RDLUT remains the solitary instruction that takes three clock cycles to be executed?

And if this is true, by latching (or extending) destination's D address and the data value grabbed from LUT ram for one more cycle, couldn't you make use of next instruction's GET phase unused WR, to move the data to its final destiny?

Henrique

cgracey wrote: »

evanh wrote: »

cgracey wrote: »

Everything but hub and WAITxxx instructions are two clocks, if I'm not missing anything.

Cool. Weren't all the branch instructions four clocks? I must have missed a change somewhere.

You're right. My mind is a steal sieve.

cgracey · 2015-10-29 03:28

Yanomani wrote: »

Hi Chip

Does something changed in the interim, or perhaps RDLUT remains the solitary instruction that takes three clock cycles to be executed?

And if this is true, by latching (or extending) destination's D address and the data value grabbed from LUT ram for one more cycle, couldn't you make use of next instruction's GET phase unused WR, to move the data to its final destiny?

Henrique

cgracey wrote: »

evanh wrote: »

cgracey wrote: »

Everything but hub and WAITxxx instructions are two clocks, if I'm not missing anything.

Cool. Weren't all the branch instructions four clocks? I must have missed a change somewhere.

You're right. My mind is a steal sieve.

Both cog RAM access slots are used at the end of RDLUT, but even if there was one free, the data forwarding circuitry still needs to have the result in case the next instruction were to read the location that received the RDLUT value.

I forgot that RDLUT takes three clocks.

Yanomani · 2015-10-29 04:25

Thanks Chip

Sorry, I forgot about the need for the data forwarding circuit to have it ready, for, possibly, to be used by the next instruction.

Either way, having the ability of linearly spanning from two to four cycles per resolved instruction, makes it useful for timing and syncing purposes (at the cost of using some available scratchpad dummy destination address).

As a side note, it'll also ease the coding for two threads (or COGs) to do some cooperative work on an odd/even short cycle count basis, eg, early/late data sampling.

Rayman · 2015-10-29 17:47

RDLUT takes 3 clocks but WRLUT only takes 2? Wish it was the reverse (at least, I think so now before actually using LUT)...

LUT exec does run at regular speed though, I assume.

cgracey · 2015-10-29 23:45

Rayman wrote: »

RDLUT takes 3 clocks but WRLUT only takes 2? Wish it was the reverse (at least, I think so now before actually using LUT)...

LUT exec does run at regular speed though, I assume.

LUT exec runs at full speed, yes.

The reason RDLUT takes three clocks:

1) issue LUT read
2) capture LUT data
3) done, mux data through result mux, fan out to all result captures

The reason WRLUT takes two clocks:

1) issue LUT write
2) done

cgracey · 2015-10-29 23:51

NEW ZIP FILE POSTED AT TOP OF THREAD!!

The hub exec FIFO-level bug was fixed and GETCNT was renamed to GETCT.

mindrobots · 2015-10-30 00:49

cgracey wrote: »

I just finished documenting the events (POLLxxx/WAITxxx) and interrupts. It's in the Google Doc linked at the top of this thread.

This section was a big help!! (ok, it certainly gave me plenty to play with this afternoon....I don't think anybody at work noticed!

EDIT: Only 10 COGS on the 1-2-3! What's this world coming to!!!

Time for those A9 boards!!

Rayman · 2015-10-30 13:56

Just scanned the documentation...

Saw this typo:

$1F8        PTRA            pointer B to hub RAM
$1F9        PTRB            pointer A to hub RAM

Also, I think I'd prefer "SETCNT1" to "ADDCNT1".
"add" seems to me to imply adding something to CNT1 and I didn't see any way to set it's initial value.
But, after I read it, I see that you are setting CNT1 to sum of S and D.
So, you are doing an "ADD", but to me, the "SET" is the important part there...

Question:
Do all cogs toggle output pins at the exact same time still? I remember a long time ago that this was one of the initial features of P2-Hot. Is it still in there?

cgracey · 2015-10-30 16:09

Rayman wrote: »
Just scanned the documentation...

Saw this typo:
$1F8        PTRA            pointer B to hub RAM
$1F9        PTRB            pointer A to hub RAM
Also, I think I'd prefer "SETCNT1" to "ADDCNT1".
"add" seems to me to imply adding something to CNT1 and I didn't see any way to set it's initial value.
But, after I read it, I see that you are setting CNT1 to sum of S and D.
So, you are doing an "ADD", but to me, the "SET" is the important part there...

Question:
Do all cogs toggle output pins at the exact same time still? I remember a long time ago that this was one of the initial features of P2-Hot. Is it still in there?

I'll fix that typo. Thanks.

Because instructions take two cycles (at least), cogs will not always update pins at the same time, unless you coordinated them to do so.

Rayman · 2015-10-30 17:40

I guess I'm asking about the coordinated case...

I think I recall that in P1 there are different propagation delays between different cogs and I/O pins...

Think I remember that was fixed for P2 at one point...

cgracey · 2015-10-30 21:44

Rayman wrote: »

I guess I'm asking about the coordinated case...

I think I recall that in P1 there are different propagation delays between different cogs and I/O pins...

Think I remember that was fixed for P2 at one point...

On the P1, those are sub-clock differences that come from propagation delays through the OR gates.

One the P2, there are flops after the OR gates to align everything.

ozpropdev · 2015-11-01 06:27

Chip
The "Prop123_A7_Prop2_v3.rbf" still has 11 working cogs not 10.

cgracey · 2015-11-01 07:09

ozpropdev wrote: »

Chip
The "Prop123_A7_Prop2_v3.rbf" still has 11 working cogs not 10.

That sounds right. It's the DE2-115 that has 10.

ozpropdev · 2015-11-01 07:13

Your comment in the top post says otherwise.

* The Prop123-A7 board now has 10 cogs, not 11.

Peter Jakacki · 2015-11-01 07:22

cgracey wrote: »

Rayman wrote: »

I guess I'm asking about the coordinated case...

I think I recall that in P1 there are different propagation delays between different cogs and I/O pins...

Think I remember that was fixed for P2 at one point...

On the P1, those are sub-clock differences that come from propagation delays through the OR gates.

One the P2, there are flops after the OR gates to align everything.

I know we are waiting on smart pins but the smart thing to do with combining the outputs would be not to OR but to AND, that is any low will produce a low as many signals are active low so multiple cogs could then leave the pin high and another could take it low. Same for serial transmit too.

How about having AND instead of OR?

evanh · 2015-11-01 08:03

What!? Peter, you're asking for something, on a chip wide basis, that is just trading one problem for another. You're not getting anything improved by doing it.

And systems like that use negative logic or default to a high-output, both are counter-intuitive.

Peter Jakacki · 2015-11-01 08:14

You have two cogs calling the same routines at different times to activate a chip select. That chip select is active low but the OR outputs mean that the line needs to idle high but that blocks the other cog. Two or more cogs call a serial transmit, say for debugging, and this is of course "active low" as it needs to idle high but the transmit is blocked because it is OR'ed. There is nothing counter-intuitive about this, if only one cog wants to set an output it works the same since the inputs of the "AND" (or negated input NOR) that aren't being driven would be "float high" just as the current OR is "float low" if you want to think of it that way.

I've had plenty of experience with the Prop and countless hardware designs and this one thing would have made such a difference to how I can have multiple cogs share I/O. It's a very important point and surely anyone who has had much to do with Prop design must know this.

jmg · 2015-11-01 08:35

Peter Jakacki wrote: »

You have two cogs calling the same routines at different times to activate a chip select. That chip select is active low but the OR outputs mean that the line needs to idle high but that blocks the other cog. Two or more cogs call a serial transmit, say for debugging, and this is of course "active low" as it needs to idle high but the transmit is blocked because it is OR'ed.

Yes, other MCUs have pullups and Port modes, and open-Drain designs that give a natural OR.
Chip-Select and default Async erial and i2c are all active-low, in action.
Most INT signals from Peripherals are Active-Low.

CPLDs have one method to support both 'Default H and Default L' mixed in a design, and that is a Pin-Polarity fuse.

evanh · 2015-11-01 09:26

I/O inversion would be more suitable as that then allows the internals to hold positive logic. UARTs use inverters, I2C is open-collector, pull-downs also another option, the list goes on ... None of that has anything to do with the Cog combining logic. Messing with the OR structure will just be annoying.

I suspect there is plenty of I/O pin config options there right now in the custom layout of the pin structures but is still waiting on Smartpins to provide access to it's config, and of course the real silicon. The FPGA doesn't have the same arrangement so using/testing of that functionality will be limited at best.

evanh · 2015-11-01 09:44

I probably should have started by saying the pins have far more config in the Prop2 compared to, well, just direction control only in the Prop1.

Speaking of which, I wonder how much need there is for having the direction control inside every Cog now? We could have a PortC in those addresses.

Seairth · 2015-11-01 12:45

evanh wrote: »

Speaking of which, I wonder how much need there is for having the direction control inside every Cog now? We could have a PortC in those addresses.

The direction control in every cog is necessary for fast mode changes, such as with fast I2C, one-wire interfaces, and capacitive "button" tricks.

However, this got me thinking: why can't it be implied by writing to OUTx or reading from INx? Read-after-write might require a couple extra click cycles to let the input settle before capture, but this is no worse than the time taken to set DIRx. Write-after-read should be two click cycles faster than the current approach.

This would free up two registers, though I don't know if we would be able to get an internal PortC in return.

Edit: oh, wait. That won't work. There'd be no way to indicate which bits in the read or write are significant. Hence the need for DIRx.

Rayman · 2015-11-01 18:23

Just updated P123 to new version.

Wondering if it would be nice to have aliases:
addct1 = addcnt
waitct1 = waitcnt

Then those who don't use interrupts can keep syntax simple...

Seairth · 2015-11-01 18:57

Rayman wrote: »

Just updated P123 to new version.

Wondering if it would be nice to have aliases:
addct1 = addcnt
waitct1 = waitcnt

Then those who don't use interrupts can keep syntax simple...

I don't see how that is simpler.

David Betz · 2015-11-01 19:06

Should this chip be called something other than "Propeller 2"? It seems like it is diverging from P1 in many ways and claiming they are the same architecture may be misleading.

Prop2 FPGA files!!! - Updated 2 June 2018 - Final Version 32i

Comments