Propeller II update - BLOG

ozpropdev · 2014-02-22 19:05

jmg wrote: »

So exit is managed elsewhere ?
In that case, maybe there is coverage IJ gives, that REPx does not
Can REPx early-exit via SW ?

From the Prop Docs

The instruction-block repeater will quit repeating the block if a branch instruction executes
within the block, or if a JMPTASK instruction affects the task which is using the repeater.

cgracey · 2014-02-22 19:19

IJZ/IJZD/IJNZ/IJNZD are gone now. I shuffled a few things around and put the COGRUN/COGRUNX instructions into D/#,S/# form, so that the cog can be indicated by either a variable or a constant. Because the new cog-starting instructions pack two 16-bit address into 32-bit D, the cog number no longer needs to be imbedded into the instruction to create an atomic operation. This simplified decoding somewhat, too.

There's room for an 'INCJMP D,S/@' instruction. This is a few gates to add. I'll try to get it in tonight. What about a 'DECJMP D,S/@' instruction? Is that useful? I think these instructions are only useful when preceded by a condition.

cgracey · 2014-02-22 19:21

jmg wrote: »

I forget, did REPx get a counter per thread in the end ?

Yes.

cgracey · 2014-02-22 19:25

Wouldn't IJNZ (increment and jump if not 0) practically be the same as INCJMP (or IJ). At 32 bits, you're unlikely for IJNZ to not branch.

This idea of modifying D and always jumping is interesting, though. Are there any other useful possibilities? How about SHRJNZ?

ozpropdev · 2014-02-22 19:33

cgracey wrote: »

Wouldn't IJNZ (increment and jump if not 0) practically be the same as INCJMP (or IJ). At 32 bits, you're unlikely for IJNZ to not branch.

This idea of modifying D and always jumping is interesting, though. Are there any other useful possibilities? How about SHRJNZ?

XXXJMP may have combinations that would benefit high speed USB drivers? (Cluso?).

jmg · 2014-02-22 19:34

cgracey wrote: »

Wouldn't IJNZ (increment and jump if not 0) practically be the same as INCJMP (or IJ). At 32 bits, you're unlikely for IJNZ to not branch.

True, but it would make for rather confusing source code - an IJNZ that was designed to never reach 0

cgracey · 2014-02-22 19:48

jmg wrote: »

True, but it would make for rather confusing source code - an IJNZ that was designed to never reach 0

I was forgetting that I just got rid of the IJZ/IJZD/IJNZ/IJNZD instructions. So, they can't help Tubular anymore, anyway.

mindrobots · 2014-02-22 20:06

cgracey wrote: »

Are there any other useful possibilities? How about SHRJNZ?

Acting on flag bits?

Now, that takes 2 instructions? Shift through carry and then a conditional jump?

jmg · 2014-02-22 20:07

cgracey wrote: »

This idea of modifying D and always jumping is interesting, though. Are there any other useful possibilities? How about SHRJNZ?

Shift and exit opcodes would help software serial cases, for where the SerDes HW was not quite right.
A duplex form would be handy, but I'm not quite seeing a form that manages duplex yet.

Cluso99 · 2014-02-23 00:14

The IJxx instructions, while I was excited to see them originally, I cannot actually see any great use of them that cannot be achieved in other ways. The DJxx are the ones that are used often because you set a count and count down.

There were a couple of instructions that would help FS USB but from memory, they are listed together with the single bit crcs.
One was a shift and set ccodes. I need to look for it.

I am going to have a think about the instructions for hubexec mode to launch a cog with/without reset (cognew/cogrun) and perform a fast full load using wides into cog $0-$1EF. I am quite happy to lose the 2 registers $1F0-1F1 that would not get loaded with this method.

Tubular · 2014-02-23 02:54

OzPropDev yes you're correct, "IJ" would just increment a variable at the same time as it always jumps.

Where it's useful is where you're trying to "profile" some tight code and hit sweet spots for hub transfers / avoid stalls etc. It's an easy way to measure how many events per second you're achieving, without adding another "inc" instruction that might or might not modify the timing of the tight loop. These loops may be tasks that don't exit (eg a serial RX task). Yes, DJNZ can be used to achieve the same thing, for profiling purposes, just negate it prior to presentation, but counting up is more intuitive for some things.

Another use is where you are polling a multi dimensional array of peripherals. Say you have 4 very fast ADCs, each has 8 channels of 16 bit data. If the counter increment is automated as part of the loop, you can just extract the 4 LSBs for bit number, next 3 bits as channel number, next 2 bits as ADC number. For this you don't want to skip a bit every 2^32-1 times, you just want it to increment regardless. Yes, there are other ways to do this and timing is generally not as critical, so adding an INC is not a big deal.

Whether there are other useful forms is a good question. It would be interesting to look at the most common P1 instructions executed immediately prior to a jump, statistically...

rogloh · 2014-02-23 16:11

Tubular wrote: »

OzPropDev yes you're correct, "IJ" would just increment a variable at the same time as it always jumps.

Where it's useful is where you're trying to "profile" some tight code and hit sweet spots for hub transfers / avoid stalls etc. It's an easy way to measure how many events per second you're achieving, without adding another "inc" instruction that might or might not modify the timing of the tight loop.

That alone sounds really useful to me.

dMajo · 2014-02-23 16:53

dMajo wrote: »

With hubexec being made perhaps we can have a reserved hub address for cogstart (like we have for clock frequency) and every cog can start in hubexec mode from that address. The address will have a jump to the next instruction. The COGNEW address opcode can write the jump to this address and then start the next cog. In this way the first jump executed can jump somewhere in the hub to then execute the next instruction or in the cog's register space thus switching to cogexec.
If the cog needs to be loaded the first jump will jump somewhere into hub space where the cogloader routine exists. this routine can end with a jump to cog register space.
One thing that can be good in this scenario is that the cogstop op effectively stops(resets) the cog's PC and all in/out/dir... but preserves ram contents thus allowing to restart the cog with direct jump in cog space without reloading the code. It can be good in energy saving applications where events can be immediately fired/answered and then the energy saving is restored.

If you go along with your idea, which is not bad, I will prefer that the first long will be like this:
LL byte: longs*4 to be loaded
LH byte: cog address*4 to start fill
HL byte: cog address*4 to start execution (set of PC)
HH byte: mode flags (register cleanup, in/dir/out reset, counter reset, ... and whatever can comes handy)
If you align the longs by 4 (who needs to load one or two) this allows for up to 1024 (256*4) thus eventually allowing for cog register space expansion in P3 (where with ram redesign the 1024 longs can be used also for instruction/data caches when running in hubexec mode thus saving the dedicated caches)

Dave Hein wrote: »

Just out of curiosity, how would a cog transfer hub data at full speed without encountering hub stalls? A "rdlongc inda++, ptra++" in a REPx loop would hit a hub stall every 8 loops. Is it possible to fill cog memory from the hub without hub stalls?

As I have understood the current hubexec and icache implementation if the hub to cogram copy routine/loop fits the icache it leaves all the hub cycles for data transfer

cgracey wrote: »

I realized that to make this flexible, where loading can start at any cog address for any number of longs, we won't be able to use the full-speed RDWIDEx instructions, since they only work on wide (8-long) boundaries. To force wide boundaries on loads would be ugly. So, we'll have to settle for RDLONGC's.

I have forgot of the wides, but as I have already suggested 4 longs boundaries the same apply for eight, for me is OK

dMajo · 2014-02-23 17:01

cgracey wrote: »
I changed the COGINIT/COGNEW instructions around, moving them to D-only opcodes. This way, two 16-bit addresses can be packed into D for PTRA (parameters) and PTRB (program). This means that parameters will start at a long-aligned address, like on Prop1. You can use AUGD with a cog starter instruction and launch statically-placed programs in what looks like one PASM instruction.

COGINIT has been renamed to COGRUN/COGRUNX, while COGNEW/COGNEWX start an idle cog. The -X suffix means start in hub memory, without loading any code into the cog. The normal versions that load the cog can use a prefix long to state how many longs to load, where to start loading inside the cog, where to jump to when done, and whether or not to pre-clear registers.
COGRUN  #/D,#0..7
COGRUNX #/D,#0..7
COGNEW  #/D
COGNEWX #/D
These changes are compiling now. I need to adjust the assembler and reassemble the ROM code to try it out.

So, we're there! We can start a cog directly from hub memory in hub exec mode, without any delays.

What does it mean? The X postfixed intructions start in hubexec and the ones without X uses the below scheme? Wil it use wides?

cgracey wrote: »

I just started drawing out what kinds of things ought to be done and I came to the same conclusion!

How about the prefix long being like this:

%c0_xxx_jjjjjjjjj_sssssssss_nnnnnnnnn = Load n longs starting at s, then jump to j. If c is 1 then pre-clear cog RAM before loading.

For simple programs that start at $000, the prefix long would just be the number of longs to load, with the optional MSB set for pre-clearing registers. A prefix long of $00000000 would just mean, "Don't load anything, jump to $000."

%c1_xxxxxxxxxxxxxx_jjjjjjjjjjjjjjjj = Jump to j (probably a hub address). If c is 1 then pre-clear cog RAM before jumping.

This scheme avoids the issues of multi-tasking setup, register remapping, etc., that are best handled by application code. This just gets things started.

Cluso99 · 2014-02-24 02:22

Here is something like the code that I think I would like to run to load a cog from hub using wides - ie fastest. I have left a couple of bits to fill in that I don't quite understand yet. Something like this could perhaps be placed in the Hub ROM.

DAT
        org     $0              ' boot code for Cog 0 ...
start
' Cog code that will load and start a cog
                                ' <cogload>  is hub addr of hubexec routine to load the cog
                                ' <codeaddr> is hub addr of cog code to load ($0-1E0)
                                ' <paraddr>  is hub addr of parameters
                                ' <cog>      is cog# to start
                                ' <len>      is length to load (bytes)             
        COGNEWX ????

DAT
        orgh    $1000           ' needs to be on a wide boundary for caching
' Load a Cog $000-1EF from Hub on a wide boundary...        
cogload SETPTRA <codeaddr>
        SETINDA #0
        REPS    #62*8,#1        ' 496 longs
        RDWIDEA #62             ' 62 wides
        MOV     INDA++,$1F1     ' repeat loading 62*8 longs
' set PTRA = PAR = hub addr of parameters & PTRB = <hubaddr> of cog code in hub
' Begin cog execution at $000
        JMPD    #0              ' J to cog $0 after executing following 2 instructions...
        SETPTRA <paraddr>       ' set hub PAR  addr
        SETPTRB <codeaddr>      ' set hub CODE addr
' Cog execution will now begin at cog $0

DAT
' Cog Program to be loaded into cog $0 and executed as a COGNEW...
        orgh    $2000           ' needs to be on a wide boundary
entry   jmp     #entry          ' loop here indefinately

cgracey · 2014-02-24 02:45

Cluso99 wrote: »

Here is something like the code that I think I would like to run to load a cog from hub using wides - ie fastest. I have left a couple of bits to fill in that I don't quite understand yet. Something like this could perhaps be placed in the Hub ROM.

DAT
        org     $0              ' boot code for Cog 0 ...
start
' Cog code that will load and start a cog
                                ' <cogload>  is hub addr of hubexec routine to load the cog
                                ' <codeaddr> is hub addr of cog code to load ($0-1E0)
                                ' <paraddr>  is hub addr of parameters
                                ' <cog>      is cog# to start
                                ' <len>      is length to load (bytes)             
        COGNEWX ????

DAT
        orgh    $1000           ' needs to be on a wide boundary for caching
' Load a Cog $000-1EF from Hub on a wide boundary...        
cogload SETPTRA <codeaddr>
        SETINDA #0
        REPS    #62*8,#1        ' 496 longs
        RDWIDEA #62             ' 62 wides
        MOV     INDA++,$1F1     ' repeat loading 62*8 longs
' set PTRA = PAR = hub addr of parameters & PTRB = <hubaddr> of cog code in hub
' Begin cog execution at $000
        JMPD    #0              ' J to cog $0 after executing following 2 instructions...
        SETPTRA <paraddr>       ' set hub PAR  addr
        SETPTRB <codeaddr>      ' set hub CODE addr
' Cog execution will now begin at cog $0

DAT
' Cog Program to be loaded into cog $0 and executed as a COGNEW...
        orgh    $2000           ' needs to be on a wide boundary
entry   jmp     #entry          ' loop here indefinately

Looks good. PTRB could be used, since it otherwise contains the program's start address. INDA could be initialized in hardware to $000 on cog start.

rogloh · 2014-02-24 06:01

In my mind now we can start up COGs very quickly in hubexec mode, there appears to be something uncannily similar about just calling a function in C and spawning a hubexec COG with its start address in PTRB loaded into the PC, and a stack pointer passed in PTRA and some function arguments passed on the stack.

I could imagine code can be written to now very easily create hub exec threads in different COGs where the call address is just the function start address in the hub memory and it gets provided with its own stack where all its parameters are passed. That's all a function call really does anyway. The called function knows how many arguments it uses and these are quickly copied from the hub stack into its working "VM" registers in low COG RAM as required (wide copies could be really good for doing that) and the stack pointer moved down to make room for any of its locals needed to be stored in hub below the arguments. To "return" we then only need a standardized way for the called code to pass back its return code when it exits and the COG stops, this probably entails using a polling mechanism by the caller reading back return data at the top of its stack somewhere. The caller can wait on this result and/or also do other work in the meantime in parallel.

I can envisage some really good things coming from this fast capability to invoke regular C functions in parallel and run them on multiple COGs, hard to pinpoint exactly what I mean, but boy it feels good ...

I know you could sorta do this type of thing also in P1 on a coarser level, but it won't be as fast or as clean given it needs to load the entire COG and entire VM each time a "function" is called. Too much overhead to be quite as useful there, but on P2 it could start to fly.

Also it might be nice to have a way to know if a COG is running or in the stopped state. Is there some anyway to tell using a single instruction instead of polling hub memory? Random idea: if there is a lock assigned to each COG, maybe we can also block/wait on one or more of these as required by whatever the caller wants to do? The called COG could release this lock whenever it stops, signifying completion.

Roger.

Heater. · 2014-02-24 06:21

We should not confuse calling a function in C and starting a new thread. No matter how easy that becomes.

A called function uses the stack and the CPU time of the caller. A called function returns a result to the caller.

A thread has it's own stack and uses it's own CPU time. At least in the case of starting one on a new processor. A thread cannot return a result to the call that started it.

Anyway yes. You are right. It is great that starting COGs is so simple an quick.

I can envisage some really good things coming from this fast capability to invoke regular C functions in parallel and run them on multiple COGs, hard to pinpoint exactly what I mean,

Using C and OMP you can split algorithms like FFTs into parallel running chunks for a good speed gain. This already works on the P1. It will be even better on the P2 now.

cgracey · 2014-02-24 20:33

I've spent all day narrowing down a strange problem that cropped up after I added COGRUNX/COGNEWX.

It seems that there is an obvious (but intermittent) bug in Quartus. Why this problem is manifesting itself in such an observable way, but not causing other, harder-to-detect problems is baffling.

In the top-level Verilog file, I generate an 8-bit loop called "sel" that goes like this:

00000001
00000010
00000100
00001000
00010000
00100000
01000000
10000000
<repeat>

These are the hub cycle signals. Each cog receives this set, but offset by its cog number. For example, if "sel" is %00000001, cog0 gets %00000001, cog1 gets %00000010, cog2 gets %00000100, etc.

In order to get these 8 signals to each cog, I made an interim variable called "selx" that consists of two abutted copies of "sel", so that if "sel" is %00000001, "selx" is %0000000100000001. This will let me index what is "sel" with wrapping:

wire [15:0] selx = {2{sel}};

Then, in a GENERATE loop, I pass each cog, based on an iteration variable "i", an indexed portion of "selx", which is really some offset of "sel" with wrapping:

selx(i+7:i)

What I've found, that is absolutely nuts, is that the cog receives a blurred copy of selx, so that all the 1's are doubled, occurring twice in a row, sequencing like %00000011, %00000110, %00001100, etc.

By getting rid of "selx" and using "sel" directly, I get single 1's, which are proper, but that only works for cog0. As soon as I use any interim variable as an indexing mechanism, I get double 1's! This makes no sense. This is some kind of bug in Quartus' compiler.

Since this makes no sense, I don't know what to do about it. Any "solution" I come up with is not really a solution. I think my design is triggering some strange behavior in Quartus. If I make some change somewhere, it will probably jostle things up a bit and get Quartus out this error case.

Any ideas?

Cluso99 · 2014-02-24 20:41

As a temporary solution, can you define selx as the 16 bit versions...
0000000100000001
0000001000000010
etc

jmg · 2014-02-24 20:46

cgracey wrote: »

What I've found, that is absolutely nuts, is that the cog receives a blurred copy of selx, so that all the 1's are doubled, occurring twice in a row, sequencing like %00000011, %00000110, %00001100, etc.

By getting rid of "selx" and using "sel" directly, I get single 1's, which are proper, but that only works for cog0. As soon as I use any interim variable as an indexing mechanism, I get double 1's! This makes no sense. This is some kind of bug in Quartus' compiler.

Sounds like a 'too many balls in the air' type of issue, can you simplify the complexity of the GENERATE enough to avoid it.
What about a table style syntax ?

Tubular · 2014-02-24 20:48

Can you insert a buffer or two safely to make sure it isn't some kind of race condition?

With Altera sometimes we used to physically break the signal(s) out and back in again, to avoid compiler complaints and also be able to scope the signals.

cgracey · 2014-02-24 20:48

Cluso99 wrote: »

As a temporary solution, can you define selx as the 16 bit versions...
0000000100000001
0000001000000010
etc

It seems to not make any difference.

I suspect we must change the Verilog code in some significant way to get Quartus out of this mess.

Perhaps a PDP-11 compatibility mode would be just the thing. Okay, that's just a joke.

potatohead · 2014-02-24 20:51

it will probably jostle things up a bit and get Quartus out this error case.

!?! What the heck else could it be doing?

Any chance that compiler has an update or maintenance pack? Are you running on a license with support of some kind?

My only thought mirrored jmg's Can you express this differently without too many implications?

The idea of jostling it around is spooky. And it underscores the need to get some more testing done for sure.

cgracey · 2014-02-24 20:51

Tubular wrote: »

Can you insert a buffer or two safely to make sure it isn't some kind of race condition?

With Altera sometimes we used to physically break the signal(s) out and back in again, to avoid compiler complaints and also be able to scope the signals.

I had to play the same in-and-out games to get the PLL's the way I wanted them.

These signals are all clocked by the same "clk", so I don't think it's a race condition. Quartus is very good about closing timing properly. This seems like a logic problem. I think it's a compiler error. It's just too simple to be anything else. I mean, the compiler handles many details far more complicated that this.

cgracey · 2014-02-24 20:53

jmg wrote: »

Sounds like a 'too many balls in the air' type of issue, can you simplify the complexity of the GENERATE enough to avoid it.
What about a table style syntax ?

This bug may only be something that occurs within GENERATE. The whole construct already is very simple. What is table-style syntax, though?

cgracey · 2014-02-24 20:55

potatohead wrote: »

!?! What the heck else could it be doing?

Any chance that compiler has an update or maintenance pack? Are you running on a license with support of some kind?

My only thought mirrored jmg's Can you express this differently without too many implications?

The idea of jostling it around is spooky. And it underscores the need to get some more testing done for sure.

This is Quartus Web Edition V13.0.1. I'm downloading V13.1 right now, and I'll see if that helps. I've encountered spooky stuff like this in Quartus before, but it's been several years since.

jmg · 2014-02-24 21:12

cgracey wrote: »

This bug may only be something that occurs within GENERATE. The whole construct already is very simple. What is table-style syntax, though?

I think you were doing essentially ROM mapping ?
Looking at my notes re Xilinx, I find this

always @(PhR)  begin  // found syntax using ISE Webpack 
   case (PhR)
       8'b00000000  : CDac = 3'b000;
       8'b00010000  : CDac = 3'b001;
       8'b00100000  : CDac = 3'b010;
       8'b00110000  : CDac = 3'b011;
 ...

I think that compiles as a lookup, no registers. In your case a 1:1 pick off should result.
can be verbose, but does not expect too much of the compiler...
or maybe concatenation {} can let you map what you want ?

Bob Lawrence (VE1RLL) · 2014-02-24 21:24

Perhaps a PDP-11 compatibility mode would be just the thing. Okay, that's just a joke.

Here's a link for the young followers:

http://en.wikipedia.org/wiki/PDP-11

now you can laugh

Bob Lawrence (VE1RLL) · 2014-02-24 21:34

@cgracey

This is Quartus Web Edition V13.0.1. I'm downloading V13.1 right now

"Altera has announced the release of its Quartus II software version 13.1, delivering on average 30 percent and up to 70 percent reduction in compile times compared to the previous version, through significant algorithm optimization and increased parallelizatio

Read more at: http://www.techonlineindia.com/techonline/design_centers/286075/altera-releases-quartus-ii-software-v131

re:version 13.1, delivering on average 30 percent and up to 70 percent reduction in compile times compared to the previous version,"

re:30 percent and up to 70 percent reduction in compile times

If so , you must be happy about that.

:cool:

Propeller II update - BLOG

Comments