Propeller II update - BLOG

Bill Henning · 2014-03-20 12:32

potatohead wrote: »

@Bill: One case is inline. With common labels, inline can work extremely lean. One can quite literally, type the block, using the same names, variables, addresses, the works. Doing this is super easy, and for people wanting to use some PASM, it is a no brainer to enter in just the bit they want to use and continue on. Originally, the snippet idea was to handle that case, but now that we have HUBEX, we can now just do it inline, quickly, easily.

I love inlining code, and for that case, re-use between C, Spin etc does not make sense.

My modification for Heater's proposal was for stand-alone drivers / modules, heck it mostly modified/codified COGNEW/COGINIT usage... and should not impact in-line code.

Maybe we should call this driver model?

Perhaps coded something like:

CON
    ' constants go here

DAT

myparams PARMS ' must be long aligned

' parameter block goes here, must be long aligned
a           LONG
' but can have words/bytes in it

myentry  CODE

' cog or hubexec code goes here

The above would assemble to a blob, usable from any language. Heck, it could automatically emit a .spin wrapper and a .h for C to expose the PARMS block!

This does not effect in-lined code, or flexibility at all.

Heck, drivers following this model could also be written in gas!

The nice thing is... these blobs would be language-independent, usable from Spin, C, Forth, et al.

Heater. · 2014-03-20 12:44

As I said, the Spin community will reject the idea of uncoupling PASM from Spin.

We are forever doomed to having to duplicate effort to create the same drivers suitable use with other languages.

Yes you can take PASM drivers from OBEX and rearrange them to work outside Spin but that is extra work and becomes a pain when the original object is updated and bug fixed, how do you track and reintegrate changes?

It's a shame, thousands of hours of human life wasted.

Thank you Spin community

Heater. · 2014-03-20 12:52

Bill,

I still don't get your suggestion. I don't care how COGINIT works. That's implementation. It's the language that is controlling this. Braking the symbolic linkage between Spin and PASM. To do that you need to put PASM in a semantically different space, the PASM section.

Now we have execute from HUB now, and perhaps that code should be able to link with Spin, and perhaps my suggested PASM section should be called the COG section instead.

But what if we want to have interoperable PASM that is hub executed? Eeek.

Roy Eltham · 2014-03-20 13:21

Guys,
I think this problem can be solved in the tools. We could easily make the spin compiler have an option that outputs the PASM blobs along with everything needed to interface with it for the C/C++ compilers. In fact, it's possible that we could have it output header/source file(s) that contains everything needed for the C/C++ code to use the PASM, including the DAT blobs in global arrays. This could include C/C++ declarations of pointer variables for all the labels in the DAT. Perhaps we could even incorporate some spinwrap or spin2cpp stuff and get some C/C++ code generated for the Spin functions.

This seems like a better approach to take with this stuff. No need to change or limit Spin/PASM.

potatohead · 2014-03-20 13:21

It is a lot harder to maximize the chip in two separate spaces Heater. And was about to post up the HUBEX point myself.

IMHO, the core SPIN, which will very likely end up native to the P2, on chip development needs to be unified as it is now.
Maximizing the two is what SPIN is all about. To really keep people out of trouble, let's enforce type checks, and all the other things that make languages big and not able to be self hosted easily. Makes no sense.

We have C for that! And this difference is why I want gcc to be awesome. Those who want all those extrascan have them.

Secondly, the core SPIN as Chip will eventually write it will be simple, loose, etc...

When we get Open Spin ported and checked, gcc will be rocking by then too. Add the capabilities needed to Open Spin, and even do what Bill suggests and automate the output!

I'll bet that project ends up useful and compelling. Doing that now, breaks some of the magic that attracts people to the Propeller, and I'm very strongly opposed.

I would use that, given a simple option, or maybe even a new section definition called "PORTABLE" where the rules needed are enforced. Bring that on. It will get used.

However we accomplish it, breaking the tight design marriage of silicon, PASM and SPIN shouldn't be done.

potatohead · 2014-03-20 13:28

Re: Thank you...

Well, one could say thanks for breaking a sweet environment just to make it fit "the one way" of C, but you won't see me doing that.

To be perfectly frank, Chip's way ob building these things is compelling. A person can jump in and do neat things without all the "pro" type worries. Having that exist is important.

Truth is, adding interoperability options to OpenSpin will make great sense. Do it there and it will compete nicely. I'll use it when appropriate. Forcing it into what Chip will build isn't OK. Some us want that, and we want it for good reasons, and that we got it on P1 is a big reason why many of us are here too. An expansion at the OpenSpin level will make sense and C will make even better sense due to the much greater capability in P2. We will have fewer worries about this stuff in the end.

Just think of all the wasted hours making sure things work for others no matter what instead of just making something happen...

Cuts both ways man.

potatohead · 2014-03-20 13:56

What Roy said... Didn't see his post. Agreed.

Cluso99 · 2014-03-20 14:23

Bill Henning wrote: »

David,

I only want to get rid of the

REPD #i,#n ' constant loop count

variant, and keep the other register based variant, which I agree is extremely useful.

Bill,
I think you missed the subtlety of the REPS/REPD difference.
The 3 delay version permits conditional execution eg
if_z REPD #I,#n

Personally, I cannot see how one would use this because you are going to slip thru' into the code anyway. So I am quite happy to have it removed unless someone can see a use for the conditional execution.

cgracey · 2014-03-20 14:57

Bill Henning wrote: »
Chip,

Unless I am missing something, a dual-op instruction can be recovered:
REPS    #n,#i    - execute 1..64 instructions 1..65536 times,   requires 1 spacer instruction
REPD    #n,#i    - execute 1..64 instructions 1..512 times,     requires 3 spacer instructions

instructions (iiiiii = #i-1, n_nnnn_nnnnnnnnn_nnn/nnnnnnnnn = #n-1)                            clocks
REPD #n,#i seems to be an exact subset of REPS #n,#i - if I am correct, that frees up a double argument instruction.

REPD is unique, since it executes not in the 2nd stage of the pipeline, like REPS, but in the 4th, so that you can use a variable for the iteration count. REPD needs three spacer instructions, while REPS needs only one.

Bill Henning · 2014-03-20 15:00

Chip,

I was only referring to the REPD #n, #i variant - it duplicates the functionality of REPS #n,#i

I LIKE REPD D, #i

cgracey wrote: »

REPD is unique, since it executes not in the 2nd stage of the pipeline, like REPS, but in the 4th, so that you can use a variable for the iteration count. REPD needs three spacer instructions, while REPS needs only one.

cgracey · 2014-03-20 15:12

Bill Henning wrote: »
Very similar to Heater's proposal to above, but slightly simpler and thus more likely to be adapted

From the latest docs:

AUGS could be used to point to the parameter block with #.

The problem is that D is a register - a cog register - which pretty much requires the cog image to be patched. Except....

I think what we need to do is:
       LOCPTRA  #hubaddress_of_parameters
       LOCPTRB  #hubaddress_of_program
       COGNEW  ptrb,ptra
If SPIN is disallowed from accessing labels inside the programs code, then binary objects can be written that will work the same with Spin and C.

Spin & C can set up initial data in the parameter section, just like Spin currently does in the cog image - but this time, as the cog image is not modified, the same blob would work fine with C or SPIN.

Note the above does not care if the cog will be running in cog or hubexec mode!

A binary driver would just have to publish the definition of its parameter block, the start of which would have to be long aligned.

This also means that we can free up two dual-reg opcodes, and use argumentless opcodes for COGNEW and COGINIT ... forcing the PTRA/PTRB usage using LOCPTR. Unless I am mistaken, this will actually require less logic to implement than the current COGNEW/COGINIT!

Those docs are out-of-date, per the warning at the top of the file. COGINIT is gone, replaced by COGRUN/COGRUNX. There's also COGNEW/COGNEWX. The -X variants start the cog in hub exec mode. There is only one parameter for code/pointer now, and that is

D = %aaaaaaaaaaaaaaaa_bbbbbbbbbbbbbbbb

D[15:0] point to the starting long in hub (minus the two %00 LSBs), and become PTRB of the cog getting started (again, minus the two %00 LSBs).
D[31:16] become PTRA of the cog getting started (minus the two %00 LSB's)

So, a single 32-bit value now serves both purposes. This frees up S/# in COGRUN/COGRUNX to provide the cog number, keeping the entire operation atomic, without needing to make self-modifying code to convey the cog#.

COGRUN D/#,S/#
COGRUNX D/#,S/#
COGNEW D/#
COGNEWX D/#

For COGNEW to receive back a cog#, it must use D and not #.

cgracey · 2014-03-20 15:15

pedward wrote: »

Why can't you just give us a slower DE0-nano instead of redacting so much functionality? I'd be happy with a 40-60Mhz DE0 if it meant a full implementation of the core.

It's not speed that determines what fits. It's functionality. Otherwise, you bet I'd make full versions. This is all we can get out of the DE0-Nano.

Bill Henning · 2014-03-20 15:23

Thank you! Much better.

Now I remember you mentioning this earlier.

cgracey wrote: »

Those docs are out-of-date, per the warning at the top of the file. COGINIT is gone, replaced by COGRUN/COGRUNX. There's also COGNEW/COGNEWX. The -X variants start the cog in hub exec mode. There is only one parameter for code/pointer now, and that is

D = %aaaaaaaaaaaaaaaa_bbbbbbbbbbbbbbbb

D[15:0] point to the starting long in hub (minus the two %00 LSBs), and become PTRB of the cog getting started (again, minus the two %00 LSBs).
D[31:16] become PTRA of the cog getting started (minus the two %00 LSB's)

So, a single 32-bit value now serves both purposes. This frees up S/# in COGRUN/COGRUNX to provide the cog number, keeping the entire operation atomic, without needing to make self-modifying code to convey the cog#.

COGRUN D/#,S/#
COGRUNX D/#,S/#
COGNEW D/#
COGNEWX D/#

For COGNEW to receive back a cog#, it must use D and not #.

cgracey · 2014-03-20 15:24

Cluso99 wrote: »

Bill,
I think you missed the subtlety of the REPS/REPD difference.
The 3 delay version permits conditional execution eg
if_z REPD #I,#n

Personally, I cannot see how one would use this because you are going to slip thru' into the code anyway. So I am quite happy to have it removed unless someone can see a use for the conditional execution.

REPD #,# is needed for GETPIX:

REPD #128,#1 'do 128 pixels
WAIT #3 'need 3 clocks in stage 1 of initial GETPIX
WAIT #3 'need 3 clocks in stage 2 of initial GETPIX
WAIT #3 'need 3 clocks in stage 3 of initial GETPIX
GETPIX INDA++ 'GETPIX takes three clocks per instruction

Cluso99 · 2014-03-20 16:43

Dave Hein wrote: »

I've modified PASM programs to work with C by moving all of the variables that are initialized by Spin code to the beginning of the PASM code. The first line of the PASM code just jumps over the variables. This allows C to set up PASM variables just like Spin does. As an example, here are the first few lines of a VGA driver written in PASM.

                        org     0

initialization          jmp     #skipover

directionState          long    0
videoState              long    0
frequencyState          long    0
numTileLines            long    0
numTileVert             long    0
visibleScale            long    0
invisibleScale          long    0
horizontalLongs         long    0
tilePtr1                long    0
tileMap1                long    0
pixelColorsAddress      long    0
syncIndicatorAddress    long    0


skipover
                        mov     vcfg, videoState          ' Setup video hardware.
                        mov     frqa, frequencyState
                        movi    ctra, #%0_00001_101

The C startup routine looks like this

#include <stdint.h>
#include "propeller.h"

#define bitsPerPixel      1
#define horizontalScaling 1
#define horizontalPixels  640
#define NUM_TILES 256

extern int8_t vga_array[];

typedef struct VgaVarS {
    uint32_t directionState;
    uint32_t videoState;
    uint32_t frequencyState;
    uint32_t verticalScaling;
    uint32_t verticalPixels;
    uint32_t visibleScale;
    uint32_t invisibleScale;
    uint32_t horizontalLongs;
    uint32_t *colortable1;
    uint32_t *buffer1;
    uint32_t *pixelColorsAddress;
    uint32_t *syncIndicatorAddress;
} VgaVarT;

void VgaStart(int pinGroup, int verticalResolution, int *newDisplayPointer,
int *colortable, int *pixelColors, int *syncIndicator)
{
    int i, temp;
    VgaVarT vga;
    int frequencyState;

    vga.directionState = 0xff << (8 * pinGroup);
    vga.videoState = 0x200000ff | (pinGroup << 9) | (bitsPerPixel << 28);
    temp = (25175000 + 1600) / 4;
    frequencyState = 1;
    for (i = 0; i < 32; i++)
    {
        temp <<= 1;
        frequencyState <<= 1;
        if (i == 31) frequencyState++;
        if (temp >= CLKFREQ)
        {
            temp -= CLKFREQ;
            frequencyState++;
        }
    }
    vga.frequencyState = frequencyState;
    vga.verticalScaling = 480 / verticalResolution;
    vga.verticalPixels = 480 / vga.verticalScaling;
    vga.visibleScale = (horizontalScaling << 12) + ((640 * 32) >> bitsPerPixel) / horizontalPixels;
    vga.invisibleScale = ((8 << bitsPerPixel) << 12) + 160;
    vga.horizontalLongs = horizontalPixels / (32 >> bitsPerPixel);
    vga.buffer1 = newDisplayPointer;
    vga.colortable1 = colortable;
    vga.pixelColorsAddress = pixelColors;
    vga.syncIndicatorAddress = syncIndicator;

    memcpy(vga_array + 4, &vga, sizeof(VgaVarT));

    cognew(vga_array, 0);
}

Using this method, both C and Spin can use the same PASM code.

Dave,
Looks great. Only thing I would change is the def to begin at the cogstart instruction and the first long is reserved for the jump over instruction. (not explained well but I hope you know what I mean).

rogloh · 2014-03-20 16:58

Cluso99 wrote: »

Bill,
I think you missed the subtlety of the REPS/REPD difference.
The 3 delay version permits conditional execution eg
if_z REPD #I,#n

Personally, I cannot see how one would use this because you are going to slip thru' into the code anyway. So I am quite happy to have it removed unless someone can see a use for the conditional execution.

Actually it might be possible to escape the slip through execution of the conditional REPD loop by breaking out in the spacer instruction...is this allowed?

if_z  REPD #i, #n
if_nz JMP  #out
      (spacer)
      (spacer)
      (looped code)

In fact perhaps the same technique could apply to the REPS form if the escape condition is already known (eg if Z=0 don't execute the loop), despite the fact that REPS opcode itself is always non conditional. You could probably just do the jump before the REPS loop however so it doesn't buy a lot. At best it probably just saves you one clock cycle if that spacer wasn't going to get used for something else.

REPS #i, #n
if_nz JMP  #out
      (looped code)

Cluso99 · 2014-03-20 16:59

Thanks Chip. I missed the subtlety of the S/#i option.
REPD also can be conditional too IIRC ? Is this necessary - perhaps its just a freebie anyway?

BTW I wonder if REPD1 & REPD3 might be more meaningful??? Perhaps we could rename the other xxxD instructions similarly??? ie xxxDn where n = # of delayed instructions.

Cluso99 · 2014-03-20 17:03

cgracey wrote: »

REPD #,# is needed for GETPIX:

REPD #128,#1 'do 128 pixels
WAIT #3 'need 3 clocks in stage 1 of initial GETPIX
WAIT #3 'need 3 clocks in stage 2 of initial GETPIX
WAIT #3 'need 3 clocks in stage 3 of initial GETPIX
GETPIX INDA++ 'GETPIX takes three clocks per instruction

Aha! Chip, you have really thought this thru'. This is really clever

Bill Henning · 2014-03-20 17:07

I swear I am not trying to be difficilt, I am trying to understand every little detail of the P2...

WAIT #3 'need 3 clocks in stage 1 of initial GETPIX
WAIT #3 'need 3 clocks in stage 2 of initial GETPIX
REPS #128,#1 'do 128 pixels
WAIT #3 'need 3 clocks in stage 3 of initial GETPIX
GETPIX INDA++ 'GETPIX takes three clocks per instruction

would it not behave the same?

If not, would not

WAIT #3 'need 3 clocks in stage 1 of initial GETPIX
WAIT #2 'need 3 clocks in stage 2 of initial GETPIX, uses the 1 cycle of REPS as well, so can wait 1 cycle less
REPS #128,#1 'do 128 pixels
WAIT #3 'need 3 clocks in stage 3 of initial GETPIX
GETPIX INDA++ 'GETPIX takes three clocks per instruction

work?

Or even

WAIT #5 'need 3 clocks in stage 2 of initial GETPIX, uses the 1 cycle of REPS as well, so can wait 1 cycle less
REPS #128,#1 'do 128 pixels
WAIT #3 'need 3 clocks in stage 3 of initial GETPIX
GETPIX INDA++ 'GETPIX takes three clocks per instruction

I am trying to understand every crook and cranny of the pipeline, so I can write crazy code as usual

cgracey wrote: »

REPD #,# is needed for GETPIX:

REPD #128,#1 'do 128 pixels
WAIT #3 'need 3 clocks in stage 1 of initial GETPIX
WAIT #3 'need 3 clocks in stage 2 of initial GETPIX
WAIT #3 'need 3 clocks in stage 3 of initial GETPIX
GETPIX INDA++ 'GETPIX takes three clocks per instruction

Cluso99 · 2014-03-20 17:21

I have noted that the COGRUN/COGRUNX and COGNEW/COGNEWX cater for hubex start.
Do we require both the cog and hubex versions?
The P2 hw could determine whether it is a cog or hub start address simply by whether the address >$1FF longs. Perhaps there is more silicon required for this rather than having 2 separate instructions?
Otherwise, the compiler could determine which instruction version to use???

JMP/CALL instructions already do this anyway.

BTW the same applies to the monitor program (as I have already done this with my now old P2 Debugger). I don't see any real need to see what is in the lower $200 longs of hub - we can write a program to show this, and ultimately it will be published anyway.

Just a thought.

Cluso99 · 2014-03-20 17:23

Bill Henning wrote: »

I swear I am not trying to be difficilt, I am trying to understand every little detail of the P2...
.....
I am trying to understand every crook and cranny of the pipeline, so I can write crazy code as usual

No - Not you Bill

ozpropdev · 2014-03-20 17:32

Bill Henning wrote: »

I am trying to understand every crook and cranny of the pipeline, so I can write crazy code as usual

Bill,

CRAZY CODE!
Bring it on....

cgracey · 2014-03-20 17:33

When a cog is started, you need to supply a hub address that a program will be loaded into the cog from (COGRUN/COGNEW), or that the cog will start executing at from hub memory (COGRUNX/COGNEWX).

If you were to do a COGRUNX/COGNEWX using an address below $200, you would jump to that address inside that cog, without loading anything into it. This would be okay if you knew what was sitting in the cog's memory, but would be reckless, otherwise.

potatohead · 2014-03-20 17:36

If you were to do a COGRUNX/COGNEWX using an address below $200, you would jump to that address inside that cog, without loading anything into it.

This is pretty nice. COGS can have multiple entry points now, or a COG can do something and exit, it's contents known for the next start @ address! This is worth having the instructions differentiated.

cgracey · 2014-03-20 17:37

Bill Henning wrote: »

I swear I am not trying to be difficilt, I am trying to understand every little detail of the P2...

WAIT #3 'need 3 clocks in stage 1 of initial GETPIX
WAIT #3 'need 3 clocks in stage 2 of initial GETPIX
REPS #128,#1 'do 128 pixels
WAIT #3 'need 3 clocks in stage 3 of initial GETPIX
GETPIX INDA++ 'GETPIX takes three clocks per instruction

would it not behave the same?

If not, would not

WAIT #3 'need 3 clocks in stage 1 of initial GETPIX
WAIT #2 'need 3 clocks in stage 2 of initial GETPIX, uses the 1 cycle of REPS as well, so can wait 1 cycle less
REPS #128,#1 'do 128 pixels
WAIT #3 'need 3 clocks in stage 3 of initial GETPIX
GETPIX INDA++ 'GETPIX takes three clocks per instruction

work?

Or even

WAIT #5 'need 3 clocks in stage 2 of initial GETPIX, uses the 1 cycle of REPS as well, so can wait 1 cycle less
REPS #128,#1 'do 128 pixels
WAIT #3 'need 3 clocks in stage 3 of initial GETPIX
GETPIX INDA++ 'GETPIX takes three clocks per instruction

I am trying to understand every crook and cranny of the pipeline, so I can write crazy code as usual

GETPIX actually begins executing in stage 1 of the pipeline, and you need to afford it three clocks per stage, from then on. Sneaking a REPS in there will cause the stages to advance after just one clock when REPS is in stage 2, which is its executing stage. There is no way around doing something like this:

WAIT #3
WAIT #3
WAIT #3
GETPIX D 'three clocks per stage needed in stages 1,2,3, then GETPIC takes 3 clocks.

Cluso99 · 2014-03-20 17:49

cgracey wrote: »

When a cog is started, you need to supply a hub address that a program will be loaded into the cog from (COGRUN/COGNEW), or that the cog will start executing at from hub memory (COGRUNX/COGNEWX).

If you were to do a COGRUNX/COGNEWX using an address below $200, you would jump to that address inside that cog, without loading anything into it. This would be okay if you knew what was sitting in the cog's memory, but would be reckless, otherwise.

Thanks Chip - Of course - I totally missed that

Bill Henning · 2014-03-20 18:18

Thanks Chip!

Now it makes sense to me.

(The coin finally dropped!)

cgracey wrote: »

GETPIX actually begins executing in stage 1 of the pipeline, and you need to afford it three clocks per stage, from then on. Sneaking a REPS in there will cause the stages to advance after just one clock when REPS is in stage 2, which is its executing stage. There is no way around doing something like this:

WAIT #3
WAIT #3
WAIT #3
GETPIX D 'three clocks per stage needed in stages 1,2,3, then GETPIC takes 3 clocks.

rogloh · 2014-03-20 18:39

cgracey wrote: »

When a cog is started, you need to supply a hub address that a program will be loaded into the cog from (COGRUN/COGNEW), or that the cog will start executing at from hub memory (COGRUNX/COGNEWX).

If you were to do a COGRUNX/COGNEWX using an address below $200, you would jump to that address inside that cog, without loading anything into it. This would be okay if you knew what was sitting in the cog's memory, but would be reckless, otherwise.

I think this address less than $200 "feature" will ultimately end up getting used in some weird and wonderful way by someone somehow. Little things like that end up allowing interesting things to be coded. A persistent pasm helper COG doing something in parallel with the code running in our own COG, has zero load time, pointer argument copied into PTRA, hmmm.....

UPDATE: I'm thinking this could potentially be a fast way to do an external memory driver, assuming there was a locking mechanism to separate COGs from competing for the same resource. Once a COG has taken the resource lock (if required), the driver COG is restarted, reads data/commands from its PTRA argument, then runs to completion writes its result, and exits. This may save the driver polling lots of different areas of memory sequentially for different COGs and thereby reduce some latency.

Sapieha · 2014-03-20 19:13

Hi Chip.

Have You removed RESD ---> else missed add them to Short descriptions

ozpropdev · 2014-03-20 19:18

Sapieha wrote: »

Hi Chip.

Have You removed RESD ---> else missed add them to Short descriptions

Its called TARG now

Propeller II update - BLOG

Comments