The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

dnalor · 2015-07-26 16:06

Ok, soverschiebenis the word of the day.

dnalor · 2015-07-26 16:38

@kwinnhttp://forums.parallax.com/discussion/comment/1338833/#Comment_1338833 below the image!?

@cgraceyXOR outa,#%0001 'mainline codeLINK $1F4,$1F5 wc,wz 'interrupt LINK occursNOP 'pipeline NOP'dXOR outa,#%1000 'interrupt code
Can a higher interrupt jump in before the XOR?If yes I would like to have a automatic IDEFER on interrupt entry.So a lower priority interrupt can do some critical code before a higher interrupt can break it after a IPERMIT.

MJB · 2015-07-26 18:53

Can a higher interrupt jump in before the XOR?If yes I would like to have a automatic IDEFER on interrupt entry.So a lower priority interrupt can do some critical code before a higher interrupt can break it after a IPERMIT.

Isn't that what it means to be a higher interrupt - interrupting the lower as fast as possible?

dnalor · 2015-07-26 19:27

Yes and sometimes no. Here you are forced to (only) three interrupts with three levels. Sometimes you may want only two or even one level. With a sixteen core device this is of course not a really a must have.
But anyhow it is important to know what happens...

jmg · 2015-07-26 20:37

Would it be possible to produce a listing of the instructions executed when single stepping?

Of course, that is merely SW.Once a Debug Kernal has the HW to launch one opcode at a time it can simply peek into the return register, and you have a execution trace you can report.The important HW detail is a return from Debug, (with INT armed) executes one opcode before firing the INT back to the Debug.Such a traced (some call this animated) flow could even be quite fast, limited mainly by the link to the PC.A timed report would also be possible, if the Debug captured CNT before and after the call, minus an offset would give the cycles executed.From Chip's comments, mostly that cycles = 1oc, but there are paired opcodes that need 2oc and REP needs Noc.

Dave Hein · 2015-07-26 21:06

I'm working on converting the P1 Spin interpreter to run on the P2, and I ran into a few instructions that don't exist on the P2. I can't find the TJNZ and TJZ instructions, so it appears that these will require a TEST instruction followed by a JNZ or JZ. Are there TJNZ and TJZ instructions for the P2? I suppose the pipelining in the P2 makes it difficult to implement these instructions.
I also had to replace JMPRET with CALLD. I may have to move the code around since the general version of CALLD uses relative addresses that only have a range of -256 to +255.

cgracey · 2015-07-26 21:26

I'm working on converting the P1 Spin interpreter to run on the P2, and I ran into a few instructions that don't exist on the P2. I can't find the TJNZ and TJZ instructions, so it appears that these will require a TEST instruction followed by a JNZ or JZ. Are there TJNZ and TJZ instructions for the P2? I suppose the pipelining in the P2 makes it difficult to implement these instructions.
I also had to replace JMPRET with CALLD. I may have to move the code around since the general version of CALLD uses relative addresses that only have a range of -256 to +255.

The JZ/JNZ/JS/JNS instructions base their decision on the D register's contents, not the flags. So, JZ/JNZ on the Prop2 are like TJZ/TJNZ on the Prop1. I'm wondering if the T should be put back.

cgracey · 2015-07-26 21:32

I'm working on converting the P1 Spin interpreter to run on the P2, and I ran into a few instructions that don't exist on the P2. I can't find the TJNZ and TJZ instructions, so it appears that these will require a TEST instruction followed by a JNZ or JZ. Are there TJNZ and TJZ instructions for the P2? I suppose the pipelining in the P2 makes it difficult to implement these instructions.
I also had to replace JMPRET with CALLD. I may have to move the code around since the general version of CALLD uses relative addresses that only have a range of -256 to +255.

And that.+255/-256 range is in bytes, not longs, making it +63/-64 instructions. Maybe I should make those relative offsets be long, with the assembler verifying common long-alignment offsets.

Bean · 2015-07-26 21:40

Chip, I haven't commented much on the P2. But to me single-stepping and break points are two features that I sorely missed going from the SX to P1.If the P2 could have these feature it would be great, but please do not delay the release of the P2 too long for anything.
Bean

jmg · 2015-07-26 21:46

The JZ/JNZ/JS/JNS instructions base their decision on the D register's contents, not the flags. So, JZ/JNZ on the Prop2 are like TJZ/TJNZ on the Prop1. I'm wondering if the T should be put back.

Yes, certainly put the T back if the opcode really does a Test and Jump.In many MCUs, JZ acts on the Zero Flag

Dave Hein · 2015-07-26 23:22

I agree that the T should be added to the JZ and JNZ instructions since they are identical to the P1's TJZ and TJNZ instructions. And for consistency the JS and JNS should be TJS and TJNS. This would reduce the confusion.
I also think that the jump offsets should be in terms of longs and not bytes. Otherwise, the range is very small and it seems like the 2 LSBs would be wasted if the offsets refer to bytes.

jmg · 2015-07-26 23:34

I agree that the T should be added to the JZ and JNZ instructions since they are identical to the P1's TJZ and TJNZ instructions. And for consistency the JS and JNS should be TJS and TJNS. This would reduce the confusion.
I also think that the jump offsets should be in terms of longs and not bytes. Otherwise, the range is very small and it seems like the 2 LSBs would be wasted if the offsets refer to bytes.

Curious if there ever can be a case of non-word aligned destinations ?

potatohead · 2015-07-27 00:55

Why would you want them? Instructions are word aligned.

Cluso99 · 2015-07-27 02:04

You mean instructions are long aligned (always).
Although we used a trick in this for fast hub downloads on the P1 but this is no longer required.

cgracey · 2015-07-27 02:56

Chip, I haven't commented much on the P2. But to me single-stepping and break points are two features that I sorely missed going from the SX to P1.If the P2 could have these feature it would be great, but please do not delay the release of the P2 too long for anything.
Bean

I hear you. I'll add what we need to make a non-maskable super-interrupt to be used for debugging. It comes down to a little steering logic to exploit INA/INB's shadow registers as hidden LINK D and S registers. It's providence that they just happen to exist and can be put to such good use.

cgracey · 2015-07-27 03:03

You mean instructions are long aligned (always).
Although we used a trick in this for fast hub downloads on the P1 but this is no longer required.

In hub memory, instructions can exist at any byte offset. This is advantageous when, say, a byte string is inserted between instructions. In blocks of pure functional code, though, there should be no long-offset interruptions. So, 9-bit relative branches ought to be reckoned in longs, not bytes. So, I will change it

cgracey · 2015-07-27 03:04

The JZ/JNZ/JS/JNS instructions base their decision on the D register's contents, not the flags. So, JZ/JNZ on the Prop2 are like TJZ/TJNZ on the Prop1. I'm wondering if the T should be put back.

Yes, certainly put the T back if the opcode really does a Test and Jump.In many MCUs, JZ acts on the Zero Flag

Will do.

jmg · 2015-07-27 03:06

In hub memory, instructions can exist at any byte offset. This is advantageous when, say, a byte string is inserted between instructions. In blocks of pure functional code, though, there should be no long-offset interruptions. So, 9-bit relative branches ought to be reckoned in longs, not bytes. So, I will change it

Sounds good - that is the sort of checking an Assembler can do & report if needed.Larger jump reach is always appreciated in MCU's

Electrodude · 2015-07-27 03:12

You mean instructions are long aligned (always).
Although we used a trick in this for fast hub downloads on the P1 but this is no longer required.

In hub memory, instructions can exist at any byte offset. This is advantageous when, say, a byte string is inserted between instructions. In blocks of pure functional code, though, there should be no long-offset interruptions. So, 9-bit relative branches ought to be reckoned in longs, not bytes. So, I will change it

I would much rather waste up to three bytes for every byte string than have to deal with misaligned instructions. Is this the only reason why the assembler now does everything in byte addresses, or are there other reasons too? Can you please add a mode to the assembler (probably enabled through a flag in the source file) that makes all PASM addresses long addresses like on the P1?

jmg · 2015-07-27 03:46

I would much rather waste up to three bytes for every byte string than have to deal with misaligned instructions. Is this the only reason why the assembler now does everything in byte addresses, or are there other reasons too? Can you please add a mode to the assembler (probably enabled through a flag in the source file) that makes all PASM addresses long addresses like on the P1?

ASM needs to be byte-granular for Data arrays and Data management, so you cannot work only in longs.However, code can be given a rule it needs to be Long aligned, and ASM usually has a variant of ORG to do that, plus the opcode address is checked for long-aligned during build.That is what Chip is now doing,

Electrodude · 2015-07-27 04:15

I would much rather waste up to three bytes for every byte string than have to deal with misaligned instructions. Is this the only reason why the assembler now does everything in byte addresses, or are there other reasons too? Can you please add a mode to the assembler (probably enabled through a flag in the source file) that makes all PASM addresses long addresses like on the P1?

ASM needs to be byte-granular for Data arrays and Data management, so you cannot work only in longs.However, code can be given a rule it needs to be Long aligned, and ASM usually has a variant of ORG to do that, plus the opcode address is checked for long-aligned during build.That is what Chip is now doing,

If cogram can now be byte-addressed, then where are the other two bits stored?

cgracey · 2015-07-27 04:22

I would much rather waste up to three bytes for every byte string than have to deal with misaligned instructions. Is this the only reason why the assembler now does everything in byte addresses, or are there other reasons too? Can you please add a mode to the assembler (probably enabled through a flag in the source file) that makes all PASM addresses long addresses like on the P1?

ASM needs to be byte-granular for Data arrays and Data management, so you cannot work only in longs.However, code can be given a rule it needs to be Long aligned, and ASM usually has a variant of ORG to do that, plus the opcode address is checked for long-aligned during build.That is what Chip is now doing,

If cogram can now be byte-addressed, then where are the other two bits stored?

Cog RAM is only byte-addressable in the assembler, and this is only the case since cog programs reside in hub RAM before they are loaded into cog RAM. Once a program is loaded into cog RAM, it is only long-addressable. So, the assembler makes sure that the addresses of cog registers have a common byte offset, in anticipation of them being loaded into a cog.

Electrodude · 2015-07-27 05:56

I would much rather waste up to three bytes for every byte string than have to deal with misaligned instructions. Is this the only reason why the assembler now does everything in byte addresses, or are there other reasons too? Can you please add a mode to the assembler (probably enabled through a flag in the source file) that makes all PASM addresses long addresses like on the P1?

ASM needs to be byte-granular for Data arrays and Data management, so you cannot work only in longs.However, code can be given a rule it needs to be Long aligned, and ASM usually has a variant of ORG to do that, plus the opcode address is checked for long-aligned during build.That is what Chip is now doing,

If cogram can now be byte-addressed, then where are the other two bits stored?

Cog RAM is only byte-addressable in the assembler, and this is only the case since cog programs reside in hub RAM before they are loaded into cog RAM. Once a program is loaded into cog RAM, it is only long-addressable. So, the assembler makes sure that the addresses of cog registers have a common byte offset, in anticipation of them being loaded into a cog.

If it's not really byte-addressable once it's loaded into cogram, then why does the assembler pretend it is? Does it benefit the programmer in any way? Unless there's some obvious advantage I'm missing, can there be a separate mode for cogexec (and hubexec if possible, even if that forces the compiler to emit only long-aligned code) that uses long addressing? It seems pointless to me to multiply numbers by four just to have them divided by four again later.

cgracey · 2015-07-27 06:38

I would much rather waste up to three bytes for every byte string than have to deal with misaligned instructions. Is this the only reason why the assembler now does everything in byte addresses, or are there other reasons too? Can you please add a mode to the assembler (probably enabled through a flag in the source file) that makes all PASM addresses long addresses like on the P1?

ASM needs to be byte-granular for Data arrays and Data management, so you cannot work only in longs.However, code can be given a rule it needs to be Long aligned, and ASM usually has a variant of ORG to do that, plus the opcode address is checked for long-aligned during build.That is what Chip is now doing,

If cogram can now be byte-addressed, then where are the other two bits stored?

Cog RAM is only byte-addressable in the assembler, and this is only the case since cog programs reside in hub RAM before they are loaded into cog RAM. Once a program is loaded into cog RAM, it is only long-addressable. So, the assembler makes sure that the addresses of cog registers have a common byte offset, in anticipation of them being loaded into a cog.

If it's not really byte-addressable once it's loaded into cogram, then why does the assembler pretend it is? Does it benefit the programmer in any way? Unless there's some obvious advantage I'm missing, can there be a separate mode for cogexec (and hubexec if possible, even if that forces the compiler to emit only long-aligned code) that uses long addressing? It seems pointless to me to multiply numbers by four just to have them divided by four again later.

I understand what you are saying. The thing is, the assembler makes an image that goes into hub RAM. Some of that image goes into cog RAM at runtime. Better semantics are needed in the assembler to handle these matters.

Cluso99 · 2015-07-27 07:25

Oh! It seems really wrong to me that a 32 bit instruction can be byte aligned.
Please reconsider what you are saying here... In hubexec mode, instructions can execute from any byte boundary. That means fetching becomes more complicated as instructions may cross the long boundary requiring 2 clocks to fetch it. We don't have short instructions like other processors so there seems no point.
It also means that the 2 lower address bits which would normally be "00" now matter, and that means relative jumps are 2 bits shorter than they need be.

IMHO this is a serious mistake (unless I am misunderstanding what you saying).

You mean instructions are long aligned (always).
Although we used a trick in this for fast hub downloads on the P1 but this is no longer required.

In hub memory, instructions can exist at any byte offset. This is advantageous when, say, a byte string is inserted between instructions. In blocks of pure functional code, though, there should be no long-offset interruptions. So, 9-bit relative branches ought to be reckoned in longs, not bytes. So, I will change it

Cluso99 · 2015-07-27 07:37

Chip, in case I have the above wrong, and you are just saying that cog code can reside on non-long boundaries, then....
IMHO I also think this is wrong too. On the P1 this was not possible and COGINIT presumed the lower 2 hub address bits were "00". The P2 should also conform to this.
If hub code addresses are forced to be on a long boundary, then code in hub may be executable in both hubexec and cove ev modes.

jmg · 2015-07-27 07:38

Oh! It seems really wrong to me that a 32 bit instruction can be byte aligned.
Please reconsider what you are saying here... In hubexec mode, instructions can execute from any byte boundary. That means fetching becomes more complicated as instructions may cross the long boundary requiring 2 clocks to fetch it. We don't have short instructions like other processors so there seems no point.
It also means that the 2 lower address bits which would normally be "00" now matter, and that means relative jumps are 2 bits shorter than they need be.

IMHO this is a serious mistake (unless I am misunderstanding what you saying).

Now I'm confused.
My understanding is Chip is saying he will align opcodes on 32b, giving +/- 256 opcodes jump range.

However, the Assembler still needs to be Byte-focused and have Byte granularity, because there are Opcodes to fetch Bytes of Data from HUB memory.

It is easy enough for the assembler to ensure that executed code, is aligned to 32b, but tables and data can be byte addressed.

evanh · 2015-07-27 07:58

I think we just need to trust Chip on this one and wait for the first round of documentation/FPGA image at least.

cgracey · 2015-07-27 08:02

For cog exec code initially residing in hub RAM, all that matters is that those instructions share a common hub byte offset, so that when they are loaded into the cog RAM, everything is long-aligned. It also does not matter for hub-exec purposes whether instructions fall on absolute long boundaries, or not. It is true that it takes one more clock to begin a hub exec instruction stream if it is not absolutely long-aligned. However, this can be avoided by long-aligning your hub exec code in the assembler. This is a small price to pay for allowing data structures of mixed word lengths in hub memory. There is no reason that I see to enforce long-alignment rules in hub memory. All that would do is introduce unnecessary strictures.

evanh · 2015-07-27 08:21

jmg said:
I've never thought much of adding hardware for debug assistance. Hand crafting debug code works a treat really. ISRs aren't much different to mainline then. Chip has just done one typical hand coded debug type exercise right above.

?? The above is the basic framework for single step debug, and that does need hardware assistance.It is that hardware assist, that Chip is testing.

Oops, didn't even read it did I. I was thinking about Chip's earlier example a few pages back. My statement still stands, even if I used an inappropriate example.

The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

Comments