The New 16-Cog, 512KB, 64 analog I/O Propeller Chip - Part 2

cgracey · 2015-09-22 21:28

Cluso99 wrote: »

Chip

cgracey wrote: »

David Betz wrote: »

cgracey wrote: »

potatohead wrote: »

I still don't see the big deal on allowing non aligned code in what could be the boot or system area... seems a nice fix.

All user programs just start at $1000 and that area is for data, or very specialized code...

I actually feel the same way. It seems like the best solution.

Top-justifying cog execution in the hub memory could cause upset on future devices, even expanding LUT memory. Keeping things bottom-justified leaves things more open-ended.

"There is a fifth dimension, beyond that which is known to man. It is a dimension as vast as space and as timeless as infinity. It is the middle ground between light and shadow, between science and superstition, and it lies between the pit of man's fears and the summit of his knowledge. This is the dimension of imagination. It is an area which we call the COG Shadow Zone."

Pit of man's fear. It leans in that direction. It is a raspberry-seed-in-your-wisdom-tooth kind of situation.

Okay...

This is ugly as all get-out, but here's what would work very nicely (consider that this allows for a 1K x 32 LUT):

$000000000xxxxxxxxx01 = cog execution addresses 0..511
$000000000xxxxxxxxx10 = LUT execution addresses 0..511
$000000000xxxxxxxxx11 = LUT execution addresses 512..1023
all others = hub execution addresses

This way, hub-exec would work from $00000 - perfect for ROM booting

Special-consideration memory only goes from $00000 to $007FF.

Nobody would notice these funny %01, %10 and %11 LSB's in cog/LUT addresses because they would be contained in symbols, with their LSB's established by the particular ORGCOG/ORGLUT/ORGLUT2 directive used before their declaration.

I think someone suggested something like this before.

Increasing the LUTs to 1K x 32 would only take about 1.2 mm2 of die area. If we couldn't fit it into this device, it could certainly go into a future smaller-geometry chip. We could implement it on the FPGA, in any case.

This would give 1,528 internal instructions per cog.

P.S. It was Seairth who had proposed something like this on the prior page.

Chip,
Makes perfect sense to me and I'd love to have 1Kx32 LUT if there's die space

But, why do we have to have hub-exec able to run from non-long aligned code???
Seems to me that we have the cart before the horse and that is complicating the PC counter.

Why couldn't we just address all instructions on long boundaries and save the 2 bits (and it's complications for the masses to understand)?
The PC would contain an extra 2 (hidden) bits (that could be extended in future P2's) to designate COG/LUT/HUB.
The jump/call/return instructions would still contain these 2 bits, but the compiler would insert these depending on whether the address was in COG/LUT/HUB.

But simplifying even further, there should be no reason to differentiate the COG/LUT so we can have seemless instruction addresses from COG $000-$3FF(or 5FF), ignoring the special register gap. The compiler will just insert these 2 address bits.

So, in reality, the PC would be the same as you have now, just that it would increment by 4, and the last 2 bits would be defined as you have suggested here but would be hidden from the user (except in the case of actual hand assembly).

So, I am just saying, hide these 2 bits from the user. Hope I have made this clear enough

BTW We can live with the extra 2 bits being the address for simplifying your pnut compiler.

The trouble with getting rid of the two LSB's of the PC and insisting that hub-exec be long-aligned is that we loose the address matching between... wait a minute. I understand what you are saying now. While we need the full address range, we don't need to track the bottom two LSB's of the PC if we are always long-aligned. I get it. We still need 20-bit addressing for bytes/words/longs.

Currently, there is only one long-aligned rule for hub memory: When RDFAST/WRFAST wrap around to repeat a block, you must use long-aligned addresses or you will get some errant data in the partial longs during the wrap. There's no way around this. That's the only time when you really need to think about long alignment: on block-wrapping fast reads and writes.

Hub-exec doesn't care about the alignment. This saves memory when placing odd-size strings or data in-between PASM code. By forcing long alignment for all instructions, we could save on some adder chains, and maybe 20 flops per cog. It would introduce a caveat that hub-exec instructions must be long-aligned. I don't know if it's worth it. The one good thing I can see is that it would clear the air on cog-exec and lut-exec, and erase ambiguities surrounding the execution-address LSB's (that only exist in people's minds).

What do you guys think??

P.S. We can't have seamless cog-to-LUT execution because of the special I/O registers from $1F8..$1FF.

David Betz · 2015-09-22 21:51

I don't see any problem with requiring hub-exec to require long aligned instructions.

cgracey · 2015-09-22 21:52

Wait! We CAN have seamless cog-to-LUT execution IF we move the special-function registers down to the start of cog memory (ie registers $000..$007). Then, the PC could flow right into LUT space, making a seamless 1k instruction space.

Since cog-exec code doesn't necessarily need to load starting at $000, anymore, we can and will be putting it everywhere.

What do you guys think about that? It means ORG'ing your cog code at $008 (x4). ORG(COG) could automatically do that, if no operand was given.

P.S. This would mean that you could load your interrupt vectors as part of your code, without having to make discreet writes to $1F0..$1F5.

potatohead · 2015-09-22 21:52

Me neither

jmg · 2015-09-22 21:57

cgracey wrote: »

Wait! We CAN have seamless cog-to-LUT execution IF we move the special-function registers down to the start of cog memory (ie registers $000..$007). Then, the PC could flow right into LUT space, making a seamless 1k instruction space.

Since cog-exec code doesn't necessarily need to load starting at $000, anymore, we can and will be putting it everywhere.

What do you guys think about that? It means ORG'ing your cog code at $008 (x4). ORG(COG) could automatically do that, if no operand was given.

Sounds worthwhile, - anything simple the tools can do, and check.

Does the overflow happen (largely?) without caveats to most code ?
ie could users simply write, and the expanded code runs. ?

What happens when they then go over the top of LUT ?
Does that flow into HUB ?

cgracey · 2015-09-22 22:01

jmg wrote: »

cgracey wrote: »

Wait! We CAN have seamless cog-to-LUT execution IF we move the special-function registers down to the start of cog memory (ie registers $000..$007). Then, the PC could flow right into LUT space, making a seamless 1k instruction space.

Since cog-exec code doesn't necessarily need to load starting at $000, anymore, we can and will be putting it everywhere.

What do you guys think about that? It means ORG'ing your cog code at $008 (x4). ORG(COG) could automatically do that, if no operand was given.

Sounds worthwhile, - anything simple the tools can do,
Does the overflow happen without caveats to most code ?
ie could users simply write, and the expanded code runs. ?

What happens when they then go over the top of LUT ?
Does that flow into HUB ?

The only caveat to cog/LUT code would be that the portion of code that exists in LUT could not be self-modifying in the normal sense, because it is out-of-range of the D-specified registers. You would have to use RDLUT/WRLUT or SETQ2+RDLONG to access it. If you hit the top address of $3FF (x4), the assembler would error out because you just crossed a boundary that needs to be handled with distinct intent.

Seairth · 2015-09-22 22:02

Too many options! I'm not saying that you shouldn't change it, but can you wait (or settle on one) until after you get an initial version of the FPGA image released? That way, we can start testing while you continue to tweak.

evanh · 2015-09-22 22:06

The Program Counter is special in that respect isn't it? There is no other relative addressing mode, in the current Prop2, that covers all address spaces like that.

I can see some wanting an empty scratch pad at the start.

tonyp12 · 2015-09-22 22:15

>move the special-function registers down to the start of cog memory (ie registers $000..$007).

Sounds OK, as on the prop1 we never got an option for the special registers to be included in the 512long transfer.
You would only use org 8 (if it is possible to fill in at a offset?) if you don't want to clear or set the first 8 longs.

org 0
long 0 ' default is to clear the special registers but can be pre-set.
long 0
long 0
long 0
long 0
long 0
long 0
long 0
org(cog) 'optional only needed if you forget to write the above 8 longs or wanted them non-initialized
code goes here

jmg · 2015-09-22 22:52

cgracey wrote: »

jmg wrote: »

cgracey wrote: »

Wait! We CAN have seamless cog-to-LUT execution IF we move the special-function registers down to the start of cog memory (ie registers $000..$007). Then, the PC could flow right into LUT space, making a seamless 1k instruction space.

Since cog-exec code doesn't necessarily need to load starting at $000, anymore, we can and will be putting it everywhere.

What do you guys think about that? It means ORG'ing your cog code at $008 (x4). ORG(COG) could automatically do that, if no operand was given.

Sounds worthwhile, - anything simple the tools can do,
Does the overflow happen without caveats to most code ?
ie could users simply write, and the expanded code runs. ?

What happens when they then go over the top of LUT ?
Does that flow into HUB ?

The only caveat to cog/LUT code would be that the portion of code that exists in LUT could not be self-modifying in the normal sense, because it is out-of-range of the D-specified registers. You would have to use RDLUT/WRLUT or SETQ2+RDLONG to access it. If you hit the top address of $3FF (x4), the assembler would error out because you just crossed a boundary that needs to be handled with distinct intent.

In both cases the ASM can give clear error messages, so ti sounds well worth doing.
It is also not a bad thing thing to have the INIT constants first in code, as it forces users to do the housekeeping, and means the smallest loadable pgm is one compact piece.

I presume those 0..7 declared as constants will load-into the registers ?
What is then the smallest executable visible program ? Init... then INC Port & Loop ?

jmg · 2015-09-22 22:55

tonyp12 wrote: »

org 0
long 0 ' default is to clear the special registers but can be pre-set.
long 0
long 0
long 0
long 0
long 0
long 0
long 0
org(cog) 'optional only needed if you forget to write the above 8 longs or wanted them non-initialized
code goes here

I would prefer a segment approach so users can comment out any line, and not break things.
things like SEGREG and SEGCOG would encapsulate the ORG and also tell the assembler ASM code was legal, or not.

You want to avoid this not giving any errors
ORG 00
Code...

tonyp12 · 2015-09-22 23:06

>I would prefer a segment approach

That's what I do in IAR on the MSP430 asm, org hardcoding fixed addresses is now frown on.

RSEG DATA16_N /* start of ram */
RSEG DATA16_C /* Define FLASH segment*/
RSEG CODE

and I use ALIGNRAM 1 (non-initialized padding) and EVEN (padding set to zero) for 16bit boundary
But this is all compiler stuff that can wait.

Cluso99 · 2015-09-22 23:11

Yes Chip... you got it

I was about to suggest having LUT first $000-1FF/3FF followed by COG Registers.

This way the address space is contiguous.
The COG Registers are at the top of COG/LUT $3F8..3FF(or $5F8..5FF)
The register space still works for all instructions because the D & S results are only 9 bits, so effectively registers are still addressed as $000..$1FF (the compiler takes care of this)
We can still use COG Registers as lookups based on $000+offset
Self-modifying code only works in the register space (ie code above #1FF/3FF) - easy to do an ORG $200/400 (The compiler can catch errors here)
We can still use FIT to check the boundaries (registers & special registers)

For a later revision of P2, perhaps the instructions could be modified (by global cog switch?) such that either S or D when read from cog (clock2) could read one of S or D from LUT by fetching/using 11 bits from the S or D address rather than 9 bits. LUT space could even be expanded further in a later revision.

Roy Eltham · 2015-09-22 23:14

Chip,
Assuming you moved special regs and made it so you could seamlessly go from cog to LUT with execution... what's stopping you from seamlessly going to hubexec from LUT? Isn't it essentially the same thing? You'd, of course, have the stall for the fifo to fill up to begin execution from hub, but that's the same as if you branched to hub, right?

I guess you could just put a branch in the last instruction slot of the LUT to go into hubexec.

Also, I still really strongly prefer not having the weird handling of the first 4k of hub. Just start hubexec at $1000. THe rom can still load starting at &0000, just that the entry point of the rom image is at $1000 in...

potatohead · 2015-09-23 01:47

Yeah me too. Amazing how a simple thing can cause it all to move and shift like it has.

I do like the idea of registers in low COG address space and the seamless execute from COG to LUT code. Good outcome that will see a lot of use.

cgracey · 2015-09-23 04:37

I got the special registers moved to the bottom of cog RAM. It makes code with interrupts a lot better. Since the interrupt vectors are located right after the special registers, they can conveniently precede code now.

There is one huge headache with all this, which was latent, all along: It's that this addressing scheme needs you to shift everything up by two bits to go from cog register address to actual assembler address. And then you must divide by 4 to get offset counts.

Here is what some code looks like. This is the main loader:

DAT
		orgh

entry		setq	#(x-begin)/4
		rdlong	8<<2,ptrb[code-entry]
		jmp	#8<<2
code
		org	8<<2

begin		clkset	#$FF			'switch to 80MHz (if pll, else 50MHz)
		wrfast	#0,#0			'ready to write entire memory
		setedg	#%0_10_111111		'select negative edge on p64

:loop		getedg				'clear edge detector
		waitedg				'wait for start bit

		rep	#2,#7			'ready for 8 bits
		waitx	waita			'wait for middle of 1st data bit
		testb	inb,#31		wc	'sample rx
		rcr	x,#1			'rotate bit into byte
		waitx	waitb			'wait for middle of nth data bit

		shr	x,#32-8			'justify received byte
		wfbyte	x			'write to hub
		djnz	bytes,@:loop		'loop until all bytes received

		wrfast	#0,#0			'wait for last byte to be written

		setq	#0			'launch new program
		coginit	#0,#$00001

bytes		long	$8_0000
waita		long	25+12-6
waitb		long	25-6
x		res	1

See all that divide-by-4 and multiply-by-4 stuff in the first several lines? It's tricky. My problem at a few different points in getting the new register locations to work was getting all this address stuff straight. It's too treacherous. This needs to be simplified, somehow, so that all that div/shl math goes away. It's very fatiguing to deal with, as it must be perfect before anything works right. It's just too complicated.

In looking at that code, I realized that some of the math could go away by using labels. This is much more tolerable, but still not a cake walk:

DAT
		orgh

entry		setq	#(x-begin)/4		'number of longs to load
		rdlong	begin,ptrb[code-entry]
		jmp	#begin
code
		org	8<<2

begin		clkset	#$FF			'switch to 80MHz (if pll, else 50MHz)
		wrfast	#0,#0			'ready to write entire memory
		setedg	#%0_10_111111		'select negative edge on p64

:loop		getedg				'clear edge detector
		waitedg				'wait for start bit

		rep	#2,#7			'ready for 8 bits
		waitx	waita			'wait for middle of 1st data bit
		testb	inb,#31		wc	'sample rx
		rcr	x,#1			'rotate bit into byte
		waitx	waitb			'wait for middle of nth data bit

		shr	x,#32-8			'justify received byte
		wfbyte	x			'write to hub
		djnz	bytes,@:loop		'loop until all bytes received

		wrfast	#0,#0			'wait for last byte to be written

		setq	#0			'launch new program
		coginit	#0,#$00001

bytes		long	$8_0000
waita		long	25+12-6
waitb		long	25-6
x		res	1

cgracey · 2015-09-23 04:50

Here is the new cog register map:

// addressable cog registers
//
//	addr		read		write		name
//	-------------------------------------------------------------
//
//	000		INA		-		INA / IJMP0
//	001		INB		-		INB / IRET0
//	002		RAM		RAM+OUTA	OUTA
//	003		RAM		RAM+OUTB	OUTB
//	004		RAM		RAM+DIRA	DIRA
//	005		RAM		RAM+DIRB	DIRB
//	006		PTRA		PTRA		PTRA
//	007		PTRB		PTRB		PTRB
//
//	008		RAM		RAM		user / ADRA
//	009		RAM		RAM		user / ADRB
//	00A		RAM		RAM		user / IJMP1
//	00B		RAM		RAM		user / IRET1
//	00C		RAM		RAM		user / IJMP2
//	00D		RAM		RAM		user / IRET2
//	00E		RAM		RAM		user / IJMP3
//	00F		RAM		RAM		user / IRET3
//
//	010-1FF		RAM		RAM		user

cgracey · 2015-09-23 04:56

I think we just have to live with this register<<2 addressing. There seems to be no way out, given the greater hub context that must be regarded for other code and data.

jmg · 2015-09-23 05:09

cgracey wrote: »

I think we just have to live with this register<<2 addressing. There seems to be no way out, given the greater hub context that must be regarded for other code and data.

There will always be some of that, but labels makes sense, and a SEGCOG or similar could swallow the ORG 8<<2 - the tools should be able to help the users here, and catch any concentration lapses with error messages.

Roy Eltham · 2015-09-23 07:29

Chip,
I think you feel the need to do the x<<2 thing for addresses in cog space because you are used to the P1 method where cog memory was addressed in longs (which was the "odd" way of doing it). Just use the byte addresses for everything and eventually you'll get used to using them.

Special regs are at:
$00000 INA
$00004 INB
$00008 OUTA
$0000C OUTB
$00010 DIRA
$00014 DIRB
$00018 PTRA
$0001C PTRB
etc.

Code space is from $00010 to $007FF (cog) and $00800 to $01000 (lut).

I think it's simpler to just think of things like this, it's how every other chip/system I have ever used works (except the P1 cog space).

Also, can't we just make the immediate operator (#) just do the div by 4 for you, since you always need to do it? (as in the assembler code can just use the proper 9 bits when making the opcode).

cgracey · 2015-09-23 07:56

Roy Eltham wrote: »

Chip,
I think you feel the need to do the x<<2 thing for addresses in cog space because you are used to the P1 method where cog memory was addressed in longs (which was the "odd" way of doing it). Just use the byte addresses for everything and eventually you'll get used to using them.

Special regs are at:
$00000 INA
$00004 INB
$00008 OUTA
$0000C OUTB
$00010 DIRA
$00014 DIRB
$00018 PTRA
$0001C PTRB
etc.

Code space is from $00010 to $007FF (cog) and $00800 to $01000 (lut).

I think it's simpler to just think of things like this, it's how every other chip/system I have ever used works (except the P1 cog space).

Also, can't we just make the immediate operator (#) just do the div by 4 for you, since you always need to do it? (as in the assembler code can just use the proper 9 bits when making the opcode).

I see what you are saying about just getting used to it.

I don't think it would be a good idea to have div-by-4 happen automatically for #, because it creates a discontinuity between behavior from immediate values and register contents.

cgracey · 2015-09-23 08:03

I got this cog/lut continuity working after moving the special registers down to the bottom of cog memory.

Here is a program that is one long chunk of code that spans from cog to lut:

dat
	orgh	$00001			'start in hub-exec at $00001 (non-aligned address below $1000)

	loc	adra,@code		'load cog starting at 'begin' with 1st half code
	setq	#$1F0-1
	rdlong	begin,adra

	loc	adra,@code + $1F0<<2	'load lut starting at $000 with 2nd half code
	setq2	#$200-1
	rdlong	$000,adra

	jmp	#begin			'cog/lut now hold one contiguous program, jump to it

code					'hub address of cog/lut program

	org	$010,$3FF		'set cog/lut org to register $010, set limit to end of lut

begin	mov	dira,#$1F		'start of cog/lut program, enable outputs

loop	notb	outa,#4			'toggle pins in a loop

	long	$F4240400 [250]		'notb outa,#0 (250 instances)
	long	$F4240401 [250]		'notb outa,#1 (250 instances)
	long	$F4240402 [250]		'notb outa,#2 (250 instances)
	long	$F4240403 [250]		'notb outa,#3 (250 instances)

	jmp	@loop

Here's the program running:

Roy Eltham · 2015-09-23 08:58

cgracey wrote: »

I see what you are saying about just getting used to it.

I don't think it would be a good idea to have div-by-4 happen automatically for #, because it creates a discontinuity between behavior from immediate values and register contents.

Yeah, maybe instead of # doing it, we could have another symbol the means "immediate with div by 4". Maybe ##?

I think labels should be the byte address, not the long address, otherwise we'll have oddity between labels in cog code vs hub code, and labels should be able to mark data that can be unaligned, so doing #label should require a div 4 in cog space. Right?

Roy Eltham · 2015-09-23 09:03

Also, when you have the orgh $0001, that means the code starts at 1 byte into hub ram right? So your IP starts at 1 instead of 0. I still really hate that.
I'd much rather just have hub exec space start at $01000.

The only difference would be having orgh $01000 in front, and making the IP start at $01000 instead of $00001. You can have the booter either read in the ROM starting at $01000, or have it start at 0, and just have the entry point in the ROM be at offset $01000. Seriously, it's better that having non-aligned memory addresses below $1000 = hub exec, and aligned ones = cog/lut exec. This is just ugh... seriously, people are going to see that and scratch their heads going "WTF? What kind of kluge mess is that?"

cgracey · 2015-09-23 09:35

Roy Eltham wrote: »

Also, when you have the orgh $0001, that means the code starts at 1 byte into hub ram right? So your IP starts at 1 instead of 0. I still really hate that.
I'd much rather just have hub exec space start at $01000.

The only difference would be having orgh $01000 in front, and making the IP start at $01000 instead of $00001. You can have the booter either read in the ROM starting at $01000, or have it start at 0, and just have the entry point in the ROM be at offset $01000. Seriously, it's better that having non-aligned memory addresses below $1000 = hub exec, and aligned ones = cog/lut exec. This is just ugh... seriously, people are going to see that and scratch their heads going "WTF? What kind of kluge mess is that?"

I'll change the hub-exec rule to $01000+.

I made the ORG (for cog/lut) use register (long index) addresses, not byte addresses. That should probably be changed to byte addresses, right?

jmg · 2015-09-23 09:45

cgracey wrote: »

I made the ORG (for cog/lut) use register (long index) addresses, not byte addresses. That should probably be changed to byte addresses, right?

The tools can check either way, but a useful way to decide could be to create some LST and MAP files and see how they scan to a user eyeballs.

Any ORG can include an implicit align and just labels can be checked for legal alignment.

If the Data arrays can be byte granular, then it seems natural to report all address values in LST and MAP files as bytes, as users will want to check record alignments etc.

Seairth · 2015-09-23 11:44

Wouldn't all/most of that addressing bit shift stuff go away if you made hub instructions long-aligned and starting at byte $1000? In this case, all instruction addressing is in terms of longs, not bytes:

Cog: $000-$1FF
LUT: $200-$3FF
Hub: $0400-$1FFF

This would have some other advantages as well:

* If relative addresses where byte-oriented, this gives relative addresses a greater range. If relative addresses were long-oriented, then it is now more consistent.

* The 20-bit address would now cover 4x as much instruction space. Of course, that won't do much for the P2, but a few people have expressed extending memory on an FPGA. And then there is the P3.

Also, I will make a plug for one variation on the above scheme:

Instruction Addressing that is local to a cog is in the form %0xxx_xxxxxxxx_xxxxxxxx. Instruction Addressing that is global to all cogs is in the form %1xxx_xxxxxxxx_xxxxxxxx. This would make the current P2 implementation look like:

Cog: $000-$1FF
LUT: $200-$3FF
Hub: $80000-9FFFF

This makes the entire hub memory executable. Because addressing is long-aligned, the hub can still be extended to $FFFFF (an additional 384K instructions), so you certainly aren't limiting your options. Further, this provides additional cog-local addressing space, should that ever be desired (e.g. LUT2).

And this does not affect data addressing, since each memory type has it's own instruction set:

Cog: $000-$1FF (long-addressing)
LUT: $000-$1FF (long-addressing)
Hub: $00000-$7FFFF (byte-addressing)

jac_goudsmit · 2015-09-23 16:12

Roy Eltham wrote: »

cgracey wrote: »

I see what you are saying about just getting used to it.

I don't think it would be a good idea to have div-by-4 happen automatically for #, because it creates a discontinuity between behavior from immediate values and register contents.

Yeah, maybe instead of # doing it, we could have another symbol the means "immediate with div by 4". Maybe ##?

How about & (like the "pointer to" operator in C/C++)? I proposed this as a fix for a problem in PropGCC which eventually got solved in some other way.

===Jac

Roy Eltham · 2015-09-23 19:20

cgracey wrote: »

I'll change the hub-exec rule to $01000+.

I made the ORG (for cog/lut) use register (long index) addresses, not byte addresses. That should probably be changed to byte addresses, right?

Thanks Chip!

I think everything should use byte addressing as far as what you are expected to type into it, and have the assembler convert things that can be safely/cleanly converted automatically (such as the org stuff). I, also, like the idea of using ## or & (as Jac suggests) for cases when we have immediate values that should be shifted. To me this is a lot cleaner and easier to wrangle than having to manage all the /4 and <<2 stuff in your code.

Rayman · 2015-09-23 19:40

Does having both with byte addressing help when porting hubexec code to cog code?
Seems like it would help...

The New 16-Cog, 512KB, 64 analog I/O Propeller Chip - Part 2

Comments