Propeller II

Heater. · 2012-08-10 06:31

The afore mentioned David May is/was a proffessor of computer science in Bristol. He has famously said that when it comes to programming parallel systems like the Transputer and XMOS he finds the students of digital hardware pick it up quicker than CS grads and end up better at it.
I guess guys designing logic in Verilog or whatever find "parallel" thinking natural.
I always thought he could be talking about you:)

evanh · 2012-08-10 08:44

Too true. Many a person has vowed to commit suicide (figuratively speaking) if they ever had to look at ladder logic again. Some people just don't like to contemplate parallel flows. I think they feel the job is being made too hard that way and it must be easier to solve with procedural flow instead.

In defence of current mainstream processor architecture, it does have huge adaptability simply through it's generalisation. There is substantial parallelism in the various buffers, stages, execution units, and of course caches. And mass production along with it's ASICs has made it hard to compete against. Not all of it is inefficient and it's pretty good at many things with a single architecture(PC dominant).

Hardware reconfigurability has a cost of it's own. It requires real-estate to fit the configuration and associated controls and the regular cell is not always a nice fit for the job. Along with the regular cell is the regular routing, of which much may not be used. It requires both extra real-estate and extra routing space for the unused or convoluted use of available resources. This makes the chip bigger and hotter than a custom design.

It all comes down to RAM accesses of large, the larger the better, memory spaces. How to efficiently (low power) manage TBytes of space and also have effectively incredibly high transfer rates without putting difficult programming limits on the software developers.

Access time can be sacrificed. This is what is done now, currently, on average, the core hardware mostly hides the latencies. The alternative argument is to bring the RAM in closer to the processing elements and divide it up into smaller blocks. It could be argued the Cell chip is an attempt in this direction.

It's not impossible to make it work me thinks. Managing the virtualised size of the overall dataset but forming it up to flow through the physical RAM blocks efficiently looks to be the next challenge for general computing. Like registerisation, it prolly needs the compiler to parcel the data/code off in blocks so the kernel can fit it physically.

More C++ extensions to come ... ;P

evanh · 2012-08-10 10:06

The mythical memristor will also be a central part of any new parallel architecture that might have a chance. Without it, SRAM is far too bulky and hot. And everything else is too slow. MRAM is the closest commercial contender so far.

Bill Henning · 2012-08-10 11:13

Nice...

I do have an interesting idea that would allow for single instruction serialization and possibly other useful hacks:

MAPCZ cpin, zpin wc wz ' wc / wz controls the signal routing

For Carry:

If output:

current state of C flag is reflected on cpin (pin 0-n)

If input:

carry flag reflects state of cpin (pin 0-n)

This allows any current rotate instruction to be used for serial input/output to a pin directly, at one bit per instruction

It would work the same for the zero flag.

' output example, make sure to set P31 to output first

MAPCZ  31,0 wc   ' set up P31 as serial out

' ugly incorrect syntax for REPS
REPS 1,#32 ' repeat one instruction for 32 bits, could xmit any number of bits really
nop ' pipeline food
ROR data,#1 wc

And now for the reverse

'input example, make sure to set P30 for input first

MAPCZ 30,0 wc
nop ' pipeline food
RCL data,#1

And another cute possibility

' fast fake interrupts - set P32 and P33 to inputs

MAPCZ 32,33 wc,wz  ' P32 and P33 are pseudo-interrupts

loop REPS 2,#511 ' repeat one instruction for 32 bits, could xmit any number of bits really
nop ' pipeline food
if_c jmp #interrupt1
if_z jmp #interrupt2
jmp #loop

1-2 clock cycle response in most cases (1/512 cases 2-3 cycle) to 2 interrupts per cog dedicated to interrupts

cgracey wrote: »

Right, SETPC writes the carry to a pin. There's also: OFFP (make input), CLRP, SETP, NOTP, SETPC, SETPNC, SETPZ, SETPNZ.

For inputting, we have GETP (into C and NZ) and GETNP (into NC and Z).

Update:

Might be usable to decode differential signalling, as it could decode the four possible states (C,Z) using four JMP's

Bill Henning · 2012-08-10 11:20

Please DO have RDxxx/WRxxx work on I/O! I need it for some ideas; hated it did not work on P1.

Maybe use the NR bit to have it read/write shadow regs instead?

cgracey wrote: »

Yes, that would be a problem, all right. I could make it so that RDxxxx instructions couldn't affect I/O, but only shadow ram.

I could see creating a lot of caveats to make this work, which might not even be adequate for what winds up being needed. I think the way to make this debugger work is to keep it above-board, where you have the user agree to give you $1F0..$1F5 of cog RAM, and you use some ordinary hub ram that is owned by the debugger. I'll keep thinking about this, though. Where there's a will...

pedward · 2012-08-10 12:34

I think LMM could be reworked in the P2 to use the TSS swapping stuff. You could put sentinel code at the top of COG memory that will swap registers, load more COG code, then swap back. It could increase the code caching significantly. Instead of 8 or 16 instructions at a time, you could probably have 256 or more instructions in a block. The COG loading would happen as a function of yielding control of the COG to the overlay loader. Just some thoughts.

jmg · 2012-08-10 13:06

Cluso99 wrote: »

Probably. However this may not help. The way I do the debugging is that in essence I single-step every instruction using an LMM mode execution unit.

That is the best way to do what I'd call Silicon Assisted Simulation, and you can get full workspace access, but at the cost of speed.

I see another couple of Debug areas, where a hardware paddle board would assist (and if it used a 'Same-prop', it it could also swallow some of the Debug LMM engine.

a) Time of flight timing / Loop counting - for that, you need a tiny debug overhead to set/clear a pin, and externally time/count that.
b) Real time watch - using SNDSER, you could send a small number of watch variables with low speed impact.
This would be a compile time choice, but the smallest I/O would be a single SNDSER per loop, which costs 1 Long and ?? Cycles.

and of course, once the pathway works, others will think of many more permutations....

Cluso99 wrote:

Actually, thinking further, using SNDSER may be a nice way to unload the decoding off chip, provided of course its not being used already. Guess it depends if we can really run SNDSER from each cog - I presume we can.

It would need to be per-cog, but I'm not even clear if this Opcode set even 'made the cut' ?

SRLM · 2012-08-10 13:14

My Armchair Propeller Design:

I would like to see the Hub area become an executable region for code, with it's own dedicated processor. This processor would be faster than cogs, and would have access to the Hub memory on every cycle. This design supports the driver model that is adopted by the Propeller community: the cogs support interaction with specific peripherals, and push their data to the hub for processing. With my suggestion, the processing could be done where the data is, instead of in another cog (and the bandwidth limitations of pulling the data and processing instructions in). You would get PASM speed on hub data.

* Yes, there are limitations (RAM multi-port/access, 9 bit SRC/DEST fields in PASM, die space, etc.) I'm choosing to ignore all of that.

Anyway...

Circuitsoft · 2012-08-10 13:38

SRLM wrote: »

My Armchair Propeller Design:

I would like to see the Hub area become an executable region for code, with it's own dedicated processor. This processor would be faster than cogs, and would have access to the Hub memory on every cycle. This design supports the driver model that is adopted by the Propeller community: the cogs support interaction with specific peripherals, and push their data to the hub for processing. With my suggestion, the processing could be done where the data is, instead of in another cog (and the bandwidth limitations of pulling the data and processing instructions in). You would get PASM speed on hub data.

* Yes, there are limitations (RAM multi-port/access, 9 bit SRC/DEST fields in PASM, die space, etc.) I'm choosing to ignore all of that.

Anyway...

Sort of like...

Circuitsoft wrote: »

My next choice after that would be a P4x32A, with 32 I/O pins, and another 32 I/O pins to make it look like an SRAM so it could be connected to a larger CPU as a peripheral. I imagine the external CPU taking every other slot on the hub. In a lot of ways, the Propeller makes CPLD-like functions accessible to programmers.

Cluso99 · 2012-08-10 16:16

Chip:

DEBUG code..

Having slept on it I have come to the conclusion that it is not worth altering/manging hardware to support it. If required, then the user should be prepared to forgo cog $1F0-$1F6 for the debug/LMM execution unit to live in. This would be co-operative with another cog or P2. No need to place code in the ROM either. It would be better to place any unused hub ROM as ram
in holes. (see below)

With P2, there are a few nice ways we can either improve debug speed by using more resources, or slower using less resources. Remember we have internal direct cog comms using the PortD, we have cog-cog or prop-prop comms using pins and SNDSER & friends, and we have faster hub access. Exciting times ahead.

It has even provoked my thinking on what can be further achieved on P1.

Spare hub ROM space
I would rather see the any spare space made as "holes" of ram. Uses could be defined for...

One long reserved for hub allocation "_HUBTOP". Always points to the lowest allocated (top down) byte used (i.e. in the P1 this would be $8000).
Perhaps a "registry" type table for cogs along the lines catalina uses. If so, I would like 8 cogs * 4 longs = 32 longs = 128 bytes. This would need to be on a boundary (eg 128 byte boundary if 128 bytes)
Perhaps a few other longs for pointers.
other???

Specifically I would like to see the #1 _HUBTOP implemented so there is a fixed location for it! Boot code should set it up. #2 could just be allocated from the top of hub and used.

Cluso99 · 2012-08-10 18:05

P2 BOOT from SD card idea

The normal boot process will be from SPI Flash. Therefore routines will exist in the P2 ROM for SPI access. Three (3) pins will be used for MOSI/D, MISO/Q, SCLK/C. An extra pin (4) may be required for -CS/-S.

How about this for an idea...

The /CS pin used by the prop is tested for the value of the external pullup resistor:
>=47K for Flash boot (includes floating)
<=20K for SD boot
(or visa versa)

Once determined, the P2 would place an internal pullup on the -CS line and proceed to drive it as required.

A slower alternative, and more failsafe, would be to try booting from the Flash, and if not found, or the first sector was either all 0s or all 1s, then booting would proceed from the SD card. This permits an unitialised SPI Flash to be on the same bus as the SD card (unprogrammed). It also means that the P2 would not require a PropPlug or equivalent, or the Flash to be preprogrammed. Just insert a programmed SD card.
Remember, the P2 should be able to do USB slave (standard or high speed comfortably). So, if this can software could be loaded from the SD card (or Flash for that matter), no PropPlug would be required.

I would think this one of these methods should provide a failsafe way for booting from SPI Flash, and we can then prove that SD boots correctly before releasing it as a guanteed feature.

The boot process for the SD would be...

Use a fixed location on the SD card (perhaps the reserved boot code sectors left over from windows on a FAT16 or FAT32 formatted card)
Alternately, follow the FAT directory to locate the first file base (this is quite simple as both heater & I have done this originally on ZiCog). The SD card will be formatted to use 32KB blocks minimum, so there is no need to worry about further blocking on the SD card for the initial boot process, as the code will only be sufficient to load a simple loader that would then know about the FAT16/32 formatting.
Direct access to the SD card is quite simple. Remember, we are only interested in obtaining sufficient boot code to commence operation. Therefore, there is no requirement at this time to verify that the SD card is formatted correctly.
Access to the SD card during boot is therefore limited to quite primitive read only instructions using direct sector locations.

I am reasonably sure this can be done with fairly minimal code.
It would be really fantastic to be able to boot directly from SD card.

Comments anyone???

Cluso99 · 2012-08-10 18:38

falf wrote: »

In six years, there has been no evolution. when Chinese take over this company, then we will see progress.

What rubbish... Jinglish manuals will be worse than what we have. The only produce or copy other products. Rarely do they actually design anything. Anyway, its way OT.

evanh · 2012-08-10 21:01

SRLM wrote: »

My Armchair Propeller Design:

I would like to see the Hub area become an executable region for code, with it's own dedicated processor. This processor would be faster than cogs, and would have access to the Hub memory on every cycle. This design supports the driver model that is adopted by the Propeller community: the cogs support interaction with specific peripherals, and push their data to the hub for processing. With my suggestion, the processing could be done where the data is, instead of in another cog (and the bandwidth limitations of pulling the data and processing instructions in). You would get PASM speed on hub data. ...

In theory that could be achieved by a Cog itself. LMM like extensions to the native instructions would be needed but a Cog could achieve full speed execution of instructions directly read from Hub. It would consume 100% of that Cog's Hub allocations (With a wider prefetch maybe) but that's often okay anyway. The good part is it wouldn't negatively impact on any of the other Cogs.

Having said all that, I feel the current LMM support for full speed inline code snippets is perfectly okay.

pedward · 2012-08-10 21:09

falf wrote: »

In six years, there has been no evolution. when Chinese take over this company, then we will see progress.

Shouldn't you be exacting a toll on a bridge somewhere?

evanh · 2012-08-10 22:15

evanh wrote: »

It's not impossible to make it work me thinks. Managing the virtualised size of the overall dataset but forming it up to flow through the physical RAM blocks efficiently looks to be the next challenge for general computing. Like registerisation, it prolly needs the compiler to parcel the data/code off in blocks so the kernel can fit it physically.

Not having to shuffle the bulk data around is a key point of this model. The bulk data never goes away. Just keeps getting added to and maybe edited ... like Wikipedia. So, distributing the relevant code and new input and results to/from each block becomes the primary flows.

If the I/O is high volume, the LHC for example, then a more tradition customised buffering/filtering approach can be added to the front end.

cgracey · 2012-08-10 22:39

Bill Henning wrote: »

Please DO have RDxxx/WRxxx work on I/O! I need it for some ideas; hated it did not work on P1.

Maybe use the NR bit to have it read/write shadow regs instead?

RDxxxx/WRxxxx will work on all the I/O registers - don't worry.

I think the 8 executable registers at $1F8..$1FF are too much trouble to set up for regular I/O write blocking and special writing to make them useful as instruction locations. They only represent 1/64th of the executable memory, anyway.

cgracey · 2012-08-10 22:42

pedward wrote: »

I think LMM could be reworked in the P2 to use the TSS swapping stuff. You could put sentinel code at the top of COG memory that will swap registers, load more COG code, then swap back. It could increase the code caching significantly. Instead of 8 or 16 instructions at a time, you could probably have 256 or more instructions in a block. The COG loading would happen as a function of yielding control of the COG to the overlay loader. Just some thoughts.

That would work fine, as long as that swap code is below $1F6 (INDA).

cgracey · 2012-08-10 22:44

SRLM wrote: »

My Armchair Propeller Design:

I would like to see the Hub area become an executable region for code, with it's own dedicated processor. This processor would be faster than cogs, and would have access to the Hub memory on every cycle. This design supports the driver model that is adopted by the Propeller community: the cogs support interaction with specific peripherals, and push their data to the hub for processing. With my suggestion, the processing could be done where the data is, instead of in another cog (and the bandwidth limitations of pulling the data and processing instructions in). You would get PASM speed on hub data.

* Yes, there are limitations (RAM multi-port/access, 9 bit SRC/DEST fields in PASM, die space, etc.) I'm choosing to ignore all of that.

Anyway...

The problem, always, is that nothing is for free. For that processor to get access to the hub RAM would mean that cogs would get less access.

evanh · 2012-08-10 22:55

cgracey wrote: »

The problem, always, is that nothing is for free. For that processor to get access to the hub RAM would mean that cogs would get less access.

I suppose it could work as a low priority fetch. Where the Cogs always have priority over the Hub's execution unit. This way the Hub executes only when a Cog doesn't use it's allocation. If all Cogs are using 100% of their respective hub accesses, unlikely imho, then the Hub processor would freeze up.

cgracey · 2012-08-10 22:59

evanh wrote: »

I suppose it could work as a low priority fetch. Where the Cogs always have priority over the Hub's execution unit. This way the Hub executes only when a Cog doesn't use it's allocation. If all Cogs are using 100% of their respective hub accesses, unlikely imho, then the Hub processor would freeze up.

Maybe we could have a 9-cycle bus, where the cogs get their 1 cycle each, and this also gets 1.

evanh · 2012-08-10 23:10

On the subject of major feature requests I'm keen on having a dual thread in the Cogs but prioritised instead of sliced. So one thread can WAIT while the other consumes the spare cycles on less timing critical processes.

I addressed the matter here - http://forums.parallax.com/showthread.php?138458-Prop-II-question-Serial-Chip-To-Chip-Communication&p=1080491&viewfull=1#post1080491

PS: I know this requires a duplicate register set. How much real-estate does the register set take?

EDIT: The advantage is, with cooperation between the two threads, is one can achieve exacting timing and at the same time make effective use of spare Cog cycles.

evanh · 2012-08-10 23:13

cgracey wrote: »

Maybe we could have a 9-cycle bus, where the cogs get their 1 cycle each, and this also gets 1.

No, it's not worth sacrificing Cog speed, imho.

mindrobots · 2012-08-10 23:19

cgracey wrote: »

Maybe we could have a 9-cycle bus, where the cogs get their 1 cycle each, and this also gets 1.

If the hub processor was truly faster, you could interleave hub memory access on a 16-cycle bus. (C0-hub-C1-hub-C2-hub...)

The strange part to me would be the expanded hub instruction set with the additional addressing bits to cover the entire memory space.

evanh · 2012-08-10 23:24

mindrobots wrote: »

The strange part to me would be the expanded hub instruction set with the additional addressing bits to cover the entire memory space.

Me thinks it would be an entirely different instruction set with only one or two accumulators. Quite a lot of work.

pedward · 2012-08-10 23:25

These are more like P3 level changes. I'd do a 64bit wide instruction with more COG RAM and perhaps External HUB RAM. This way die space is allocated to 64KB-128KB of COG ram each and HUB RAM may be external to the chip. Of course it would be 90nm... Some form of Simultaneous Multi-Threading would also be possible.

The other possibility is that COG 0 has the Hub RAM and COG 1-8 access HUB RAM as normal, but it's just an additional port on the HUB memory. This way the ports don't grow exponentially. So COG 0 has 1-4MB of RAM, COG 1-8 have 64-128KB of COG RAM. COG 0 can use RDLONG, etc, or natively address the whole HUB memory. You would have 9 COGs and 5 ports on the HUB memory, 4 on the other COG memory. HUB windows would be orthogonal like now, 8 slices. The Master/Super COG would always run from HUB.

For the majority of non-indoctrinated developers, a flat 1-4MB memory space for their single threaded code would help them to "just use" the P3. They would apply software peripherals like a child bakes a cake, but for the indoctrinated programmers, it would be incredibly powerful and a huge performance leap. That much memory would make possible a lot of things that would open the applications of the Prop outside of just the mCU market and into the SoC market.

A 64bit Prop 3 with 4MB on-chip HUB, 1 super-COG, 8 b64 COGs, probable clock speed of 400Mhz+, single cycle and additional SIMD/MIMD instructions, hardware single cycle arithmetic, perhaps floating point too. That would be 3600-4500 MIPS, but twice the data width of the P2, so instructions that can parallel process 2 longs at a time, doubling the effective instruction rate for certain operations, perhaps an overall 15% speed improvement just from long-long instructions. I would refer to this as parallel issue half-width instructions. Perhaps it could be make more generic, having 32bit Prop1/2 instructions that can be parallel issued line-by-line in the assembler, only accessing the lower 512 longs of COG RAM.

evanh · 2012-08-10 23:44

I think you're being a bit optimistic with the P3 internal memory sizing there.

As for external, say DDR3, interface, it's not possible to implement the low latency random access demands of the GBytes per second rate needed to feed all 8 Cogs at once. That's why caches exist and the headaches that go with them.

EDIT: Well, not possible with DRAM at least. External SRAM or MRAM could maybe pull it off with some sacrifices. And you're talking about hundreds of pins on the Prop3 for the RAM alone.

cgracey · 2012-08-10 23:57

evanh wrote: »

On the subject of major feature requests I'm keen on having a dual thread in the Cogs but prioritised instead of sliced. So one thread can WAIT while the other consumes the spare cycles on less timing critical processes.

I addressed the matter here - http://forums.parallax.com/showthread.php?138458-Prop-II-question-Serial-Chip-To-Chip-Communication&p=1080491&viewfull=1#post1080491

PS: I know this requires a duplicate register set. How much real-estate does the register set take?

EDIT: The advantage is, with cooperation between the two threads, is one can achieve exacting timing and at the same time make effective use of spare Cog cycles.

I would really have to think about all the ramifications of having prioritized tasks. We do have task switching in the Prop II, though, that preserves Z and C into the same register which holds the program counter. Here, 8 different programs run in a round-robin fashion:

' tasksw = jmpret inda,++inda wz,wc

pub go

  coginit(0,@pgm,0)

DAT

pgm			org

			fixinda	pcs+7,pcs	'set pointers to 8 pc's

c0			notp	#0		'0
			nop	#25
			notp	#0
			tasksw
			jmp	#c0

c1			notp	#1		'1
			nop	#50
			notp	#1
			tasksw
			jmp	#c1

c2			notp	#2		'2
			nop	#75
			notp	#2
			tasksw
			jmp	#c2

c3			notp	#3		'3
			nop	#100
			notp	#3
			tasksw
			jmp	#c3

c4			notp	#4		'4
			nop	#125
			notp	#4
			tasksw
			jmp	#c4

c5			notp	#5		'5
			nop	#150
			notp	#5
			tasksw
			jmp	#c5

c6			notp	#6		'6
			nop	#175
			notp	#6
			tasksw
			jmp	#c6

c7			notp	#7		'7
			nop	#200
			notp	#7
			tasksw
			jmp	#c7


pcs			long	c0,c1,c2,c3,c4,c5,c6,c7

evanh · 2012-08-11 00:11

cgracey wrote: »

I would really have to think about all the ramifications of having prioritized tasks. We do have round-robin task switching in the Prop II, though, that even preserves Z and C into the same register which holds the program counter:

Yep, I see that as one application for the low priority hardware thread. What I'm thinking is the high priority hardware thread has it's entire own 504? general registers (haven't considered the special registers) where it can have the low level I/O functions and data loaded and instantly ready for any pin activity. The CLUT can be used as an interchange with the low priority thread so that there is no need for Hub accesses and thereby improving the tight loop even further.

On the real-estate side of things even the pipelines will have to be duplicated so it's a significant request I know. The latency results are pretty cool though.

evanh · 2012-08-11 00:35

Apologies, I'd not tried to understand that code example. I'm not familiar with Cog assembly having not actually written any. I assumed it was a way of loading the next LMM context but that's not the case is it, it's more like a sequence of gotos with only the smallest of context and can all fit in a single Cog, right?

Makes my request look a little bloated.

EDIT: Oh, The whole set of LMM contexts are contained in the Cog, right? ... I think need to learn a little more and shut up now ...

Sapieha · 2012-08-11 01:27

Hi Chip.

Ih Chip I have be thinking what You said that That D-long field are hard wired in MOVF.
You now have that for for field mover.
SETF D/# - set up field mover
%w_xxdd_yyss

Why You don't have 32 bit wide to have even D-Long address in this
%E_MSSSSSSSSS_MDDDDDDDDD_rr_w_xxdd_yyss

E = Use S, D pointers
M = 0 - Main COG memory address, 1 - LUT memory area address
rr = Spare Bits maybe for 10=increment, 11=decremen D-Long.

E = Use S, D pointers -- 0 = use only w_xxdd_yyss field control

1 = use entire extended capabilitys

That even leave 2 bits free -- To maybe have 10=increment, 11=decrement that possibilitys on D-Long.

With that we have complete BYTE handler for entire COG-LUT memory area

cgracey wrote: »

Dave, I would post one, but I don't have time to make one right now. You can probably infer the REPS/SETINDx instructions, but here's SETF/MOVF:

SETF D/# - set up field mover

%w_xxdd_yyss

w: 0=byte, 1=word
xx: destination field control, 00=static, 01=rotate left by 8/16 bits, 10=increment, 11=decrement
dd: initial destination field, 00=byte0/word0, 01=byte1/word0, 10=byte2/word1, 11=byte3/word1
yy: source field control, 0x=static, 10=increment, 11=decrement
ss: initial source field, 00=byte0/word0, 01=byte1/word0, 10=byte2/word1, 11=byte3/word1

MOVF D,S - moves a byte/word from S into a byte/word in D, leaving other bits unchanged (except in the case of xx=01, in which bits rotate by 8 or 16)