Propeller II update - BLOG

cgracey · 2014-01-18 03:30

Cluso99 wrote: »

Chip,
I am curious as to whether we could split the cog ram into 2x 256 longs (less the few registers) where in hubexec mode the upper 256 longs could be used instead for larger hub cache memory? This would save the 8x 8xlong cache lines, but instead of 8 lines we would have 32 lines.

If the cog was split into 2 256 longs, might we be able to use the upper half for the aux ram under some circumstances - could save aux ram? I have not totally thought all the implications through. But perhaps if the upper was not usable for cache lines, perhaps it could be used for the clut instead of having aux.

Would any of these changes give us any more hub ram- maybe another 64KB ?

That 4-port memory is our own design (in silicon, not Verilog), and it won't bend to function any differently than it's designed to. Someday on Prop3, we'll synthesize the whole thing, including memories, and then things like what you're talking about will be possible.

dMajo · 2014-01-18 12:04

cgracey wrote: »

No. There is a 32-bit XCH (exchange) system that can route 32-bits per clock between/among any/all cogs.

potatohead wrote: »

lol
Am I the only one who totally forgot about this?

Cluso99 wrote: »

What might makesense with P2 is comms between cogs where we use hub fifos (FullDuplexSerial, etc), we also set a bit in port D to indicate a byteis available, and the cog waitpeq on this, ratheer than using a hub slot in a tight loop, freeing the slot.

Potatohead/Cluso, the below assertions was made with the cogs Pord D handshaking in mind

dMajo wrote: »

That means that to use this feature the hungry cog will need to coginit another one and within the second release the resource (and eventually waitpeq for ever, on internal io, if no other uses).

dMajo wrote: »

Cluso by coding an object that uses 2 cogs its enough that one of them don't use any hub slot and automatically the other have all the unused slots thus having the double bandwidth guaranteed (the same apply for more cogs). I see no reason to-over complicate that.

dMajo wrote: »

I you are a so smart/clever coder that need this functionality you can adapt the same code so that hub access happens on 1 window over four. By doing so you have acquired the knowledge over someone else's code, have understood how it works and see if this is doable avoiding issues. The end result is the same, even without a sharing(donation) option.
If you are writing from scratch then again you do not need sharing option because the code can be written to not use all of the slots. If you are not able to accomplish that, take your time to study first, to understand how it works and in the meantime stay out of hungry modes.

cgracey wrote: »

I've been thinking about how a cog has 512 registers.

I found early on in Prop1 development that 256 registers just wasn't enough room to write applications in, so I doubled the memory to 512 longs and that proved adequate for everything I wanted to do.

Now that we have hub execution, most programs will probably live outside the cog, with only some high-speed code needing to be in the cog registers, while the rest of the registers are variables.

Is there any compelling reason to go back to 256 registers, instead of 512? I'm just wondering. This would shrink the silicon area needed for those space-hungry 4-port RAMs and would free up two instruction bits, with D and S fields falling back to a more natural 8 bits, instead of 9. I don't think I'm going to make such a change, but I thought I'd see if anyone had any interesting ideas about this.

Chip, if the hubexec will be limited to single thread cog than i will be fine having icache/dcache in cog ram eating up 256 registers and leaving the other 256 for cog execution thus freeing the silicon resourdes used by the current icache. Since this is not possible due to the ram design i think I also prefer 512 registers.

Cluso99 · 2014-01-18 15:33

Chip,
I understand that the cog ram is an inhouse custom 4 port design.
Wouldn't this work for aux ram (if you could configure the upper 256 longs as aux) where the write port would be the hub read / cog read side (wide is possible), one read port is the cog write side (so now we have the r/w port of the aux) and the next read port becomes the video out port. The last read in this scenario would be unused.

Wouldn't the cache lines be similar - write is the hub read side (byte/word/long/wide). This could then be used for the instruction cache lines nicely.

It may be necessary to relay the cog ram into smaller sub-blocks - would thisbe a big job?

I am just wondering if you could simplify the design somewhat, with less instructions required, by effectively folding aux/cache/cog into one config 256long block. There would certainly be some productive gains possible here, including possible hubexec able to execute and selfmodify.

David Betz · 2014-01-18 16:12

Cluso99 wrote: »

including possible hubexec able to execute and selfmodify.

I hate to admit this but I have been hoping that the days of self-modifying code on the Propeller were numbered because of the new stack instructions as well as hub execution. It certainly seems like they should be far less necessary now. What would be the harm of restricting them to code running from COG memory?

evanh · 2014-01-18 16:15

I'm not completely up on what Aux's 256 longs will be used for but having it available along side the full 512 longs of CogRAM was part of it's feature I believe.

Switchably using CogRAM for HubExec caching seems much more acceptable in terms of space constraints. However, the actual wiring will be the roadblock on this one I suspect. The wide Hub reads require a wide bus coming further into the guts of the Cog.

Reducing Cog space holds too many consequences for so many of the existing changes and features that have been added to the Prop2.

EDIT: Switchable address ranges kind of defeats what Chip was asking anyway. He was more after instruction encoding space than making caches bigger.

potatohead · 2014-01-18 16:53

Never mind...

evanh · 2014-01-18 16:55

potatohead wrote: »

Never mind...

There is a delete post button when (re)editing

ozpropdev · 2014-01-18 17:04

evanh wrote: »

I'm not completely up on what Aux's 256 longs will be used for but having it available along side the full 512 longs of CogRAM was part of it's feature I believe.

Switchably using CogRAM for HubExec caching seems much more acceptable in terms of space constraints. However, the actual wiring will be the roadblock on this one I suspect. The wide Hub reads require a wide bus coming further into the guts of the Cog.

Reducing Cog space holds too many consequences for so many of the existing changes and features that have been added to the Prop2.

EDIT: Switchable address ranges kind of defeats what Chip was asking anyway. He was more after instruction encoding space than making caches bigger.

AUX ram (formally called CLUT ram "Color LookUp Table") is still used by the video sub system as well.

ctwardell · 2014-01-18 17:11

I think we should leave the 512 long cog ram as is.

It's my opinion that we are headed down a questionable path moving toward hubexec being the 'normal' mode.

I think hubexec is great for a lot of 'business logic' type code, VM's, etc., but I don't think it is needed for the vast majority of peripheral type code.

If the size of the cog ram is reduced it will force a lot of code that could have fit within the cog to use hubexec mode.

C.W.

User Name · 2014-01-18 17:24

I have no specific plans of ever using hubexec. The embedded control stuff I do is ideally suited to individual cogs and their own exclusive RAM.

Still, it's a bit premature to make bold declarations. Given all that's packed into the P2, who knows what crazy synergies and unforeseen use patterns might arise?

mindrobots · 2014-01-18 17:55

ctwardell wrote: »

I think we should leave the 512 long cog ram as is.

It's my opinion that we are headed down a questionable path moving toward hubexec being the 'normal' mode.

I think hubexec is great for a lot of 'business logic' type code, VM's, etc., but I don't think it is needed for the vast majority of peripheral type code.

If the size of the cog ram is reduced it will force a lot of code that could have fit within the cog to use hubexec mode.

C.W.

+1 (or +2 or 3 on this one!)

ozpropdev · 2014-01-18 18:39

With the fabulous selection of special 1 cycle instructions in P2 I think users will be pleasantly surprised how much can be done in "cog mode".
Compact size and speed PASM, hard to beat!

This in combination with hub exec mode makes one powerful beast!

Cluso99 · 2014-01-18 22:49

What my cog ram suggestions were aimed at were by better utilising the cog ram, it might be possible to remove the clut/aux ram and utilise half of the cog ram when this was required. Of course this is coming from someone (me

) who a short time ago was asking for 512 long clut/aux ram.

But things have changed so much with the advent of hubexec mode and wide mode, that perhaps (as Chip suggested) it is worth discussing the merits of cog ram size. I thought it should be discussed, including the clut/aux/fifos and the hubexec caches.

I just imagined that a tidyup here could reduce space, reduce instructions (because we may no longer need the clut for more cog storage/stacks/etc), and a general improvement and overall simplification.

I wondered how much die space this could give, and could some of that add some more hub space (or a new block of space to share between cogs) ???

ozpropdev · 2014-01-18 23:10

I seem to recall Chip mentioned somewhere he was going to/or has modified the XFR block to allow transfer from HUB to AUX ram for improved video operation. This would save the double handling of pixel data from HUB to COG to AUX. Maybe in his latest hub exec work and his recent "snow blindness" he forgot about that?

pik33 · 2014-01-19 00:08

Heater. wrote: »

So we only need 13 COG registers: AX, BX, CX, DX, SI, DI, IP, SP, BP, DS, CS, SS, ES. That should be enough that the P2 architecture is useful for the next 30 years or so.

Guys...stop beating on me...I was only joking...

Then assign some opcodes to fully implement all 80286 instructions, so we can run Windows 3.1 on it

English is not my mother tongue so reading all of this got too difficult. I understand the P2 gets more complex every week and starts to lack P1 simplicity. I think this is not good.

Another question to Chip: can you make a version of DE2 P2 emulator programmable via DE2's onboard RS232?

kbash · 2014-01-19 01:17

pik33 wrote: »

Then assign some opcodes to fully implement all 80286 instructions, so we can run Windows 3.1 on it
English is not my mother tongue so reading all of this got too difficult. I understand the P2 gets more complex every week and starts to lack P1 simplicity. I think this is not good.

Another question to Chip: can you make a version of DE2 P2 emulator programmable via DE2's onboard RS232?

Pik33,

English IS my mother tongue and (trust me on this one) IT'S NOT YOUR ENGLISH!

I found my copy of the INTEL 432 Micromainframe manual the other day. As I glanced through it's pages it gave me an eerie sense of deja vu about watching the P2 go through it's evolution from a relatively straight-forward, logical, incredible, step-up from the P1 into something that more resembles a supercomputer of the 1970's than it does an embedded microcontroller.

I THINK that most of these enhancements will not affect our ability to take our P1 code, transfer it to the P2 and just use the originally planned features that we desperately needed a year ago, Chip said the other day that we should: "Just think of them as friends you haven't met yet, who will all want to help you out."

Chip has established a lot of reasons for us to trust his judgement about microprocessor design and so far... I do. But it sent a chill down my spine to go through the iAPX 432 Micromainframe manual and realize that even a giant like Intel could waste hundreds of man-years on a design that was too complex for its own good.

K.B.

jmg · 2014-01-19 01:49

kbash wrote: »

.. But it sent a chill down my spine to go through the iAPX 432 Micromainframe manual and realize that even a giant like Intel could waste hundreds of man-years on a design that was too complex for its own good.

Intel's sales and stock price, says it was not a waste at all

Yardsticks change, and even the 432 was very simple, compared with what they ship currently.

Checking wiki, intel learned from the 432, and it was out of step with the process and tools they had at the time (wiki says started in 1975 ... a year before their 16-bit 8086 project began)

That makes it rather more like the P2 from 2+ years ago, before the change to Verilog.

P1 is 32 bit, and P2 is also 32 bit, so there is no large step in Data size here.

Intel also had no FPGA path to prove the Verilog.

P2 is constrained by the Package, and process, so there is an upper limit to any total available features.

evanh · 2014-01-19 02:41

jmg wrote: »

Intel's sales and stock price, says it was not a waste at all

That, of course, has very little to do with Intel's design choices nor does it reflect on the effectiveness of one design over another. It has everything to do with IBM choosing the 8088 for the PC and then leaving the clones to take over board and case manufacturing with the general idea being that chip fabricating being too far out of reach for them. There is likely a lot of behind the scenes politics involved. I doubt the real story has even been close to told yet.

Heater. · 2014-01-19 03:23

kbash,

Oh yes the Intel 432. I tried to read the 432 manuals at the time. I did not understand a word of it, Despite being employed writing 8 and 16 bit Intel and Motorola assembler at the time this beast was something else. I decided I was just to stupid to deal with it. Little did I suspect it was not just me but every one else. The 432 failed big time.

Do you also recall the Inter i860 from 1989? A super duper RISC machine with some serious floating point performance. Only problem was that it to was far to complex. With it's parallel execution of float and integer ops, it's long instruction pipeline that you had to deal with in our programs, and so on. Programming it was hard. Getting performance out of it in assembler was impossible. Even Intel did not know how to write an FFT that reached the theoretical peak flops. Compilers did not do any better. The i860 failed big time.

Then of course we have the Intel Itanium an on going train wreck that we can watch in real time. Again an architecture change, to VLIW, that was so complex that neither humans or compilers can program it.

Meanwhile, what did succeed? As we know, the 286, the 386. the 486, the Pentium and so on. All extentions on extentions of the original x86, itself mostly compatible with the 8 bit 8080!

The big slap for Intel was AMD continuing the tradition with x64 extention. Intel had not wanted to go there but ratther go with Itanium.

What is the point of this ramble?

1) KISS.

Complex things don't get used. They put people off. No body has the time to invest in learning how to deal efficiently with hundreds of obscure instructions, multiple operating modes and weird interactions between them. It's even harder for compilers to use this stuff.

2) Backwards compatibility wins.

Not 100% backwards compatible. And not at the binary level. But thee are big wins in being able to easily adapt old code and just ideas to new devices. I think simple user familiarity has a lot going for it as well.

Brian Fairchild · 2014-01-19 03:34

Heater. wrote: »

1) KISS.

Complex things don't get used.

I've always been fascinated by how few instructions your average compiler uses.

Heater. · 2014-01-19 03:37

@jmg,
[quoute]
Intel's sales and stock price, says it was not a waste at all...
[/quoute]
This says nothing about the qualities of i432, i860 or Itanium or anything else.

Top marks to Intel for sticking their necks out and trying different architectures and ideas. Everybody likes to complain how awful x86 is. Intel has been bravely trying to get us away from x86 for years. What does everybody buy? x86!

Intel have been making huge money out of x86 and so could afford to investigate some sideways options. It's good that they tried.

Parallax does not have the resources of an Intel.
[quoute]
... the 432 was very simple, compared with what they ship currently...
[/quoute]
Yes, I'm sure the 432 had a lot less transistors and connections than modern CPU. But from a user perspective it was not simple.

I wonder if that i432 manual is on line somewhere so we can remind ourselves how complex it was.

Heater. · 2014-01-19 04:16

Brian,

I've always been fascinated by how few instructions your average compiler uses.

Yep. Let's remind ourselves with an example.

for (i = 0; i < 100, i++)
{
    *x++ = *y++;
}

Clearly all we are doing is moving a block of memory here. Given enough smarts the compiler could see that and make use of the block move instructions you find on x86 and other chips. Mostly they do not. I just tried it on GCC and Clang. In general it's just hard for the compiler to analyse the intent of you code and make use of special instructions when appropriate.

Turns out it's just as well. Using the hardware block moves can be slower on new x86 devices!

Oddly GCC use the REP instruction prefix for that loop. REP is supposed to go in front of a block instruction. However there is no block instruction, just REP by itself. As far as I can tell this is not intended usage of REP. Not sure what's going on there yet.

Here is the GCC output for that loop:

.L2:
        movl    (%rdx), %ecx
        addq    $4, %rdx
        movl    %ecx, (%rax)
        addq    $4, %rax
        cmpq    $ax+400, %rax
        jne     .L2
        rep

I know, this is getting way off topic.

kwinn · 2014-01-19 08:27

Heater. wrote: »

.........................................................................

What is the point of this ramble?

1) KISS.

Complex things don't get used. They put people off. No body has the time to invest in learning how to deal efficiently with hundreds of obscure instructions, multiple operating modes and weird interactions between them. It's even harder for compilers to use this stuff.

2) Backwards compatibility wins.

Not 100% backwards compatible. And not at the binary level. But thee are big wins in being able to easily adapt old code and just ideas to new devices. I think simple user familiarity has a lot going for it as well.

+1 Simplicity was one of the things that drew me to the P1 in the first place. That's not to say I don't like most of the improvements on the P2, I am just concerned that some of the additions may be so complex, hard to use, or specialized that they do not get used. That would be a waste of silicon and add to the price of the chip.

jmg · 2014-01-19 10:55

Heater. wrote: »

2) Backwards compatibility wins.

Not 100% backwards compatible. And not at the binary level. But thee are big wins in being able to easily adapt old code and just ideas to new devices. I think simple user familiarity has a lot going for it as well.

History shows us that backward compatible matters, and certainly being able to easily adapt old code.
History also has examples of miss-steps of those who believed binary level compatibility was not really needed.
Without it, you rather contradict the 'being able to easily adapt old code' aspiration.

Anyone designing a follow-on chip, should be thinking carefully about their existing code base, and whilst ASM use is less than it was, being to able to run existing code at the same speed should not be ignored.

In P2 context, 'being able to easily adapt old code' would mean having one+ COGs run ported P1 code, at equal speed, whilst not limiting other COGs to that same low speed.

Lessons from History :
Intel MCS251 - Intel nearly got this right - they even had a switch for Binary or extended mode, but strangely chose to make that OTP, instead of run-time.
At a time when most user code was in Assembler, that decision removed from designers the choice of linking in smarter/faster libraries, to existing code for a simple upgrade in numeric performance.

Philips XA - Philips decided binary compatible did not really matter to customers, but they also did very little about supporting conversion of existing code flows.

jmg · 2014-01-19 11:09

Heater. wrote: »

Intel have been making huge money out of x86 and so could afford to investigate some sideways options. It's good that they tried.

Yes, and even tho that particular part code did not hit critical mass, the design experience was not wasted.

Heater. wrote: »

Parallax does not have the resources of an Intel.

2014 intel ? - clearly not.
Relative to intel circa 1975 tho ? - I'd place Parallax 201x well ahead on tools, and intel in 1975 tried to do something that was ahead of the tools of the time. We do now have 32/64 bit computing pretty much everywhere.

Heater. · 2014-01-19 11:50

jmg,

History shows us that backward compatible matters, and certainly being able to easily adapt old code.

Yep. You could not run 8080 binaries on an 8086. But you could easily run 8080 assembler source code through a translator to 8086 assembler (conv86) and the build that program to run on x86. But, I believe, conceptually it was an architectural change that people could deal with. As opposed to the weird worlds of the 432 or i860 or Itanium.

Whist we are at it history also shows that nobody cares about backward compatibility. ARM processors dominate the mobile landscape. A place where Intel and their "backwards compatible with the 8080" processors would love to go.

Anyone designing a follow-on chip, should be thinking carefully about their existing code base, and whilst ASM use is less than it was, being to able to run existing code at the same speed should not be ignored.

Nah. I go with Linus Torvalds and Linux here. Close is good enough. Don't expect us to add layer upon layer of baggage to support generations of old code forever.

This "compatibility" is such a big thing. Consider:

Back in the day people gave up on binary compatibility and started to write their code in high level languages likes C. Recompile and it works anywhere.

That was fine but it does not work across operating systems. Enter monstrosities like Java. Write once run anywhere. By virtue of dragging a whole OS abstraction with it.

That turned out not to be so good. Enter JavaScript. Works any place that has a web browser. Now a days any place that has a JS engine.

Which is pretty much anywhere.

Nobody cares about your HUB or COG execution. Or your hardware thread scheduling and stack in AUX. They just want to do stuff.

I know this might sound a bit "out there" but it seems to be the way the world is going.

Heater. · 2014-01-19 12:00

jmg,

2014 intel ? - clearly not.

Clearly I cannot speak of the financial health of Parallax. Something I know nothing about.
My gut tells me the P2 development is a big drain on a small company. The sooner it gets out the door and I can pay money for it the better:)

potatohead · 2014-01-19 12:17

Nah. I go with Linus Torvalds and Linux here. Close is good enough. Don't expect us to add layer upon layer of baggage to support generations of old code forever.

Yes. At some level, we have portable so that it's actually portable, right? Right.

When this chip was first discussed, the issue of compatability was also discussed. The overall vision was to grow the platform toward a higher level of capability, put some emphasis on PASM so that people doing that got the things they need to accomplish real time, and blowing out the instructions / addressing / SPIN were all deemed to be a good idea so that the whole thing can be maximized and set a foundation for a P3 someday.

Given how this ecosystem works, P1 code is very highly likely to be run on P1 chips. There will be P1 chips and all that activity simply continues. The expectation that a P2 will run P1 code basically opens a very large can of worms and it detracts from what a P2 is, which needs to be maximized in it's own way, and we've done that. (have we ever, sheesh!)

For now, at the scale things are working at and given the highly differentiated design, comparisons to Intel, et al. really don't mean much. One strength of this chip is doing a control task, taking I/O, math, some user interface, and whatever else together, in parallel, sans an operating system. Pretty much everything else with a similar capability set really isn't going there. Whether or not that makes sense in the end is something we are all going to see play out. Maybe it really doesn't, or we see some lean kind of OS get dropped in there, or something else happens! Who knows?

"just want to do stuff" may play out very well given pieces get written in ways that people can combine and link together in creative ways. Right now, that's on the table, possible and IMHO, likely. Frankly, P1 code will get in the way of that. Better to build new and take best advantage of this device as it is.

And we've got C, so there will be some reasonable reuse, but for those cases where inline / native code is presented. Then again, those bits are typically small and easily rewritten given compartmentalization has been reasonably applied.

From what I see so far, and on a low clock version of the P2, most of the entire P1 body of things we've done people find useful actually fits into a few COGS! The P2 is going to take us to a whole new level of capability, and to get that, code is going to have to be written, not carried forward from the P1, which actually had very different design goals at the time.

Finally, there really isn't some grand vision or path in place like Intel and others have had. It's something like ARM where the devices get made to perform the tasks, not so much like the march of general purpose computing many of these comparisons are based on.

If this takes off, Heater is going to be right about it. People won't care so much about the lower level magic, other than being able to do things. Those of us building that lower level magic may well find opportunities there, none of which are really going to involve P1 code, if they are to be maximized at all.

evanh · 2014-01-19 12:19

Brian Fairchild wrote: »

I've always been fascinated by how few instructions your average compiler uses.

That's a side effect of x86 having gone RISC as of the 80486. Intel took that to extreme with the Pentium4 and even disconnected the native instructions which hurt badly, not that anyone would admit it at the time.

The interesting part of this is that the register map legacy of the x86 had skewed compiler design ever since. There is much bigger opportunity, for compilers to get back on track, now that 64 bit mode brings a general purpose register set. And I'll think you'll find that there is compilers using SSE and FPU and various other special instructions. Whatever is faster.

potatohead · 2014-01-19 12:21

One other thing to take away from Linus and Linux in general.

"close enough" works like a great filter! Really useful code will get carried forward. There is enough interest for it to happen and because of that, it shall.

Sad news for the binary blobs we are going to have to deal with on P2 and going forward from it, but then again that is the cost of binary blobs. Open code, on the other hand, will run through this filter with the really valuable things carried forward, baggage left aside as it should be.

Propeller II update - BLOG

Comments