New Hub Scheme For Next Chip

David Betz · 2014-05-21 18:56

Cluso99 wrote: »

IMHO, irrespective of anything else...

DJNZ & friends should be relative !

JMPRET should at least have a relative mode (for the S = goto address anyway. There is an argument both ways for the D return address.

This is important for relocatable code. It should have been in P1, but Chip didn't conceive code running anywhere but cog.

One problem with DJNZ having a relative address is it means that an instruction near the end of COG memory won't be able to branch back to an instruction near the beginning since the 9 bit value will be interpreted as a signed offset. Is that acceptable?

Roy Eltham · 2014-05-21 19:00

Bill,
You can say that abs/rel stuff is not separate instructions, but according to Chip's list they were. That's what I was going by... I assume they need different verilog at some point in their paths.

It doesn't really matter how many instructions and registers are needed for hubexec. I just know that Chip would rather not do hubexec. Perhaps the "triviality" of it isn't so trivial anymore? Perhaps the combination of all of it makes things start going bad? I dunno really.

Bill Henning · 2014-05-21 19:03

Cluso99 wrote: »

IMHO, irrespective of anything else...

DJNZ & friends should be relative !

100% agreed

Cluso99 wrote: »

JMPRET should at least have a relative mode (for the S = goto address anyway. There is an argument both ways for the D return address.

Maybe. Frankly, the only use I see for JMPRET anymore is for cooperative threading, I'd much rather CALLA with a stack pointed to by INDA. Much cleaner.

Cluso99 wrote: »

This is important for relocatable code. It should have been in P1, but Chip didn't conceive code running anywhere but cog.

<putting on computer scientist hat>

1) Any flavor of JMP and CALL should have both relative and absolute addressing modes, at least for hubexec

2) Any data references to hub memory, regardless of being FIFO based RDFxxxx or WRFxxx or random RDxxx/WRxxx based, should optionally add a BASE register value before using the address - this would give us relocatable data.

This would give us relocatable libraries with local data and much more.

<gazing into the future, P3 mode>

having base and limit registers for code and separate base and limit registers for data allows for memory protection, data and code relocation. Memory protection buys us safety from one cog's hubexec task clobbering another tasks memory.

<far future... think P5+...>

Hello MMU.... hello FPU!

<taking off computer scientist hat, putting on fire proof suit and oxy tank>

David Betz · 2014-05-21 19:07

Roy Eltham wrote: »

Bill,
You can say that abs/rel stuff is not separate instructions, but according to Chip's list they were. That's what I was going by... I assume they need different verilog at some point in their paths.

It doesn't really matter how many instructions and registers are needed for hubexec. I just know that Chip would rather not do hubexec. Perhaps the "triviality" of it isn't so trivial anymore? Perhaps the combination of all of it makes things start going bad? I dunno really.

Well, Chip is the architect of the P2. If he doesn't think hubexec belongs then he should ignore all of the requests for it and leave it out. I only promoted it because I thought good C performance was a requirement for P2. Parallax is free to decide that it is not and that LMM is good enough. I would certainly support them in that decision if they think that is what is best for their customers or even if Chip just thinks it's the cleanest design.

Cluso99 · 2014-05-21 19:08

Bill Henning wrote: »

Am I the only one that finds it ironic that I came up with LMM, and I am working hard to kill it?

I've been chuckling about that one since Chip said "I am trying very hard not to think of executing from the hub".

Yes, it does make me chuckle a bit.

However, in the P1 there was no way to change it, so you invented LMM.

Now, you have the ability to input to the design, so you don't have to circumvent the problem. Fix the shortcomings - ie use hubexec.

But, as I have tried to explain, hubexec (ignoring the hub aspect of it) is really native mode for almost all the other micros. The prop just has this additional mode (we call it the default or standard mode) that it can run self-modifying code in the cog register space.

Cluso99 · 2014-05-21 19:11

David Betz wrote: »

One problem with DJNZ having a relative address is it means that an instruction near the end of COG memory won't be able to branch back to an instruction near the beginning since the 9 bit value will be interpreted as a signed offset. Is that acceptable?

I believe it is acceptable.
Most all cases using the DJNZ & friends are to execute a tight loop. Next most common use is to skip a few instructions.

So, the question is rather, what is the major use of this ??? If you agree with me, then its a no-brainer (and you get relocatable code).

Bill Henning · 2014-05-21 19:11

Roy,

They are separate op codes, that invoke the same abs/rel decode logic. Once you have that block, it is essentially wired to the instruction decode table, saying "these instructions decode thus".

Chip would rather not do hubexec as right now he is under a ton of pressure to simplify everything to the bone.

I think he need a vacation, too many topsy-turvy changes due to "5W panic", P1+ simplification, naysayers etc. Hubexec is a bit more work/debug, but the benefits are huge.

He is also remembering CALL/CALLLR/CALLA/CALLB/CALLX/CALLY, RET/RETLR/RETA/RETB/RETX/RETY, four line LRU cache, 256 bit bus, one line data cache, tasks, state saves/restores for debugging soft threads etc.

I've been working on LMM, and many variations, since 2006. Hubexec will be 2x-4x faster, I cannot see a way around it, and I tried.

The ONLY way around it with the new hub is to spend a TON of money on a very specialized gcc back end, treat everything as 16 instruction VLIW chunks. And that will not hapen, and even if it did, it would still be slower, and use more memory, than hubexec.

Go look at my performance analysis. The memory savings and performance increase due to hubexec cannot be ignored if Parallax wants to enter new markets.

If all Parallax wants is its current educational niche, it can live without hubexec.

Btw, one simplification would be no REP blocks in hubexec (one instruction block trivial, fifo size supportable fairly easily, bigger than fifo size HUGE pain for Chip).

Roy Eltham wrote: »

Bill,
You can say that abs/rel stuff is not separate instructions, but according to Chip's list they were. That's what I was going by... I assume they need different verilog at some point in their paths.

It doesn't really matter how many instructions and registers are needed for hubexec. I just know that Chip would rather not do hubexec. Perhaps the "triviality" of it isn't so trivial anymore? Perhaps the combination of all of it makes things start going bad? I dunno really.

David Betz · 2014-05-21 19:13

Cluso99 wrote: »

I believe it is acceptable.
Most all cases using the DJNZ & friends are to execute a tight loop. Next most common use is to skip a few instructions.

So, the question is rather, what is the major use of this ??? If you agree with me, then its a no-brainer (and you get relocatable code).

I agree with you. I only mentioned it to see if anyone else thought that the inability to reach any COG address might be a problem.

Cluso99 · 2014-05-21 19:15

Roy Eltham wrote: »

Bill,
You can say that abs/rel stuff is not separate instructions, but according to Chip's list they were. That's what I was going by... I assume they need different verilog at some point in their paths.

It doesn't really matter how many instructions and registers are needed for hubexec. I just know that Chip would rather not do hubexec. Perhaps the "triviality" of it isn't so trivial anymore? Perhaps the combination of all of it makes things start going bad? I dunno really.

Yes, Roy. It became way too overcomplicated, particularly because into the mix was multi-tasking. That pipeline must have been a nightmare.

I would rather go back to the basic minimal requirements. If there is time, then just maybe a few helper bits might be in order. But just the basics makes for a huge performance boost.

potatohead · 2014-05-21 19:20

I have the same objections Roy does, and agree with Chip on not wanting to do hubex. I really dislike creating a memory zone in the COG, and a cut to 256 registers. Really, why not just ask for an entirely different CPU that has nothing at all to do with a COG as we know it today?

Personally, I have come to the realization that we are talking about three different CPUs

There is the one Chip would build, and the one Cluso wants, and another one that runs big code from the HUB generally.

Packing all of that into what needs to be one CPU is where all the pain, punishment and complexity is. Going down that road is higly undesirable.

Given we have a third of the instructions dedicated to this hubex, I wonder if it does not make more sense to make an instruction or two that really make LMM shine?

Or maybe make an LMM state machine type thing that does what a software kernel does, and when it bumps into one of the opcodes that need to be ignored, such as JMP, have it run COG code to handle, then carry on?

If hubex can be done without making a complete mess, fine, but I see an increasingly messy set of ideas here.

Hey, there is a LUT now! Can we also make it a stack, run code from it, god knows what else?

The HUB scheme is different. Can we just make really big COGS and ignore that? Or can we include a mode for every possible scenario we might have to think about otherwise?

Look a FIFO! No, it needs to be two of them, and we need lots of options because... well, because!

Might as well call those GIGOs. Or maybe we can combine it all into a MIMO GIGO processor that does everything except for just process!

Sheesh...

I'm off to some camping. Make sure it doesn't also play music with spare HUB slots while I'm gone. Thanks.

Roy Eltham · 2014-05-21 19:23

Bill,
I understand the hubexec is a big perf gain and memory savings. Chip understands that too. He's just not happy with it as it currently stands. Like I said before, if there was some simpler/cleaner way to achieve the performance without doing the full hubexec, then that would be more acceptable, I think. Perhaps if hubexec could be done in some cleaner more elegant way that fit in with the rest of the chip better, he'd like that too.

I certainly am not saying that hubexec is out or won't be done. In fact, I think ultimately we will end up with some form of hubexec. It won't be in this first FPGA image, and it may take a bunch of back and forth over what's really needed.

I, also, think we all need to step outside of the box we have constructed around ourselves of this version of hubexec, and then maybe we might see another path that fits better with the current P2 design. Like, what if there is a way we could get normal cog native PASM (self modifying and all) to work with HUB memory at half or quarter cog native speed? Maybe there isn't a solution better than what we have, but it's worth thinking about.

David Betz · 2014-05-21 19:27

potatohead wrote: »

Or maybe make an LMM state machine type thing that does what a software kernel does, and when it bumps into one of the opcodes that need to be ignored, such as JMP, have it run COG code to handle, then carry on?

Is that a trap mechanism? Dangerously close to interrupts I think! :-)

If hubex can be done without making a complete mess, fine, but I see an increasingly messy set of ideas here.

There is a simple path to this but everyone seems to insist on piling on additional nice but not necessary features.

I'm off to some camping. Make sure it doesn't also play music with spare HUB slots while I'm gone. Thanks.

Play music? That sounds really cool! :-)

David Betz · 2014-05-21 19:30

Roy Eltham wrote: »

Bill,
I understand the hubexec is a big perf gain and memory savings. Chip understands that too. He's just not happy with it as it currently stands. Like I said before, if there was some simpler/cleaner way to achieve the performance without doing the full hubexec, then that would be more acceptable, I think. Perhaps if hubexec could be done in some cleaner more elegant way that fit in with the rest of the chip better, he'd like that too.

I certainly am not saying that hubexec is out or won't be done. In fact, I think ultimately we will end up with some form of hubexec. It won't be in this first FPGA image, and it may take a bunch of back and forth over what's really needed.

I, also, think we all need to step outside of the box we have constructed around ourselves of this version of hubexec, and then maybe we might see another path that fits better with the current P2 design. Like, what if there is a way we could get normal cog native PASM (self modifying and all) to work with HUB memory at half or quarter cog native speed? Maybe there isn't a solution better than what we have, but it's worth thinking about.

There are certain features of executing instructions from the hub that can't really be done in any other way. For example, we need a PC with more bits to address the additional memory. Because of that, we need a way to express those larger addresses in instructions. That could be done using AUGS (is that even still in P2?) but it would mean every CALL or JMP would be 64 bits. We also need a way to load constants since a Propeller instruction can only manage 9 bits as it stands. There are LMM macros to load 32 bit constants but those won't be usable when executing code from the hub. I believe those are really the only problems that need to be solved. The rest is stuff that got added because it was "nice" and/or because "it only takes a few gates".

Bill Henning · 2014-05-21 19:33

potatohead wrote: »

I have the same objections Roy does, and agree with Chip on not wanting to do hubex. I really dislike creating a memory zone in the COG, and a cut to 256 registers. Really, why not just ask for an entirely different CPU that has nothing at all to do with a COG as we know it today?

Personally, I have come to the realization that we are talking about three different CPUs

There is the one Chip would build, and the one Cluso wants, and another one that runs big code from the HUB generally.

Packing all of that into what needs to be one CPU is where all the pain, punishment and complexity is. Going down that road is higly undesirable.

Above is your personal opinion.

potatohead wrote: »

Given we have a third of the instructions dedicated to this hubex,

FUD, BS.

Proof please.

potatohead wrote: »

I wonder if it does not make more sense to make an instruction or two that really make LMM shine?

Ok, show us how, and show a model for it, with cycles.

I do not believe it is possible. See my models earlier.

potatohead wrote: »

Or maybe make an LMM state machine type thing that does what a software kernel does,

It's called hubexec.

potatohead wrote: »

and when it bumps into one of the opcodes that need to be ignored, such as JMP, have it run COG code to handle, then carry on?

No handwaving, unsuported theories, please. Show us how it would work, and how fast it would be.

potatohead wrote: »

If hubex can be done without making a complete mess, fine, but I see an increasingly messy set of ideas here.

Chip/Roy made the new hub version. the "messy" stuff is what is needed for decent large code performance.

potatohead wrote: »

Hey, there is a LUT now! Can we also make it a stack, run code from it, god knows what else?

If there is such a LUT, and extra capabilities that help the prop take little extra logic, it should be added, otherwise it is wasted silicon when not in use.

Personally, I'd be fine with raster display cogs using registers 0..255 as the lut, or having say four lut's in the hub with four video circuits, and leaving that stuff out of the cogs.

potatohead wrote: »

The HUB scheme is different. Can we just make really big COGS and ignore that? Or can we include a mode for every possible scenario we might have to think about otherwise?

Look a FIFO! No, it needs to be two of them, and we need lots of options because... well, because!

Might as well call those GIGOs. Or maybe we can combine it all into a MIMO GIGO processor that does everything except for just process!

Sheesh...

I'm off to some camping. Make sure it doesn't also play music with spare HUB slots while I'm gone. Thanks.

Take Chip with you, he needs a break too

FYI, two fifo's are not needed. Chip asked if there was any use for two, I showed one use. I was not saying put it in, just answering Chip.

RossH · 2014-05-21 19:35

David Betz wrote: »

I agree with you. I only mentioned it to see if anyone else thought that the inability to reach any COG address might be a problem.

Yes, it's a problem. Not impossible to work around, but it would definitely be perceived as a problem, because (and I speak from experience here) it's very frustrating when you are constantly working with no spare longs that you have to add a jump instruction because the destination of your relative jump is too far away!

Ross.

RossH · 2014-05-21 19:42

I'd like to add my vote to those who think that if HubExec is getting too complex, it can and should be ditched for this next chip. Perhaps it can be be added to a later version.

HubExec was always just a "nice to have". It is not the "natural" mode of execution of the Propeller, and is not likely to become so.

Ross.

David Betz · 2014-05-21 19:46

RossH wrote: »

I'd like to add my vote to those who think that if HubExec is getting too complex, it can and should be ditched for this next chip. Perhaps it can be be added to a later version.

HubExec was always just a "nice to have". It is not the "natural" mode of execution of the Propeller, and is not likely to become so.

Ross.

Before you dismiss it entirely, I'd like to mention one feature of hubexec besides just improving execution speed. Removing the LMM kernel and its associated "macros" from COG memory leave a lot more COG memory available for other purposes like perhaps a CMM interpreter that could be used to improve code density in areas of the code where performance isn't as critical. It also provides space for fast library functions to live in COG memory instead of slower hub memory. Anyway, there are advantages beyond just the increase in instruction execution speed.

Bill Henning · 2014-05-21 19:54

NOTE: WITHOUT FIFO OR CACHES, HUBEXEC IS SAME PERFORMANCE AS LMM ... ie 12.5MIPS max @ 200MHz ... which SUCKS

WITH FIFO / CACHES HUBEXEC IS 2X TO 4X LMM PERFORMANCE!

1) Absolute minimum hubexec requirements: (this version is NOT good, wastes too much memory)

Chip's FIFO

LR link register just below I/O registers, fixed location, treat as special register

JMP D/#
CALL D/# ... no need for RET, JMP LR is RET

LOCPTRA #
LOCPTRB #

AUGS

Total of five instructions. (two have two addressing modes) ... seven binary opcodes total

2) Slightly better version, relative addressing is needed

Chip's FIFO

LR link register just below I/O registers, fixed location, treat as special register

JMP D/#/@
CALL D/#/@ ... no need for RET, JMP LR is RET

LOCPTRA #/@ ... allows limited relocatable data
LOCPTRB #/@

AUGS
AUGD

Total of six instructions (two have three addressing modes, and two have two addressing modes) ... twelve binary opcodes

OPTIONS

A) For large assembly language programs add CALLP and RETP based on INDA stack in cog, 'P' is for PASM

CALLP D/#/@
RETP

Adds two instructions (one has three addressing modes) ... four binary opcodes

C) For C and other stack frame language support add CALLC and RETC based on PTRB stack in hub

CALLC D/#/@ ,,, if GCC does not use CALLC/RETC I guarantee other compilers will
RETC

LOCINST #/@ ... for function pointers

Adds three instructions (one has three addressing modes, one has two addressing modes) ... six binary opcodes

Maximum current proposal (2) + (A) + (B) = 11 instructions!

HOW THE FRUIT IS THAT 1/3 TO 1/2 OF ALL THE INSTRUCTIONS???? I AM TIRED OF THE FUD BEING SPREAD!

IF YOU WANT TO COUNT ADDRESSING MODE VARIATIONS YOU HAVE TO COUNT D/# AS TWO FOR ALL INSTRUCTIONS THAT SUPPORT IT!

Even the old P2 variations with many more CALL/RET variants did not take 1/3 of all the opcodes.

NICE TO HAVE, BUT NOT NECESSARY

- allowing INDA/INDB stacks, PTRA/PTRB stacks ... one of each is enough
- more LOCxxx variations for byte, word etc

NOTE

THE TWO ADDRESS MODE INSTRUCTIONS SHARE THE SAME LOGIC FOR THE TWO ADDRESSING MODES

THE THREE ADDRESS MODE INSTRUCTIONS SHARE THE SAME LOGIC FOR THE THREE ADDRESSING MODES

IF YOU WANT TO DISPUTE ANY OF THE ABOVE, YOU MUST PROVIDE ANALYSIS SHOWING YOUR REASONING, OTHERWISE YOU ARE SPREADING FUD

Cluso99 · 2014-05-21 20:03

RossH wrote: »

Yes, it's a problem. Not impossible to work around, but it would definitely be perceived as a problem, because (and I speak from experience here) it's very frustrating when you are constantly working with no spare longs that you have to add a jump instruction because the destination of your relative jump is too far away!

Ross.

Ross,
We are referring to only these instructions...
DJZ, DJNZ, TJZ, TJNZ, TJS, TJNS, JP, JNP D,S/@
In Chip's last P16X64A (16 Apr) Instruction map, these instructions had both relative and absolute modes.

JMPSW D/@ was also there. Not sure about this one.

All JMP & CALLx #abs/@rel have both relative and absolute modes.

SO, if Chip implements these, we are good to go on hubexec and use extended cog ram (LUT) or whatever.

Whether you want it or not, why waste the mainly unused LUT when it could be used for extra cog ram. IMHO it is not going to take many gates.

RossH · 2014-05-21 20:08

David Betz wrote: »

Before you dismiss it entirely, I'd like to mention one feature of hubexec besides just improving execution speed. Removing the LMM kernel and its associated "macros" from COG memory leave a lot more COG memory available for other purposes like perhaps a CMM interpreter that could be used to improve code density in areas of the code where performance isn't as critical. It also provides space for fast library functions to live in COG memory instead of slower hub memory. Anyway, there are advantages beyond just the increase in instruction execution speed.

I'm not dismissing it. In fact, I'm still generally in favor of it. But not at any cost.

Ross.

David Betz · 2014-05-21 20:10

Can someone remind me what LOCPTRA and LOCPTRB do? I guess all of the complex addressing that goes along with the PTRA and PTRB registers are still in the P2 design?

Cluso99 · 2014-05-21 20:12

Bill,
What did LOC do again?
Did it replace the next instructions S operand with LOC's immediate 17 bit address ???
So a DJNZ <reg>,<goto> would become a 17bit <goto> ?

RossH · 2014-05-21 20:19

Bill Henning wrote: »

NOTE: WITHOUT FIFO OR CACHES, HUBEXEC IS SAME PERFORMANCE AS LMM ... ie 12.5MIPS max @ 200MHz ... which SUCKS

You don't have to shout, Bill. The thing is that not everyone agrees with you that this outcome "SUCKS".

For myself, I'd be happy with this level of performance from the next Propeller chip. And so would most people (as agreed in this thread).

Ross.

Bill Henning · 2014-05-21 20:19

David,

Latest indication from Chip was that PTRA/PTRB survived with at least pre/post increment and decrement modes, but the indexing may move to a prefix INDEX instruction.

WRLONG var,--ptra ' push onto stack
RDLONG var,ptra++ ' pop from stack

An index would alow accessing variables on the stack frame directly

Ray,

LOCPTRA #/@ ' set PTRA to embedded 17 bit long absolute or relative address
LOCPTRB #/@ ' set PTRB to embedded 17 bit long absolute or relative address

You are thinking of the AUGS prefix four source, and AUGD for destination.

ALL

These absolute/relative 17 bit instructions will save a TON of hub memory... over 10%

Bill Henning · 2014-05-21 20:21

Ross,

For the educational market, it might be OK.

For larger embedded projects, that need large code, 12.5MIPS is pathetic. I thought the idea was to break into new markets.

I was shouting because I am tired of FUD and unreasoned arguments from some (not you).

RossH wrote: »

You don't have to shout Bill. The thing is that not everyone agrees with you that this outcome "SUCKS".

For myself, I'd be happy with this level of performance from the next Propeller chip. And so would most people (as agreed in this thread).

Ross.

Bill Henning · 2014-05-21 20:22

FYI ALL

I am in favor of an FPGA release real soon, hubexec can wait for a couple of weeks for the next FPGA release, or whenever Chip has time for it.

Roy Eltham · 2014-05-21 20:43

Bill,
I was going by the instruction list Chip shared with me. It has all the individual binary opcodes listed. There's a lot less total instructions in this design than the previous one. The hubexec ones chip had marked looked to be at least 1/3 of the list. Maybe I misunderstood? Maybe he had more stuff than you listed, and thus maybe aren't needed. I was just sharing what I saw.

You don't need to spew forth all caps bolded page long messages stating a bunch of stuff we already know. You come across as standing there yelling at us all like we are children. We all get it that hubexec with fifo/cache is faster than LMM. We all know it's preferred to get the better performance (except maybe RossH) for C via some form of hubexec. I was just sharing what Chip has said (he's said it here on these forums recently too) about not being happy with hubexec and it's complexity and such.

Also, I'm getting tired of people throwing around FUD in here at every turn. I know I used it once or twice a bit ago, but it's really tiring hearing everyone spew it for everything and it's really pointless. Why? Because when it comes down to it, no one knows for sure what we will end up with in the P2 or performance of LMM or hubexec or anything. Even you with all your shreadsheet calcs and matter of fact preaching about analysis and evidence can't really "prove" anything at all about the actual real chip we will end up with. You can only make educated estimations based on what we think it will be. So stop being so harsh and forceful. Share your information and ideas, discuss things with everyone, and lets all work towards getting the best P2 we can by helping Chip as much as possible. I know it will mean disagreement, but it doesn't have to mean telling everyone they are wrong and spreading FUD and demand evidence and proof of everything in all caps and bolded!

Also, I would bet hubexec will take longer than a "couple weeks" to get into the FPGA image after the initial one is delivered.

Phil Pilgrim (PhiPi) · 2014-05-21 20:45

Bill,

How did you come up with 12.5 MIPS? Please show an example of the LMM loop that operates at that speed.

Thanks,
-Phil

Roy Eltham · 2014-05-21 20:51

It's based on getting one long every 16 clocks, so at 200Mhz you get 12.5MIPs. It's worse than 12.5MIPs when you factor in branching and other LMM stuff.

Although, I think LMM with the FIFO would give quite a bit more performance. Bill's earlier post shows that.

Invent-O-Doc · 2014-05-21 20:57

FUD FUD FUD FUD FUD FUD.......

(passing around some FUD)

New Hub Scheme For Next Chip

Comments