Hub Execution Model Thread (split from blog)

Bill Henning · 2013-12-03 09:24

HUB EXECUTION MODEL:

Note - these instructions have normal conditional execution bits :-)

Any cog mode jump or cog mode call outside of the OCTL window exits hub execution mode, with PTRA pointing to next hub instruction

This would make P2 very competitive (actually, due to 8 cores, totally outclass) arm chips without hardware floating point that run at up to 160MHz

It would also save Parallax the development cost of a quad-long based VLIW style GCC port (at a guess, about $250K)

It would be useful to standardize on the 8-long cache being mapped to $1E0, as then instructions within the cache could refer to constants embedded in the code at known locations.

HJMP D/#addr

TTTTTTT ZC I CCCC AAAAAAAAA AAAAAAAAA

Enters hub-exec mode if in cog mode

If immediate address, jumps to AAAAAAAAAAAAAAAAAA00 (sets ptra to scaled address, fetches OCTL, jumps to first instruction in octl)

If not immediate address, jumps to address in D (sets ptra to D, fetches OCTL, jumps to first instruction in octl)

Setting C and Z does not make sense for HJMP, so could be used as additional address bits, or relative jumps

C could indicate add AAAAAAAAAAAAAAAAAA00 to PTRA (forward relative jump)
Z could indicate subtract AAAAAAAAAAAAAAAAAA00 from PTRA (backward relative jump)

Relative jumps would be helpful for position independent code.

HCALL D/#addr

TTTTTTT ZC I CCCC AAAAAAAAA AAAAAAAAA

AUX = ++PTRA

Saves next hub instruction address value onto the AUX stack using --SPA, then

It would also be very desirable to be able to enter hub-exec mode with HCALL, as then cog-only mode would be able to execute library code from the hub

This would largely eliminate the cog memory limitation; note all cogs could share hub subroutine libraries.

If immediate address, jumps to AAAAAAAAAAAAAAAAAA00 (sets ptra to scaled address, fetches OCTL, jumps to first instruction in octl)

If not immediate address, jumps to address in D (sets ptra to D, fetches OCTL, jumps to first instruction in octl)

WC could be applied to set a flag in case stack wraps around
WZ could be applied to set a flag if there is a stack collision with SPB

HRET {#offset}

TTTTTTT ZC I CCCC AAAAAAAAA AAAAAAAAA

PTRA = AUX[SPA++] + offset
execute instruction in cog memory right after the HJMP that entered hub-exec mode

It would be highly desirable that if hub code was invoked with HCALL, that the HRET would go back to cog execution mode - see explanation in HCALL

Offset is scaled by 4, normall 0, but could be used to pop up several levels - think exceptions; of course SUBSPA #offset would do the same

WC could be applied to set a flag in case stack wraps around
WZ could be applied to set a flag if there is a stack collision with SPB

Loading Constants

RDLONG reg,ptra++
long constant

Loading Variables

Assuming the cache is visible at $1E0

1e0: rdlong reg,$1E1 ' any two consecutive longs of the eight in the cache visible at $1E0, could be word, byte
1e1: long 23 bit address

Saving Variables

1e0: wrlong reg,$1E1 ' any two consecutive longs of the eight in the cache visible at $1E0, could be word, byte
1e1: long 23 bit address

Limitations

- REPxxx loops must fit in the 8 long OCTL cache
- DJNZ and friends must fit in the 8 long window
- any type of jump/loop or call that is not HJMP / HCALL / HRET exits hub execution mode
- RDxxxxC and WRxxxxC instructions must not be used in hub execute model

Possible Improvements

- it should be possible to support calling cog subroutines from hub execution mode using JMPRET, as long as they can return to hub execution mode
- adding a CSEG register that is added to all HJMP/HCALL addresses would eliminate the need for relative jumps
- adding a DSEG register for non-HJMP/HCALL/HRET hub references would also allow relocatable data

- it would be relatively easy to write two cog subroutines for HCALLH and HRETH that would use a hub based stack via PTRB for code that needed a large stack
- in hub stack mode, stack variables can be referenced with indexes of PTRB

- by writing a small relocating loader, it would be possible to support multiple HUBEXEC C programs at the same time, running in different cogs

Folks, with this the P2 is no longer just a microcontroller - it is also a full fledged microprocessor!

HISTORY:

- with the new process, the DAC bus would no longer fit
- removing the DAC bus allowed chip to increase the hub to 256KB
- increasing the hub to 256KB made Chip think of RDOCTL/WROCTL
- jazzed suggested trying to run directly out of the hub
http://forums.parallax.com/showthread.php/125543-Propeller-II-update-BLOG?p=1223354&viewfull=1#post1223354
- Chip tried not to think of executing from the hub
http://forums.parallax.com/showthread.php/125543-Propeller-II-update-BLOG?p=1223807&viewfull=1#post1223807
- I could not help thinking about it, as LMM came about from wanting to run code from the hub
http://forums.parallax.com/showthread.php/125543-Propeller-II-update-BLOG/page175
http://forums.parallax.com/showthread.php/89640-ANNOUNCING-Large-memory-model-for-Propeller-assembly-language-programs!
- Chip started thinking about it... including auto-loading sequential 8-long thunks
http://forums.parallax.com/showthread.php/125543-Propeller-II-update-BLOG?p=1223818&viewfull=1#post1223818
- initially, I considered 8-long thunks, VLIW style
- Chip suggested supporting relative jumps
- David asked if Chip sucessfully avoided thinking about executing from the hub, but Chip thought some more about it, and more
http://forums.parallax.com/showthread.php/125543-Propeller-II-update-BLOG/page175
http://forums.parallax.com/showthread.php/125543-Propeller-II-update-BLOG?p=1223922&viewfull=1#post1223922
- the 8-long grain for hub jumps and calls bothered me, so I finally proposed HUBEXEC, HJMP, HCALL, HRET
http://forums.parallax.com/showthread.php/125543-Propeller-II-update-BLOG/page179

The rest, shall we say, is history - read the thread starting at Chip's post above to see the great discussion that ensued!

The Bright Future

For the P3, using DDR2/3/4/+, this model could be extended to XEXEC - bringing in a small cache of longs, and executing in the same manner as the HUB EXEC model.

By adding CSEG and DSEG registers, and ideally an SSEG (for external stack) with limit registers on each... we could have essentially unlimited memory, and port Linux, as the segment/limit register pairs will effectively act like a per-cog MMU.

Bill Henning · 2013-12-03 09:25

Summary of revised HUBEXEC instructions, including suggested instuction bit encoding:

Link register versions for easy GCC support

AUX stack versions for Spin, other virtual machines, and other compilers for greater performance

Instructions with embedded 18 bit hub address (lowest 2 bits zero due to long aligment, and implied)

TTTTTTT ZC I CCCC jjAAAAAAAAA AAAAAAA

where jj select between HJMP / HCALL / HCALLA / HCALLB

TTTTTTT ZC I CCCC 00AAAAAAAAA AAAAAAA HJMP
TTTTTTT ZC I CCCC 01AAAAAAAAA AAAAAAA HCALL
TTTTTTT ZC I CCCC 10AAAAAAAAA AAAAAAA HCALLA
TTTTTTT ZC I CCCC 11AAAAAAAAA AAAAAAA HCALLB

TTTTTTT is the seven bit op code to be assigned by Chip

ZC,I,CCCC as normal P1/P2 usage

jj selects between the four hub-address instructions

HJMP D/#addr

TTTTTTT ZC I CCCC 00AAAAAAAAA AAAAAAA HJMP

Enters hub-exec mode if in cog mode

Exits hubexec mode and jumps to cog address if address < $1E0, or assign a unique op-code to HRET below instead of aliasing to HJMP

If in HUBEXEC mode

If immediate address, jumps to AAAAAAAAAAAAAAAAAA00 (sets ptra to scaled address, fetches OCTL, jumps to first instruction in octl)

If not immediate address, jumps to address in D (sets ptra to D, fetches OCTL, jumps to first instruction in octl)

Setting C and Z does not make sense for HJMP, so could be used as additional address bits, or relative jumps

C could indicate add address to the program counter (forward relative jump)
Z could indicate subtract address from the program counter (backward relative jump)

Relative jumps would be helpful for position independent code.

HCALL D/#addr

TTTTTTT ZC I CCCC 01AAAAAAAAA AAAAAAA HCALL

LR = ++PC

Saves next hub instruction address value into a link register, then PC = specified address

I strongly recommend using location $1F0 as the link register, that way compiled code can rely on its know location, and there would be no need for an 'SETLR reg' instruction or linker magic. KISS principle

If immediate address, jumps to AAAAAAAAAAAAAAAAAA00 (sets program counter to scaled address, fetches OCTL, jumps to first instruction in octl)

If not immediate address, jumps to address in D (sets program counter to D, fetches OCTL, jumps to first instruction in octl)

WC could be applied to set a flag in case stack wraps around
WZ could be applied to set a flag if there is a stack collision with SPB

It would also be very desirable to be able to enter hub-exec mode with HCALL, as then cog-only mode would be able to execute library code from the hub

This would largely eliminate the cog memory limitation; note all cogs could share hub subroutine libraries.

HCALLA D/#addr

TTTTTTT ZC I CCCC 10AAAAAAAAA AAAAAAA HCALLA

AUX = ++PC

Saves next hub instruction address value onto the AUX stack using --SPA, then PC = address

Only difference from HCALL is using a hardware stack instead of a link register

HCALLB D/#addr

TTTTTTT ZC I CCCC 11AAAAAAAAA AAAAAAA HCALLB

AUX = ++PC

Saves next hub instruction address value onto the AUX stack using --SPA, then PC = address

Only difference from HCALLA is using the SPB pointer

Zero or Single operand instructions (bit pattern to be assigned by Chip, a lot are available)

HRET

No opcode needed; exactly equivalent to

HJMP $1F0 ' proposed fixed Link register, can be used as regular general purpose register when not in hubexec mode

If desired, assign unique opcode that jumps to hub address in $1F0

HRETA

Op code to be assigned

PC = [SPA++]

HRETB

Op code to be assigned

PC = [SPB++]

NOTE:

allowing a direct 9 bit constant in the source field would allow cleaning up the stack by removing local variables and arguments.

Example:

HRETA #-9

SPA -= 9 before popping the return address

Instructions with embedded 23 bit constant

Opcode encoding to be assigned by Chip.

BIG #const23

Suggested by David, as per Chip's or David's usage, allows extending 9 bit immediate constants to instructions to a full 32 bits.

It may be useful to allocate $1F1 as the "BIG" value register, and store the created 32 bit constant in it, so subsequent instructions can use it.

Example:

RDLONG reg,#const32 ' assembler replaces with RDLONG / BIG pair as per David's suggesting

mul reg, #5

add reg,3

WRLONG reg, $1F1 ' saves one long as address already computed in 'big' register

Such code is VERY common, so the potential for savings is significant.

Bill Henning · 2013-12-03 09:25

Reserved for sample code

Snippet #1: Load & Execute a big block of cog code (think FCACHE)

' fcache equivalent, in-line, takes 5 longs

	setinda	#0
	reps  	#$1E0,#1
	setptrb	from_ptr
	rdlong	inda++,ptrb++
	jmp   	 #0
	nop	' next hub instruction, does not need to be a NOP
	nop
	nop

' loaded block returns with returns with
	
	hjmp	ptra

Snippet #2: Dump a big block of cog code (think debugger)

' inverse fcache - swap out the cog contents, useful when debugging

	setinda	#0
	reps 	#len,#1
	setptrb	to_ptr
	wrlong	inda++,ptrb++
	jmp   	#0
	nop	' next hub instruction, does not need to be a NOP
	nop
	nop

David Betz · 2013-12-03 09:39

I assumed PTRA/PTRB would be extended to 18 just to support the hub ram from the current 17 bits

It should be fairly easy to modify GCC for this mode (famous last words)

(Let's take this discussion to the HUBEXEC thread)

Using PTRx would of course mean that you couldn't have two HUB threads in the same COG. I guess that wouldn't perform very well anyway since each would trash the others icache.

David Betz · 2013-12-03 09:42

Don't we still have a potential conflict between the use of the hub slot to fill the icache and a RDxxxx or WRxxxx instruction trying to access the hub at the same time? One would have to stall I guess.

Dave Hein · 2013-12-03 09:47

Why not just implement a multilevel cache? If the data or instruction is not in the cog cache fetch it from the hub cache. If it's not in the hub cache then fetch it from external memory.

EDIT: Get rid of dedicated hub access windows, and use a smarter hub bus arbiter.

Bill Henning · 2013-12-03 09:53

Exactly. Only one HUBEXEC thread per cog (for P2) - we would need several OCTL caches otherwise.

ptheads will work nicely!

David Betz wrote: »

Using PTRx would of course mean that you couldn't have two HUB threads in the same COG. I guess that wouldn't perform very well anyway since each would trash the others icache.

Bill Henning · 2013-12-03 09:55

I was worried about that too, but in another thread Chip said there is no conflict!

http://forums.parallax.com/showthread.php/152070-To-cache-or-not-to-cache...-musings-on-improving-Spin-LMM-performance

The non-C RD/WR's don't clobber the cache.

David Betz wrote: »

Don't we still have a potential conflict between the use of the hub slot to fill the icache and a RDxxxx or WRxxxx instruction trying to access the hub at the same time? One would have to stall I guess.

Bill Henning · 2013-12-03 09:56

Proper caching is a great idea for P3, but too much to bite off in the time we have available (while Beau is changing the transistor size as needed for the new process)

For the sake of determinate timing (needed for hard real time) we need to keep hub windows.

Dave Hein wrote: »

Why not just implement a multilevel cache? If the data or instruction is not in the cog cache fetch it from the hub cache. If it's not in the hub cache then fetch it from external memory.

EDIT: Get rid of dedicated hub access windows, and use a smarter hub bus arbiter.

David Betz · 2013-12-03 09:56

Bill Henning wrote: »

I was worried about that too, but in another thread Chip said there is no conflict!

http://forums.parallax.com/showthread.php/152070-To-cache-or-not-to-cache...-musings-on-improving-Spin-LMM-performance

The non-C RD/WR's don't clobber the cache.

I'm not worried about clobbering the icache. I'm more worried that two stages of the pipeline might need to access the hub slot at the same time, the icache fill logic and the normal RDxxxx/WRxxxx instruction handling.

Edit: Maybe that can be handled by saying that icache fill always wins the arbitration and stalls the pipeline until the cache line is read. That will also stall the RDxxxx/WRxxxx instruction.

Bill Henning · 2013-12-03 10:00

I am not worried about it. At a guess, Chip will queue hub slot accesses - weather from embedded RD/WR or cache reload - and handle them one at a time.

David Betz wrote: »

I'm not worried about clobbering the icache. I'm more worried that two stages of the pipeline might need to access the hub slot at the same time, the icache fill logic and the normal RDxxxx/WRxxxx instruction handling.

Edit: Maybe that can be handled by saying that icache fill always wins the arbitration and stalls the pipeline until the cache line is read. That will also stall the RDxxxx/WRxxxx instruction.

potatohead · 2013-12-03 10:03

I've been thinking about the HUB PASM model some.

Basically, it's hardware LMM with some helpers.

Seems to me, the conflict between HUB data read write and execute is best solved by simply returning to the COG to write the data, which is nicely provided for in the current instruction proposal. Or, employ another COG, or task in a COG to get this done in parallel with the HUB PASM program execution. It would look sort of like the math operations do.

So then, one needs to organize the COG for this to really be maximized.

One approach is COG library code, or MCU model.

I really like the idea of shared library code frankly.

In that model, one blasts through snippets much like a snippet in SPIN 2. The parallels here are beautiful. Call it, work gets done, return to COG for business as usual. The conflict here isn't that big of a deal.

The other one is the CPU strategy. The conflict here is a big deal as the majority of the time will be spent in HUBEXEC mode.

Put the WIDE register block somewhere. I like top of COG personally.

Setup a few PASM routines that can blast data out to the HUB, then jump to the HUB PASM program. All of the rest of COG LONGS are basically CPU working registers! So do the work, and when it makes sense to get data to the HUB, or to another COG via PORT D, return to the COG to do that.

For larger chunks of data, the penalty is less. Worse case, read, modify, write a single byte.

Additionally, a second COG could be watching PORT D for data. It writes to the HUB, leaving the COG running HUB PASM to carry on with some timing assumptions!

I just caught the other posts. If the non C ops can run, perhaps queued or with a stall, that's yet another option... This is unreal.

David Betz · 2013-12-03 10:06

potatohead wrote: »

Seems to me, the conflict between HUB data read write and execute is best solved by simply returning to the COG to write the data, which is nicely provided for in the current instruction proposal. Or, employ another COG, or task in a COG to get this done in parallel with the HUB PASM program execution. It would look sort of like the math operations do.

Since access to hub-based variables is a large part of what any program does, I'd rather let the hardware arbitrate access to the hub than force two mode changes on every hub data access.

potatohead · 2013-12-03 10:09

And it appears we will get something like that. I just caught the other posts on reload and didn't see it. I would prefer that too, though I am thinking about the implications of PORT D and another COG able to perform data operations...

jazzed · 2013-12-03 10:09

It's good to separate this.

Whether Chip actually wants to implement it for P2 is an open question. That needs to be answered quickly BTW so we know what to expect. Personally I don't think this should be even considered as a P2 feature.

Still it's good to flush out what is actually needed for a day when it is seriously considered.

When I mentioned the idea here in paragraph 3 I wanted to dump the requirement for an LMM interpreter and all that it involves. The biggest win is speed, other wins are not needing to waste 2KB on an interpreter among others. Needing any kind of an interpreter for such a feature is a non-starter - that seems to be clear.

But for serious consideration one should look at the implemented LMM interpreters and all aspects of what they do. And anyone who has written one should speak up!

This means going from startup, to multi-cog execution, using ADD/SUP PC, all the FCALL services, and to any effect of ignoring attributes like fcache.

The other question is how would this work with external memory? Should it? I'm guessing that would still require an interpreter unless Chip can volunteer a way to make fetch-exec from external memory possible. I suspect it would be a big friggin headache.

Bill Henning · 2013-12-03 10:10

potatohead,

for a large blast, a lot fits in 8 longs with REPS being usable within the 8 block long; so cog-hub, hub-cog, hub-hub copies should fit

otherwise we call a cog subroutine. FCACHE is not dead yet :-)

For single byte, save ptra, and use ptra++/ptrb++ in a REPS loop to move the data.

THIS WILL BE FUN!

potatohead · 2013-12-03 10:10

BTW: How much time do we have available?

David Betz · 2013-12-03 10:13

Bill Henning wrote: »

HUB EXECUTION MODEL:By adding CSEG and DSEG registers, and ideally an SSEG (for external stack) with limit registers on each... we could have essentially unlimited memory, and port Linux, as the segment/limit register pairs will effectively act like a per-cog MMU.

Do you know how many x86 programmers groaned when you mentioned segment registers? :-)
Seriously though, how would these segment registers help with XMM? You'd need to also have some sort of TLB and with it traps to handle TLB misses. I think traps are dangerously close to interrupts and might not be tolerated well by the Propeller community!

Bill Henning · 2013-12-03 10:16

The circle of life :-)

Wanting to execute out of the hub leads to LMM, FCACHE. Those lead to RDxxxxC and RDQUAD. Getting rid of the DAC and bumping the hub to 256KB bus gives Chip the idea for RDOCTL, he valiantly tries to not think of executing out of the hub. I play with ideas

No idea if he wants to put it into the P2, but as you say, still a worthwhile separate discussion for followup chips.

FYI, I talk about XMM version at the end of post#1.

jazzed wrote: »

It's good to separate this.

Whether Chip actually wants to implement it for P2 is an open question. That needs to be answered quickly BTW so we know what to expect. Personally I don't think this should be even considered as a P2 feature.

Still it's good to flush out what is actually needed for a day when it is seriously considered.

When I mentioned the idea here in paragraph 3 I wanted to dump the requirement for an LMM interpreter and all that it involves. The biggest win is speed, other wins are not needing to waste 2KB on an interpreter among others. Needing any kind of an interpreter for such a feature is a non-starter - that seems to be clear.

But for serious consideration one should look at the implemented LMM interpreters and all aspects of what they do. And anyone who has written one should speak up!

This means going from startup, to multi-cog execution, using ADD/SUP PC, all the FCALL services, and to any effect of ignoring attributes like fcache.

The other question is how would this work with external memory? Should it? I'm guessing that would still require an interpreter unless Chip can volunteer a way to make fetch-exec from external memory possible. I suspect it would be a big friggin headache.

Bill Henning · 2013-12-03 10:21

I was the first one that groaned when I thought of it - but it is the easiest, simplest way to add memory protection and relocation. I don't want to think of full MMU and virtual memory for cogs. (I plead the fifth on being an x86 programmer in the past under DOS, minix, coherent etc)

I was thinking of treating XMM as a large linear space for the P3.

Segments/Limits was how Unix was first run, and provide for memory protection and relocation; heck they have even been used for swapping to virtual store.

You really only need TLB's and traps if you are swapping... I am thinking of running right out of the external memory, with one to four small cache lines per cog. The prop has a microcontroller hertiage, and I don't think it needs full virtual memory / MMU capability; and with per-cog segment/limit registers, virtual memory is still possible, if silly on a microcontroller.

Actually segment registers are not evil - ugly 16 bit segment registers, used to provide only offsets to 64KB segments are evil. Using full 32 bit segment pointers / limits / pointers they are relatively painless, and allow trivial relocation.

David Betz wrote: »

Do you know how many x86 programmers groaned when you mentioned segment registers? :-)
Seriously though, how would these segment registers help with XMM? You'd need to also have some sort of TLB and with it traps to handle TLB misses. I think traps are dangerously close to interrupts and might not be tolerated well by the Propeller community!

jazzed · 2013-12-03 10:25

Bill Henning wrote: »

The circle of life :-)

Wanting to execute out of the hub leads to LMM, FCACHE. Those lead to RDxxxxC and RDQUAD. Getting rid of the DAC and bumping the hub to 256KB bus gives Chip the idea for RDOCTL, he valiantly tries to not think of executing out of the hub. I play with ideas

No idea if he wants to put it into the P2, but as you say, still a worthwhile separate discussion for followup chips.

FYI, I talk about XMM version at the end of post#1.

Ya, the devil is in the details.

Since you have a history section which I just noticed, you should add a link to my post

David Betz · 2013-12-03 10:25

Bill Henning wrote: »

You really only need TLB's and traps if you are swapping... I am thinking of running right out of the external memory, with one to four small cache lines per cog. The prop has a microcontroller hertiage, and I don't think it needs full virtual memory / MMU capability; and with per-cog segment/limit registers, virtual memory is still possible, if silly on a microcontroller.

But you essentially *are* swapping if you use a two-level memory architecture where hub serves as a cache for off-chip memory. You need a way for the CPU to recognize that a PC it is trying to fetch is not represented by data in hub memory and that it needs to suspend its operation and fetch that data from external memory before it can proceed. This is essentially a page fault or second-level cache miss. In any case, you have to vector off to separate code to handle it while maintaining the state of the code that caused the fault. This is generally handled by a trap.

Edit: Or are you suggesting that this all happens in hardware?

potatohead · 2013-12-03 10:31

Personally, I don't think it should work with external memory.

I'm on the fence about it being a P2 feature.

Execute in place changes things. A lot.

Some here believe working on the design to bridge the time gap based on what we learned on what I'll call the dry run makes sense. If we get execute in place at some high speed, say 90 percent? 80 percent? I think that's worth adding, because it is a very serious differentiator and it doesn't break the basics of what a Propeller is and it doesn't mean we can't still do LMM.

Here is what it comes down to for me:

I really don't want the basic dynamics of a Propeller broken on this chip. I think it's important to have those be solid so that current users can adopt this thing and carry the whole community forward. Anything that threatens that really should be considered very carefully.

I don't know whether or not execute in place does that. Maybe it does. I'm already thinking about most COGS now able to spill over into the HUB, and so where do they do it, how is that managed, etc...? Looks to be a real mess from that POV.

Which is why I'm on the fence.

Oh, and I hate to say this, but I feel the same way about execute in place (hardware LMM assist) as others do about opening up the HUB timing slots.

So, if I'm shouted down in the same way others were, no worries. That may be the right thing to do.

It's up to Chip, and I would say Parallax, because they will take the risk, and they will need to add the value needed to carry most of us forward. If they think they can do that, I'm going to trust it, because I believe in them.

Not much else to say.

Bill Henning · 2013-12-03 10:32

Done

jazzed wrote: »

Ya, the devil is in the details.

Since you have a history section which I just noticed, you should add a link to my post

David Betz · 2013-12-03 10:34

This *does* seem like a fairly big change even if you leave out any support for external memory. I guess we have to trust Chip to decide if it is a risk. I'd be happy to have it left for P3 if it's risky. I certainly wouldn't want a bug in this to ruin the next foundry run!

Bill Henning · 2013-12-03 10:36

Yep.

For hub exec:

the OCTL buffer is a single line cache

For P3 xmm exec:

Totally bypasses the hub, DDR2/+ loads one (or small number of) cache lines, in hardware. This will not be as fast, or as deterministic as hub exec, but the more cache lines, the closer it will be.

We can also expect video to compete for bandwidth.

In both cases we say goodby to LMM loops, and run MUCH faster.

FCACHE will still be useful in both cases.

Highest performance: cog only (fully deterministic)

Second: hub exec (can be deterministic if coded directly in assembler)

Third: xmm exec (not deterministic, but good for overall application)

David Betz wrote: »

But you essentially *are* swapping if you use a two-level memory architecture where hub serves as a cache for off-chip memory. You need a way for the CPU to recognize that a PC it is trying to fetch is not represented by data in hub memory and that it needs to suspend its operation and fetch that data from external memory before it can proceed. This is essentially a page fault or second-level cache miss. In any case, you have to vector off to separate code to handle it while maintaining the state of the code that caused the fault. This is generally handled by a trap.

Edit: Or are you suggesting that this all happens in hardware?

potatohead · 2013-12-03 10:40

re: Segments

Well, big segment registers are just a different scale of evil. Sort of a "don't worry about it today" kind of evil that somebody somewhere will worry about eventually kind of evil.

That said, I've no opinion on them. If this feature goes in, it goes in. Given how this has gone, I suspect blowing it all out again for P3 would see them eliminated or another "worry tomorrow" evil will replace them.

Bill Henning · 2013-12-03 10:40

It can be a small change if:

- Chip goes to RDOCTL anyway
- AUX only hardware stack
- only the three simple HMM instructions are added (described above)
- Chip, please do not even consider the xmm exec for P2! THAT would be considerable work

The cog already has mechanism to handle back to back reads, so this is just requires a tiny bit of arbitration.

Frankly, its probably less work that hub slot management

... but only Chip knows how much work it will be.

Even this simple version can support a hub based stack, with two helper cog routines (HCALLH, HRETH)

Knowing Chip, he is busily adding this as I type this...

David Betz wrote: »

This *does* seem like a fairly big change even if you leave out any support for external memory. I guess we have to trust Chip to decide if it is a risk. I'd be happy to have it left for P3 if it's risky. I certainly wouldn't want a bug in this to ruin the next foundry run!

Bill Henning · 2013-12-03 10:42

I was not even considering segment registers for P2!

I was not even considering considering hardware xmm support for P2!

Dang it, I think I need to add more disclaimers in discussions....

Just to give you a bigger evil to think on...

64 bit P3, with 64 bit segment/limit registers

potatohead wrote: »

re: Segments

Well, big segment registers are just a different scale of evil. Sort of a "don't worry about it today" kind of evil that somebody somewhere will worry about eventually kind of evil.

That said, I've no opinion on them. If this feature goes in, it goes in. Given how this has gone, I suspect blowing it all out again for P3 would see them eliminated or another "worry tomorrow" evil will replace them.

David Betz · 2013-12-03 10:49

Bill Henning wrote: »

I was not even considering segment registers for P2!

I was not even considering considering hardware xmm support for P2!

Dang it, I think I need to add more disclaimers in discussions....

Just to give you a bigger evil to think on...

64 bit P3, with 64 bit segment/limit registers

To be fair, you *did* mention the segment registers under a section labeled "P3".

Bill Henning · 2013-12-03 10:53

Correct! I am thinking about them for P3, but not for P2... too big a change there, and the addition may cost a clock cycle.

It was also under "Possible Improvements", but I meant segment registers for P3, which is why it also showed up in the P3 section... in my excitement, I did not clarify sufficiently!

David Betz wrote: »

To be fair, you *did* mention the segment registers under a section labeled "P3".

Hub Execution Model Thread (split from blog)

Comments