Propeller II update - BLOG

cgracey · 2013-12-03 04:57

Okay. I've got the DE2-115 running new Verilog with 256KB hub memory and RDWIDE/RDWIDEC/WRWIDE.

I modified XFR to handle 8 longs (a WIDE) and also made changes to all the ROM programs so that they work with the 256KB hub. PTRA/PTRB have been increased to 18 bits. Everything seems to be working fine. Right now, I'm compiling for 5 cogs on the DE2-115.

To get this Verilog done, we need to implement the new pin instructions for USB and get the SERDES worked out.

Seairth · 2013-12-03 05:02

ozpropdev wrote: »

What do the numbers represent?

The number of bytes. Valid values would be 1, 2, 4, 16, 32, etc. No need to decrypt the nmeumonic.

David Betz · 2013-12-03 05:05

cgracey wrote: »

Okay. I've got the DE2-115 running new Verilog with 256KB hub memory and RDWIDE/RDWIDEC/WRWIDE.

I modified XFR to handle 8 longs (a WIDE) and also made changes to all the ROM programs so that they work with the 256KB hub. PTRA/PTRB have been increased to 18 bits. Everything seems to be working fine. Right now, I'm compiling for 5 cogs on the DE2-115.

To get this Verilog done, we need to implement the new pin instructions for USB and get the SERDES worked out.

Excellent news! I guess you successfully avoided thinking about executing code from hub memory? :-)
Probably a feature for P3 at this point.

ozpropdev · 2013-12-03 05:06

cgracey wrote: »

Okay. I've got the DE2-115 running new Verilog with 256KB hub memory and RDWIDE/RDWIDEC/WRWIDE.

I modified XFR to handle 8 longs (a WIDE) and also made changes to all the ROM programs so that they work with the 256KB hub. PTRA/PTRB have been increased to 18 bits. Everything seems to be working fine. Right now, I'm compiling for 5 cogs on the DE2-115.

To get this Verilog done, we need to implement the new pin instructions for USB and get the SERDES worked out.

Fantastic Chip!
That was quick

cgracey · 2013-12-03 05:13

ozpropdev wrote: »

Fantastic Chip!
That was quick

These changes don't take long to make. This change involved half a dozen files, but most changes we make are confined to the cog Verilog file, only.

cgracey · 2013-12-03 05:14

David Betz wrote: »

Excellent news! I guess you successfully avoided thinking about executing code from hub memory? :-)
Probably a feature for P3 at this point.

I'm trying to think about how to think about it.

It may be possible to get something running quickly, but it would take some real consideration to make it work well. I could see a lot of people using it right off the bat, just because they suppose they need a big memory model, but then finding it has some strange caveats and getting the idea that the whole chip is goofy. Their frame of reference would almost certainly be opposite to how this chip works best. With WIDEs, you can easily get 1/2 speed PASM in large memory model programs.

The tricky thing about executing from hub memory is that you would be operating in a hybrid situation, with a context that is not quite "cog". Maybe some idea will pop up that lights the way. Right now, it's just murky.

LoopyByteloose · 2013-12-03 05:15

Since so much has been added rather unexpectantly, and all for the good... maybe the final release should just be called the Propeller Three, and let the Propeller Two just remain a memorial to all the hard work.

It is not just a grand idea. Those people that have looked in on the Propeller Two and wander away might just be more enticed to once again get involved if they are made aware that the improvements have been on-going.

It just might end a lot of discussion at the final launch of 'why so long?'. And it would send a strong message that this is the best yet.

At least consider a modifided name, like the Propeller 2x. or 2PLUS

ozpropdev · 2013-12-03 05:17

cgracey wrote: »

These changes don't take long to make. This change involved half a dozen files, but most changes we make are confined to the cog Verilog file, only.

Chip,
Any more thoughts on the AUX stuff?

cgracey · 2013-12-03 05:38

ozpropdev wrote: »

Chip,
Any more thoughts on the AUX stuff?

I don't think we have room to even double the AUX memories. It's not a matter area, exactly, but of placement. Also, that is a custom memory that we designed. To modify it is a big task, unlike most of these Verilog changes.

There's been some interesting proposals about increasing AUX's accessibility, but I don't have the room in my head at the moment to think about them clearly. I need to get the USB pin instructions implemented next, and come to some rest point on executing from the hub.

David Betz · 2013-12-03 05:48

cgracey wrote: »

and come to some rest point on executing from the hub.

I guess this is complicated because the hub access slot may already be in use by a data access in another pipeline stage? Would executing from hub cause too many stalls in the pipeline to be worth while?

cgracey · 2013-12-03 05:54

David Betz wrote: »

I guess this is complicated because the hub access slot may already be in use by a data access in another pipeline stage? Would executing from hub cause too many stalls in the pipeline to be worth while?

Yowza! I didn't even think about that possibility. I like it, though, because it brings resolution. As soon as we can determine that it won't work well, we'll be done worrying about it.

David Betz · 2013-12-03 05:59

cgracey wrote: »

Yowza! I didn't even think about that possibility. I like it, though, because it brings resolution. As soon as we can determine that it won't work well, we'll be done worrying about it.

I guess this is where a real icache would help since there would be no contention for it with other pipeline stages. That certainly sounds like a P3 feature though.

Anyway, I'm pretty happy with 256k of hub memory! That will improve the amount of C code we can fit in hub significantly especially using the PropGCC CMM instruction set.

Yanomani · 2013-12-03 06:01

David Betz wrote: »

I guess this is complicated because the hub access slot may already be in use by a data access in another pipeline stage? Would executing from hub cause too many stalls in the pipeline to be worth while?

Other possible conflicting scenario, would be two or more COGS trying to use the same HUB resident code for themselves, randomly changing instructions during each one's executing phase.
At least I foresee a heavy use for the semaphores, only to ensure the shared area will be treated as read only in those situations.
In another perspective view, shared data is almost expected to change its contents in situations as depicet above, so r/w access should not be blocked for them.

Yanomani

cgracey · 2013-12-03 06:10

David Betz wrote: »

I guess this is where a real icache would help since there would be no contention for it with other pipeline stages. That certainly sounds like a P3 feature though.

Anyway, I'm pretty happy with 256k of hub memory! That will improve the amount of C code we can fit in hub significantly especially using the PropGCC CMM instruction set.

'icache' - Now why did you have to go and mention that? Now I'm thinking again, to no avail. The icache would have to be the 8 WIDEs. The LMM program couldn't do any WIDE operations. It would need to be able to enter "cog" context and run routines in the cog RAM, and then switch back to hub mode.

The key to making this work might be to somehow always stay in "cog" context, maybe by constraining the cog's PC from 0..7 (keep the PC at %000000xxx), until a branch out occurs. We could execute in the background a virtual 'RDWIDE PTRA++' and have the WIDE window at $000..$007 in the cog. That would take a minimal amount of logic. Hub branches would be effected by changing PTRA. There would be a coarse 8-long block to respect, unless you used cog code above $008 to do finer queuing.

David Betz · 2013-12-03 06:11

Yanomani wrote: »

Other possible conflicting scenario, would be two or more COGS trying to use the same HUB resident code for themselves, randomly changing instructions during each one's executing phase.
At least I foresee a heavy use for the semaphores, only to ensure the shared area will be treated as read only in those situations.
In another perspective view, shared data is almost expected to change its contents in situations as depicet above, so r/w access should not be blocked for them.

Yanomani

Yes, sharing code is easiest if the code can be considered read-only. That would mean that code executing from hub would not be able to use self-modifying code. I guess we'd have to look closely at the P2 instruction set to make sure that it is possible to completely avoid self-modifying code and still have a usable processor. I think it probably is but I'm not absolutely sure. You'd certainly have to use AUX memory for subroutine linkage and not the standard CALL/RET instructions.

Yanomani · 2013-12-03 06:13

David Betz wrote: »

Yes, sharing code is easiest if the code can be considered read-only. That would mean that code executing from hub would not be able to use self-modifying code. I guess we'd have to look closely at the P2 instruction set to make sure that it is possible to completely avoid self-modifying code and still have a usable processor. I think it probably is but I'm not absolutely sure. You'd certainly have to use AUX memory for subroutine linkage and not the standard CALL/RET instructions.

Thinking further, under semaphores control behavior, i.e., warning but not blocking, should perhaps the way to do it.
I sure see a lot of cases, where a parcial or even full rewrite of a code stream could be a good behavior; One COG, controlling many others behavior, by changing their execution paths, right there at the HUB contents.
One can easily take of another COG, from an infinite looping situation, using this technic, and recover its normal behavior, without having to fully stop and reload it.

Yanomani

Yanomani · 2013-12-03 06:15

Chip

Have you noticed my early post, at #3499?

Perhaps it could help a bit

Yanomani wrote: »

Chip

I was in a 80 mile trip, driving back home, just wondering about how a 256 bit bus, between HUB ram and the COGs would be useful, and you come with them, and RDOCTLs!
Way, way, way damn good!

If you still have some time, and coffee, and after flushing your stack, automatic RDOCTLs, in the background, are the way to go, just as if you'd used some endless REPS with them, but when the straight instruction block must be cutted off, case of a out of the straightline JUMP or CALL, the REPS vanish away, automaticaly.
Jumping inside the OCT block, should otherwise be preserved, since the target instruction is already present.
Perhaps if JUMPs whose target is already loaded, and progressing inside the pipeline, could activate the "no execute" bit of the intervening ones, it will act as a 1, 2 or up to three SKIP.

Yanomani

Heater. · 2013-12-03 06:20

myself:

And to think. All I ever wanted from a PII originally was an order of magnitude faster execution speed and 256K RAM and 64 pins....

Wow! My dream seems to be coming true at last.

I'll be looking forward to Chipmas again this winter.

cgracey · 2013-12-03 06:33

Yanomani wrote: »

Chip

Have you noticed my early post, at #3499?

Perhaps it could help a bit

Hey, I read that earlier and thought there were some gems in there, but didn't know that I remembered. You were already explaining what I just thought I thought of, myself. I think this is the way to do hub execution. Staying as much as possible in a "cog" context is the way to keep it sane. It's like LMM would be using RDWIDE, but without having to do the RDWIDEs and then waiting after each one for the results to become executable. We'll get the RDWIDEs abutting and keep the PC looping from $000.$007 until a jail-breaking branch occurs.

It's now 6:30am here. My wife is getting up to get the kids ready for school and I've got to get some sleep.

Yanomani · 2013-12-03 06:33

cgracey wrote: »

'icache' - Now why did you have to go and mention that? Now I'm thinking again, to no avail. The icache would have to be the 8 WIDEs. The LMM program couldn't do any WIDE operations. It would need to be able to enter "cog" context and run routines in the cog RAM, and then switch back to hub mode. The key to making this work might be to somehow always stay in "cog" context, maybe by constraining the cog's PC from 0..7 (keep the PC at 0000xxx), until a branch out occurs. We could execute in the background a virtual 'RDWIDE PTRA++' and have the WIDE window at $000..$007 in the cog. That would take a minimal amount of logic. Hub branches would be effected by changing PTRA. There would be a coarse 8-long block to respect, unless you used cog code above $008 to do finer queuing.

Chip

IF a control bit could be set in such situations, even jumping from other adresses, back inside the $000;$111 space, could restart the automatic 'RDWIDE PTRA++', and have the code LOUPE, sliding over the HUB PANORAMA code picture.
The trickiest part would be re-syncing the 256 bit read operation, or stalling first time re-execution, till the next HUB slot tick.

Yanomani

Bill Henning · 2013-12-03 06:41

It may be simpler to NOT map it to cog memory locations 0-7 (which would also interfere with tasking, memory mapping)

Consider:

- the cog program counter will have to grow to 18 bits, or PTRA can be used

- in hub-exec mode

xxxxxxxxxxxxxLLL00 (18 bit hub address)

where LLL is the "long index" into the wide cache

on a hub-instruction-fetch, xxxxxxxxxxxxx is compared to the previous yyyyyyyyyyyyy - if it is the same, push cache line LLL into the instruction pipeline

if it is not the same, stall for next hub cycle (P3 optimization: if the hub window goes by while executing the cache, pre-fetch following cache line - ie have TWO wide caches)

if the PC is PTRA, then RDLONGC reg,PTRA++ can be used to fetch 32 bit constants in the code stream, and will increment the PC so that there is no attempt to execute constants

Due to the cache being 32 bytes, this supports (transparently) hub sizes up to 20 bits (1MB) - leaving plenty of headroom for P3

To enter this mode, I suggest:

HUBEXEC ptra

to exit, just use cog jumps, to re-enter, use HUBEXEC/HJMP/HCALL/HRET

needs

HJMP / HCALL / HRET

for hub-execution versions of those instructions, maybe even an HDJNZ

cgracey wrote: »

'icache' - Now why did you have to go and mention that? Now I'm thinking again, to no avail. The icache would have to be the 8 WIDEs. The LMM program couldn't do any WIDE operations. It would need to be able to enter "cog" context and run routines in the cog RAM, and then switch back to hub mode. The key to making this work might be to somehow always stay in "cog" context, maybe by constraining the cog's PC from 0..7 (keep the PC at %000000xxx), until a branch out occurs. We could execute in the background a virtual 'RDWIDE PTRA++' and have the WIDE window at $000..$007 in the cog. That would take a minimal amount of logic. Hub branches would be effected by changing PTRA. There would be a coarse 8-long block to respect, unless you used cog code above $008 to do finer queuing.

Bill Henning · 2013-12-03 06:46

actually, no need for HUBEXEC

Just a HJMP or HCALL, and return with HRET or regular cog jmp etc.

Restrictions on this mode:

REPx / DJNZ etc would have to fit in the 8 line cache

no use of the RDxxxxC or WRxxxxC instructions in the hub code, non-C versions only so cache is not spoiled (would lead to thrashing)

It should get very close to COG performance, I'd guess 90%+ of native pasm.

cgracey · 2013-12-03 06:47

Yanomani wrote: »

Chip

IF a control bit could be set in such situations, even jumping from other adresses, back inside the $000;$111 space, could restart the automatic 'RDWIDE PTRA++', and have the code LOUPE, sliding over the HUB PANORAMA code picture.
The trickiest part would be re-syncing the 256 bit read operation, or stalling first time re-execution, till the next HUB slot tick.

Yanomani

This is the same set of dynamics that exist when using the XFR circuit to move data into the WIDE registers. Before jumping to $000, you'd do an initial 'RDWIDE PTRA++', and then a 'JMP #$000'. When you'd get to $000, the WIDE data would be executable. Then, you'd have the cog do an instruction-less 'RDWIDE PTRA++' in the background to keep the whole show going. The initial and instruction-less RDWIDE's could just be enabled by some instruction, and like you said, a branch cancels the mode. It should work fine. And since it doesn't try to break out of "cog" context in any way, there are no crazy caveats to learn.

David Betz · 2013-12-03 06:47

Bill Henning wrote: »

HJMP / HCALL / HRET

for hub-execution versions of those instructions, maybe even an HDJNZ

I think I missed something. What do these instructions do? I assume that HCALL and HRET don't actually modify instructions like CALL/RET do? Or maybe you're expecting HCALL to use all 18 bits of D+S and push its return address on an AUX memory stack?

cgracey · 2013-12-03 06:53

David Betz wrote: »

I think I missed something. What do these instructions do? I assume that HCALL and HRET don't actually modify instructions like CALL/RET do? Or maybe you're expecting HCALL to use all 18 bits of D+S and push its return address on an AUX memory stack?

Yeah, Bill. These would be worth implementing. They would round out what was missing for finer address control and calls/returns without resorting to discreet subroutines in PASM.

Yanomani · 2013-12-03 06:59

cgracey wrote: »

It's now 6:30am here. My wife is getting up to get the kids ready for school and I've got to get some sleep.

Like Chilly Willy, the singing polar bear, in the tale that the Old Captain told us:

Rockaby baby, la la la la...

Happy dreams!

cgracey · 2013-12-03 07:08

After one of you mentioned that we wouldn't want the PC staying at 0..7 during hub execution because it would undermine register remapping, I just realized that we could constrain it to the WIDE window, which could be based at any address in the cog.

pedward · 2013-12-03 08:28

cgracey wrote: »

I think WIDE is it. It reads well and is very simple to remember.

I want WIDELD, not RDWIDE

potatohead · 2013-12-03 08:29

So then the product of this is PASM sitting in the HUB, directly executed with HEXEC ptra with ptra the address?

Don't use some instructions, hubops, etc... which would be reserved for LMM and COG PASM.

This gets called HUB PASM, and we now have PASM, LMM, XMM execute models. Wow.

Return to the COG via a standard JMP instruction, canceling the hardware HUB execution mode. And carry on at top speed and full use of HUB operations.

Bill, I think that's about the simplest model there is, given the state of things right now.

Well yes, I have to agree with Kerry.

1MB PASM programs @ ~90 percent of native with few restrictions.

Holy Buckets! All of the sudden, that 256MB HUB makes a big difference. Plenty of room to be paging in fairly large programs from external memory.

Can't wait to play with an FPGA image.

@JMG, well OK. Here we are this morning. I'm going to concede your point. Maximizing it right now makes perfect sense. On the assumption this all makes sense. I'm thinking it will.

David Betz · 2013-12-03 08:33

I'm still not sure I understand HCALL and HRET. Where does HCALL put its return address and where does HRET get it from?

Propeller II update - BLOG

Comments