Okay. I've got the DE2-115 running new Verilog with 256KB hub memory and RDWIDE/RDWIDEC/WRWIDE.
I modified XFR to handle 8 longs (a WIDE) and also made changes to all the ROM programs so that they work with the 256KB hub. PTRA/PTRB have been increased to 18 bits. Everything seems to be working fine. Right now, I'm compiling for 5 cogs on the DE2-115.
To get this Verilog done, we need to implement the new pin instructions for USB and get the SERDES worked out.
Okay. I've got the DE2-115 running new Verilog with 256KB hub memory and RDWIDE/RDWIDEC/WRWIDE.
I modified XFR to handle 8 longs (a WIDE) and also made changes to all the ROM programs so that they work with the 256KB hub. PTRA/PTRB have been increased to 18 bits. Everything seems to be working fine. Right now, I'm compiling for 5 cogs on the DE2-115.
To get this Verilog done, we need to implement the new pin instructions for USB and get the SERDES worked out.
Excellent news! I guess you successfully avoided thinking about executing code from hub memory? :-)
Probably a feature for P3 at this point.
Okay. I've got the DE2-115 running new Verilog with 256KB hub memory and RDWIDE/RDWIDEC/WRWIDE.
I modified XFR to handle 8 longs (a WIDE) and also made changes to all the ROM programs so that they work with the 256KB hub. PTRA/PTRB have been increased to 18 bits. Everything seems to be working fine. Right now, I'm compiling for 5 cogs on the DE2-115.
To get this Verilog done, we need to implement the new pin instructions for USB and get the SERDES worked out.
Excellent news! I guess you successfully avoided thinking about executing code from hub memory? :-)
Probably a feature for P3 at this point.
I'm trying to think about how to think about it.
It may be possible to get something running quickly, but it would take some real consideration to make it work well. I could see a lot of people using it right off the bat, just because they suppose they need a big memory model, but then finding it has some strange caveats and getting the idea that the whole chip is goofy. Their frame of reference would almost certainly be opposite to how this chip works best. With WIDEs, you can easily get 1/2 speed PASM in large memory model programs.
The tricky thing about executing from hub memory is that you would be operating in a hybrid situation, with a context that is not quite "cog". Maybe some idea will pop up that lights the way. Right now, it's just murky.
Since so much has been added rather unexpectantly, and all for the good... maybe the final release should just be called the Propeller Three, and let the Propeller Two just remain a memorial to all the hard work.
It is not just a grand idea. Those people that have looked in on the Propeller Two and wander away might just be more enticed to once again get involved if they are made aware that the improvements have been on-going.
It just might end a lot of discussion at the final launch of 'why so long?'. And it would send a strong message that this is the best yet.
At least consider a modifided name, like the Propeller 2x. or 2PLUS
I don't think we have room to even double the AUX memories. It's not a matter area, exactly, but of placement. Also, that is a custom memory that we designed. To modify it is a big task, unlike most of these Verilog changes.
There's been some interesting proposals about increasing AUX's accessibility, but I don't have the room in my head at the moment to think about them clearly. I need to get the USB pin instructions implemented next, and come to some rest point on executing from the hub.
and come to some rest point on executing from the hub.
I guess this is complicated because the hub access slot may already be in use by a data access in another pipeline stage? Would executing from hub cause too many stalls in the pipeline to be worth while?
I guess this is complicated because the hub access slot may already be in use by a data access in another pipeline stage? Would executing from hub cause too many stalls in the pipeline to be worth while?
Yowza! I didn't even think about that possibility. I like it, though, because it brings resolution. As soon as we can determine that it won't work well, we'll be done worrying about it.
Yowza! I didn't even think about that possibility. I like it, though, because it brings resolution. As soon as we can determine that it won't work well, we'll be done worrying about it.
I guess this is where a real icache would help since there would be no contention for it with other pipeline stages. That certainly sounds like a P3 feature though.
Anyway, I'm pretty happy with 256k of hub memory! That will improve the amount of C code we can fit in hub significantly especially using the PropGCC CMM instruction set.
I guess this is complicated because the hub access slot may already be in use by a data access in another pipeline stage? Would executing from hub cause too many stalls in the pipeline to be worth while?
Other possible conflicting scenario, would be two or more COGS trying to use the same HUB resident code for themselves, randomly changing instructions during each one's executing phase.
At least I foresee a heavy use for the semaphores, only to ensure the shared area will be treated as read only in those situations.
In another perspective view, shared data is almost expected to change its contents in situations as depicet above, so r/w access should not be blocked for them.
I guess this is where a real icache would help since there would be no contention for it with other pipeline stages. That certainly sounds like a P3 feature though.
Anyway, I'm pretty happy with 256k of hub memory! That will improve the amount of C code we can fit in hub significantly especially using the PropGCC CMM instruction set.
'icache' - Now why did you have to go and mention that? Now I'm thinking again, to no avail. The icache would have to be the 8 WIDEs. The LMM program couldn't do any WIDE operations. It would need to be able to enter "cog" context and run routines in the cog RAM, and then switch back to hub mode.
The key to making this work might be to somehow always stay in "cog" context, maybe by constraining the cog's PC from 0..7 (keep the PC at %000000xxx), until a branch out occurs. We could execute in the background a virtual 'RDWIDE PTRA++' and have the WIDE window at $000..$007 in the cog. That would take a minimal amount of logic. Hub branches would be effected by changing PTRA. There would be a coarse 8-long block to respect, unless you used cog code above $008 to do finer queuing.
Other possible conflicting scenario, would be two or more COGS trying to use the same HUB resident code for themselves, randomly changing instructions during each one's executing phase.
At least I foresee a heavy use for the semaphores, only to ensure the shared area will be treated as read only in those situations.
In another perspective view, shared data is almost expected to change its contents in situations as depicet above, so r/w access should not be blocked for them.
Yanomani
Yes, sharing code is easiest if the code can be considered read-only. That would mean that code executing from hub would not be able to use self-modifying code. I guess we'd have to look closely at the P2 instruction set to make sure that it is possible to completely avoid self-modifying code and still have a usable processor. I think it probably is but I'm not absolutely sure. You'd certainly have to use AUX memory for subroutine linkage and not the standard CALL/RET instructions.
Yes, sharing code is easiest if the code can be considered read-only. That would mean that code executing from hub would not be able to use self-modifying code. I guess we'd have to look closely at the P2 instruction set to make sure that it is possible to completely avoid self-modifying code and still have a usable processor. I think it probably is but I'm not absolutely sure. You'd certainly have to use AUX memory for subroutine linkage and not the standard CALL/RET instructions.
Thinking further, under semaphores control behavior, i.e., warning but not blocking, should perhaps the way to do it.
I sure see a lot of cases, where a parcial or even full rewrite of a code stream could be a good behavior; One COG, controlling many others behavior, by changing their execution paths, right there at the HUB contents.
One can easily take of another COG, from an infinite looping situation, using this technic, and recover its normal behavior, without having to fully stop and reload it.
I was in a 80 mile trip, driving back home, just wondering about how a 256 bit bus, between HUB ram and the COGs would be useful, and you come with them, and RDOCTLs!
Way, way, way damn good!
If you still have some time, and coffee, and after flushing your stack, automatic RDOCTLs, in the background, are the way to go, just as if you'd used some endless REPS with them, but when the straight instruction block must be cutted off, case of a out of the straightline JUMP or CALL, the REPS vanish away, automaticaly.
Jumping inside the OCT block, should otherwise be preserved, since the target instruction is already present.
Perhaps if JUMPs whose target is already loaded, and progressing inside the pipeline, could activate the "no execute" bit of the intervening ones, it will act as a 1, 2 or up to three SKIP.
Hey, I read that earlier and thought there were some gems in there, but didn't know that I remembered. You were already explaining what I just thought I thought of, myself. I think this is the way to do hub execution. Staying as much as possible in a "cog" context is the way to keep it sane. It's like LMM would be using RDWIDE, but without having to do the RDWIDEs and then waiting after each one for the results to become executable. We'll get the RDWIDEs abutting and keep the PC looping from $000.$007 until a jail-breaking branch occurs.
It's now 6:30am here. My wife is getting up to get the kids ready for school and I've got to get some sleep.
'icache' - Now why did you have to go and mention that? Now I'm thinking again, to no avail. The icache would have to be the 8 WIDEs. The LMM program couldn't do any WIDE operations. It would need to be able to enter "cog" context and run routines in the cog RAM, and then switch back to hub mode. The key to making this work might be to somehow always stay in "cog" context, maybe by constraining the cog's PC from 0..7 (keep the PC at 0000xxx), until a branch out occurs. We could execute in the background a virtual 'RDWIDE PTRA++' and have the WIDE window at $000..$007 in the cog. That would take a minimal amount of logic. Hub branches would be effected by changing PTRA. There would be a coarse 8-long block to respect, unless you used cog code above $008 to do finer queuing.
Chip
IF a control bit could be set in such situations, even jumping from other adresses, back inside the $000;$111 space, could restart the automatic 'RDWIDE PTRA++', and have the code LOUPE, sliding over the HUB PANORAMA code picture.
The trickiest part would be re-syncing the 256 bit read operation, or stalling first time re-execution, till the next HUB slot tick.
It may be simpler to NOT map it to cog memory locations 0-7 (which would also interfere with tasking, memory mapping)
Consider:
- the cog program counter will have to grow to 18 bits, or PTRA can be used
- in hub-exec mode
xxxxxxxxxxxxxLLL00 (18 bit hub address)
where LLL is the "long index" into the wide cache
on a hub-instruction-fetch, xxxxxxxxxxxxx is compared to the previous yyyyyyyyyyyyy - if it is the same, push cache line LLL into the instruction pipeline
if it is not the same, stall for next hub cycle (P3 optimization: if the hub window goes by while executing the cache, pre-fetch following cache line - ie have TWO wide caches)
if the PC is PTRA, then RDLONGC reg,PTRA++ can be used to fetch 32 bit constants in the code stream, and will increment the PC so that there is no attempt to execute constants
Due to the cache being 32 bytes, this supports (transparently) hub sizes up to 20 bits (1MB) - leaving plenty of headroom for P3
To enter this mode, I suggest:
HUBEXEC ptra
to exit, just use cog jumps, to re-enter, use HUBEXEC/HJMP/HCALL/HRET
needs
HJMP / HCALL / HRET
for hub-execution versions of those instructions, maybe even an HDJNZ
'icache' - Now why did you have to go and mention that? Now I'm thinking again, to no avail. The icache would have to be the 8 WIDEs. The LMM program couldn't do any WIDE operations. It would need to be able to enter "cog" context and run routines in the cog RAM, and then switch back to hub mode. The key to making this work might be to somehow always stay in "cog" context, maybe by constraining the cog's PC from 0..7 (keep the PC at %000000xxx), until a branch out occurs. We could execute in the background a virtual 'RDWIDE PTRA++' and have the WIDE window at $000..$007 in the cog. That would take a minimal amount of logic. Hub branches would be effected by changing PTRA. There would be a coarse 8-long block to respect, unless you used cog code above $008 to do finer queuing.
IF a control bit could be set in such situations, even jumping from other adresses, back inside the $000;$111 space, could restart the automatic 'RDWIDE PTRA++', and have the code LOUPE, sliding over the HUB PANORAMA code picture.
The trickiest part would be re-syncing the 256 bit read operation, or stalling first time re-execution, till the next HUB slot tick.
Yanomani
This is the same set of dynamics that exist when using the XFR circuit to move data into the WIDE registers. Before jumping to $000, you'd do an initial 'RDWIDE PTRA++', and then a 'JMP #$000'. When you'd get to $000, the WIDE data would be executable. Then, you'd have the cog do an instruction-less 'RDWIDE PTRA++' in the background to keep the whole show going. The initial and instruction-less RDWIDE's could just be enabled by some instruction, and like you said, a branch cancels the mode. It should work fine. And since it doesn't try to break out of "cog" context in any way, there are no crazy caveats to learn.
for hub-execution versions of those instructions, maybe even an HDJNZ
I think I missed something. What do these instructions do? I assume that HCALL and HRET don't actually modify instructions like CALL/RET do? Or maybe you're expecting HCALL to use all 18 bits of D+S and push its return address on an AUX memory stack?
I think I missed something. What do these instructions do? I assume that HCALL and HRET don't actually modify instructions like CALL/RET do? Or maybe you're expecting HCALL to use all 18 bits of D+S and push its return address on an AUX memory stack?
Yeah, Bill. These would be worth implementing. They would round out what was missing for finer address control and calls/returns without resorting to discreet subroutines in PASM.
After one of you mentioned that we wouldn't want the PC staying at 0..7 during hub execution because it would undermine register remapping, I just realized that we could constrain it to the WIDE window, which could be based at any address in the cog.
So then the product of this is PASM sitting in the HUB, directly executed with HEXEC ptra with ptra the address?
Don't use some instructions, hubops, etc... which would be reserved for LMM and COG PASM.
This gets called HUB PASM, and we now have PASM, LMM, XMM execute models. Wow.
Return to the COG via a standard JMP instruction, canceling the hardware HUB execution mode. And carry on at top speed and full use of HUB operations.
Bill, I think that's about the simplest model there is, given the state of things right now.
Well yes, I have to agree with Kerry.
1MB PASM programs @ ~90 percent of native with few restrictions.
Holy Buckets! All of the sudden, that 256MB HUB makes a big difference. Plenty of room to be paging in fairly large programs from external memory.
Can't wait to play with an FPGA image.
@JMG, well OK. Here we are this morning. I'm going to concede your point. Maximizing it right now makes perfect sense. On the assumption this all makes sense. I'm thinking it will.
Comments
I modified XFR to handle 8 longs (a WIDE) and also made changes to all the ROM programs so that they work with the 256KB hub. PTRA/PTRB have been increased to 18 bits. Everything seems to be working fine. Right now, I'm compiling for 5 cogs on the DE2-115.
To get this Verilog done, we need to implement the new pin instructions for USB and get the SERDES worked out.
The number of bytes. Valid values would be 1, 2, 4, 16, 32, etc. No need to decrypt the nmeumonic.
Probably a feature for P3 at this point.
Fantastic Chip!
That was quick
These changes don't take long to make. This change involved half a dozen files, but most changes we make are confined to the cog Verilog file, only.
I'm trying to think about how to think about it.
It may be possible to get something running quickly, but it would take some real consideration to make it work well. I could see a lot of people using it right off the bat, just because they suppose they need a big memory model, but then finding it has some strange caveats and getting the idea that the whole chip is goofy. Their frame of reference would almost certainly be opposite to how this chip works best. With WIDEs, you can easily get 1/2 speed PASM in large memory model programs.
The tricky thing about executing from hub memory is that you would be operating in a hybrid situation, with a context that is not quite "cog". Maybe some idea will pop up that lights the way. Right now, it's just murky.
It is not just a grand idea. Those people that have looked in on the Propeller Two and wander away might just be more enticed to once again get involved if they are made aware that the improvements have been on-going.
It just might end a lot of discussion at the final launch of 'why so long?'. And it would send a strong message that this is the best yet.
At least consider a modifided name, like the Propeller 2x. or 2PLUS
Chip,
Any more thoughts on the AUX stuff?
I don't think we have room to even double the AUX memories. It's not a matter area, exactly, but of placement. Also, that is a custom memory that we designed. To modify it is a big task, unlike most of these Verilog changes.
There's been some interesting proposals about increasing AUX's accessibility, but I don't have the room in my head at the moment to think about them clearly. I need to get the USB pin instructions implemented next, and come to some rest point on executing from the hub.
Yowza! I didn't even think about that possibility. I like it, though, because it brings resolution. As soon as we can determine that it won't work well, we'll be done worrying about it.
Anyway, I'm pretty happy with 256k of hub memory! That will improve the amount of C code we can fit in hub significantly especially using the PropGCC CMM instruction set.
Other possible conflicting scenario, would be two or more COGS trying to use the same HUB resident code for themselves, randomly changing instructions during each one's executing phase.
At least I foresee a heavy use for the semaphores, only to ensure the shared area will be treated as read only in those situations.
In another perspective view, shared data is almost expected to change its contents in situations as depicet above, so r/w access should not be blocked for them.
Yanomani
'icache' - Now why did you have to go and mention that? Now I'm thinking again, to no avail. The icache would have to be the 8 WIDEs. The LMM program couldn't do any WIDE operations. It would need to be able to enter "cog" context and run routines in the cog RAM, and then switch back to hub mode.
The key to making this work might be to somehow always stay in "cog" context, maybe by constraining the cog's PC from 0..7 (keep the PC at %000000xxx), until a branch out occurs. We could execute in the background a virtual 'RDWIDE PTRA++' and have the WIDE window at $000..$007 in the cog. That would take a minimal amount of logic. Hub branches would be effected by changing PTRA. There would be a coarse 8-long block to respect, unless you used cog code above $008 to do finer queuing.
Thinking further, under semaphores control behavior, i.e., warning but not blocking, should perhaps the way to do it.
I sure see a lot of cases, where a parcial or even full rewrite of a code stream could be a good behavior; One COG, controlling many others behavior, by changing their execution paths, right there at the HUB contents.
One can easily take of another COG, from an infinite looping situation, using this technic, and recover its normal behavior, without having to fully stop and reload it.
Yanomani
Have you noticed my early post, at #3499?
Perhaps it could help a bit
Wow! My dream seems to be coming true at last.
I'll be looking forward to Chipmas again this winter.
Hey, I read that earlier and thought there were some gems in there, but didn't know that I remembered. You were already explaining what I just thought I thought of, myself. I think this is the way to do hub execution. Staying as much as possible in a "cog" context is the way to keep it sane. It's like LMM would be using RDWIDE, but without having to do the RDWIDEs and then waiting after each one for the results to become executable. We'll get the RDWIDEs abutting and keep the PC looping from $000.$007 until a jail-breaking branch occurs.
It's now 6:30am here. My wife is getting up to get the kids ready for school and I've got to get some sleep.
Chip
IF a control bit could be set in such situations, even jumping from other adresses, back inside the $000;$111 space, could restart the automatic 'RDWIDE PTRA++', and have the code LOUPE, sliding over the HUB PANORAMA code picture.
The trickiest part would be re-syncing the 256 bit read operation, or stalling first time re-execution, till the next HUB slot tick.
Yanomani
Consider:
- the cog program counter will have to grow to 18 bits, or PTRA can be used
- in hub-exec mode
xxxxxxxxxxxxxLLL00 (18 bit hub address)
where LLL is the "long index" into the wide cache
on a hub-instruction-fetch, xxxxxxxxxxxxx is compared to the previous yyyyyyyyyyyyy - if it is the same, push cache line LLL into the instruction pipeline
if it is not the same, stall for next hub cycle (P3 optimization: if the hub window goes by while executing the cache, pre-fetch following cache line - ie have TWO wide caches)
if the PC is PTRA, then RDLONGC reg,PTRA++ can be used to fetch 32 bit constants in the code stream, and will increment the PC so that there is no attempt to execute constants
Due to the cache being 32 bytes, this supports (transparently) hub sizes up to 20 bits (1MB) - leaving plenty of headroom for P3
To enter this mode, I suggest:
HUBEXEC ptra
to exit, just use cog jumps, to re-enter, use HUBEXEC/HJMP/HCALL/HRET
needs
HJMP / HCALL / HRET
for hub-execution versions of those instructions, maybe even an HDJNZ
Just a HJMP or HCALL, and return with HRET or regular cog jmp etc.
Restrictions on this mode:
REPx / DJNZ etc would have to fit in the 8 line cache
no use of the RDxxxxC or WRxxxxC instructions in the hub code, non-C versions only so cache is not spoiled (would lead to thrashing)
It should get very close to COG performance, I'd guess 90%+ of native pasm.
This is the same set of dynamics that exist when using the XFR circuit to move data into the WIDE registers. Before jumping to $000, you'd do an initial 'RDWIDE PTRA++', and then a 'JMP #$000'. When you'd get to $000, the WIDE data would be executable. Then, you'd have the cog do an instruction-less 'RDWIDE PTRA++' in the background to keep the whole show going. The initial and instruction-less RDWIDE's could just be enabled by some instruction, and like you said, a branch cancels the mode. It should work fine. And since it doesn't try to break out of "cog" context in any way, there are no crazy caveats to learn.
Yeah, Bill. These would be worth implementing. They would round out what was missing for finer address control and calls/returns without resorting to discreet subroutines in PASM.
Like Chilly Willy, the singing polar bear, in the tale that the Old Captain told us:
Rockaby baby, la la la la...
Happy dreams!
I want WIDELD, not RDWIDE
Don't use some instructions, hubops, etc... which would be reserved for LMM and COG PASM.
This gets called HUB PASM, and we now have PASM, LMM, XMM execute models. Wow.
Return to the COG via a standard JMP instruction, canceling the hardware HUB execution mode. And carry on at top speed and full use of HUB operations.
Bill, I think that's about the simplest model there is, given the state of things right now.
Well yes, I have to agree with Kerry.
1MB PASM programs @ ~90 percent of native with few restrictions.
Holy Buckets! All of the sudden, that 256MB HUB makes a big difference. Plenty of room to be paging in fairly large programs from external memory.
Can't wait to play with an FPGA image.
@JMG, well OK. Here we are this morning. I'm going to concede your point. Maximizing it right now makes perfect sense. On the assumption this all makes sense. I'm thinking it will.