Shop OBEX P1 Docs P2 Docs Learn Events
Propeller II update - BLOG - Page 119 — Parallax Forums

Propeller II update - BLOG

1116117119121122223

Comments

  • cgraceycgracey Posts: 14,133
    edited 2013-12-03 04:57
    Okay. I've got the DE2-115 running new Verilog with 256KB hub memory and RDWIDE/RDWIDEC/WRWIDE.

    I modified XFR to handle 8 longs (a WIDE) and also made changes to all the ROM programs so that they work with the 256KB hub. PTRA/PTRB have been increased to 18 bits. Everything seems to be working fine. Right now, I'm compiling for 5 cogs on the DE2-115.

    To get this Verilog done, we need to implement the new pin instructions for USB and get the SERDES worked out.
  • SeairthSeairth Posts: 2,474
    edited 2013-12-03 05:02
    ozpropdev wrote: »
    What do the numbers represent?

    The number of bytes. Valid values would be 1, 2, 4, 16, 32, etc. No need to decrypt the nmeumonic.
  • David BetzDavid Betz Posts: 14,511
    edited 2013-12-03 05:05
    cgracey wrote: »
    Okay. I've got the DE2-115 running new Verilog with 256KB hub memory and RDWIDE/RDWIDEC/WRWIDE.

    I modified XFR to handle 8 longs (a WIDE) and also made changes to all the ROM programs so that they work with the 256KB hub. PTRA/PTRB have been increased to 18 bits. Everything seems to be working fine. Right now, I'm compiling for 5 cogs on the DE2-115.

    To get this Verilog done, we need to implement the new pin instructions for USB and get the SERDES worked out.
    Excellent news! I guess you successfully avoided thinking about executing code from hub memory? :-)
    Probably a feature for P3 at this point.
  • ozpropdevozpropdev Posts: 2,792
    edited 2013-12-03 05:06
    cgracey wrote: »
    Okay. I've got the DE2-115 running new Verilog with 256KB hub memory and RDWIDE/RDWIDEC/WRWIDE.

    I modified XFR to handle 8 longs (a WIDE) and also made changes to all the ROM programs so that they work with the 256KB hub. PTRA/PTRB have been increased to 18 bits. Everything seems to be working fine. Right now, I'm compiling for 5 cogs on the DE2-115.

    To get this Verilog done, we need to implement the new pin instructions for USB and get the SERDES worked out.

    Fantastic Chip!
    That was quick :)
  • cgraceycgracey Posts: 14,133
    edited 2013-12-03 05:13
    ozpropdev wrote: »
    Fantastic Chip!
    That was quick :)

    These changes don't take long to make. This change involved half a dozen files, but most changes we make are confined to the cog Verilog file, only.
  • cgraceycgracey Posts: 14,133
    edited 2013-12-03 05:14
    David Betz wrote: »
    Excellent news! I guess you successfully avoided thinking about executing code from hub memory? :-)
    Probably a feature for P3 at this point.

    I'm trying to think about how to think about it.

    It may be possible to get something running quickly, but it would take some real consideration to make it work well. I could see a lot of people using it right off the bat, just because they suppose they need a big memory model, but then finding it has some strange caveats and getting the idea that the whole chip is goofy. Their frame of reference would almost certainly be opposite to how this chip works best. With WIDEs, you can easily get 1/2 speed PASM in large memory model programs.

    The tricky thing about executing from hub memory is that you would be operating in a hybrid situation, with a context that is not quite "cog". Maybe some idea will pop up that lights the way. Right now, it's just murky.
  • LoopyBytelooseLoopyByteloose Posts: 12,537
    edited 2013-12-03 05:15
    Since so much has been added rather unexpectantly, and all for the good... maybe the final release should just be called the Propeller Three, and let the Propeller Two just remain a memorial to all the hard work.

    It is not just a grand idea. Those people that have looked in on the Propeller Two and wander away might just be more enticed to once again get involved if they are made aware that the improvements have been on-going.

    It just might end a lot of discussion at the final launch of 'why so long?'. And it would send a strong message that this is the best yet.

    At least consider a modifided name, like the Propeller 2x. or 2PLUS
  • ozpropdevozpropdev Posts: 2,792
    edited 2013-12-03 05:17
    cgracey wrote: »
    These changes don't take long to make. This change involved half a dozen files, but most changes we make are confined to the cog Verilog file, only.

    Chip,
    Any more thoughts on the AUX stuff?
  • cgraceycgracey Posts: 14,133
    edited 2013-12-03 05:38
    ozpropdev wrote: »
    Chip,
    Any more thoughts on the AUX stuff?

    I don't think we have room to even double the AUX memories. It's not a matter area, exactly, but of placement. Also, that is a custom memory that we designed. To modify it is a big task, unlike most of these Verilog changes.

    There's been some interesting proposals about increasing AUX's accessibility, but I don't have the room in my head at the moment to think about them clearly. I need to get the USB pin instructions implemented next, and come to some rest point on executing from the hub.
  • David BetzDavid Betz Posts: 14,511
    edited 2013-12-03 05:48
    cgracey wrote: »
    and come to some rest point on executing from the hub.
    I guess this is complicated because the hub access slot may already be in use by a data access in another pipeline stage? Would executing from hub cause too many stalls in the pipeline to be worth while?
  • cgraceycgracey Posts: 14,133
    edited 2013-12-03 05:54
    David Betz wrote: »
    I guess this is complicated because the hub access slot may already be in use by a data access in another pipeline stage? Would executing from hub cause too many stalls in the pipeline to be worth while?

    Yowza! I didn't even think about that possibility. I like it, though, because it brings resolution. As soon as we can determine that it won't work well, we'll be done worrying about it.
  • David BetzDavid Betz Posts: 14,511
    edited 2013-12-03 05:59
    cgracey wrote: »
    Yowza! I didn't even think about that possibility. I like it, though, because it brings resolution. As soon as we can determine that it won't work well, we'll be done worrying about it.
    I guess this is where a real icache would help since there would be no contention for it with other pipeline stages. That certainly sounds like a P3 feature though.

    Anyway, I'm pretty happy with 256k of hub memory! That will improve the amount of C code we can fit in hub significantly especially using the PropGCC CMM instruction set.
  • YanomaniYanomani Posts: 1,524
    edited 2013-12-03 06:01
    David Betz wrote: »
    I guess this is complicated because the hub access slot may already be in use by a data access in another pipeline stage? Would executing from hub cause too many stalls in the pipeline to be worth while?

    Other possible conflicting scenario, would be two or more COGS trying to use the same HUB resident code for themselves, randomly changing instructions during each one's executing phase.
    At least I foresee a heavy use for the semaphores, only to ensure the shared area will be treated as read only in those situations.
    In another perspective view, shared data is almost expected to change its contents in situations as depicet above, so r/w access should not be blocked for them.

    Yanomani
  • cgraceycgracey Posts: 14,133
    edited 2013-12-03 06:10
    David Betz wrote: »
    I guess this is where a real icache would help since there would be no contention for it with other pipeline stages. That certainly sounds like a P3 feature though.

    Anyway, I'm pretty happy with 256k of hub memory! That will improve the amount of C code we can fit in hub significantly especially using the PropGCC CMM instruction set.

    'icache' - Now why did you have to go and mention that? Now I'm thinking again, to no avail. The icache would have to be the 8 WIDEs. The LMM program couldn't do any WIDE operations. It would need to be able to enter "cog" context and run routines in the cog RAM, and then switch back to hub mode.

    The key to making this work might be to somehow always stay in "cog" context, maybe by constraining the cog's PC from 0..7 (keep the PC at %000000xxx), until a branch out occurs. We could execute in the background a virtual 'RDWIDE PTRA++' and have the WIDE window at $000..$007 in the cog. That would take a minimal amount of logic. Hub branches would be effected by changing PTRA. There would be a coarse 8-long block to respect, unless you used cog code above $008 to do finer queuing.
  • David BetzDavid Betz Posts: 14,511
    edited 2013-12-03 06:11
    Yanomani wrote: »
    Other possible conflicting scenario, would be two or more COGS trying to use the same HUB resident code for themselves, randomly changing instructions during each one's executing phase.
    At least I foresee a heavy use for the semaphores, only to ensure the shared area will be treated as read only in those situations.
    In another perspective view, shared data is almost expected to change its contents in situations as depicet above, so r/w access should not be blocked for them.

    Yanomani
    Yes, sharing code is easiest if the code can be considered read-only. That would mean that code executing from hub would not be able to use self-modifying code. I guess we'd have to look closely at the P2 instruction set to make sure that it is possible to completely avoid self-modifying code and still have a usable processor. I think it probably is but I'm not absolutely sure. You'd certainly have to use AUX memory for subroutine linkage and not the standard CALL/RET instructions.
  • YanomaniYanomani Posts: 1,524
    edited 2013-12-03 06:13
    David Betz wrote: »
    Yes, sharing code is easiest if the code can be considered read-only. That would mean that code executing from hub would not be able to use self-modifying code. I guess we'd have to look closely at the P2 instruction set to make sure that it is possible to completely avoid self-modifying code and still have a usable processor. I think it probably is but I'm not absolutely sure. You'd certainly have to use AUX memory for subroutine linkage and not the standard CALL/RET instructions.

    Thinking further, under semaphores control behavior, i.e., warning but not blocking, should perhaps the way to do it.
    I sure see a lot of cases, where a parcial or even full rewrite of a code stream could be a good behavior; One COG, controlling many others behavior, by changing their execution paths, right there at the HUB contents.
    One can easily take of another COG, from an infinite looping situation, using this technic, and recover its normal behavior, without having to fully stop and reload it.

    Yanomani
  • YanomaniYanomani Posts: 1,524
    edited 2013-12-03 06:15
    Chip

    Have you noticed my early post, at #3499?

    Perhaps it could help a bit

    Yanomani wrote: »
    Chip

    I was in a 80 mile trip, driving back home, just wondering about how a 256 bit bus, between HUB ram and the COGs would be useful, and you come with them, and RDOCTLs!
    Way, way, way damn good!

    If you still have some time, and coffee, and after flushing your stack, automatic RDOCTLs, in the background, are the way to go, just as if you'd used some endless REPS with them, but when the straight instruction block must be cutted off, case of a out of the straightline JUMP or CALL, the REPS vanish away, automaticaly.
    Jumping inside the OCT block, should otherwise be preserved, since the target instruction is already present.
    Perhaps if JUMPs whose target is already loaded, and progressing inside the pipeline, could activate the "no execute" bit of the intervening ones, it will act as a 1, 2 or up to three SKIP.

    Yanomani
  • Heater.Heater. Posts: 21,230
    edited 2013-12-03 06:20
    myself:
    And to think. All I ever wanted from a PII originally was an order of magnitude faster execution speed and 256K RAM and 64 pins....

    Wow! My dream seems to be coming true at last.

    I'll be looking forward to Chipmas again this winter.
  • cgraceycgracey Posts: 14,133
    edited 2013-12-03 06:33
    Yanomani wrote: »
    Chip

    Have you noticed my early post, at #3499?

    Perhaps it could help a bit

    Hey, I read that earlier and thought there were some gems in there, but didn't know that I remembered. You were already explaining what I just thought I thought of, myself. I think this is the way to do hub execution. Staying as much as possible in a "cog" context is the way to keep it sane. It's like LMM would be using RDWIDE, but without having to do the RDWIDEs and then waiting after each one for the results to become executable. We'll get the RDWIDEs abutting and keep the PC looping from $000.$007 until a jail-breaking branch occurs.

    It's now 6:30am here. My wife is getting up to get the kids ready for school and I've got to get some sleep.
  • YanomaniYanomani Posts: 1,524
    edited 2013-12-03 06:33
    cgracey wrote: »
    'icache' - Now why did you have to go and mention that? Now I'm thinking again, to no avail. The icache would have to be the 8 WIDEs. The LMM program couldn't do any WIDE operations. It would need to be able to enter "cog" context and run routines in the cog RAM, and then switch back to hub mode. The key to making this work might be to somehow always stay in "cog" context, maybe by constraining the cog's PC from 0..7 (keep the PC at 0000xxx), until a branch out occurs. We could execute in the background a virtual 'RDWIDE PTRA++' and have the WIDE window at $000..$007 in the cog. That would take a minimal amount of logic. Hub branches would be effected by changing PTRA. There would be a coarse 8-long block to respect, unless you used cog code above $008 to do finer queuing.

    Chip

    IF a control bit could be set in such situations, even jumping from other adresses, back inside the $000;$111 space, could restart the automatic 'RDWIDE PTRA++', and have the code LOUPE, sliding over the HUB PANORAMA code picture.
    The trickiest part would be re-syncing the 256 bit read operation, or stalling first time re-execution, till the next HUB slot tick.

    Yanomani
  • Bill HenningBill Henning Posts: 6,445
    edited 2013-12-03 06:41
    It may be simpler to NOT map it to cog memory locations 0-7 (which would also interfere with tasking, memory mapping)

    Consider:

    - the cog program counter will have to grow to 18 bits, or PTRA can be used

    - in hub-exec mode

    xxxxxxxxxxxxxLLL00 (18 bit hub address)

    where LLL is the "long index" into the wide cache

    on a hub-instruction-fetch, xxxxxxxxxxxxx is compared to the previous yyyyyyyyyyyyy - if it is the same, push cache line LLL into the instruction pipeline

    if it is not the same, stall for next hub cycle (P3 optimization: if the hub window goes by while executing the cache, pre-fetch following cache line - ie have TWO wide caches)

    if the PC is PTRA, then RDLONGC reg,PTRA++ can be used to fetch 32 bit constants in the code stream, and will increment the PC so that there is no attempt to execute constants

    Due to the cache being 32 bytes, this supports (transparently) hub sizes up to 20 bits (1MB) - leaving plenty of headroom for P3

    To enter this mode, I suggest:

    HUBEXEC ptra

    to exit, just use cog jumps, to re-enter, use HUBEXEC/HJMP/HCALL/HRET

    needs

    HJMP / HCALL / HRET

    for hub-execution versions of those instructions, maybe even an HDJNZ

    cgracey wrote: »
    'icache' - Now why did you have to go and mention that? Now I'm thinking again, to no avail. The icache would have to be the 8 WIDEs. The LMM program couldn't do any WIDE operations. It would need to be able to enter "cog" context and run routines in the cog RAM, and then switch back to hub mode. The key to making this work might be to somehow always stay in "cog" context, maybe by constraining the cog's PC from 0..7 (keep the PC at %000000xxx), until a branch out occurs. We could execute in the background a virtual 'RDWIDE PTRA++' and have the WIDE window at $000..$007 in the cog. That would take a minimal amount of logic. Hub branches would be effected by changing PTRA. There would be a coarse 8-long block to respect, unless you used cog code above $008 to do finer queuing.
  • Bill HenningBill Henning Posts: 6,445
    edited 2013-12-03 06:46
    actually, no need for HUBEXEC

    Just a HJMP or HCALL, and return with HRET or regular cog jmp etc.

    Restrictions on this mode:

    REPx / DJNZ etc would have to fit in the 8 line cache

    no use of the RDxxxxC or WRxxxxC instructions in the hub code, non-C versions only so cache is not spoiled (would lead to thrashing)

    It should get very close to COG performance, I'd guess 90%+ of native pasm.
  • cgraceycgracey Posts: 14,133
    edited 2013-12-03 06:47
    Yanomani wrote: »
    Chip

    IF a control bit could be set in such situations, even jumping from other adresses, back inside the $000;$111 space, could restart the automatic 'RDWIDE PTRA++', and have the code LOUPE, sliding over the HUB PANORAMA code picture.
    The trickiest part would be re-syncing the 256 bit read operation, or stalling first time re-execution, till the next HUB slot tick.

    Yanomani

    This is the same set of dynamics that exist when using the XFR circuit to move data into the WIDE registers. Before jumping to $000, you'd do an initial 'RDWIDE PTRA++', and then a 'JMP #$000'. When you'd get to $000, the WIDE data would be executable. Then, you'd have the cog do an instruction-less 'RDWIDE PTRA++' in the background to keep the whole show going. The initial and instruction-less RDWIDE's could just be enabled by some instruction, and like you said, a branch cancels the mode. It should work fine. And since it doesn't try to break out of "cog" context in any way, there are no crazy caveats to learn.
  • David BetzDavid Betz Posts: 14,511
    edited 2013-12-03 06:47
    HJMP / HCALL / HRET

    for hub-execution versions of those instructions, maybe even an HDJNZ
    I think I missed something. What do these instructions do? I assume that HCALL and HRET don't actually modify instructions like CALL/RET do? Or maybe you're expecting HCALL to use all 18 bits of D+S and push its return address on an AUX memory stack?
  • cgraceycgracey Posts: 14,133
    edited 2013-12-03 06:53
    David Betz wrote: »
    I think I missed something. What do these instructions do? I assume that HCALL and HRET don't actually modify instructions like CALL/RET do? Or maybe you're expecting HCALL to use all 18 bits of D+S and push its return address on an AUX memory stack?

    Yeah, Bill. These would be worth implementing. They would round out what was missing for finer address control and calls/returns without resorting to discreet subroutines in PASM.
  • YanomaniYanomani Posts: 1,524
    edited 2013-12-03 06:59
    cgracey wrote: »
    It's now 6:30am here. My wife is getting up to get the kids ready for school and I've got to get some sleep.

    Like Chilly Willy, the singing polar bear, in the tale that the Old Captain told us:

    Rockaby baby, la la la la...

    Happy dreams!
  • cgraceycgracey Posts: 14,133
    edited 2013-12-03 07:08
    After one of you mentioned that we wouldn't want the PC staying at 0..7 during hub execution because it would undermine register remapping, I just realized that we could constrain it to the WIDE window, which could be based at any address in the cog.
  • pedwardpedward Posts: 1,642
    edited 2013-12-03 08:28
    cgracey wrote: »
    I think WIDE is it. It reads well and is very simple to remember.

    I want WIDELD, not RDWIDE :)
  • potatoheadpotatohead Posts: 10,260
    edited 2013-12-03 08:29
    So then the product of this is PASM sitting in the HUB, directly executed with HEXEC ptra with ptra the address?

    Don't use some instructions, hubops, etc... which would be reserved for LMM and COG PASM.

    This gets called HUB PASM, and we now have PASM, LMM, XMM execute models. Wow.

    Return to the COG via a standard JMP instruction, canceling the hardware HUB execution mode. And carry on at top speed and full use of HUB operations.

    Bill, I think that's about the simplest model there is, given the state of things right now.

    Well yes, I have to agree with Kerry.

    1MB PASM programs @ ~90 percent of native with few restrictions.

    Holy Buckets! All of the sudden, that 256MB HUB makes a big difference. Plenty of room to be paging in fairly large programs from external memory.

    Can't wait to play with an FPGA image.

    @JMG, well OK. Here we are this morning. I'm going to concede your point. Maximizing it right now makes perfect sense. On the assumption this all makes sense. I'm thinking it will. :)
  • David BetzDavid Betz Posts: 14,511
    edited 2013-12-03 08:33
    I'm still not sure I understand HCALL and HRET. Where does HCALL put its return address and where does HRET get it from?
Sign In or Register to comment.