The New 16-Cog, 512KB, 64 analog I/O Propeller Chip - Part 2

191012141519

Comments

  • potatoheadpotatohead Posts: 9,759
    edited 2015-09-24 - 00:04:08
    Oh man, a long based processor would be very nice, simple and clean. All addressing just like COG addressing on P1...

    Honestly, the COG address space is just going to be different. Adding the operator to specify that difference makes sense. I for sure don't feel good about the lengthy shift characters being everywhere. Laborious and error prone.

    If all addressing is longs, there are lots of shifts and masks everywhere. Having that byte granularity in the HUB sense is best, to me at least.

    And I don't get it at all. Why such a fuss over things that make a lot of sense in the context of how Propellers have been programmed so far?

    The one gripe was no standard tools, and we got gcc for that and it works too. Those that really felt that was best use gcc and SPIN / PASM are just fine orherwise.

    This Prop should play out just the same. We will have a bunch of gcc users looking for those tools, as well as people using SPIN / PASM.



    Do not taunt Happy Fun Ball! @opengeekorg ---> Be Excellent To One Another SKYPE = acuity_doug
    Parallax colors simplified: https://forums.parallax.com/discussion/123709/commented-graphics-demo-spin<br>
  • tonyp12tonyp12 Posts: 1,945
    edited 2015-09-24 - 00:11:29
    on Arm's everything is 32bit aligned,
    there is only two instructions LDRB and STRB that can handle bytes and they only work together with a register,
    So most of the time it's two instructions to get anything done.

    The few times you need bytes on P2, could you not use a special register that you have to write it to and then read from that will shift+mask.
    e.g one extra step to get byte from a long, but not a big deal

    Rdlong specialbytetranslaterreg, hub address ' you need to read a byte in to the cog anyway.
    mov label, specialbytetranslaterreg wz wc ' wz/wc sets the 1 of 4 mask location and always shift down to bit0-7

  • potatohead wrote: »
    As that "new user" so many years ago, I was able to jump on a P1 and get things done in a DAY. Nothing about it was hard, and lots about it was a lot of fun.

    I really want that same overall feel for P2 with SPIN and PASM. Ideally, the on chip system will complete the picture with the whole thing one design, made to operate together, etc...

    I absolutely agree! And I want that again too!

    However, you and I are at an advantage here: we both already have P1 PASM under our belt and we both have been deeply involved in the P2 design. Even without an FPGA image to work with, I guarantee both of us can write advanced P2-style PASM. We can never be "new users" again, even for the P2.

    This will not be the case for someone who has never looked at a Propeller before. The P2 is already must more advanced (and complex) than the P1 is. The learning curve is going to be steeper. I'm just concerned that it's also going to be less fun.

    (more on this in my reply to another comment... hold on...)
  • potatohead wrote: »
    ...
    This has come up a pile of times before. And I'll say it again, if SPIN and PASM didn't make the great sense they did, I would have passed on this chip in a second, never thinking twice. It's really important that we leave SPIN and PASM to it's creator, who is Chip, and let him do what he does with languages and tools.

    Yes, that is different, and that is precisely why a lot of us like using those tools and languages.

    Bear in mind, one of the design specs is "fun to use"
    Absolutely Agree!

    I love the simplicity of PASM.
    Spin syntax (the short operators like != etc catch me often) I don't enjoy so much. I'd rather a more basic like syntax. But I do like the enforced indentation (as long as the IDE marks it like PropTool can).
    BTW There could always be an alternate syntax giving the same bytecode output.

    I am really an Assembler Programmer. However, I only do PASM in the P1 when required. If there is no speed issue, the Spin is actually easier.

    When I "accidentally found" the P1 I was over-awed (if that's a word) with it's capabilities, multicore and no interrupts. I immediately ordered a ProtoBoard or 2. Then I had to wait. Meanwhile I started programming.

    When it arrived, I had my blinking LED program running in minutes!. You cannot do that with any other chip that I know of (and I have programmed a lot of them).

    One thing is certain, to me anyway - PASM + SPIN2 will go together well. They will be used by lots of newbies to just get started.

    And, +1 for wanting some short form macro ability, at least for the ALT/AUG+JMP/CALL/RET instructions.
    My Prop boards: P8XBlade2, RamBlade, CpuBlade, TriBlade
    Prop OS (also see Sphinx, PropDos, PropCmd, Spinix)
    Website: www.clusos.com
    Prop Tools (Index) , Emulators (Index) , ZiCog (Z80)
  • jmgjmg Posts: 13,777
    Roy Eltham wrote: »
    jmg,
    however, the # thing means immediate value across all pasm instructions, not just for the jmp #label case. So you are proposing making it inconsistent.
    I'm not following - other ASMS use # for immediate ?

    A label does not always translate to an immediate, and once you add relative jumps, or relocatable code, what is loaded into the actual opcode is nothing like the literal #label.

    Then, there is JMP $+Offset assembler.... common in other MCUs

    Not that users care much what is loaded, they just want simpler and clean easy to read code.

  • Chip,

    Why can't we just use longs on the outside (ie visible to the programmer)?

    The only time we use bytes and words is with RD/WR-BYTE/WORD. So we really only need to worry about byte addressing is when referencing hub.

    So what has happened to make us use byte addresses everywhere on the P2 ???
    Everything was fine on the P1 so can't we do the same on P2?
    I am confused!
    My Prop boards: P8XBlade2, RamBlade, CpuBlade, TriBlade
    Prop OS (also see Sphinx, PropDos, PropCmd, Spinix)
    Website: www.clusos.com
    Prop Tools (Index) , Emulators (Index) , ZiCog (Z80)
  • Roy Eltham wrote: »
    Seairth, jmg, etc.
    I STRONGLY disagree with you. In fact, I argue that doing it as I suggested makes it MUCH easier for the new person. Having to put addr/4 or addr<<2 all over you code and know which to use when it extra complication just because the actual stuff is byte addressed, but most of the opcodes only contain 9 bits for cog addressing so you need to /4 the values, however some of them expect the larger 20bit address.

    mov x, ##value (or mov x, &value) is a lot cleaner than mov x, #value/4

    You misunderstood what I was saying. I wasn't arguing that we should keep "/4" instead. I'm saying "##" is obscuring the fact that you have to do "/4" at all!

    Here's the way I see it:

    1. Hub memory supports instructions at any byte offset. Because of that,
    2. Instructions in cog (and now LUT) memory are also treated like they're byte addressable to keep things consistent (even though cog/LUT instructions must always be long-aligned). Because of that,
    3. Cog/LUT instruction addresses have two extra bits that must be dealt with. Because of that,
    4. We have to do things like "/4" and "<<2". Because of that,
    5. We add "##" as syntax sugar.

    Each step adds complexity to address the complexity before it. Instead, if we treat the instruction addresses as long offsets (including in the hub), that entire list above goes away. To me, that is simpler. That is easier to learn and to understand. That is more fun.
  • jmgjmg Posts: 13,777
    tonyp12 wrote: »
    The few times you need bytes on P2, could you not use a special register that you have to write it to and then read from that will shift+mask.
    e.g one extra step to get byte from a long, but not a big deal
    It is things like packed records where byte granularity is important. If you have multiple COGS working in the same memory, you need that as Atomic Granularity too.
    It can be a pain to mix, but (as Chip says) I do not see much choice ?
  • SeairthSeairth Posts: 2,374
    edited 2015-09-24 - 01:43:41
    cgracey wrote: »
    The whole conundrum is in supporting less-than-long data (words and bytes). They need extra bits to resolve addresses among longs.

    It would be great to make a machine that is just long-based - what a relief that would be! Supporting words and bytes, though, requires those extra sub-bits. Then there's the issue of how to handle the addressing scheme which must involve all three sizes.

    Chip, I think I must be missing something about the new design. I understand that the address lines to the hub memory must have the lower two bits so that you can address individual bytes. What I don't understand is why this also affects instruction fetching. If instruction addressing (not data addressing) in in longs instead of bytes, then I would think that:

    * pc[8:0] would exactly match the cog or lut address lines (depending on pc[9]).
    * {pc[16:0], 2'b00} would exactly match the address lines to the hub memory.
    * pc[19:17] would be reserved for future expansion (assuming you didn't implement the other suggestion I made above)
    * pc would increment by 1 regardless of execution mode.

    I don't see how this affects or is affected by supporting less-than-long data addressing.
  • Rayman wrote: »
    Does having both with byte addressing help when porting hubexec code to cog code?
    Seems like it would help...

    Actually, all hub code can run in the cog, if it fits.
  • Seairth wrote: »
    Please don't add more address operators. This is just hidingobscuring complication with syntax sugar. This makes PASM more difficult to learn for new people. I'm sure some of you will disagree, but youneed to remember that you have an entirely different perspective of the P2 than a new person will.

    And, personally, I think it makes the Propeller less fun to program for. I'd much rather get rid of the complication and keep the fun!

    I agree. I was just revisiting Prop2-Hot and looking at its address operators. We are much simplified in this Prop2. Much of that simplification comes from not having alignment rules.
  • Cluso99 wrote: »
    I am quite concerned about the Special Registers being located at COG $000+.

    Cog RAM $000+ is often used for tables. Now that they cannot be "0" based means adding an extra offset value to get the table. While this is often not a problem, it is if the table is being continually used which will slow down the code.

    Some examples...

    1. Font table: I use a font table located in cog $000+ within the video generator cog. It is extremely timing dependant!

    2. Vector table: Currently there is no other way, but in my faster spin interpreter I have a vector table located in hub. It will be much faster for this to be located in COG or LUT. If it's in COG $000 it will be much faster to decode each spin opcode. IIRC the average spin opcode uses about 50 instructions. Cutting just one instruction in EVERY op code will yield another 2% gain. With LUT-exec and stacks, we are going to see a dramatic improvement in spin execution time. Every bit of speed will help. I am also sure we will see other interpreters making an appearance on P2, as well as GCC :)

    If Bill is around I would love to hear his opinion ???

    Meanwhile, Chip may I suggest you just leave it as you now have it (Special Registers at $000+). This way we can check it out.
    We all need an FPGA code release :)

    The big advantage to putting those special registers at $000..$007 is that cog and LUT become one uninterrupted code space. It makes 1k-instruction programs much easier to write, as there's no interruption where those special registers used to be. So, no cutting your program in half all the time.

    You can always use the LUT as a quick lookup table with zero-based addressing. The RDLUT is a 3-clock instruction, though, not a 2-clock.
  • potatohead wrote: »
    I actually hope it's both...

    Byte code SPIN still has a place for code size reasons.

    It would be neat to have object-level and PUB/PRI-level control over whether Spin code is compiled or interpreted.
  • cgracey wrote: »
    Here is the new cog register map:
    //	addr		read		write		name
    //	-------------------------------------------------------------
    //
    //	008		RAM		RAM		user / ADRA
    //	009		RAM		RAM		user / ADRB
    

    By the way, what is ADRA/ADRB?
  • cgracey wrote: »
    potatohead wrote: »
    I actually hope it's both...

    Byte code SPIN still has a place for code size reasons.

    It would be neat to have object-level and PUB/PRI-level control over whether Spin code is compiled or interpreted.

    Now that SPIN won't be in the ROM, the nice thing is that it can be improved even after the P2 is released! I suggest adding a "_version" const (or something similar) to SPIN2 in anticipation of having a living language spec.
  • Chip,
    This is what I mean about the addressing and the PC.
    I am using contiguous addresses for COG & LUT, and leaving the ability to make LUT 4KB if there is space.

    431 x 555 - 70K
    My Prop boards: P8XBlade2, RamBlade, CpuBlade, TriBlade
    Prop OS (also see Sphinx, PropDos, PropCmd, Spinix)
    Website: www.clusos.com
    Prop Tools (Index) , Emulators (Index) , ZiCog (Z80)
  • Chip,
    How fast can the OnSemi RAM be accessed? Could it be 2x the P2 clock speed by chance???
    My Prop boards: P8XBlade2, RamBlade, CpuBlade, TriBlade
    Prop OS (also see Sphinx, PropDos, PropCmd, Spinix)
    Website: www.clusos.com
    Prop Tools (Index) , Emulators (Index) , ZiCog (Z80)
  • I'm confused, wasn't the purpose of having the cog memory long aligned so that the addresses could fit in 9 bits? How does the P2 work now?

    Almost all 32bit processors enforce some kind of long alignment rules because memory fetches are long aligned.
  • Cluso99Cluso99 Posts: 15,219
    edited 2015-09-24 - 02:09:06
    Chip,
    Here is what I thought your program (posted some posts back) could look like...
    dat
            orgh    $00000                  |  orgh    $00001                  'start in hub-exec at $00001 (non-aligned address below $1000)
                                            |  
            loc     adra,@code              |  loc     adra,@code              'load cog starting at 'begin' with 1st half code
            setq    #(codeend-code)         |  setq    #$1F0-1
            rdlong  begin,adra              |  rdlong  begin,adra
                                            |  
                                            |  loc     adra,@code + $1F0<<2    'load lut starting at $000 with 2nd half code
                                            |  setq2   #$200-1
                                            |  rdlong  $000,adra
                                            |  
            jmp     #begin                  |  jmp     #begin                  'cog/lut now hold one contiguous program, jump to it
                                            |  
    code                                    |                                  'hub address of cog/lut program
                                            |  
            org     $200                    |  org     $010,$3FF               'set cog/lut org to register $010, set limit to end of lut
                                            |  
    begin   mov     dira,#$1F               |  mov     dira,#$1F               'start of cog/lut program, enable outputs
                                            |  
    loop    notb    outa,#4                 |  notb    outa,#4                 'toggle pins in a loop
                                            |  
            long    $F4240400 [250]         |  long    $F4240400 [250]         'notb outa,#0 (250 instances)
            long    $F4240401 [250]         |  long    $F4240401 [250]         'notb outa,#1 (250 instances)
            long    $F4240402 [250]         |  long    $F4240402 [250]         'notb outa,#2 (250 instances)
            long    $F4240403 [250]         |  long    $F4240403 [250]         'notb outa,#3 (250 instances)
                                            |  
            jmp     @loop                   |  jmp     @loop
    codeend
    
    No need to orgh $1000 or whatever since the program counter will have a flag determining whether the code is in hub or cog/lut.

    You will note that I have used SETQ to set the number of times the RD/WR-LONG/WORD/BYTE will execute. Thus the count only needs to be the number of long/word/byte 's that need to be copied. The Verilog will add 1/2/4 where needed (when executing the rdlong/etc).

    I have also presumed since we now have contiguous COG/LUT that the SETQ could be changed (later) to allow a full copy of COG/LUT (ie can use 11 bits).

    Is this possible since it is a lot easier that at present? And does it make sense, or am I missing something???


    Here is another piece of code you posted...
    DAT
                    orgh    $00000
    
    entry           setq    #(x-begin)               |  setq #(x-begin)/4            'number of longs to load
                    rdlong  begin,ptrb[code-entry]   |  rdlong begin,ptrb[code-entry]
                    jmp     #begin                   |  jmp  #begin
    code                                             |  
                    org     8                        |  org  8<<2
                                                     |  
    begin           clkset  #$FF                     |  clkset #$FF                  'switch to 80MHz (if pll, else 50MHz)
                    wrfast  #0,#0                    |  wrfast #0,#0                 'ready to write entire memory
                    setedg  #%0_10_111111            |  setedg #%0_10_111111         'select negative edge on p64
                                                     |  
    :loop           getedg                           |  getedg                       'clear edge detector
                    waitedg                          |  waitedg                      'wait for start bit
                                                     |  
                    rep     #2,#7                    |  rep  #2,#7                   'ready for 8 bits
                    waitx   waita                    |  waitx waita                  'wait for middle of 1st data bit
                    testb   inb,#31         wc       |  testb inb,#31        wc      'sample rx
                    rcr     x,#1                     |  rcr  x,#1                    'rotate bit into byte
                    waitx   waitb                    |  waitx waitb                  'wait for middle of nth data bit
                                                     |  
                    ---------------------------------|------------------------------------------------------------------
    coginstrs  ...
                    ...some code/data in register space
                    jmp     #lut1                   'jump into lut code
    
                    org     512                      'just to ensure we are in lut
    lut             ...some code/data in LUT space
    
    lut1            ...some lut-exec code
    :loop           mov     x,y                     'moves cog register "x" 32bit contents to cog register "y"
                    ...
                    djnz    #:loop,count
                    ...
                    jmp     #coginstrs             'jump to cog register space
    
    
    ORGH no longer needs to start at an offset.
    SETQ sets a count (longs for rdlong).
    ORG 8<<2 for cog no longer is in bytes, so ORG 8 can be used.


    My Prop boards: P8XBlade2, RamBlade, CpuBlade, TriBlade
    Prop OS (also see Sphinx, PropDos, PropCmd, Spinix)
    Website: www.clusos.com
    Prop Tools (Index) , Emulators (Index) , ZiCog (Z80)
  • Bill HenningBill Henning Posts: 6,445
    edited 2015-09-24 - 02:57:11
    Since you asked :)

    I would use long addresses for instructions. I don't like wasting two of 20 address bits, and I see no benefit to non-long-aligned code.

    It also allows prop3+ to have a larger executable space.

    Love byte addressing for (RD|WR){BYTE|WORD|LONG}

    Regarding moving the I/O registers etc to the front - I don't really like it for the reasons Cluso99 mentions, but it allows for more cog ram on later props.

    cog-code should start executing at say $010, so perhaps table reading could move to the LUT? (I miss INDA/INDB)

    so, like Seairth wrote:

    Long addressed:

    $00000-$001FF = cog ram local to cog executing, rd/wr byte/long/word can use same range as mailboxes
    $00200-$003FF = LUT, long addressable, local to cog

    Byte addressed:

    $00400-endofhub = byte addressed

    In the future, for byte code interpreters, I could see

    LUTJMP D/# - jump to the address held in LUT location addressed by D/S

    EMUL8 D/# - read byte from hub (usually ptra++ or ptrb++) and jump throuh jump table held in lut (saves instruction, delay slot)

    Consider a Spin interpreter using EMUL8 also using the second half of the LUT... 1k instructions...

    In other news...

    I wil be dusting off my DE2-115 :) :) :)

    and PIC16F assembly is still as "interesting" as ever. Ugh.
    Cluso99 wrote: »
    I am quite concerned about the Special Registers being located at COG $000+.

    Cog RAM $000+ is often used for tables. Now that they cannot be "0" based means adding an extra offset value to get the table. While this is often not a problem, it is if the table is being continually used which will slow down the code.

    Some examples...

    1. Font table: I use a font table located in cog $000+ within the video generator cog. It is extremely timing dependant!

    2. Vector table: Currently there is no other way, but in my faster spin interpreter I have a vector table located in hub. It will be much faster for this to be located in COG or LUT. If it's in COG $000 it will be much faster to decode each spin opcode. IIRC the average spin opcode uses about 50 instructions. Cutting just one instruction in EVERY op code will yield another 2% gain. With LUT-exec and stacks, we are going to see a dramatic improvement in spin execution time. Every bit of speed will help. I am also sure we will see other interpreters making an appearance on P2, as well as GCC :)

    If Bill is around I would love to hear his opinion ???

    Meanwhile, Chip may I suggest you just leave it as you now have it (Special Registers at $000+). This way we can check it out.
    We all need an FPGA code release :)

    www.mikronauts.com / E-mail: mikronauts _at_ gmail _dot_ com / @Mikronauts on Twitter
    RoboPi: The most advanced Robot controller for the Raspberry Pi (Propeller based)
  • Just been searching for the latest P2 Instruction Set Chip posted a while back.
    Does anyone have a link or can repost the set please?
    My Prop boards: P8XBlade2, RamBlade, CpuBlade, TriBlade
    Prop OS (also see Sphinx, PropDos, PropCmd, Spinix)
    Website: www.clusos.com
    Prop Tools (Index) , Emulators (Index) , ZiCog (Z80)
  • Roy,
    I love PASM, it's by far the best ASM language I have ever used, and I have used a half dozen or more. It's super simple and consistent. I want the P2 version to retain that as much as possible while it adds all the new abilities.

    I have totally lost track of the issues here but I totally agree with that. I hope PASM does not get messed up.

    What I'm worrying about is how easy it will be to get P1 PASM working on the P2.
  • Cluso99 wrote: »
    Chip,

    Why can't we just use longs on the outside (ie visible to the programmer)?

    The only time we use bytes and words is with RD/WR-BYTE/WORD. So we really only need to worry about byte addressing is when referencing hub.

    So what has happened to make us use byte addresses everywhere on the P2 ???
    Everything was fine on the P1 so can't we do the same on P2?
    I am confused!

    The way the Prop1 addresses memory means that longs must be long-aligned and words must be word-aligned, while bytes can be anywhere. One thing that means is that you cannot have structures made up of mixed word sizes.

    On the Prop2, there are no such limitations. There is only one issue where any type of hub alignment matters, and that is on fast r/w blocks that wrap - they must be long-aligned to wrap properly. In no other case does it matter, so it makes understanding hub memory dead simple. The ONLY place I see it being a pain is in reconciling cog and LUT longs, which each have a single address, with longs in hub, which take four addresses. That's why <<2 and >>2 come into play. Those could be cleaned up by the approach taken in the development tools, though.

    You know that there is a one-clock penalty for reading/writing hub longs and words that cross long boundaries, but that minor penalty can be overcome by using long alignment, if you want. It is not necessary, though, and I don't see a reason to force it, as it would just introduce a caveat to where something can be.

    I think what Roy said about insisting on byte-address-level reckoning for cog and LUT is the key to happiness (or peace, at least) here because it maintains consistency of understanding between cog/LUT memory and hub memory, at least size-wise. What SEPARATES cog/LUT memory from hub memory is another issue which touches on sensibilities.

    About not wasting two bits of the PC by supporting non-long-alignment: Remember that we still need to have ANOTHER two bits beyond the PC's bits to reach down to words and longs. Those two bits must be encoded into the instructions for reckoning absolute and relative addresses. We are at 20 bits for those purposes and there are no more bits for bigger addresses in the opcode set. So, these two sub bits of the 18-bit PC, if you want to see them that way, total about 20 flops per cog, with 16 of them being in the 8-level PUSH/POP/CALL/RET hardware stack. They are not resource hogs and if we got rid of them, we would be forced into long-alignment for all instructions. That would be the only effect of getting rid of them. We wouldn't get a 4x-size hub memory map because we are constrained to 20 bits for byte-level addresses. However, if we totally got rid of words and bytes (which I've really though about), we could have a 4x-size hub memory map. Supporting words and bytes is a pain, but I realize that for many reasons they are vital. If we didn't have bytes, each of us would hit a wall as soon as we needed a memory-efficient mechanism to handle them. We'd be doing read-modify-writes on hub longs and pulling our hair out, knowing we were mired in the reinvention of an old wheel.
  • cgraceycgracey Posts: 11,488
    edited 2015-09-24 - 05:55:34
    Cluso99 wrote: »
    Chip,

    Why can't we just use longs on the outside (ie visible to the programmer)?

    The only time we use bytes and words is with RD/WR-BYTE/WORD. So we really only need to worry about byte addressing is when referencing hub.

    So what has happened to make us use byte addresses everywhere on the P2 ???
    Everything was fine on the P1 so can't we do the same on P2?
    I am confused!

    That's interesting!

    I wonder if we could reckon ALL memory by long-address and consider the two orphaned LSBs as fractions: 0.00, 0.25, 0.50, 0.75. actually, those could be expressed as .0, .1, .2, .3.

    It would be a little weird to understand that some hub-exec code starts at xxxx.3, for example. But, that's life. I think that would really look strange to people. Perhaps just having the tools unify hub-addressing notions with cog/LUT realities would be best.
  • Fixed point addressing...

    It is weird.
    Do not taunt Happy Fun Ball! @opengeekorg ---> Be Excellent To One Another SKYPE = acuity_doug
    Parallax colors simplified: https://forums.parallax.com/discussion/123709/commented-graphics-demo-spin<br>
  • cgraceycgracey Posts: 11,488
    edited 2015-09-24 - 06:43:46
    Seairth wrote: »
    cgracey wrote: »
    Here is the new cog register map:
    //	addr		read		write		name
    //	-------------------------------------------------------------
    //
    //	008		RAM		RAM		user / ADRA
    //	009		RAM		RAM		user / ADRB
    

    By the way, what is ADRA/ADRB?

    They are generic registers that can receive 20-bit address results from the LOC instruction, in addition to PTRA and PTRB,
  • Cluso99 wrote: »
    Chip,
    How fast can the OnSemi RAM be accessed? Could it be 2x the P2 clock speed by chance???

    It's rated for ~350MHz, but getting a 2x clock spread around is too much trouble. We also would't have enough setup time to do much, given the clock uncertainty.
  • pedward wrote: »
    I'm confused, wasn't the purpose of having the cog memory long aligned so that the addresses could fit in 9 bits? How does the P2 work now?

    Almost all 32bit processors enforce some kind of long alignment rules because memory fetches are long aligned.

    There are two bits in the PC to resolve non-aligned code addresses in the hub. So, during hub exec, all the bits are used to specify the byte-start of the long instruction, while in cog exec, PC[10:2] feeds the cog RAM and the two LSBs are ignored.
  • Thanks Chip.

    I realise we need to keep the byte and word access to/from hub. But that is the only requirement where we need to see the lowest 2 bits. And they are only non-zero (presuming we must long align instructions in hub - and it is my belief this should be demanded) when we want to access bytes (00/01/10/11) and words (00/10).
    But when we are referring to hub longs those bits should be 00.
    In other words, words should be word aligned and longs long aligned, just as we have in the P1. That made sense and was easy to understand.
    When we reference cog or lut, they should always be accessed as longs. If it's necessary anywhere (and I didn't see that in P1V code) then they should be hidden from the user and be 00.

    IMHO I think the whole byte addressing idea came about because of hub-exec. But that should not have happened as the instructions should always be long aligned. I don't see any reason for them not to be. It's not like we have varying sized instructions as on some processors.

    Therefore, the PC should only hold bits Addr[19:2] with bits[1:0]=00 assumed. We then just need a flag to indicate whether the address is in hub, or in cog/lut where cog and lut should IMHO be represented as contiguous addresses A[12:2] with A[1:0]=00 assumed. Of course D & S can only normally contain A[10:2] which we use as D[8:0] and S[8:0].

    To be continued..(must go and pick up wifey)
    My Prop boards: P8XBlade2, RamBlade, CpuBlade, TriBlade
    Prop OS (also see Sphinx, PropDos, PropCmd, Spinix)
    Website: www.clusos.com
    Prop Tools (Index) , Emulators (Index) , ZiCog (Z80)
  • Cluso99 wrote: »
    Chip,
    Here is what I thought your program (posted some posts back) could look like...
    dat
            orgh    $00000                  |  orgh    $00001                  'start in hub-exec at $00001 (non-aligned address below $1000)
                                            |  
            loc     adra,@code              |  loc     adra,@code              'load cog starting at 'begin' with 1st half code
            setq    #(codeend-code)         |  setq    #$1F0-1
            rdlong  begin,adra              |  rdlong  begin,adra
                                            |  
                                            |  loc     adra,@code + $1F0<<2    'load lut starting at $000 with 2nd half code
                                            |  setq2   #$200-1
                                            |  rdlong  $000,adra
                                            |  
            jmp     #begin                  |  jmp     #begin                  'cog/lut now hold one contiguous program, jump to it
                                            |  
    code                                    |                                  'hub address of cog/lut program
                                            |  
            org     $200                    |  org     $010,$3FF               'set cog/lut org to register $010, set limit to end of lut
                                            |  
    begin   mov     dira,#$1F               |  mov     dira,#$1F               'start of cog/lut program, enable outputs
                                            |  
    loop    notb    outa,#4                 |  notb    outa,#4                 'toggle pins in a loop
                                            |  
            long    $F4240400 [250]         |  long    $F4240400 [250]         'notb outa,#0 (250 instances)
            long    $F4240401 [250]         |  long    $F4240401 [250]         'notb outa,#1 (250 instances)
            long    $F4240402 [250]         |  long    $F4240402 [250]         'notb outa,#2 (250 instances)
            long    $F4240403 [250]         |  long    $F4240403 [250]         'notb outa,#3 (250 instances)
                                            |  
            jmp     @loop                   |  jmp     @loop
    codeend
    
    No need to orgh $1000 or whatever since the program counter will have a flag determining whether the code is in hub or cog/lut.

    You will note that I have used SETQ to set the number of times the RD/WR-LONG/WORD/BYTE will execute. Thus the count only needs to be the number of long/word/byte 's that need to be copied. The Verilog will add 1/2/4 where needed (when executing the rdlong/etc).

    I have also presumed since we now have contiguous COG/LUT that the SETQ could be changed (later) to allow a full copy of COG/LUT (ie can use 11 bits).

    Is this possible since it is a lot easier that at present? And does it make sense, or am I missing something???


    Here is another piece of code you posted...
    DAT
                    orgh    $00000
    
    entry           setq    #(x-begin)               |  setq #(x-begin)/4            'number of longs to load
                    rdlong  begin,ptrb[code-entry]   |  rdlong begin,ptrb[code-entry]
                    jmp     #begin                   |  jmp  #begin
    code                                             |  
                    org     8                        |  org  8<<2
                                                     |  
    begin           clkset  #$FF                     |  clkset #$FF                  'switch to 80MHz (if pll, else 50MHz)
                    wrfast  #0,#0                    |  wrfast #0,#0                 'ready to write entire memory
                    setedg  #%0_10_111111            |  setedg #%0_10_111111         'select negative edge on p64
                                                     |  
    :loop           getedg                           |  getedg                       'clear edge detector
                    waitedg                          |  waitedg                      'wait for start bit
                                                     |  
                    rep     #2,#7                    |  rep  #2,#7                   'ready for 8 bits
                    waitx   waita                    |  waitx waita                  'wait for middle of 1st data bit
                    testb   inb,#31         wc       |  testb inb,#31        wc      'sample rx
                    rcr     x,#1                     |  rcr  x,#1                    'rotate bit into byte
                    waitx   waitb                    |  waitx waitb                  'wait for middle of nth data bit
                                                     |  
                    ---------------------------------|------------------------------------------------------------------
    coginstrs  ...
                    ...some code/data in register space
                    jmp     #lut1                   'jump into lut code
    
                    org     512                      'just to ensure we are in lut
    lut             ...some code/data in LUT space
    
    lut1            ...some lut-exec code
    :loop           mov     x,y                     'moves cog register "x" 32bit contents to cog register "y"
                    ...
                    djnz    #:loop,count
                    ...
                    jmp     #coginstrs             'jump to cog register space
    
    
    ORGH no longer needs to start at an offset.
    SETQ sets a count (longs for rdlong).
    ORG 8<<2 for cog no longer is in bytes, so ORG 8 can be used.


    I was of the mind yesterday that I should expand RDLONG-repeat to automatically flow from cog to LUT. This would involve one more D bit in the RDLONG instruction and one more D bit in the SETQ instruction. Both could be done, but then I started thinking how it would booger up the instruction set for this single-purpose accommodation and I decided against it. It's still pulling at me, though. It would be nice to have a single means to load both cog and LUT. It could be as simple as this:
    entry	setq2	#$3F0-1		'load $010..$3FF
    	rdlong	$010,ptrb[(code-entry)>>2]
    
    	jmp	#begin		'cog/lut now hold one contiguous program, jump to it
    
    code				'hub address of cog/lut program
    
    	org	$010,$3FF	'set cog org to register $010, set limit to end of lut
    
    begin	<cog+lut code>		'your cog+lut program
    
Sign In or Register to comment.