Shop OBEX P1 Docs P2 Docs Learn Events
Memory mapped I/O vs I/O mapped I/O — Parallax Forums

Memory mapped I/O vs I/O mapped I/O

evanhevanh Posts: 15,187
edited 2015-03-17 04:35 in Propeller 2
I just bumped into a subject mater raised by Seairth in the P1V forum - http://forums.parallax.com/showthread.php/160278-The-need-for-CTRx-addressable-registers

He was originally looking into a windowed mechanism for maintaining a small range of Cog addresses used for the I/O map. But going straight for a dedicated I/O map with it's own instructions is the cleaner approach imho.


I believe the basis for memory mapped I/O is one of you don't need any, not even one, special instructions for managing all I/O. And the bonus is you have the entire existing instruction set with which to use to access this memory mapped I/O.

It's the "keep it simple" approach. Brilliant! Case closed. :)


Not quite ... it presumes one detail ... that memory address space is effectively unlimited. And, as we know, not all processing units have large addressing ranges. The more I/O that's provided the more a small space gets squeezed.

The most important question very quickly then becomes; How many different ways of accessing I/O and special registers are useful? Well, I haven't as yet actually tried to study examples but I'm of the opinion, as per my post in the other topic, that load and store are all that's useful for data massaging due to the need for atomic data manipulation.

Special examples, like waitvid, have special instructions. We've already got some dedicated instructions, why not finish the job and make a dedicated I/O address map as well.


Anyone that's written lots of PASM code, please butt in.

Comments

  • evanhevanh Posts: 15,187
    edited 2015-03-14 16:11
    In case it wasn't obvious, I'm thinking this is a matter that Chip should consider for the Propeller 2 design. Ie: Throw away the now 16 special addresses in Cog space and instead have fully I/O mapped I/O. This gives the full 512 address range back to program and data.
  • potatoheadpotatohead Posts: 10,254
    edited 2015-03-14 16:41
    Memory mapped does offer the potential to use an IO result directly, where dedicated instructions add cycle time.

    On a load store type CPU, that isn't as big of an advantage. Props are memory to memory in the COG.

    If we do get hubexec, the size of the cog space isn't as big of a deal as it might be otherwise.

    Now that we have the smart pins, non binary IO cases are likely to have instructions, for a nice balance.
  • jmgjmg Posts: 15,145
    edited 2015-03-14 16:46
    evanh wrote: »
    I believe the basis for memory mapped I/O is one of you don't need any, not even one, special instructions for managing all I/O. And the bonus is you have the entire existing instruction set with which to use to access this memory mapped I/O.
    ...

    The most important question very quickly then becomes; How many different ways of accessing I/O and special registers are useful? Well, I haven't as yet actually tried to study examples but I'm of the opinion, as per my post in the other topic, that load and store are all that's useful for data massaging due to the need for atomic data manipulation.

    It is a good idea to not consume Code space, but a full second 512 presumes a spare bit in the Opcode space, to signal IO or Memory opcode.

    It is also not quite just load and store, as you need to have ORL, ANL, XRL for atomic bit access.
    Also, you ideally need associated bits in the same field, for true atomic access.
    ie sometimes you do want to control two timers at one opcode.


    If you are doing a 'smarter pass' on IO, then also consider mapping some boolean flags as virtual pins, so the
    WAITxx opcodes can work on Peripheral flags

    A variant that was WAIT_BIT_AND_CLEAR (perhaps CC on WAITxx? ) would give the most code-compact flag handling.
  • evanhevanh Posts: 15,187
    edited 2015-03-14 17:04
    jmg wrote: »
    It is a good idea to not consume Code space, but a full second 512 presumes a spare bit in the Opcode space, to signal IO or Memory opcode.
    Nope, that's the nice part about only a restricted number of opcodes. It doesn't consume anywhere near a whole bit at all.

    It is also not quite just load and store, as you need to have ORL, ANL, XRL for atomic bit access.
    Also, you ideally need associated bits in the same field, for true atomic access.
    ie sometimes you do want to control two timers at one opcode.

    Are those PASM instructions? Assuming you are referring to multi-cycle atomic memory operations then those only work as atomic when it's real RAM, or a working buffer. When it's raw I/O the data is latched every sys clock so the only sure way is unidirectional flow. Ie: load and store is all that's needed.

    If you are doing a 'smarter pass' on IO, then also consider mapping some boolean flags as virtual pins, so the
    WAITxx opcodes can work on Peripheral flags

    A variant that was WAIT_BIT_AND_CLEAR (perhaps CC on WAITxx? ) would give the most code-compact flag handling.

    Yep, all good stuff.
  • evanhevanh Posts: 15,187
    edited 2015-03-14 17:12
    potatohead wrote: »
    On a load store type CPU, that isn't as big of an advantage. Props are memory to memory in the COG.
    It could be argued that HubExec is a load/store architecture. ;)

    If we do get hubexec, the size of the cog space isn't as big of a deal as it might be otherwise.
    There is opportunity for having larger "caching" here. Chip is indicating the HubExec buffers will be visibly addressable under the special register set, but it's limited to only 8 locations.

    Now that we have the smart pins, non binary IO cases are likely to have instructions, for a nice balance.
    Which makes the case for memory mapping being unimportant.
  • ozpropdevozpropdev Posts: 2,791
    edited 2015-03-14 18:03
    While Hubexec solves the code size issue nicely, cog programs are not left behind.
    Cog program sizes in the "HOT" P2 were significantly smaller than P1 programs.
    Lots of gains were made with the new efficient instructions.
    Many operations were now reduced to single instructions, Auto indexing of pointer registers, direct hub writes of 9 bits without a separate LONG directive.
    The testing of bits without having a collection of masks was also a nice feature.
    Word, byte and nibble instructions also were big winners in code size reduction.
    Even IO made gains with combined DIR/OUT functions also saving space.
    Those IO instructions such as SETP,CLRP,OFFP etc also had flow on effects to SPIN 2.
    With these instructions it would have made it even easier for beginners/newbies eliminating the need to even think about pin direction.
    These IO shortcuts appear to be missing in the new "Cool" P2 but maybe that's part of the smart pins now.
    Most of the other stuff appears to have made the jump across to the new P2 including my other favourite "REP" block repeat instruction :)
  • jmgjmg Posts: 15,145
    edited 2015-03-14 18:21
    evanh wrote: »
    Are those PASM instructions? Assuming you are referring to multi-cycle atomic memory operations then those only work as atomic when it's real RAM, or a working buffer. When it's raw I/O the data is latched every sys clock so the only sure way is unidirectional flow. Ie: load and store is all that's needed.

    For data flow maybe, but there are a lot of peripheral booleans that need to be tested/set/cleared, hence why you need also AND / OR / XOR logical opcodes (and some even also allow INC/DEC), and it is natural in the Prop to allow the WAITxx opcodes also access IO space.
    So the opcodes needed grows to more than just 2.
    IIRC, some ARMs 'patch-in' IO opcodes via the large address space - but that pushes you to loading 32b values.
    ( eg you can map any pin to a unique per-bit Address, map more for SET and more for Clear... it is easy to hit hundreds of address locations).
    ARM vendors tend to rather be less worried about absolute code size.
  • msrobotsmsrobots Posts: 3,704
    edited 2015-03-14 18:44
    Maybe I do not get this proposal of memory mapped IO.

    But the memory map of a cog is not virtual infinite (32bits) but restricted to 512 longs by the available opcode space. And that is 9bits for source/destination.

    Or you think about memory mapped IO in the HUB ram space. Using rdlong/word/byte and wrlong/word/byte. But that is a hubop. And way slower then register access inside the cog.

    So to use the normal cog instruction for IO access of 512 virtual registers we would need two more bits per opcode to distinguish RAM / IO for source and destination. Just not there.

    And for special IO instructions? Well its the same, isn't it? You will need a lot of instructions: movIO, testIO, orIO, andIO, xorIO, waitxxIOs, all the rorIO/rolIO/shiftIO/ instructions and tons more.

    Its just ugly and will also need space in the opcode. Since going to a 34bit long in a cog is not really feasible and would need 34bit hub and external ram/flash/eeproms, What to do?

    A "switch between IO and cogram" instruction will eat a long every time used in code. Will eat up 16 longs of special registers easy. Also not a solution.

    We could kick out the conditional execution of code. Would gain 3 bits. But it is a main part of PASM and I would really miss it.

    I am quite sure that @Chip went thru all this variations before he came up with his smart pins.

    In some sort they are pin mapped registers not memory mapped. 64(96?) independent subsystems controlling the pins (pin pairs?). Sort of minion cogs. Running parallel to the cogs like timer/counter do now. But who, besides Chip knows.

    As far as I could follow them vague descriptions of them smart pins - Chip plan was to have some fast internal serial(?) bus around all pins for configuration and a different(?) parallel(?)/muxed(?) bus for direct access to each cog.

    He also stated that - since the smart pins are somehow independent of the cog Verilog - the capability of them pins is mainly restricted by the die size left after finishing the cogs.

    Basically the classical roman approach of divide and conquer.

    We may find out. Some day.

    Mike
  • potatoheadpotatohead Posts: 10,254
    edited 2015-03-14 19:28
    It could be argued that HubExec is a load/store architecture.

    Yes, exactly. In it's simplest implementation, we get a CPU with 500 or so registers! In the last iteration we saw, it was pretty easy to work between HUB and COG, which I was eager to explore some. Stuff some handy routines in the COG, and have the program running in HUB... We get to play on that all again soon.

    In terms of the I/O, it's going to come from the COG anyway, so then we might as well just talk about the COG.

    @msrobots: Yes, mostly where I'm at on it too.

    For bit mashing, it's best to have memory mapped I/O as a ton of ops get done directly. To match it, we need lots of instructions. Seems a bit silly, given where things are at.

    Honestly, if this is all about a few longs in the COG, I would prioritize both speed in the bit banging use cases and number of opcodes overall above freeing a few COG longs, particularly now that we will get 16 of them, and hubexec...

    All academic though. Chip's gonna do what he does. When we get an image, it's time to play!
  • evanhevanh Posts: 15,187
    edited 2015-03-14 20:01
    jmg wrote: »
    For data flow maybe, but there are a lot of peripheral booleans that need to be tested/set/cleared, hence why you need also AND / OR / XOR logical opcodes (and some even also allow INC/DEC), and it is natural in the Prop to allow the WAITxx opcodes also access IO space.
    None of that is needed as such, but as OzProp just mentioned their is a code size case for twiddling config bits. Interestingly, Potatohead just highlighted how SmartPins won't have any memory mapped registers at all. Potentially including replacement counters.

    The only bit oriented special registers looking to be kept in Cog space will be the direct IN/OUT/DIR registers .... Everything else is solidly read/written as longs, so could be just as effective in an I/O map with simple load/store instructions.

    Hmm .... maybe SmartPins makes the issue mostly irrelevant.
  • evanhevanh Posts: 15,187
    edited 2015-03-14 20:15
    potatohead wrote: »
    Memory mapped does offer the potential to use an IO result directly, where dedicated instructions add cycle time.

    On a load store type CPU, that isn't as big of an advantage. Props are memory to memory in the COG.

    My main point from the big write up is real use cases always end up storing in a software buffer before reading/writing the hardware. It's a natural load/store operation.

    The one exception is literal bit bashing. There is bitwise configuration but we can forget that with SmartPins, it'll be whole longs at a time. That leaves just literal bit bashing ...
  • potatoheadpotatohead Posts: 10,254
    edited 2015-03-14 21:47
    Yes, it does.

    And yeah, I think the smart pins idea will render most of this discussion moot. The counters and everything are going onto the pins, sort of...

    The basic digital I/O case is likely to stay as is, and bit bashing is a very useful case to keep optimized. Lots of stuff we do today on P1 would have been impacted with more instructions being required for an I/O transaction.

    Last I understood, full speed DAC operation will be by pin / cog group. Any COG can set the DAC, "with an instruction" or some operation or other. I don't think this is going to be memory mapped at all. There is also the matter of the FIFO buffer, we all were pretty eager to see that be a two way thing, out for video, audio, signals, and in for capture of same, among other tasks.

    Honestly, it's premature. Chip had just started to sort out the lower level technical details, which he shared. From what he shared, the vast majority of it won't look like it does on a P1 today at all. Probably only the digital I/O case, IN, OUT, DIR with the rest who knows?
  • Cluso99Cluso99 Posts: 18,069
    edited 2015-03-15 19:52
    Hubexec would permit the cog space to be extended, with the extended cog space being single ported.
    This would allow more register space in the lower 512 cog space.

    Anyway, everything is moot until we know what Chip is/has done.

    IIRC we are almost 1 year since the last published instruction set. We do not now what this is now, or anything else for that matter.
    All design is being done offline and I suspect that when it is published it will be frozen, other than bug fixes.
  • kwinnkwinn Posts: 8,697
    edited 2015-03-16 08:42
    evanh wrote: »
    In case it wasn't obvious, I'm thinking this is a matter that Chip should consider for the Propeller 2 design. Ie: Throw away the now 16 special addresses in Cog space and instead have fully I/O mapped I/O. This gives the full 512 address range back to program and data.

    It's an idea worth checking out, but it would be a good idea to profile the current P1 code to see if the gains are worthwhile. Being able to use multiple instructions to read/write data from/to the registers does have some advantages.
  • SeairthSeairth Posts: 2,474
    edited 2015-03-17 04:35
    For those who aren't reading the P1V thread, I posted another idea related to this topic. In the case of the P2, the idea might be even easier to implement since much of the infrastructure is already in place.
  • Hi all

    If You are missed it
    Propeller have MEMORY maped IO
Sign In or Register to comment.