Shop OBEX P1 Docs P2 Docs Learn Events
ZiCog a Zilog Z80 emulator in 1 Cog - Page 27 — Parallax Forums

ZiCog a Zilog Z80 emulator in 1 Cog

1242527293041

Comments

  • heaterheater Posts: 3,370
    edited 2010-01-17 10:12
    Ron: Sounds like you have a handle on running AltairZ80 emulator. So you could avoid all that writing to the PC with the W command and then xmodemf stuff.

    Just modify that altairz80 startup script, the one thing called "cpm2" in the base AltairZ80 install, so that it attaches your Forth floppy as drive C: or D: or whatever and attaches something as a hard disk J:.
    No matter if the something does not exist as AltairZ80 will create a new empty HD image file for it if not.
    The just copy from floppy to HD from inside AltairZ80

    a:> PIP J:=c:*.*

    Now you have your files on a new HD image to use in ZiCog.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    For me, the past is not over yet.
  • Ron SutcliffeRon Sutcliffe Posts: 420
    edited 2010-01-17 12:58
    Ah, he said, "Too easy ". [noparse]:)[/noparse]
    thanks heater
    Ron
  • Cluso99Cluso99 Posts: 18,069
    edited 2010-02-01 03:29
    This discussion (on the DacBlade thread) really applies to the ZiCog code rather than specific implementations, so I thought it best to transfer the discussion here. Hope no-one minds.
    Bill said... (in the DracBlade thread)


    With the ever-increasing number of XMM solutions, have you considered decoupling ZiCog from the memory access?

    I have not looked at the sources, so I don't know how easy the following would be - or if you have thought of something similar.

    Loosely, there are three types of memory accesses:

    - code fetch
    - data read
    - data write

    I am thinking of a solution where the memory access is handed off to another cog, and ZiCog requests memory actions through hub locations.

    (
    sample code deleted
    )

    The beauty of this approach is that it TOTALLY decouples ZiCog from specific XMM implementation, and the memory cog can try to do all sorts of caching etc.

    Adding new XMM targets is trivial.

    Frees up some LONGs in ZiCog

    Doing split I/D for 128K memory (which I think MP/M supported) is easy.

    Doing banked memory on any XMM becomes MUCH easier.

    Even better, in any instruction that is not a JUMP/CALL, the next instruction read can be done in parallel with executing the current instruction!

    Simply ask for the next instruction before processing the current one.

    The hub delay slots can also be used [noparse]:)[/noparse]

    I think it would potentiall run faster.

    This would also make it trivial to provide breakpoints for execution or data access, and monitoring locations, performance etc.

    On the hardware side...

    This would also allow a super-cheap ZiCog config I was thinking about, by using two MCP23K256 SPI ram's or FRAMs (with a speed penalty)

    What do you guys think?




    Bill,

    The ZiCog SRAM memory drivers (for TriBlade, RamBlade and DracBlade)...

    There is a SRAM driver (for each XMM model) that allows for multi-byte access, typically for use with transferring sectors to/from the SD driver. While this driver can also perform single byte access, this is totally inefficient.

    The other SRAM access is embedded withing ZiCog and is tailored (by #define compile statements). It resides in a subroutine, although it is my intention to do a couple of embedded inline code sections to improve speed further.

    RamBlade has the fastest SRAM read access using just 4 instructions. There is nothing faster than this. For instance, even without passing addresses, etc to another cog, as hub access is equivalent to 4 instructions (16 cycles) - yes worst case, but that is what you have to count on. Do not forget if you are fetching in advance, you require instructions for comapring the address is the next one, etc. For banking, it is a matter of just adding the offset address.

    TriBlade is next fastest, taking 7 instructions per byte read.

    DracBlade is much slower, as the latches have to first be checked (or loaded).

    I expect serial Flash/Ram would be extremely slow. My TriBlade has provision for 1/4/8MB of Flash.

    So, the philosophy of RamBlade was to make the fastest self-contained prop, SRAM (XMM) and microSD module cheaply. When only 1 serial port or else a keyboard and TV are required (or both is SRAM size is reduced to 128KB) it can be run as a SBC (single board computer). Otherwise, it can plug into another prop board (like the Prop Proto Board) which can then be used an an intelligent set of peripherals.



    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    Links to other interesting threads:

    · Home of the MultiBladeProps: TriBlade,·RamBlade,·SixBlade, website
    · Single Board Computer:·3 Propeller ICs·and a·TriBladeProp board (ZiCog Z80 Emulator)
    · Prop Tools under Development or Completed (Index)
    · Emulators: CPUs Z80 etc; Micros Altair etc;· Terminals·VT100 etc; (Index) ZiCog (Z80) , MoCog (6809)·
    · Prop OS: SphinxOS·, PropDos , PropCmd··· Search the Propeller forums·(uses advanced Google search)
    My cruising website is: ·www.bluemagic.biz·· MultiBlade Props: www.cluso.bluemagic.biz
  • Bill HenningBill Henning Posts: 6,445
    edited 2010-02-01 03:45
    Cluso99 said...
    This discussion (on the DacBlade thread) really applies to the ZiCog code rather than specific implementations, so I thought it best to transfer the discussion here. Hope no-one minds.

    Bill,

    The ZiCog SRAM memory drivers (for TriBlade, RamBlade and DracBlade)...

    There is a SRAM driver (for each XMM model) that allows for multi-byte access, typically for use with transferring sectors to/from the SD driver. While this driver can also perform single byte access, this is totally inefficient.

    The other SRAM access is embedded withing ZiCog and is tailored (by #define compile statements). It resides in a subroutine, although it is my intention to do a couple of embedded inline code sections to improve speed further.

    RamBlade has the fastest SRAM read access using just 4 instructions. There is nothing faster than this. For instance, even without passing addresses, etc to another cog, as hub access is equivalent to 4 instructions (16 cycles) - yes worst case, but that is what you have to count on. Do not forget if you are fetching in advance, you require instructions for comapring the address is the next one, etc. For banking, it is a matter of just adding the offset address.

    TriBlade is next fastest, taking 7 instructions per byte read.

    DracBlade is much slower, as the latches have to first be checked (or loaded).

    I expect serial Flash/Ram would be extremely slow. My TriBlade has provision for 1/4/8MB of Flash.

    So, the philosophy of RamBlade was to make the fastest self-contained prop, SRAM (XMM) and microSD module cheaply. When only 1 serial port or else a keyboard and TV are required (or both is SRAM size is reduced to 128KB) it can be run as a SBC (single board computer). Otherwise, it can plug into another prop board (like the Prop Proto Board) which can then be used an an intelligent set of peripherals.

    Thanks for transferring it here...

    I agree with you that RamBlade has the shortest read access.

    My intent with this proposal was to decouple *most* types of XMM from ZiCog, because several XMM implementations would I think greatly benefit from my proposed approach.

    I agree that naive use of SPI ram would be insanely slow - however consider treating it like a backing store, and keeping 16 to 64 pages (of 256 bytes) resident in the hub, swapping pages in and out as required. 64 pages would only take 16K of hub, and I think would be acceptably fast with a VM approach like I am suggesting. Code pages do not need to be swapped out... only updated data pages. With a 64K address space, I suspect CP/M programs would live quite nicely in a 16KB working set, without swapping much. The advantage, even if it ran at 1/2 (or even less) speed is a VERY cheap and simple ZiCog setup.

    I also think the overlapping of fetch/execute made possible for all but PC changing instructions would be a big win.

    I really like your RamBlade - I think it is simply brilliant - however my suggested approach would make supporting other XMM targets easier, and would likely speed up XMM not as fast as RamBlade .. it might even speed up RamBlade, due to the potential overlapping of fetch/execute.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com 5.0" VGA LCD in stock!
    Morpheus dual Prop SBC w/ 512KB kit $119.95, Mem+2MB memory/IO kit $89.95, both kits $189.95 SerPlug $9.95
    Propteus and Proteus for Propeller prototyping 6.250MHz custom Crystals run Propellers at 100MHz
    Las - Large model assembler Largos - upcoming nano operating system
  • Cluso99Cluso99 Posts: 18,069
    edited 2010-02-01 05:07
    I moved this discussion here following Bill's discussion but it appears to have also digressed more to the SphinxOS. Will leave it here for now.
    Drac said... (from DracBlade thread)

    Re BST let't take a typical object - eg the attached one which is the keyboard object. How much would it take to convert that to relocatable cog DAT code that you could put in a seperate file on an SD card. (For the moment, leave all the spin code as-is and just considering PASM)

    There are the VAR statements at the beginning - would they go in the cog code or stay as they are? Would you add the locations of the buffers to the Start list? And what would be the best way to get data out of buffers if those are now internal in a cog?

    @cluso - yes good point re moving the thread - I'll follow it along on the zicog thread.
    Drac: The implications of a seperately compiled cog code (cog objects)·is that it must be self-resident with minimal hub interaction as this must be with fixed (known location·and defined size/protocol) hub locations. This is in fact what CPM and other OSes do. We access these drivers through a defined interface. In SphinxOS, the standard output has a long and the standard input has a long hub interface. ($7F80 & $7FC0). The multiple character buffers are within the cog.

    For input (i.e. a keyboard)...·as another cog/spin takes a character out and clears the hub, the driver cog looks, sees it it empty (=0) and places another character into the hub buffer (if one is available in the cog buffer). The process repeats. This method may not be quite as fast but reduces the hub footprint. The opposite works for output.

    Once everything goes via this standard interface, the driver may be substituted without changing the calling program. So, you can substitute a pc serial for the keyboard, or a remote wireless connection. The calling program has no need to know.

    So far, for input, we have serial, keyboard, 1-pin keyboard (almost complete). For output, we have serial, video (TV composite 3pin), with 1-pin TV as a wip.

    We also have fsrw (SD card driver) but this has the whole file system interface as well BUT currently takes 3 cogs, so this·warrants some reworking.

    Here is a link to SphinxOS code v006 (near the bottom of the page) http://forums.parallax.com/showthread.php?p=819353
    IIRC the driver is sx---.spin and the calling interface is isx---.spin (two calling files, one input and one output). The drivers can be substituted while the calling interface remains unchanged.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    Links to other interesting threads:

    · Home of the MultiBladeProps: TriBlade,·RamBlade,·SixBlade, website
    · Single Board Computer:·3 Propeller ICs·and a·TriBladeProp board (ZiCog Z80 Emulator)
    · Prop Tools under Development or Completed (Index)
    · Emulators: CPUs Z80 etc; Micros Altair etc;· Terminals·VT100 etc; (Index) ZiCog (Z80) , MoCog (6809)·
    · Prop OS: SphinxOS·, PropDos , PropCmd··· Search the Propeller forums·(uses advanced Google search)
    My cruising website is: ·www.bluemagic.biz·· MultiBlade Props: www.cluso.bluemagic.biz
  • Cluso99Cluso99 Posts: 18,069
    edited 2010-02-01 05:52
    Bill said...
    My intent with this proposal was to decouple *most* types of XMM from ZiCog, because several XMM implementations would I think greatly benefit from my proposed approach.
    Bill, you should look at the ZiCog code and the RamBlade driver.

    Any block movements from hub to/from SRAM is done by a call to the driver in another cog. This is·easy to replace with any XMM driver.

    The other access, which is the ZiCog code emulation (fetching instructions and read/write to Z80's memory model in SRAM) only occurs in 1 subroutine (IIRC). This is also easy to replace with any other driver subroutine. (see my code below)

    Where there are issues however, is with the TriBlade SD access, because it requires ZiCog to relinquish the bus. RamBlade does not suffer from this problem although at present I have not removed these interspersed code bits (and there are quite a few).

    So, pretty much, I believe this is already done simply. An include might work better.
    Bill said...
    I also think the overlapping of fetch/execute made possible for all but PC changing instructions would be a big win.
    For your approach of using a seperate driver for fetching instructions from SRAM...

    Let's presume the best case, which means the driver cog has already prefetched the next byte instruction and it is waiting in the hub for the cog to take it...

    fetch       rdbyte  opcode,hubptr  wz      'fetch the next opcode
          if_z  jmp     #fetch                 'not ready so wait
     
    

    The best we can hope for is that we are at the hub window, so these 2 instructions will take 8+4 cycles = 12 cycles. The worst case is we just missed a hub window, so the 2 instructions will take 16+4 = 20 cycles. So the average should be 16 cycles. Due to the indeterminate way ZiCog executes, I believe there is no way to ensure a hub window is achieved.

    This presumes that
    • the code has been prefetched
    • we have done no comparisons to check that the prefetched data is from the correct address, which we must also do
    • ignores the overhead of the address changes where we have to tell the driver to fetch an alternate address byte
    Here is my RamBlade fetch code...
    read_memory_byte
                            mov     outa, address              'copy address bits to P0-18, -CE=0, -WE=t/s(=1 pullup)
                            add     address, #1                'increment address (req delay anyway)
                            mov     data_8, ina                'read SRAM
                            shr     data_8, #24                'reposition 8 data bits and clear upper bits
    read_memory_byte_ret    ret
    
    


    You will note that I get a "free" address increment (not used currently since I require 2 seperate routines for this to work this way as one·requires NO increment).

    So my access is 4 instructions = 16 clock cycles. I do not count the·call/ret instructions as these are the same no matter what. But with my way I can code this inline and save the call/ret for 2 longs extra of code per call location. A gain of 33%. I do not have to perform any check instructions to see that I have the correct prefetch instruction or the like.

    So you can see from this, there is no faster way.

    Of course, we have not examined the overall code impact. Fetching and read/write SRAM is only part of the overall code, particularly in an emulation. More impact will be had with LMM code and the like.

    Regards



    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    Links to other interesting threads:

    · Home of the MultiBladeProps: TriBlade,·RamBlade,·SixBlade, website
    · Single Board Computer:·3 Propeller ICs·and a·TriBladeProp board (ZiCog Z80 Emulator)
    · Prop Tools under Development or Completed (Index)
    · Emulators: CPUs Z80 etc; Micros Altair etc;· Terminals·VT100 etc; (Index) ZiCog (Z80) , MoCog (6809)·
    · Prop OS: SphinxOS·, PropDos , PropCmd··· Search the Propeller forums·(uses advanced Google search)
    My cruising website is: ·www.bluemagic.biz·· MultiBlade Props: www.cluso.bluemagic.biz
  • BradCBradC Posts: 2,601
    edited 2010-02-01 06:28
    Cluso99 said...
    Iit must be self-resident with minimal hub interaction as this must be with fixed (known location and defined size/protocol) hub locations.

    Whilst I agree completely about the defined size/protocol, why does the location have to be fixed? That is precisely what PAR is for.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    Life may be "too short", but it's the longest thing we ever do.
  • Cluso99Cluso99 Posts: 18,069
    edited 2010-02-01 06:36
    BradC: If the compilation is done seperately it is dangerous to be anywhere but fixed.
    I am expecting subsequent binaries to be distributed. (just like CPM had with LS and others)

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    Links to other interesting threads:

    · Home of the MultiBladeProps: TriBlade,·RamBlade,·SixBlade, website
    · Single Board Computer:·3 Propeller ICs·and a·TriBladeProp board (ZiCog Z80 Emulator)
    · Prop Tools under Development or Completed (Index)
    · Emulators: CPUs Z80 etc; Micros Altair etc;· Terminals·VT100 etc; (Index) ZiCog (Z80) , MoCog (6809)·
    · Prop OS: SphinxOS·, PropDos , PropCmd··· Search the Propeller forums·(uses advanced Google search)
    My cruising website is: ·www.bluemagic.biz·· MultiBlade Props: www.cluso.bluemagic.biz
  • BradCBradC Posts: 2,601
    edited 2010-02-01 06:48
    Cluso99 said...
    BradC: If the compilation is done seperately it is dangerous to be anywhere but fixed.
    I am expecting subsequent binaries to be distributed. (just like CPM had with LS and others)

    I don't understand I'm afraid. Any binary chunk designed to be loaded into a cog is entirely independent of its hub ram location. Provided you give it a structure it can parse in PAR when you boot it, you should be able to load it from any long aligned address, anywhere in ram. The structure you pass to it would contain the addresses of any mailbox or configuration locations.

    Heck, even loading and relocating an entire Propeller binary file is a doddle provided you don't go hardcoding HUB addresses into it.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    Life may be "too short", but it's the longest thing we ever do.
  • heaterheater Posts: 3,370
    edited 2010-02-01 09:54
    I had to forgo the use of structures passed to the COG via PAR for ZiCog. There was not enough space in t he COG for the PASM code to parse it all in. So any parameters ZiCog needs are "POKED" into the PASM whilst it is in HUB prior to loading/starting the COG.

    Not sure how any Prop OS COG loader would deal with that.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    For me, the past is not over yet.
  • BradCBradC Posts: 2,601
    edited 2010-02-01 10:14
    heater said...
    I had to forgo the use of structures passed to the COG via PAR for ZiCog. There was not enough space in t he COG for the PASM code to parse it all in. So any parameters ZiCog needs are "POKED" into the PASM whilst it is in HUB prior to loading/starting the COG.

    Not sure how any Prop OS COG loader would deal with that.

    You would need a dedicated long to tell the spin code where, and how many longs to shove into the config block in the DAT code.

    You could use one of the register longs, as you can compile into them, but they will be zeroed on load anyway. Make your last directive fit 497 instead and use the last long to tell the spin loader where to shove the config data. That would of course require padding out the resulting DAT file to get it in the right place, but again it's not difficult.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    Life may be "too short", but it's the longest thing we ever do.
  • Cluso99Cluso99 Posts: 18,069
    edited 2010-02-01 12:29
    heater: I guess we could use an overlay to do the initialisation and then load the normally resident overlay.

    Brad: On further reflection, you of course are correct (heck you wrote the compiler). I don't fully understand the object structure created by the compiler. Your idea of the mailbox (called the rendezvous in Sphinx) address being passed to the cog alows multiple copies to be installed which is another advantage. So we could have more that 1 serial port, etc. Anyway, I have now moved this type of discussion to Sphinx where it is more appropriate.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    Links to other interesting threads:

    · Home of the MultiBladeProps: TriBlade,·RamBlade,·SixBlade, website
    · Single Board Computer:·3 Propeller ICs·and a·TriBladeProp board (ZiCog Z80 Emulator)
    · Prop Tools under Development or Completed (Index)
    · Emulators: CPUs Z80 etc; Micros Altair etc;· Terminals·VT100 etc; (Index) ZiCog (Z80) , MoCog (6809)·
    · Prop OS: SphinxOS·, PropDos , PropCmd··· Search the Propeller forums·(uses advanced Google search)
    My cruising website is: ·www.bluemagic.biz·· MultiBlade Props: www.cluso.bluemagic.biz
  • Bill HenningBill Henning Posts: 6,445
    edited 2010-02-01 18:44
    Cluso,

    With respect, I think you are missing my point. I acknowledged that your RamBlade code is great - and is probably the optimal solution for ZiCog.

    Futhermore, I think at $40, it is an incredible value - the only reason I have not ordered a few is that I am too busy working on upcoming products. Plug it into a $32 usb proto board, add the VGA/kb kit, and you have a working ZiCog system for about $40+$33+$15+s/h ($88+s/h)

    I have absolutely no doubt that your RamBlade will greatly outperform latched designs even using my caching approach - as I have stated several times, your "no latches" approach has definite advantages for random access.

    I came up with this alternate approach for three reasons:

    - to help Dr_Acula's design get better performance from his latched XMM solution
    - to make possible a reduced-performance very low cost "HoboCPM" board using SPI Ram.
    - to improve the performance on ZiCog on Morpheus, once it is ported

    I suspect having such a cog would speed up (on average) Dr_Acula's design average memory access significantly.

    I am in the middle of working on about four products right now, but I may take a day to lay out a "HoboCPM" or "HoboEmu" or "HoboZiCog" board and send it off for a prototype run just to test out my theories [noparse]:)[/noparse]

    The reason I have not dug into ZiCog - even though I think it is brilliant and I can't wait to run it - is due to a lack of time, and also due to Mike Hustleton kindly starting to port it to Morpheus.

    Bill
    Cluso99 said...
    Bill said...

    My intent with this proposal was to decouple *most* types of XMM from ZiCog, because several XMM implementations would I think greatly benefit from my proposed approach.
    Bill, you should look at the ZiCog code and the RamBlade driver.



    Any block movements from hub to/from SRAM is done by a call to the driver in another cog. This is easy to replace with any XMM driver.



    The other access, which is the ZiCog code emulation (fetching instructions and read/write to Z80's memory model in SRAM) only occurs in 1 subroutine (IIRC). This is also easy to replace with any other driver subroutine. (see my code below)



    Where there are issues however, is with the TriBlade SD access, because it requires ZiCog to relinquish the bus. RamBlade does not suffer from this problem although at present I have not removed these interspersed code bits (and there are quite a few).



    So, pretty much, I believe this is already done simply. An include might work better.




    Bill said...

    I also think the overlapping of fetch/execute made possible for all but PC changing instructions would be a big win.

    For your approach of using a seperate driver for fetching instructions from SRAM...



    Let's presume the best case, which means the driver cog has already prefetched the next byte instruction and it is waiting in the hub for the cog to take it...





    fetch       rdbyte  opcode,hubptr  wz      'fetch the next opcode 
    
          if_z  jmp     #fetch                 'not ready so wait 
    
      
    
    



    The best we can hope for is that we are at the hub window, so these 2 instructions will take 8+4 cycles = 12 cycles. The worst case is we just missed a hub window, so the 2 instructions will take 16+4 = 20 cycles. So the average should be 16 cycles. Due to the indeterminate way ZiCog executes, I believe there is no way to ensure a hub window is achieved.



    This presumes that

    <UL>
    * the code has been prefetched

    * we have done no comparisons to check that the prefetched data is from the correct address, which we must also do

    * ignores the overhead of the address changes where we have to tell the driver to fetch an alternate address byte
    </UL>
    Here is my RamBlade fetch code...


    read_memory_byte
                            mov     outa, address              'copy address bits to P0-18, -CE=0, -WE=t/s(=1 pullup)
                            add     address, #1                'increment address (req delay anyway)
                            mov     data_8, ina                'read SRAM
                            shr     data_8, #24                'reposition 8 data bits and clear upper bits
    read_memory_byte_ret    ret
     
    
    




    You will note that I get a "free" address increment (not used currently since I require 2 seperate routines for this to work this way as one requires NO increment).



    So my access is 4 instructions = 16 clock cycles. I do not count the call/ret instructions as these are the same no matter what. But with my way I can code this inline and save the call/ret for 2 longs extra of code per call location. A gain of 33%. I do not have to perform any check instructions to see that I have the correct prefetch instruction or the like.



    So you can see from this, there is no faster way.



    Of course, we have not examined the overall code impact. Fetching and read/write SRAM is only part of the overall code, particularly in an emulation. More impact will be had with LMM code and the like.



    Regards
    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com 5.0" VGA LCD in stock!
    Morpheus dual Prop SBC w/ 512KB kit $119.95, Mem+2MB memory/IO kit $89.95, both kits $189.95 SerPlug $9.95
    Propteus and Proteus for Propeller prototyping 6.250MHz custom Crystals run Propellers at 100MHz
    Las - Large model assembler Largos - upcoming nano operating system

    Post Edited (Bill Henning) : 2/1/2010 6:55:21 PM GMT
  • Cluso99Cluso99 Posts: 18,069
    edited 2010-02-02 03:13
    @Bill: Yes, I wish I had more time too cry.gif· There are just too many great things going on with the prop to do them all.

    The SRAM (XMM) driver I use for the TriBlade and RamBlade (the TriBlade has a latch for -CE) does block moves by specifying the size. Obviously it works for a length of 1. ZiCog currently has only 1 SRAM readbyte and writebyte routine, so this could easily be changed to call the SRAM/XMM driver.

    It would then be possible to write just a new XMM driver for Morpheus, or, as you suggest, a SPI flash. In fact, since many boards use 64KB eeproms, the easiest may be to use this with a driver. Of course, this is not big enough for a disk emulator.

    BTW, I do not know how you have implemented your SRAM on Morpheus.

    On TriBlade, my mistake was using the Address lines for accessing the microSD (DI/DO/Clk pins) so I have to ensure they are tristated before I call another cog. On RamBlade I use the Data lines for accessing the microSD (DI/DO/Clk pins) so this is not an issue and the interface is easier.

    You may be correct in submising that the another cog may help Drac's performance. As you may have noticed, I gave Drac a lot of assistance to get his DracBlade running.

    BTW: If you just add 2x10K, 1x100R, 1x270R (can be 191R-1K1), 1xLM1117T-3.3 (3V3 voltage regulator TO220), 1x100uF·25V·to the RamBlade, a 5V regulated supply (use a camera/phone charger·or other USB powerpack), PS2 Keyboard and a TV (mono only) and you can have an extremely cheap·ZiCog or SphinxOS single board computer (SBC).

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    Links to other interesting threads:

    · Home of the MultiBladeProps: TriBlade,·RamBlade,·SixBlade, website
    · Single Board Computer:·3 Propeller ICs·and a·TriBladeProp board (ZiCog Z80 Emulator)
    · Prop Tools under Development or Completed (Index)
    · Emulators: CPUs Z80 etc; Micros Altair etc;· Terminals·VT100 etc; (Index) ZiCog (Z80) , MoCog (6809)·
    · Prop OS: SphinxOS·, PropDos , PropCmd··· Search the Propeller forums·(uses advanced Google search)
    My cruising website is: ·www.bluemagic.biz·· MultiBlade Props: www.cluso.bluemagic.biz
  • Bill HenningBill Henning Posts: 6,445
    edited 2010-02-02 04:20
    Agreed, 24 hours in a day is not long enough [noparse]:)[/noparse]

    Actually I was suggesting SPI RAM - MCP 23K256. Two would be 64K - and ofcourse that would be too small for disk emulation!

    SRAM on Morpheus is a hybrid latched design; lowest 8 address bits are direct, upper 16 are latched (24 bit address space). The design is admittedly optimized for screen buffer burst reads, which I do at 20MB/sec. I suspect performance wise (for random access) it sits between RamBlade and DracBlade.

    I know you help not just Dr_Acula but others... which is why I was surprised at your initial reply, as it did not address my suggestion, which was largely intended to boost DracBlade's performance without changing the hardware.

    I saved myself some PCB layout time by realizing that if I pull the DIP flash, Morpheus can take two of those SRAM's right now - so I could test the SPI VM idea on my existing board!

    I think you should make a tiny companion board for RamBlade, that adds precisely the components you list - as you say, it would make a great little ZiCog/SphinxOS board [noparse]:)[/noparse]
    Cluso99 said...
    @Bill: Yes, I wish I had more time too cry.gif There are just too many great things going on with the prop to do them all.

    The SRAM (XMM) driver I use for the TriBlade and RamBlade (the TriBlade has a latch for -CE) does block moves by specifying the size. Obviously it works for a length of 1. ZiCog currently has only 1 SRAM readbyte and writebyte routine, so this could easily be changed to call the SRAM/XMM driver.

    It would then be possible to write just a new XMM driver for Morpheus, or, as you suggest, a SPI flash. In fact, since many boards use 64KB eeproms, the easiest may be to use this with a driver. Of course, this is not big enough for a disk emulator.

    BTW, I do not know how you have implemented your SRAM on Morpheus.

    On TriBlade, my mistake was using the Address lines for accessing the microSD (DI/DO/Clk pins) so I have to ensure they are tristated before I call another cog. On RamBlade I use the Data lines for accessing the microSD (DI/DO/Clk pins) so this is not an issue and the interface is easier.

    You may be correct in submising that the another cog may help Drac's performance. As you may have noticed, I gave Drac a lot of assistance to get his DracBlade running.

    BTW: If you just add 2x10K, 1x100R, 1x270R (can be 191R-1K1), 1xLM1117T-3.3 (3V3 voltage regulator TO220), 1x100uF 25V to the RamBlade, a 5V regulated supply (use a camera/phone charger or other USB powerpack), PS2 Keyboard and a TV (mono only) and you can have an extremely cheap ZiCog or SphinxOS single board computer (SBC).
    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com 5.0" VGA LCD in stock!
    Morpheus dual Prop SBC w/ 512KB kit $119.95, Mem+2MB memory/IO kit $89.95, both kits $189.95 SerPlug $9.95
    Propteus and Proteus for Propeller prototyping 6.250MHz custom Crystals run Propellers at 100MHz
    Las - Large model assembler Largos - upcoming nano operating system

    Post Edited (Bill Henning) : 2/2/2010 4:29:55 AM GMT
  • Cluso99Cluso99 Posts: 18,069
    edited 2010-02-02 09:00
    Bill: Yes, the SPI SRAM certainly would be nice and small. Could probably share DI/DO/Clk with uSD.

    I guess I like to wrench the speed out of things. I don't think ZiCog would have got off the ground if it was slower than the old Z80.

    Drac has taken a different approach and upped the anti with all his interfaces on the 1 prop. Gee we certainly could do with that 64pin Prop 1B now - pitty we didn't realise that a year ago!

    Your Morpheus SRAM interface must be similar to Blade #1 on the TriBlade. It has VGA/TV/KBD/Mouse, plus 512KB SRAM with A0-10 (2KB) directly addressed and A11-18 via a latch. However, some parts are either/or. My idea was the same as yours, for video buffer. However, no-one has tried it to my knowledge (me either).

    Funny you should mention an add-on pcb with the 1-pin keyboard, video, and regulators on it. I have a pcb partially laid out. I also have another, alternative partially laid out for months too. However, I am still looking for that elusive 25th hour in the day - I see you have not found it either. LOL
    Bill said...
    I know you help not just Dr_Acula but others... which is why I was surprised at your initial reply, as it did not address my suggestion, which was largely intended to boost DracBlade's performance without changing the hardware.
    Sorry, I did not mean to be so negative. I guess I didn't think about this being feasible since Drac has all cogs used. However, the SRAM driver cog could probably do what you suggest. I am still skeptical that this could improve the performance, but hey, give it a try. I like to be proven wrong (really, because I learn something).


    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    Links to other interesting threads:

    · Home of the MultiBladeProps: TriBlade,·RamBlade,·SixBlade, website
    · Single Board Computer:·3 Propeller ICs·and a·TriBladeProp board (ZiCog Z80 Emulator)
    · Prop Tools under Development or Completed (Index)
    · Emulators: CPUs Z80 etc; Micros Altair etc;· Terminals·VT100 etc; (Index) ZiCog (Z80) , MoCog (6809)·
    · Prop OS: SphinxOS·, PropDos , PropCmd··· Search the Propeller forums·(uses advanced Google search)
    My cruising website is: ·www.bluemagic.biz·· MultiBlade Props: www.cluso.bluemagic.biz

    Post Edited (Cluso99) : 2/2/2010 9:19:02 AM GMT
  • Toby SeckshundToby Seckshund Posts: 2,027
    edited 2010-02-02 09:00
    I will not attempt to compare myself to the likes of you all, but ...

    With DracBlade, I was wondering about a half way house of using all 16 pins on the one side of the Prop for 16bit latched addr and multiplexed data (8051/8085 ish, with "ALE"·) and pushing the SD over to Pins20-23, with just 4 pins for the VGA.

    Or if I get my finger out perhaps the DRAM might come out to play again, having cleared the extra pins.


    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    Style and grace : Nil point
  • Cluso99Cluso99 Posts: 18,069
    edited 2010-02-02 09:24
    @Toby: Go for it. I am not sure of what you get with 4pin VGA. You can save a sync pin by using a pair of exclusive or's and a few resistors, but this will require a mod to the VGA driver (I do not know how to do this). There is a thread on this, originally by Phil. Also, I am unsure what you will get using 1 pin per R, G & B or do you mean using a pair of resistors to do a single output for R+G+B? You will also need to be careful with the pin use as the Video Counter has certain requirements in that department.

    P.S. I would definately give the DRAM a miss.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    Links to other interesting threads:

    · Home of the MultiBladeProps: TriBlade,·RamBlade,·SixBlade, website
    · Single Board Computer:·3 Propeller ICs·and a·TriBladeProp board (ZiCog Z80 Emulator)
    · Prop Tools under Development or Completed (Index)
    · Emulators: CPUs Z80 etc; Micros Altair etc;· Terminals·VT100 etc; (Index) ZiCog (Z80) , MoCog (6809)·
    · Prop OS: SphinxOS·, PropDos , PropCmd··· Search the Propeller forums·(uses advanced Google search)
    My cruising website is: ·www.bluemagic.biz·· MultiBlade Props: www.cluso.bluemagic.biz
  • Toby SeckshundToby Seckshund Posts: 2,027
    edited 2010-02-02 11:50
    Clusso

    Certainly will miss out the DRAMs until later.

    The VGA stuff was going to be H + V syncs on separate pins, as normal, and the other two would give the background (black or primary colour ) and the foreground white by driving the other colours. Two colour feeds on one pin should be OK but to drive all three, for B+W, would need buffers. I have played with the VGA demo from PropBASIC and it show promise, at reduced resolution.

    Or I am being blissfully deluded, again.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    Style and grace : Nil point
  • Cluso99Cluso99 Posts: 18,069
    edited 2010-02-02 12:19
    Toby said...
    Or I am being blissfully deluded, again.
    I have no idea. However, it is easy enough to try. Disable some of the pins when the VSL is being set.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    Links to other interesting threads:

    · Home of the MultiBladeProps: TriBlade,·RamBlade,·SixBlade, website
    · Single Board Computer:·3 Propeller ICs·and a·TriBladeProp board (ZiCog Z80 Emulator)
    · Prop Tools under Development or Completed (Index)
    · Emulators: CPUs Z80 etc; Micros Altair etc;· Terminals·VT100 etc; (Index) ZiCog (Z80) , MoCog (6809)·
    · Prop OS: SphinxOS·, PropDos , PropCmd··· Search the Propeller forums·(uses advanced Google search)
    My cruising website is: ·www.bluemagic.biz·· MultiBlade Props: www.cluso.bluemagic.biz
  • Dr_AculaDr_Acula Posts: 5,484
    edited 2010-02-02 13:33
    Re I agree that naive use of SPI ram would be insanely slow - however consider treating it like a backing store, and keeping 16 to 64 pages (of 256 bytes) resident in the hub, swapping pages in and out as required. 64 pages would only take 16K of hub, and I think would be acceptably fast with a VM approach like I am suggesting. Code pages do not need to be swapped out... only updated data pages. With a 64K address space, I suspect CP/M programs would live quite nicely in a 16KB working set, without swapping much. The advantage, even if it ran at 1/2 (or even less) speed is a VERY cheap and simple ZiCog setup.

    Sorry I've missed a fair bit of this discussion.

    One thing that would be cool about a sphinx OS is it frees up to 14k of ram in hub That could well end up being a very useful space for buffering - and hence speed up memory access.

    Some musing re CP/M - most programs spend a lot of time in three places:
    0-FFH - for the calls to the BIOS
    100H to nnH for the main program that is running
    E600 and up for CP/M itself, especially waiting for keyboard input.

    The blocks are not contiguous so you need a smarter way of managing memory. So presumably memory management code needs to first check that the block being sought is indeed in the list of hub ones. That takes a few cycles - possibly more than a single ramblade access, but it would certainly improve performance of a latched solution.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    www.smarthome.viviti.com/propeller
  • Bill HenningBill Henning Posts: 6,445
    edited 2010-02-02 19:43
    Sharing everything but /CS is exactly what I was thinking of.

    I know what you mean about trying to wrench the most speed out possible [noparse]:)[/noparse] but in some cases slower is "good enough".

    I practically drool when I think of what we could do with an 84 pin PLCC Prop1.5 - most important would be VERY fast XMM. Just think... Z80PC EQU OUTB.... need I say more?

    Yes, my SRAM interface is similar to your Blade #1. A0-A7 direct, A8-A23 via two latches. Getting the XMM/video drivers working was... interesting.

    Nice! That tiny "baseboard" will be very popular. No, I have not found the 25th hour. Too bad sleep is not optional.

    Trying the VMMCog is on my (long) TODO list. I realized I can try it on Morpheus, by using both SPI Flash/SRAM sockets for SPI SRAMS [noparse]:)[/noparse] so I don't need to lay out a PCB or solder two SPI rams to a Protoboard (or Propteus).

    I actually think it will work OK. I am guessing 50%+ the performance of parallel ram for programs whose "working set" fits within 16KB (ie 90%+ of the time of the program only uses a 16K subset of the full 64KB). Of course I could be wrong... I too learn by experience [noparse]:)[/noparse]
    Cluso99 said...
    Bill: Yes, the SPI SRAM certainly would be nice and small. Could probably share DI/DO/Clk with uSD.

    I guess I like to wrench the speed out of things. I don't think ZiCog would have got off the ground if it was slower than the old Z80.

    Drac has taken a different approach and upped the anti with all his interfaces on the 1 prop. Gee we certainly could do with that 64pin Prop 1B now - pitty we didn't realise that a year ago!

    Your Morpheus SRAM interface must be similar to Blade #1 on the TriBlade. It has VGA/TV/KBD/Mouse, plus 512KB SRAM with A0-10 (2KB) directly addressed and A11-18 via a latch. However, some parts are either/or. My idea was the same as yours, for video buffer. However, no-one has tried it to my knowledge (me either).

    Funny you should mention an add-on pcb with the 1-pin keyboard, video, and regulators on it. I have a pcb partially laid out. I also have another, alternative partially laid out for months too. However, I am still looking for that elusive 25th hour in the day - I see you have not found it either. LOL
    Bill said...

    I know you help not just Dr_Acula but others... which is why I was surprised at your initial reply, as it did not address my suggestion, which was largely intended to boost DracBlade's performance without changing the hardware.

    Sorry, I did not mean to be so negative. I guess I didn't think about this being feasible since Drac has all cogs used. However, the SRAM driver cog could probably do what you suggest. I am still skeptical that this could improve the performance, but hey, give it a try. I like to be proven wrong (really, because I learn something).

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com 5.0" VGA LCD in stock!
    Morpheus dual Prop SBC w/ 512KB kit $119.95, Mem+2MB memory/IO kit $89.95, both kits $189.95 SerPlug $9.95
    Propteus and Proteus for Propeller prototyping 6.250MHz custom Crystals run Propellers at 100MHz
    Las - Large model assembler Largos - upcoming nano operating system
  • Cluso99Cluso99 Posts: 18,069
    edited 2010-02-03 02:36
    @Bill: Once the operating model is understood, it is likely some fine tuning of the various ram blocks could be done. Of course, finding this is actually fairly simple at the expense of slowing it down while doing so. Just add a little hub table and increment each time a block is accessed.

    There are obviously only certain sections of the 64KB CPM space that would require "overlaying". From what I understand, the blocks would be from about $FF00 or $F000 downwards, leaving as much as possible from $0000 upwards always resident.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    Links to other interesting threads:

    · Home of the MultiBladeProps: TriBlade,·RamBlade,·SixBlade, website
    · Single Board Computer:·3 Propeller ICs·and a·TriBladeProp board (ZiCog Z80 Emulator)
    · Prop Tools under Development or Completed (Index)
    · Emulators: CPUs Z80 etc; Micros Altair etc;· Terminals·VT100 etc; (Index) ZiCog (Z80) , MoCog (6809)·
    · Prop OS: SphinxOS·, PropDos , PropCmd··· Search the Propeller forums·(uses advanced Google search)
    My cruising website is: ·www.bluemagic.biz·· MultiBlade Props: www.cluso.bluemagic.biz
  • Bill HenningBill Henning Posts: 6,445
    edited 2010-02-03 05:00
    I was thinking of doing a classic LRU page replacement policy. (LRU = Least Recently Used)

    In a 64KB address space there are (obviously) 256 pages of 256 bytes.

    My paper design uses cog locations 0..255 for the LRU page table. When the cog is started, location 0 would contain a JMP #$100, which would do a MOV 0,#0 to clear the first page table entry.

    Bits 0-8 (the source field) of the page table would contain the upper 7 bits (on prop 1) of the hub page where that page resides.

    The upper 25 bits would be the access counter, allowing counting up to 32M accesses.

    As there is only 32KB of hub ram on the current prop, bit 8 would be used as a 'dirty' bit (set whenever a write occurs to that page).

    If a page is not present in memory, the whole register is set to 0.

    When any count approached 16M, all counts should be cut in half.

    Say you wanted to read $3F29 in the virtual memory address space.

    
    vmm_addr  long  0   ' the vmm address to be read
    vmm_data  long  0   ' data read / or data to be written
    vmm_tmp   long  0   ' temporary
    vmm_hub   long  0   ' address of virtual byte
    
    ' read a byte - NOT optimized yet, off-the-cuff sample code, I am sure it can be optimized
    
    vmm_read: 
          mov   vmm_tmp, vmm_addr
          shr     vmm_tmp,#8
          movd  vmm_check, vmm_tmp
          movs  vmm_fix, vmm_tmp
    vmm_check: 
          or  0,#0 wz
      if_z  jmp #page_load
    vmm_fix:
          mov    vmm_hub,0
          shl      vmm_hub,#8
          and     vmm_addr,#$FF
          or       vmm_hub,vmm_addr
          rdbyte  vmm_data,0
    vmm_read_ret: ret
    
    ' page load find the smallest non-zero entry in the first 256 registers, and saves that entry into vmm_temp, and that register number in MIN
    ' if bit 8 is set, it needs to save that virtual page first
    ' set register whose address is saved in MIN to zero
    ' read 256 bytes into the HUB page pointed to by vmm_temp bits 0..7 
    ' set the register pointed to by vmm_temp bits 0..7 to 512+hub page. (access count of one, pointing at correct hub page)
    
    
    



    It might be more efficient to store the page table entries as long {hubpage:7, dirty:1, count:24} because then a simple SHR by #25 would get you the hub page address, and SHL hub page by #8 and it has zeroed the low 8 bits, making it ready to or in the offset. Sorry, I have not spent any time optimizing it yet.

    With 64 pages (16KB of hub buffer) I'd expect well over 90% hit ratio (test code would actually be able to calculate this).

    When there is a hit, the unoptimized code above takes about 11 instructions and one hub access... call it 44+22 cycles worst case, less than 1us any way.

    If we assume that the average ZiCog instruction emulation takes 2us for the instruction, and .5us for an unlatched byte read, the total instruction time would increase to 3us on page hits, and something much worse on misses. Say 256 bytes * 8 bits per byte * 100ns (assuming 10Mbps SPI read) = 204.8us + 2us for the instruction - let's call it 207us.

    At 2.5us per ZiCog instruction, 1M instructions would take 2.5s

    If 90% of the instructions hit, there would be 900,000 hits at 3us, and 100,000 at 207us, for a total of 2.7s + 20.7s = 23.7s - approximately 10.5% the speed of pure xmm ZiCog

    If 95% of the instructions hit, there would be 950,000 hits at 3us, and 50,000 at 207us, for a total of 2.85s + 10.35s = 13.2s - approx. 20% the speed of pure xmm ZiCog

    If the Z80 is like most mainframes, the the hit rate would be more like 99%

    At 99%, there would be 990,000 at 3us, and 10,000 at 207us, for a total of 2.97us + 2.07s = 5.04s - approx. 49.6% of the speed of pure xmm ZiCog.

    Ofcourse if the average ZiCog instruction (with XMM) took 3.75us (50% more than above) the VM approach could reach 75% of XMM performance.

    Note that hits would take at most 0.8us, and that page reads at 20Mbps would take 103.5us.

    Optimizing the read code would have good effect.

    Using 20Mbps reads from the SPI flash would offer a dramatic improvement.

    My best guess?

    For "average" software, 50%+ of XMM speed should be attainable.

    This will be fun to test [noparse]:)[/noparse]

    Does anyone have any idea how many ZiCog instructions are processed per second? the 2.5us average (including TriBlade unlatched read) was a WAG based on reading almost 400k instructions per sec earlier.
    Cluso99 said...
    @Bill: Once the operating model is understood, it is likely some fine tuning of the various ram blocks could be done. Of course, finding this is actually fairly simple at the expense of slowing it down while doing so. Just add a little hub table and increment each time a block is accessed.

    There are obviously only certain sections of the 64KB CPM space that would require "overlaying". From what I understand, the blocks would be from about $FF00 or $F000 downwards, leaving as much as possible from $0000 upwards always resident.

    That's the beauty of the LRU algorithm... initialization code, and infrequently used code would automatically be swapped out, and the most used code would always be resident automatically!

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com 5.0" VGA LCD in stock!
    Morpheus dual Prop SBC w/ 512KB kit $119.95, Mem+2MB memory/IO kit $89.95, both kits $189.95 SerPlug $9.95
    Propteus and Proteus for Propeller prototyping 6.250MHz custom Crystals run Propellers at 100MHz
    Las - Large model assembler Largos - upcoming nano operating system

    Post Edited (Bill Henning) : 2/3/2010 5:12:06 AM GMT
  • heaterheater Posts: 3,370
    edited 2010-02-03 05:47
    Bill: Re: Your question about moving the Z80's memory access operations to another COG.

    Yes it has been considered. Basically it would work like the ZiCog's IN and OUT instructions work now.

    I do like the idea of decoupling junks of software functionality where ever possible, from a software engineering point of view. Straight away it makes life much easier for those who want to port to a different hardware. Like mikediv want's to do for the Hydra. Looks like it saves a few LONGs in the Z80 COG as well.

    There are two reasons why I have not pursued that idea:

    1) Conservation of COG. I all ways looked at COGs as being few in number and precious. Seemed a waste to use a whole 32 bit CPU for just XMM access. Hence "ZiCog" a Z80 emulator in ONE Cog.

    2) Speed. I have yet to see how it can be done in a way that does not slow things down. This is perhaps not such a big issue. Given all the PASM that has to be executed per Z80 op the impact may not be so great. On the other hand I like Clusso's RamBlade attitude, "Everything for speed".

    3) Simplicity. At least for the early ZiCog versions there was only one XMM solution.

    Regarding banked memory for CP/M 3 and MP/M. We have code in place for bank switching the Z80 RAM space. It's very small, tight and fast. I don't see much room for improvement there.

    One problem you seem to have glossed over is in the idea that the Cog handling the RAM can some how do work in the background and hence recover the time lost in COG-COG communications. As far as I can tell this is not possible, or at least won't work as well as one might expect.

    Consider: It looks like the memory COG could be (pre)fetching the next Z80 opcode while the current Z80 instruction is executing.
    Problem: The current instruction does a data access to memory. Oops, it has to wait in the "data_fetch" until the memory GOG gets around to it.
    Problem: When a Z80 jump, call or ret is made the prefetched op is now junk and a new op has to be fetched. This throws away the prefetch time saving. It also means the "code_fetch" path has to check if the requested address is already prefetched or not. It has to do this on every code fetch, this eats time. There are a lot of jumps in Z800 code.

    Now it could be that with all the "swings and roundabouts" we have going on here that a dedicated XMM Cog solution can be made that is faster than what we have now or at least breaks even. So, Bill, if you would like to experiment with it we would love to see what the results are[noparse]:)[/noparse]

    P.S. I've softened up my stance on "wasting" a COG for XMM. As it is we've eaten up all the Prop pins for RAM and the HUB is pretty full so there is no point in saving COGs that have nowhere to work.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    For me, the past is not over yet.
  • Dr_AculaDr_Acula Posts: 5,484
    edited 2010-02-03 05:47
    Bill, that is a fascinating analysis of the speeds. There would definitely be speed increases for latched versions as you would read in blocks of data and hence most of the time would only be changing the low byte latch. Plus some other code optimisations would make it possibly twice as fast to access memory.

    In practice a typical program sitting at 100H is going to be almost always linear with local jumps so that code will be very efficient. There will also be bios calls (keyboard, display output) which will jump to locations in high ram, but these will be the same each time so those blocks will end up on the list fairly early on and then stay there.

    As a rough guide the dracblade runs the same as about a 3.5Mhz Z80. Cluso's runs faster.

    I guess if sphinx does manage to save a whole lot of hub ram we can experiment with what to do with that. Video buffer ram for graphics? Faster speed? Or maybe the user can choose.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    www.smarthome.viviti.com/propeller
  • heaterheater Posts: 3,370
    edited 2010-02-03 06:30
    Bill: PLEASE, PLEASE can you find some time to implement your XMM driver COG with "code_fetch", "data_fetch" etc.

    I was writing that last post only minutes after waking up so, still tired, the following motivations had not occurred to me:

    1) We have a number of XMM options that are just inherently slow. Those that use complicated latching schemes and those that could be made using serial devices. In these cases any speed hit due to COG-COG communications is probably not going to notice much in the final result. We can still have the "all out for speed" Tri/RamBlade option in the ZiCog code wrapped up in #ifdefs so nothing lost.
    Using serial memory appeals to me, may be slow but I'd love to have some free pins such that ZiCog can do IN/OUT to them directly from Z80 code.

    2) If you add operations for reading and writing WORDs we get more speed back. ZiCog does a lot of WORD accesses.

    3) This can probably be done without wasting a COG. Just combine it with the TriBlade XMM block move driver or such.

    4) MoCog. The MoCog 6809 emulator PASM is getting huge. It will require two COGs. Hopefully only two. If both those COGs need access to XMM (likely) then your suggested XMM handler COG would a) Save duplicating access code in two COGs. b) Make life much easier, saves having two COGs fighting for those RAM pins.

    5) ZyCog. Yes "ZyCog" not "ZiCog" See below.


    What the heck is ZyCog?

    For a long time now I've pondered two things:

    1) Is there a nice byte code, like Spin, that could be interpreted in one or two Cogs, like Z80, but more efficient and with much larger address space. For use with code in external memory.

    2) Is there such a byte code that exists already and has a nice compiler to go with it. C or whatever. So that we have a ready to run tool chain. Yes there is Java but "no thank you".

    Recently I found the answer, the ZPU processor core from ZyLin AS.

    Get this:

    1) The ZPU processor core is the smallest, in terms of logic blocks, 32 bit CPU.
    2) It's instructions are all byte wide, good for XMM.
    3) There are only a handful instructions, can probably emulate the ZPU in one COG.
    4) There is a version of GCC that generates code for the ZPU.

    Yes, that's right, with ZPU emulation we can use the GCC compiler for the Propeller and have huge programs in external RAM.

    Hence my new project "ZyCog" the ZyLin ZPU processor in a COG.

    ZyCog is as yet unannounced and has no Prop code. It's just an idea so don't tell anyone[noparse]:)[/noparse]
    At least I got as far as getting the GCC to generate ZPU code to experiment with.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    For me, the past is not over yet.

    Post Edited (heater) : 2/3/2010 6:44:59 AM GMT
  • heaterheater Posts: 3,370
    edited 2010-02-03 06:40
    Bill: "Does anyone have any idea how many ZiCog instructions are processed per second?"

    A long time ago this was measured with a frequency counter whilst the Z80 was executing its op code test program. Results we published here somewhere. No idea where or memory of the numbers. Less than 1 million more than 500,000 per second.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    For me, the past is not over yet.
  • Cluso99Cluso99 Posts: 18,069
    edited 2010-02-03 07:26
    A few comments on the above...

    For ZiCog, the XMM cog should have 2 seperate rendezvous locations, one for instruction fetch and one for data fetch. That way the prefetch doesn't get flushed every time a data fetch occurs. Also, it may as well prefetch words, or even longs.

    ZyCog - love the idea smile.gif

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    Links to other interesting threads:

    · Home of the MultiBladeProps: TriBlade,·RamBlade,·SixBlade, website
    · Single Board Computer:·3 Propeller ICs·and a·TriBladeProp board (ZiCog Z80 Emulator)
    · Prop Tools under Development or Completed (Index)
    · Emulators: CPUs Z80 etc; Micros Altair etc;· Terminals·VT100 etc; (Index) ZiCog (Z80) , MoCog (6809)·
    · Prop OS: SphinxOS·, PropDos , PropCmd··· Search the Propeller forums·(uses advanced Google search)
    My cruising website is: ·www.bluemagic.biz·· MultiBlade Props: www.cluso.bluemagic.biz
  • heaterheater Posts: 3,370
    edited 2010-02-03 08:19
    Cluso: Bill already has code and data separated, see the DracBlade thread. Like so:

    codefetch  long 0
    coder        long 0
    
    dataread   long  0
    datar        long  0
    
    datawrite  long  0
    dataw      long  0
    
    



    WORD access is suggested above. It's surprising how much WORD access goes on in an 8 bit CPU. All those jumps, calls and rets need WORD access. Then there's the loading, storing, PUSHing and POPing, of 16 bit registers.

    Looks like what we are about to design is the worlds first 8 bit processor with a 16 bit data bus!

    More on ZyCog later....

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    For me, the past is not over yet.
Sign In or Register to comment.