TAQOZ - Tachyon Forth for the P2 BOOT ROM

189101113

Comments

  • Peter JakackiPeter Jakacki Posts: 7,834
    edited November 2 Vote Up0Vote Down
    IIRC I looked at the timing at the time on the scope and spoke to Ray about how slow it was. He was a bit surprised until we realised that hubexec code running in a loop that inserts waits and also tries to send and receive in the same loop will run slow. Whereas my routines run from cog, use rep, no waits, and are optimised either for send or receive since SD card data is essentially half-duplex.

    NOTE: I always use a scope or LA when dealing with I/O and I consider the use of this essential to doing any kind of work involving I/O. If I had enough boot ROM I could build the equivalent of SPLAT except it would also output to VGA for capturing I/O timing.

    Tachyon Forth - compact, fast, forthwright and interactive
    useforthlogo-s.png
    --->CLICK THE LOGO for more links<---
    Latest binary V5.4 includes EASYFILE +++++ Tachyon Forth News Blog
    P2 SHORTFORM DATASHEET +++++ TAQOZ documentation
    Brisbane, Australia
  • Cluso99Cluso99 Posts: 14,161
    edited November 3 Vote Up0Vote Down
    It's also very specific to the SD initialisation time.
    If FLASH is present by CS pull-up, FLASH will be tried first, which further delays SD boot.

    It's fair to say, SD was not designed to be fast, as it's not really practical. If you want fast, use FLASH first.

    The design decisions for my routines were:
    * To be callable from a users program, so minimal COG/LUT footprint
    * To boot from any SD, so minimal risk code (slow clocking to meet original spec)

    The code to initialise the SD and read the MBR/VOL and/or a small FAT32 file as a two-stage boot process will not run much faster.
    The two-stage process can use the crystal (180+MHz) so it can then boot the second data/file considerably faster than what is possible with the rcosc ~22MHz. This will give a superior boot. The ROM will never know what xtal option to boot faster.

    Note that the FLASH will always be faster at booting than SD. This could be used as a first stage boot.

    Peter,
    I am curious. You have the FLASH on my P2D2 and code installed, yet it does not boot from here, as if I have an SD card installed it will boot.
    Do you know why as the FLASH, if present, should boot first? Is Chip looking for a signature in FLASH?
    Just checked the boot code... yes, a checksum is done on the first $400 bytes of FLASH, and if not valid the boot code will try SD or Serial

    My Prop boards: P8XBlade2, RamBlade, CpuBlade, TriBlade
    Prop OS (also see Sphinx, PropDos, PropCmd, Spinix)
    Website: www.clusos.com
    Prop Tools (Index) , Emulators (Index) , ZiCog (Z80)
  • Ray, the Flash is being used as a backup and restore for a TAQOZ image but requires a separate secondary boot loader written to the Flash which I would have to include somewhere. But I'm thinking about how you envisage using your SD routines but I don't think anyone will use them because they are too slow even at 200MHz. So if that is the case then why don't you just optimize the routines for SD booting because that IS the PRIMARY purpose of the boot ROM and run them in a cog.

    Tachyon Forth - compact, fast, forthwright and interactive
    useforthlogo-s.png
    --->CLICK THE LOGO for more links<---
    Latest binary V5.4 includes EASYFILE +++++ Tachyon Forth News Blog
    P2 SHORTFORM DATASHEET +++++ TAQOZ documentation
    Brisbane, Australia
  • Ray, the Flash is being used as a backup and restore for a TAQOZ image but requires a separate secondary boot loader written to the Flash which I would have to include somewhere. But I'm thinking about how you envisage using your SD routines but I don't think anyone will use them because they are too slow even at 200MHz. So if that is the case then why don't you just optimize the routines for SD booting because that IS the PRIMARY purpose of the boot ROM and run them in a cog.

    I totally agree, Ray. It's all about booting. Tool systems are going to totally rewrite the upper RAM, anyway. That 6s needs to come down to as little as possible. If need be, let's get your code running in another cog this time, like we had talked about earlier. If it can run fast in hub, that's ideal, but if not, run it in a cog.
  • Chip or Peter,
    Can you time this on a CRO please...

    Compile with pnut, save the xxx.obj as file "_BOOT_P2.BIX" on an SD FAT32.
    Boot P2,timing the end of the reset pulse to the start of the high pulse on P0.

    Let me know if you have FLASH installed and the R4 10K pull-up on Flash/CS P61 (this means there is a delay while trying to read FLASH first)
    CON
      _greenled     = 0                                     ' 1=ON
    DAT
                    orgh    0
                    org     0
    entry
                    drvh    #_greenled                      ' 1=ON
    .loop           drvnot  #_greenled                      ' GREEN
                    waitx   delay
                    jmp     #.loop
    
    delay           long    $2_000000                       ' 2*16MHz
    

    Meanwhile I will think about if there is another easier way to test this.
    My Prop boards: P8XBlade2, RamBlade, CpuBlade, TriBlade
    Prop OS (also see Sphinx, PropDos, PropCmd, Spinix)
    Website: www.clusos.com
    Prop Tools (Index) , Emulators (Index) , ZiCog (Z80)
  • Cluso99 wrote: »
    Chip or Peter,
    Can you time this on a CRO please...

    Compile with pnut, save the xxx.obj as file "_BOOT_P2.BIX" on an SD FAT32.
    Boot P2,timing the end of the reset pulse to the start of the high pulse on P0.

    Let me know if you have FLASH installed and the R4 10K pull-up on Flash/CS P61 (this means there is a delay while trying to read FLASH first)
    CON
      _greenled     = 0                                     ' 1=ON
    DAT
                    orgh    0
                    org     0
    entry
                    drvh    #_greenled                      ' 1=ON
    .loop           drvnot  #_greenled                      ' GREEN
                    waitx   delay
                    jmp     #.loop
    
    delay           long    $2_000000                       ' 2*16MHz
    

    Meanwhile I will think about if there is another easier way to test this.

    You need a cheap Chinese oscilloscope. If you buy yourself one, I'll reimburse you somehow.
  • Cluso99Cluso99 Posts: 14,161
    edited November 3 Vote Up0Vote Down
    cgracey wrote: »
    Cluso99 wrote: »
    Chip or Peter,
    Can you time this on a CRO please...

    Compile with pnut, save the xxx.obj as file "_BOOT_P2.BIX" on an SD FAT32.
    Boot P2,timing the end of the reset pulse to the start of the high pulse on P0.

    Let me know if you have FLASH installed and the R4 10K pull-up on Flash/CS P61 (this means there is a delay while trying to read FLASH first)
    CON
      _greenled     = 0                                     ' 1=ON
    DAT
                    orgh    0
                    org     0
    entry
                    drvh    #_greenled                      ' 1=ON
    .loop           drvnot  #_greenled                      ' GREEN
                    waitx   delay
                    jmp     #.loop
    
    delay           long    $2_000000                       ' 2*16MHz
    

    Meanwhile I will think about if there is another easier way to test this.

    You need a cheap Chinese oscilloscope. If you buy yourself one, I'll reimburse you somehow.

    I have been thinking how to time the code using the P2 CNT counter.
    After some time going up the wrong garden path, I now think I have a better way - just coding it now :smile:
    My Prop boards: P8XBlade2, RamBlade, CpuBlade, TriBlade
    Prop OS (also see Sphinx, PropDos, PropCmd, Spinix)
    Website: www.clusos.com
    Prop Tools (Index) , Emulators (Index) , ZiCog (Z80)
  • Cluso99 wrote: »
    cgracey wrote: »
    Cluso99 wrote: »
    Chip or Peter,
    Can you time this on a CRO please...

    Compile with pnut, save the xxx.obj as file "_BOOT_P2.BIX" on an SD FAT32.
    Boot P2,timing the end of the reset pulse to the start of the high pulse on P0.

    Let me know if you have FLASH installed and the R4 10K pull-up on Flash/CS P61 (this means there is a delay while trying to read FLASH first)
    CON
      _greenled     = 0                                     ' 1=ON
    DAT
                    orgh    0
                    org     0
    entry
                    drvh    #_greenled                      ' 1=ON
    .loop           drvnot  #_greenled                      ' GREEN
                    waitx   delay
                    jmp     #.loop
    
    delay           long    $2_000000                       ' 2*16MHz
    

    Meanwhile I will think about if there is another easier way to test this.

    You need a cheap Chinese oscilloscope. If you buy yourself one, I'll reimburse you somehow.

    I have been thinking how to time the code using the P2 CNT counter.
    After some time going up the wrong garden path, I now think I have a better way - just coding it now :smile:

    Try to achieve whatever is possible using the 20MHz+ RC oscillator.

    Do you think 1s is reasonable?
  • Cluso99, let booting be your only objective. That's what people are going to use.
  • roglohrogloh Posts: 853
    edited November 3 Vote Up0Vote Down
    FWIW a 20MHz AVR MCU can hit reasonable transfer speeds of over 1MB/s with FATFS with large 4k blocks in SPI mode. Admittedly this does not take into account initial startup latency so it might not be a fair comparison, but here it is regardless. I would hope that once initialized we could achieve somewhat similar transfer rates on a P2. I guess as of now P2 is about an order of magnitude slower if it takes 6s to boot 512kB even if the card init/boot timeout is about 1s or thereabouts (though not sure what it might be). The AVR has the luxury of a hardware shift register, whereas we would need to bitbang but there is the pin streamer too and smartpins if they can be leveraged here.

    http://elm-chan.org/fsw/ff/res/rwtest1.png

    http://elm-chan.org/fsw/ff/00index_e.html
  • rogloh wrote: »
    FWIW a 20MHz AVR MCU can hit reasonable transfer speeds of over 1MB/s with FATFS with large 4k blocks in SPI mode. Admittedly this does not take into account initial startup latency so it might not be a fair comparison, but here it is regardless. I would hope that once initialized we could achieve somewhat similar transfer rates on a P2. I guess as of now P2 is about an order of magnitude slower if it takes 6s to boot 512kB even if the card init/boot timeout is about 1s or thereabouts (though not sure what it might be). The AVR has the luxury of a hardware shift register, whereas we would need to bitbang but there is the pin streamer too and smartpins if they can be leveraged here.

    http://elm-chan.org/fsw/ff/res/rwtest1.png

    http://elm-chan.org/fsw/ff/00index_e.html

    Even with bit-banging the P2 can do around 200kB/s at 20MHz and what with smart pins we should be able to do much better. Once the clock is running then I am getting over 3MB/s rates. The slow boot is more to do with the fact that Ray was trying to make it general purpose and safe but we know the speed it's running at so all we need it to do is boot as fast as it possibly can (and safely). Plenty of testers available soon, and hopefully they will help out.

    I will try out the synchronous modes on the smart pins next to see if we can't do a lot better at 20MHz.




    Tachyon Forth - compact, fast, forthwright and interactive
    useforthlogo-s.png
    --->CLICK THE LOGO for more links<---
    Latest binary V5.4 includes EASYFILE +++++ Tachyon Forth News Blog
    P2 SHORTFORM DATASHEET +++++ TAQOZ documentation
    Brisbane, Australia
  • Plenty of testers available soon, and hopefully they will help out.

    Can't wait to help out . : ]
  • roglohrogloh Posts: 853
    edited November 3 Vote Up0Vote Down
    Good news is it seems we all still have an opportunity to improve things from here, and if more ROM space does happen to eventuate in time who know what other wonderful features P2 we may end up with. The whole HDMI feature is looking pretty good now, even the bitbang type has good potential. With that and USB it may ultimately be possible to be truly self hosting from reset at some point. Take your P2 and point it at a HDTV and plug in a USB keyboard, hit reset and you are away ready to code or debug some hardware on your board, no PC or internet required to first obtain flash/boot images etc. And perhaps some nice VGA style IDE could be crafted in but maybe that's just getting too ambitious. I know things like this should all be doable with boot flash, serial terminals, SD cards etc but maybe there are still some useful possibilities even without those extra requirements.

    Edit: I'd forgotten the frequency detection thing. Boot code may not know how to setup the PLL the right way to achieve video frequencies (eg. 250MHz) which could throw a spanner in the works for self hosting ideas above unless it can be made to sense itself somehow, or having some new rule requiring any new self hosting mode to run at some known frequency. Reality always gets more complicated, as usual.
  • Here are some SD boot times loading a single sector FAT32 file with the time from SD code start (ie after Chip's boot section has run and passes control to my routine) until the SD has been initialised and the FAT32 file has been located and it's first sector (it's a 1 sector length file) has been read into hub $00000. From a power boot (takes the SD card the longest time to initialise)
    ~7,285,151 clocks (presuming 20MHz -> 330ms)
    From reset with previous SD initialisation (ie warm boot)
    ~2,406,791 clocks (presuming 20MHz -> 120ms)

    A 512B FAT32 file at RCOSC assumed 20MHz loads in 0.330s and 0.120s (from above)
    A 496KB FAT32 file at RCOSC assumed 20MHz loads in 10.985s and 10.729s
    A 496KB FAT32 file at 96MHz loads in 3.064s and 2.842s
    A 496KB FAT32 file at 250MHz loads in 1.638s and 1.591s

    I have tried running the code in COG but it's not working yet.
    My Prop boards: P8XBlade2, RamBlade, CpuBlade, TriBlade
    Prop OS (also see Sphinx, PropDos, PropCmd, Spinix)
    Website: www.clusos.com
    Prop Tools (Index) , Emulators (Index) , ZiCog (Z80)
  • cgraceycgracey Posts: 10,128
    edited November 3 Vote Up0Vote Down
    Cluso99 wrote: »
    Here are some SD boot times loading a single sector FAT32 file with the time from SD code start (ie after Chip's boot section has run and passes control to my routine) until the SD has been initialised and the FAT32 file has been located and it's first sector (it's a 1 sector length file) has been read into hub $00000. From a power boot (takes the SD card the longest time to initialise)
    ~7,285,151 clocks (presuming 20MHz -> 330ms)
    From reset with previous SD initialisation (ie warm boot)
    ~2,406,791 clocks (presuming 20MHz -> 120ms)

    A 512B FAT32 file at RCOSC assumed 20MHz loads in 0.330s and 0.120s (from above)
    A 496KB FAT32 file at RCOSC assumed 20MHz loads in 10.985s and 10.729s
    A 496KB FAT32 file at 96MHz loads in 3.064s and 2.842s
    A 496KB FAT32 file at 250MHz loads in 1.638s and 1.591s

    I have tried running the code in COG but it's not working yet.

    It seems to me that you would only want to load a loader program which could turn on the crystal and do the big load.

    Is there a way to specify the length of the load to be done by your booter?
  • Yes. It loads the files length, up to a max of 496KB, so you cannot overwrite the 16KB loaded from the ROM. A 2-stage loaded is the expected concept.
    My Prop boards: P8XBlade2, RamBlade, CpuBlade, TriBlade
    Prop OS (also see Sphinx, PropDos, PropCmd, Spinix)
    Website: www.clusos.com
    Prop Tools (Index) , Emulators (Index) , ZiCog (Z80)
  • Cluso99 wrote: »
    I have been thinking how to time the code using the P2 CNT counter.
    You could just load Bean's Reciprocal Frequency Counter into a P1, and use that ?

  • cgracey wrote: »
    It seems to me that you would only want to load a loader program which could turn on the crystal and do the big load.
    The crystal value is somewhat unknown at boot time ?
    The values above are appx 50k Bytes/Second loading, so modest sizes of code will load fine. ie 1s boot is possible, just not a full-image load.

  • rogloh wrote: »
    Edit: I'd forgotten the frequency detection thing. Boot code may not know how to setup the PLL the right way to achieve video frequencies (eg. 250MHz) which could throw a spanner in the works for self hosting ideas above unless it can be made to sense itself somehow, or having some new rule requiring any new self hosting mode to run at some known frequency. Reality always gets more complicated, as usual.
    Yes, P2 lacks any means to check presence or value of external clock/xtal, so boot ROM has to be 20MHz RCFAST focused.
  • rogloh wrote: »
    I guess as of now P2 is about an order of magnitude slower if it takes 6s to boot 512kB even if the card init/boot timeout is about 1s or thereabouts (though not sure what it might be).
    From above a tiny load of 512B, takes 330ms or 130ms for System-POR or P2-(re)reset
    rogloh wrote: »
    The AVR has the luxury of a hardware shift register, whereas we would need to bitbang but there is the pin streamer too and smartpins if they can be leveraged here.

    Chip uses the Smart pins for UART Boot mode.

    On the topic of speed, can the RCFAST feed the PLL/VCO ? if yes, then that opens another speed boost, where it could select ~100MHz perhaps under some user choice ?

  • No, the RC fast cannot feed the PLL.
  • With boot pins, are there any without pull-ups and pull-downs such that a voltage divider could be constructed which when booting is read as an analogue value by smart-pin with ranges indicating *safe* alternate speeds to switch to. For development boards the positive half of the divider could be connected +ve to 3 x current limit resistors to an RGB led with a trim-pot below it to set the 'speed range' for the divider, the low side terminal is the pin and there is a pull-down resistor to -ve. Different operating voltages or the RGB led give a visual indication of the range selected, so Parallax's customers will not need a multimeter (think education, introduction to a hobby) to select the correct voltage if there is a problem with the SD card's speed. The pull-up and pull-down resistors should be such that any digital signal on the pin is unaffected.

    If this is a workable solution, provide a jumper or small pcb switch to disconnect the leds connection to the pin and bottom half of the voltage divider for people who use battery power. The leds are only there as a visual indication to set the voltage range which selects switching to a *safe* speed.
  • 78rpm wrote: »
    With boot pins, are there any without pull-ups and pull-downs such that a voltage divider could be constructed which when booting is read as an analogue value by smart-pin with ranges indicating *safe* alternate speeds to switch to. For development boards the positive half of the divider could be connected +ve to 3 x current limit resistors to an RGB led with a trim-pot below it to set the 'speed range' for the divider, the low side terminal is the pin and there is a pull-down resistor to -ve. Different operating voltages or the RGB led give a visual indication of the range selected, so Parallax's customers will not need a multimeter (think education, introduction to a hobby) to select the correct voltage if there is a problem with the SD card's speed. The pull-up and pull-down resistors should be such that any digital signal on the pin is unaffected.

    If this is a workable solution, provide a jumper or small pcb switch to disconnect the leds connection to the pin and bottom half of the voltage divider for people who use battery power. The leds are only there as a visual indication to set the voltage range which selects switching to a *safe* speed.

    Interesting idea. Some i2c parts use such 'trinary levels' for address setting.

    Even existing pins could be used for this, the digital decision can remain, but a finer detail could be extracted from the exact voltage.
    Some parts may have external loads, (eg I think SD cards have CE pullups already) so that may complicate some pins usage ?
  • rogloh wrote: »
    FWIW a 20MHz AVR MCU can hit reasonable transfer speeds of over 1MB/s with FATFS with large 4k blocks in SPI mode.
    http://elm-chan.org/fsw/ff/res/rwtest1.png

    Hmm, do they mean bytes or bits here ?
    1M+ is plausible, but they also claim 2M+ and that's not possible as bytes over the 10Mb/s link.
    Likewise in their benchmark 2, they claim 18Mb/s link, but 7561kB/s - which must actually mean bits/s ?

  • @jmg, good point, though according to the footnote in benchmark 1, the CF modes were using GPIO (probably parallel) so they could potentially be higher than 1MB/s. The 20MHz AVR USART-SPI is limited to 10MHz which can do ~1.2 MB/s best case, which was not exceeded.

    Benchmark 2 - not sure, different processor, ARM I think. They do show the bus speed limits here at 9000kB/s for a 72MHz processor which may be possible, if their kB = 1000 not 1024 anyway and that processor supports SPI rates that high?

    I do think they are talking MB/s not Mb/s
  • Peter JakackiPeter Jakacki Posts: 7,834
    edited November 4 Vote Up0Vote Down
    These secondary bootloaders annoy me in that they are an extra layer that needs to be added when all we need to know at boot is what it has for a clock. So I propose that both the Flash and the SD have configuration parameters that are set by the user so that when they save a file they can also describe their system that the boot loader can use. The loader wants to know if there is an oscillator or a crystal, the frequency of that crystal, the desired runtime frequency after booting, and perhaps the baud rate of any terminal that may be connected. The SD config file can either be written in the MBR or as a file _CONFIG_.TXT and the serial Flash could use from address 0 as its "MBR".

    That way the SD can boot all 512kB if it needs to within about 200ms or less especially since the config may override any resistor settings perhaps.

    BTW, I have PS/2 keyboards working very nicely in conjunction with VGA! All the meta keys are working too.

    Tachyon Forth - compact, fast, forthwright and interactive
    useforthlogo-s.png
    --->CLICK THE LOGO for more links<---
    Latest binary V5.4 includes EASYFILE +++++ Tachyon Forth News Blog
    P2 SHORTFORM DATASHEET +++++ TAQOZ documentation
    Brisbane, Australia
  • Nice work Peter.

    I agree. Perhaps a fallback, should those be wrong or missing.

    Would be super nice to put them in the clear, simple offset from start of image too. Peolle can edit a binary, if needed, or a tool boot loader menu, OS can do that

    Do not taunt Happy Fun Ball! @opengeekorg ---> Be Excellent To One Another SKYPE = acuity_doug
    Parallax colors simplified: https://forums.parallax.com/discussion/123709/commented-graphics-demo-spin<br>
  • BTW, I have PS/2 keyboards working very nicely in conjunction with VGA! All the meta keys are working too.

    Very cool, bring on that ROM size upgrade! Won't take you long and we'll have HDMI + USB soon enough and boot right into something like VGA IDE.
    :smile:
  • jmg wrote: »
    Cluso99 wrote: »
    I have been thinking how to time the code using the P2 CNT counter.
    You could just load Bean's Reciprocal Frequency Counter into a P1, and use that ?

    Yes, i did think of getting my P1 out of the closet. I just wanted to use the P2's CNT register for that tho.
    My Prop boards: P8XBlade2, RamBlade, CpuBlade, TriBlade
    Prop OS (also see Sphinx, PropDos, PropCmd, Spinix)
    Website: www.clusos.com
    Prop Tools (Index) , Emulators (Index) , ZiCog (Z80)
  • We do not know the crystal frequency.
    We don't know if it's an oscillator.
    We don't know the desired run frequency.

    Here is my code for HUBSET...
    _XTALFREQ       = 12_000_000                                    ' crystal frequency
    _XDIV           = 12                                            ' crystal divider to give 1MHz
    _XMUL           = 180                                           ' crystal / div * mul to give _CLKFREQ
    _CLOCKFREQ      = _XTALFREQ / _XDIV * _XMUL                     ' %0000_xxxE_DDDDDD_MMMMMMMMMM_PPPP_CC_SS  ' set clock generator mode
    _SETFREQ        = $0100_00F4 + (_XDIV-1)<<18 + (_XMUL-1)<<8     ' %0000_0001_dddddd_mmmmmmmmmm_1111_01_00  ' ena-xtal+PLL,div=12,mul=20,p=15,0pF,20MHz+
    _ENAFREQ        = _SETFREQ + 3                                  ' %0000_0001_dddddd_mmmmmmmmmm_1111_01_11  ' enable xxMHZ oscillator
    .....
    ''-------[ Set Xtal ]---------------------------------------------------------- 
                    hubset  #0                              ' set 20MHz+ mode
                    hubset  ##_SETFREQ                      ' setup oscillator
                    waitx   ##20_000_000/100                ' ~10ms
                    hubset  ##_ENAFREQ                      ' enable oscillator
    '+-----------------------------------------------------------------------------+
    

    Here, I am using a 12MHz Oscillator. Also, I have used PPPP=15 which is another divider. Neither of these options are coded as CON yet.

    We cannot code for all these variances in the Boot ROM. The only way we can use these is to configure the first stage FLASH/SD bootloader.

    Now, in my P1 OS, I can tell which hardware I am using, and therefore know which XTAL and clockfreq I am using. Therefore I can have one SD card which I can use in all my various boards.
    I don't expect this luxury in P2 unless we all can agree with a standard xtal or oscillator. Any takers? Start a new thread if you think we could get consensus - I am not holding my breath :(
    My Prop boards: P8XBlade2, RamBlade, CpuBlade, TriBlade
    Prop OS (also see Sphinx, PropDos, PropCmd, Spinix)
    Website: www.clusos.com
    Prop Tools (Index) , Emulators (Index) , ZiCog (Z80)
Sign In or Register to comment.