TAQOZ - Tachyon Forth for the P2 BOOT ROM

Peter Jakacki · 2018-11-02 22:17

IIRC I looked at the timing at the time on the scope and spoke to Ray about how slow it was. He was a bit surprised until we realised that hubexec code running in a loop that inserts waits and also tries to send and receive in the same loop will run slow. Whereas my routines run from cog, use rep, no waits, and are optimised either for send or receive since SD card data is essentially half-duplex.

NOTE: I always use a scope or LA when dealing with I/O and I consider the use of this essential to doing any kind of work involving I/O. If I had enough boot ROM I could build the equivalent of SPLAT except it would also output to VGA for capturing I/O timing.

Cluso99 · 2018-11-03 00:12

It's also very specific to the SD initialisation time.
If FLASH is present by CS pull-up, FLASH will be tried first, which further delays SD boot.

It's fair to say, SD was not designed to be fast, as it's not really practical. If you want fast, use FLASH first.

The design decisions for my routines were:
* To be callable from a users program, so minimal COG/LUT footprint
* To boot from any SD, so minimal risk code (slow clocking to meet original spec)

The code to initialise the SD and read the MBR/VOL and/or a small FAT32 file as a two-stage boot process will not run much faster.
The two-stage process can use the crystal (180+MHz) so it can then boot the second data/file considerably faster than what is possible with the rcosc ~22MHz. This will give a superior boot. The ROM will never know what xtal option to boot faster.

Note that the FLASH will always be faster at booting than SD. This could be used as a first stage boot.

Peter,
I am curious. You have the FLASH on my P2D2 and code installed, yet it does not boot from here, as if I have an SD card installed it will boot.
Do you know why as the FLASH, if present, should boot first? Is Chip looking for a signature in FLASH?
Just checked the boot code... yes, a checksum is done on the first $400 bytes of FLASH, and if not valid the boot code will try SD or Serial

Peter Jakacki · 2018-11-03 00:31

Ray, the Flash is being used as a backup and restore for a TAQOZ image but requires a separate secondary boot loader written to the Flash which I would have to include somewhere. But I'm thinking about how you envisage using your SD routines but I don't think anyone will use them because they are too slow even at 200MHz. So if that is the case then why don't you just optimize the routines for SD booting because that IS the PRIMARY purpose of the boot ROM and run them in a cog.

cgracey · 2018-11-03 01:12

Peter Jakacki wrote: »

Ray, the Flash is being used as a backup and restore for a TAQOZ image but requires a separate secondary boot loader written to the Flash which I would have to include somewhere. But I'm thinking about how you envisage using your SD routines but I don't think anyone will use them because they are too slow even at 200MHz. So if that is the case then why don't you just optimize the routines for SD booting because that IS the PRIMARY purpose of the boot ROM and run them in a cog.

I totally agree, Ray. It's all about booting. Tool systems are going to totally rewrite the upper RAM, anyway. That 6s needs to come down to as little as possible. If need be, let's get your code running in another cog this time, like we had talked about earlier. If it can run fast in hub, that's ideal, but if not, run it in a cog.

Cluso99 · 2018-11-03 02:09

Chip or Peter,
Can you time this on a CRO please...

Compile with pnut, save the xxx.obj as file "_BOOT_P2.BIX" on an SD FAT32.
Boot P2,timing the end of the reset pulse to the start of the high pulse on P0.

Let me know if you have FLASH installed and the R4 10K pull-up on Flash/CS P61 (this means there is a delay while trying to read FLASH first)

CON
  _greenled     = 0                                     ' 1=ON
DAT
                orgh    0
                org     0
entry
                drvh    #_greenled                      ' 1=ON
.loop           drvnot  #_greenled                      ' GREEN
                waitx   delay
                jmp     #.loop

delay           long    $2_000000                       ' 2*16MHz

Meanwhile I will think about if there is another easier way to test this.

cgracey · 2018-11-03 03:23

Cluso99 wrote: »
Chip or Peter,
Can you time this on a CRO please...

Compile with pnut, save the xxx.obj as file "_BOOT_P2.BIX" on an SD FAT32.
Boot P2,timing the end of the reset pulse to the start of the high pulse on P0.

Let me know if you have FLASH installed and the R4 10K pull-up on Flash/CS P61 (this means there is a delay while trying to read FLASH first)
CON
  _greenled     = 0                                     ' 1=ON
DAT
                orgh    0
                org     0
entry
                drvh    #_greenled                      ' 1=ON
.loop           drvnot  #_greenled                      ' GREEN
                waitx   delay
                jmp     #.loop

delay           long    $2_000000                       ' 2*16MHz
Meanwhile I will think about if there is another easier way to test this.

You need a cheap Chinese oscilloscope. If you buy yourself one, I'll reimburse you somehow.

Cluso99 · 2018-11-03 04:17

cgracey wrote: »
Cluso99 wrote: »
Chip or Peter,
Can you time this on a CRO please...

Compile with pnut, save the xxx.obj as file "_BOOT_P2.BIX" on an SD FAT32.
Boot P2,timing the end of the reset pulse to the start of the high pulse on P0.

Let me know if you have FLASH installed and the R4 10K pull-up on Flash/CS P61 (this means there is a delay while trying to read FLASH first)
CON
  _greenled     = 0                                     ' 1=ON
DAT
                orgh    0
                org     0
entry
                drvh    #_greenled                      ' 1=ON
.loop           drvnot  #_greenled                      ' GREEN
                waitx   delay
                jmp     #.loop

delay           long    $2_000000                       ' 2*16MHz
Meanwhile I will think about if there is another easier way to test this.
You need a cheap Chinese oscilloscope. If you buy yourself one, I'll reimburse you somehow.

I have been thinking how to time the code using the P2 CNT counter.
After some time going up the wrong garden path, I now think I have a better way - just coding it now

cgracey · 2018-11-03 04:29

Cluso99 wrote: »
cgracey wrote: »
Cluso99 wrote: »
Chip or Peter,
Can you time this on a CRO please...

Compile with pnut, save the xxx.obj as file "_BOOT_P2.BIX" on an SD FAT32.
Boot P2,timing the end of the reset pulse to the start of the high pulse on P0.

Let me know if you have FLASH installed and the R4 10K pull-up on Flash/CS P61 (this means there is a delay while trying to read FLASH first)
CON
  _greenled     = 0                                     ' 1=ON
DAT
                orgh    0
                org     0
entry
                drvh    #_greenled                      ' 1=ON
.loop           drvnot  #_greenled                      ' GREEN
                waitx   delay
                jmp     #.loop

delay           long    $2_000000                       ' 2*16MHz
Meanwhile I will think about if there is another easier way to test this.
You need a cheap Chinese oscilloscope. If you buy yourself one, I'll reimburse you somehow.
I have been thinking how to time the code using the P2 CNT counter.
After some time going up the wrong garden path, I now think I have a better way - just coding it now

Try to achieve whatever is possible using the 20MHz+ RC oscillator.

Do you think 1s is reasonable?

cgracey · 2018-11-03 04:30

Cluso99, let booting be your only objective. That's what people are going to use.

rogloh · 2018-11-03 06:07

FWIW a 20MHz AVR MCU can hit reasonable transfer speeds of over 1MB/s with FATFS with large 4k blocks in SPI mode. Admittedly this does not take into account initial startup latency so it might not be a fair comparison, but here it is regardless. I would hope that once initialized we could achieve somewhat similar transfer rates on a P2. I guess as of now P2 is about an order of magnitude slower if it takes 6s to boot 512kB even if the card init/boot timeout is about 1s or thereabouts (though not sure what it might be). The AVR has the luxury of a hardware shift register, whereas we would need to bitbang but there is the pin streamer too and smartpins if they can be leveraged here.

http://elm-chan.org/fsw/ff/res/rwtest1.png

http://elm-chan.org/fsw/ff/00index_e.html

Peter Jakacki · 2018-11-03 10:53

rogloh wrote: »

FWIW a 20MHz AVR MCU can hit reasonable transfer speeds of over 1MB/s with FATFS with large 4k blocks in SPI mode. Admittedly this does not take into account initial startup latency so it might not be a fair comparison, but here it is regardless. I would hope that once initialized we could achieve somewhat similar transfer rates on a P2. I guess as of now P2 is about an order of magnitude slower if it takes 6s to boot 512kB even if the card init/boot timeout is about 1s or thereabouts (though not sure what it might be). The AVR has the luxury of a hardware shift register, whereas we would need to bitbang but there is the pin streamer too and smartpins if they can be leveraged here.

http://elm-chan.org/fsw/ff/res/rwtest1.png

http://elm-chan.org/fsw/ff/00index_e.html

Even with bit-banging the P2 can do around 200kB/s at 20MHz and what with smart pins we should be able to do much better. Once the clock is running then I am getting over 3MB/s rates. The slow boot is more to do with the fact that Ray was trying to make it general purpose and safe but we know the speed it's running at so all we need it to do is boot as fast as it possibly can (and safely). Plenty of testers available soon, and hopefully they will help out.

I will try out the synchronous modes on the smart pins next to see if we can't do a lot better at 20MHz.

Ym2413a · 2018-11-03 11:16

Peter Jakacki wrote: »

Plenty of testers available soon, and hopefully they will help out.

Can't wait to help out . : ]

rogloh · 2018-11-03 11:38

Good news is it seems we all still have an opportunity to improve things from here, and if more ROM space does happen to eventuate in time who know what other wonderful features P2 we may end up with. The whole HDMI feature is looking pretty good now, even the bitbang type has good potential. With that and USB it may ultimately be possible to be truly self hosting from reset at some point. Take your P2 and point it at a HDTV and plug in a USB keyboard, hit reset and you are away ready to code or debug some hardware on your board, no PC or internet required to first obtain flash/boot images etc. And perhaps some nice VGA style IDE could be crafted in but maybe that's just getting too ambitious. I know things like this should all be doable with boot flash, serial terminals, SD cards etc but maybe there are still some useful possibilities even without those extra requirements.

Edit: I'd forgotten the frequency detection thing. Boot code may not know how to setup the PLL the right way to achieve video frequencies (eg. 250MHz) which could throw a spanner in the works for self hosting ideas above unless it can be made to sense itself somehow, or having some new rule requiring any new self hosting mode to run at some known frequency. Reality always gets more complicated, as usual.

Cluso99 · 2018-11-03 12:27

Here are some SD boot times loading a single sector FAT32 file with the time from SD code start (ie after Chip's boot section has run and passes control to my routine) until the SD has been initialised and the FAT32 file has been located and it's first sector (it's a 1 sector length file) has been read into hub $00000. From a power boot (takes the SD card the longest time to initialise)
~7,285,151 clocks (presuming 20MHz -> 330ms)
From reset with previous SD initialisation (ie warm boot)
~2,406,791 clocks (presuming 20MHz -> 120ms)

A 512B FAT32 file at RCOSC assumed 20MHz loads in 0.330s and 0.120s (from above)
A 496KB FAT32 file at RCOSC assumed 20MHz loads in 10.985s and 10.729s
A 496KB FAT32 file at 96MHz loads in 3.064s and 2.842s
A 496KB FAT32 file at 250MHz loads in 1.638s and 1.591s

I have tried running the code in COG but it's not working yet.

cgracey · 2018-11-03 13:25

Cluso99 wrote: »

Here are some SD boot times loading a single sector FAT32 file with the time from SD code start (ie after Chip's boot section has run and passes control to my routine) until the SD has been initialised and the FAT32 file has been located and it's first sector (it's a 1 sector length file) has been read into hub $00000. From a power boot (takes the SD card the longest time to initialise)
~7,285,151 clocks (presuming 20MHz -> 330ms)
From reset with previous SD initialisation (ie warm boot)
~2,406,791 clocks (presuming 20MHz -> 120ms)

A 512B FAT32 file at RCOSC assumed 20MHz loads in 0.330s and 0.120s (from above)
A 496KB FAT32 file at RCOSC assumed 20MHz loads in 10.985s and 10.729s
A 496KB FAT32 file at 96MHz loads in 3.064s and 2.842s
A 496KB FAT32 file at 250MHz loads in 1.638s and 1.591s

I have tried running the code in COG but it's not working yet.

It seems to me that you would only want to load a loader program which could turn on the crystal and do the big load.

Is there a way to specify the length of the load to be done by your booter?

Cluso99 · 2018-11-03 15:27

Yes. It loads the files length, up to a max of 496KB, so you cannot overwrite the 16KB loaded from the ROM. A 2-stage loaded is the expected concept.

jmg · 2018-11-03 21:01

Cluso99 wrote: »

I have been thinking how to time the code using the P2 CNT counter.

You could just load Bean's Reciprocal Frequency Counter into a P1, and use that ?

jmg · 2018-11-03 21:08

cgracey wrote: »

It seems to me that you would only want to load a loader program which could turn on the crystal and do the big load.

The crystal value is somewhat unknown at boot time ?
The values above are appx 50k Bytes/Second loading, so modest sizes of code will load fine. ie 1s boot is possible, just not a full-image load.

jmg · 2018-11-03 21:09

rogloh wrote: »

Edit: I'd forgotten the frequency detection thing. Boot code may not know how to setup the PLL the right way to achieve video frequencies (eg. 250MHz) which could throw a spanner in the works for self hosting ideas above unless it can be made to sense itself somehow, or having some new rule requiring any new self hosting mode to run at some known frequency. Reality always gets more complicated, as usual.

Yes, P2 lacks any means to check presence or value of external clock/xtal, so boot ROM has to be 20MHz RCFAST focused.

jmg · 2018-11-03 21:17

rogloh wrote: »

I guess as of now P2 is about an order of magnitude slower if it takes 6s to boot 512kB even if the card init/boot timeout is about 1s or thereabouts (though not sure what it might be).

From above a tiny load of 512B, takes 330ms or 130ms for System-POR or P2-(re)reset

rogloh wrote: »

The AVR has the luxury of a hardware shift register, whereas we would need to bitbang but there is the pin streamer too and smartpins if they can be leveraged here.

Chip uses the Smart pins for UART Boot mode.

On the topic of speed, can the RCFAST feed the PLL/VCO ? if yes, then that opens another speed boost, where it could select ~100MHz perhaps under some user choice ?

cgracey · 2018-11-03 22:07

No, the RC fast cannot feed the PLL.

78rpm · 2018-11-03 22:11

With boot pins, are there any without pull-ups and pull-downs such that a voltage divider could be constructed which when booting is read as an analogue value by smart-pin with ranges indicating *safe* alternate speeds to switch to. For development boards the positive half of the divider could be connected +ve to 3 x current limit resistors to an RGB led with a trim-pot below it to set the 'speed range' for the divider, the low side terminal is the pin and there is a pull-down resistor to -ve. Different operating voltages or the RGB led give a visual indication of the range selected, so Parallax's customers will not need a multimeter (think education, introduction to a hobby) to select the correct voltage if there is a problem with the SD card's speed. The pull-up and pull-down resistors should be such that any digital signal on the pin is unaffected.

If this is a workable solution, provide a jumper or small pcb switch to disconnect the leds connection to the pin and bottom half of the voltage divider for people who use battery power. The leds are only there as a visual indication to set the voltage range which selects switching to a *safe* speed.

jmg · 2018-11-03 22:24

78rpm wrote: »

With boot pins, are there any without pull-ups and pull-downs such that a voltage divider could be constructed which when booting is read as an analogue value by smart-pin with ranges indicating *safe* alternate speeds to switch to. For development boards the positive half of the divider could be connected +ve to 3 x current limit resistors to an RGB led with a trim-pot below it to set the 'speed range' for the divider, the low side terminal is the pin and there is a pull-down resistor to -ve. Different operating voltages or the RGB led give a visual indication of the range selected, so Parallax's customers will not need a multimeter (think education, introduction to a hobby) to select the correct voltage if there is a problem with the SD card's speed. The pull-up and pull-down resistors should be such that any digital signal on the pin is unaffected.

If this is a workable solution, provide a jumper or small pcb switch to disconnect the leds connection to the pin and bottom half of the voltage divider for people who use battery power. The leds are only there as a visual indication to set the voltage range which selects switching to a *safe* speed.

Interesting idea. Some i2c parts use such 'trinary levels' for address setting.

Even existing pins could be used for this, the digital decision can remain, but a finer detail could be extracted from the exact voltage.
Some parts may have external loads, (eg I think SD cards have CE pullups already) so that may complicate some pins usage ?

jmg · 2018-11-03 22:40

rogloh wrote: »

FWIW a 20MHz AVR MCU can hit reasonable transfer speeds of over 1MB/s with FATFS with large 4k blocks in SPI mode.
http://elm-chan.org/fsw/ff/res/rwtest1.png

Hmm, do they mean bytes or bits here ?
1M+ is plausible, but they also claim 2M+ and that's not possible as bytes over the 10Mb/s link.
Likewise in their benchmark 2, they claim 18Mb/s link, but 7561kB/s - which must actually mean bits/s ?

rogloh · 2018-11-03 23:03

@jmg, good point, though according to the footnote in benchmark 1, the CF modes were using GPIO (probably parallel) so they could potentially be higher than 1MB/s. The 20MHz AVR USART-SPI is limited to 10MHz which can do ~1.2 MB/s best case, which was not exceeded.

Benchmark 2 - not sure, different processor, ARM I think. They do show the bus speed limits here at 9000kB/s for a 72MHz processor which may be possible, if their kB = 1000 not 1024 anyway and that processor supports SPI rates that high?

I do think they are talking MB/s not Mb/s

Peter Jakacki · 2018-11-04 01:55

These secondary bootloaders annoy me in that they are an extra layer that needs to be added when all we need to know at boot is what it has for a clock. So I propose that both the Flash and the SD have configuration parameters that are set by the user so that when they save a file they can also describe their system that the boot loader can use. The loader wants to know if there is an oscillator or a crystal, the frequency of that crystal, the desired runtime frequency after booting, and perhaps the baud rate of any terminal that may be connected. The SD config file can either be written in the MBR or as a file _CONFIG_.TXT and the serial Flash could use from address 0 as its "MBR".

That way the SD can boot all 512kB if it needs to within about 200ms or less especially since the config may override any resistor settings perhaps.

BTW, I have PS/2 keyboards working very nicely in conjunction with VGA! All the meta keys are working too.

potatohead · 2018-11-04 02:14

Nice work Peter.

I agree. Perhaps a fallback, should those be wrong or missing.

Would be super nice to put them in the clear, simple offset from start of image too. Peolle can edit a binary, if needed, or a tool boot loader menu, OS can do that

rogloh · 2018-11-04 02:17

Peter Jakacki wrote: »

BTW, I have PS/2 keyboards working very nicely in conjunction with VGA! All the meta keys are working too.

Very cool, bring on that ROM size upgrade! Won't take you long and we'll have HDMI + USB soon enough and boot right into something like VGA IDE.

Cluso99 · 2018-11-04 02:26

jmg wrote: »

Cluso99 wrote: »

I have been thinking how to time the code using the P2 CNT counter.

You could just load Bean's Reciprocal Frequency Counter into a P1, and use that ?

Yes, i did think of getting my P1 out of the closet. I just wanted to use the P2's CNT register for that tho.

Cluso99 · 2018-11-04 02:45

We do not know the crystal frequency.
We don't know if it's an oscillator.
We don't know the desired run frequency.

Here is my code for HUBSET...

_XTALFREQ       = 12_000_000                                    ' crystal frequency
_XDIV           = 12                                            ' crystal divider to give 1MHz
_XMUL           = 180                                           ' crystal / div * mul to give _CLKFREQ
_CLOCKFREQ      = _XTALFREQ / _XDIV * _XMUL                     ' %0000_xxxE_DDDDDD_MMMMMMMMMM_PPPP_CC_SS  ' set clock generator mode
_SETFREQ        = $0100_00F4 + (_XDIV-1)<<18 + (_XMUL-1)<<8     ' %0000_0001_dddddd_mmmmmmmmmm_1111_01_00  ' ena-xtal+PLL,div=12,mul=20,p=15,0pF,20MHz+
_ENAFREQ        = _SETFREQ + 3                                  ' %0000_0001_dddddd_mmmmmmmmmm_1111_01_11  ' enable xxMHZ oscillator
.....
''-------[ Set Xtal ]---------------------------------------------------------- 
                hubset  #0                              ' set 20MHz+ mode
                hubset  ##_SETFREQ                      ' setup oscillator
                waitx   ##20_000_000/100                ' ~10ms
                hubset  ##_ENAFREQ                      ' enable oscillator
'+-----------------------------------------------------------------------------+

Here, I am using a 12MHz Oscillator. Also, I have used PPPP=15 which is another divider. Neither of these options are coded as CON yet.

We cannot code for all these variances in the Boot ROM. The only way we can use these is to configure the first stage FLASH/SD bootloader.

Now, in my P1 OS, I can tell which hardware I am using, and therefore know which XTAL and clockfreq I am using. Therefore I can have one SD card which I can use in all my various boards.
I don't expect this luxury in P2 unless we all can agree with a standard xtal or oscillator. Any takers? Start a new thread if you think we could get consensus - I am not holding my breath

TAQOZ - Tachyon Forth for the P2 BOOT ROM

Comments