Thanks for the update! I tried this new version and it was not noticably faster at running my xbasic test program. The old version of Catalina using the -x3 memory layout took 12 seconds and this version took 11 seconds. That could easily be explained by errors in pushing the stopwatch button on my watch. In any case, it's good to have the cache bug fixed since I'm sure it could have caused problems.
Thanks,
David
Something is not right. On my C3 it is at least twice as fast. I hope I haven't missed something out of the upgrade. Can you post your binary (and your makefile options) and I'll try it when I get home.
Something is not right. On my C3 it is at least twice as fast. I hope I haven't missed something out of the upgrade. Can you post your binary (and your makefile options) and I'll try it when I get home.
Thanks,
Ross.
Here is the binary and the makefile. Hopefully I didn't mess something up!
Here is the binary and the makefile. Hopefully I didn't mess something up!
No, it looks like it was me that messed up. I was working with some experimental changes to the caching algortithm, and I appear to have left them enabled.
In the file Catalina_SPI_Cache.spin you will find a line (currently commented out) that says:
'#define DISABLE_HASH
Remove the quote mark (i.e. define the symbol DISABLE_HASH) and try your program again. Note that you also have to recompile both the xmm.binary (in the utilities folder). You should see the program speed double.
No, it looks like it was me that messed up. I was working with some experimental changes to the caching algortithm, and I appear to have left them enabled.
In the file Catalina_SPI_Cache.spin you will find a line (currently commented out) that says:
'#define DISABLE_HASH
Remove the quote mark (i.e. define the symbol DISABLE_HASH) and try your program again. Note that you also have to recompile both the xmm.binary (in the utilities folder). You should see the program speed double.
Ross.
Thanks Ross! As you suggested, defining DISABLE_HASH almost doubled the speed of xbasic. It now takes about 7 seconds to compile and run my test program rather than 11-12. While that is certainly an improvement, it is still too slow to be useful. This is only a 35 line program. This isn't Catalina's fault entirely though. The xbasic bytecode compiler makes three passes over the source code so it is parsing the program three times. I may try compiling xbasic for the PIC24H on Andre' LaMothe's Chameleon PIC board just to see how it performs. It may not be much better. Of course, xbasic runs with blinding speed on my MacBook Pro! :-)
Thanks Ross! As you suggested, defining DISABLE_HASH almost doubled the speed of xbasic. It now takes about 7 seconds to compile and run my test program rather than 11-12. While that is certainly an improvement, it is still too slow to be useful. This is only a 35 line program. This isn't Catalina's fault entirely though. The xbasic bytecode compiler makes three passes over the source code so it is parsing the program three times. I may try compiling xbasic for the PIC24H on Andre' LaMothe's Chameleon PIC board just to see how it performs. It may not be much better. Of course, xbasic runs with blinding speed on my MacBook Pro! :-)
Hi David,
Additional speed improvements are possible, but it's never going to make the C3 an order of magnitude faster - not while programs have to be executed out of serial memory! At some point someone may make a parallel RAM add-on board for the C3, and that could change things.
I will keep the caching driver as an option since it also improve performances on other platforms - provided you can afford to sacrifice that much Hub RAM!
Additional speed improvements are possible, but it's never going to make the C3 an order of magnitude faster - not while programs have to be executed out of serial memory! At some point someone may make a parallel RAM add-on board for the C3, and that could change things.
I will keep the caching driver as an option since it also improve performances on other platforms - provided you can afford to sacrifice that much Hub RAM!
Ross.
David,
One more suggestion - why not arrange to load and save the byte-coded format? This was common practice in the "old" days of Basic interpreters (which were all generally pretty slow!). This makes the compilation speed less of an issue.
One more suggestion - why not arrange to load and save the byte-coded format? This was common practice in the "old" days of Basic interpreters (which were all generally pretty slow!). This makes the compilation speed less of an issue.
Ross.
That is certainly possible. In fact, this basic system started out as a compiler that ran on a PC and a VM that ran on the PIC, AVR, or Propeller. Andre' convinced me that we needed a language that would run on the Propeller without need for a PC so I stripped my compiler down and made it fit on the Propeller with external memory.
Does xbasic run on the dracblade? Also, do you have a link to a xbasic download by any chance?
It's kind of a work in progress. For instance, I haven't completed the heap manager for dynamic strings yet. It should run on the Dracblade but I haven't tried it. I'll attach the sources to this message if you promise not to laugh too loud when you look at them! :-)
Thanks for that. So - dumb question here, but is this the same as the xbasic you find when you search google? Or is this something you are writing yourself?
I thought David would answer this question, so I didn't.
Yes, xbasic runs on the DracBlade using the same caching driver as the C3. It is slightly faster than on the C3 - say 5s rather than 6s or 7s to run David's test program.
I don't think David would regard that as a really significant speed up.
However, just out of interest, I also tried it on the RamBlade and it runs in about 1.5s - this is partly due to the faster XMM RAM on the RamBlade (I think it is the fastest platform in that respect) and also because the RamBlade clock speed is 100Mz instead of 80Mhz. I wonder if David would consider that fast enough for his purposes?
David and I have exchanged a few PMs today and maybe it is worth taking this to a discussion here as this is very interesting.
Cluso's ramblade is definitely the fastest platform around. I think this gives us a benchmark to work from in terms of how fast things can be if you really optimise the code.
I took another look at the dracblade driver code and there are a few things that could be improved.
''Dracblade driver for talking to a ram chip via three latches
'' Modified code from Cluso's triblade
' DoCmd(command_, hub_address, ram_address, block_length)
' R - read bytes at address n up (n to n+block_length) where n =0 to 65535 (ie lower 64k of the sram chip)
' W - write bytes at address n up
' I - initialise
' N - Led on
' F - Led off
' H - set high latch to value in ramaddress A16 to A23 (will include the led)
VAR
' communication params(5) between cog driver code - only "command" and "errx" are modified by the driver
long command, hubaddrs, ramaddrs, blocklen, errx, cog ' rendezvous between spin and assembly (can be used cog to cog)
' command = R, W, N, F H =0 when operation completed by cog
' hubaddrs = hub address for data buffer
' ramaddrs = ram address for data ($0000 to $FFFF)
' blocklen = ram buffer length for data transfer
' errx = returns =0 (false=good), else <>0 (true & error code)
' cog = cog no of driver (set by spin start routine)
PUB start : err_
' Initialise the Drac Ram driver. No actual changes to ram as the read/write routines handle this
command := "I"
cog := 1 + cognew(@tbp2_start, @command)
if cog == 0
err_ := $FF ' error = no cog
else
repeat while command ' driver cog sets =0 when done
err_ := errx ' driver cog sets =0 if no error, else xx = error code
PUB stop
if cog
cogstop(cog~ - 1)
PUB DoCmd(command_, hub_address, ram_address, block_length) : err_
' Do the command: R, W, N, F, H
hubaddrs := hub_address ' hub address start
ramaddrs := ram_address ' ram address start
blocklen := block_length ' block length
command := command_ ' must be last !!
' Wait for command to complete and get status
repeat while command ' driver cog sets =0 when done
err_ := errx ' driver cog sets =0 if no error, else xx = error code
PUB rendezvous
return @command
DAT
'' +--------------------------------------------------------------------------+
'' | Dracblade Ram Driver (with grateful acknowlegements to Cluso) |
'' +--------------------------------------------------------------------------+
org 0
tbp2_start ' setup the pointers to the hub command interface (saves execution time later
' +-- These instructions are overwritten as variables after start
comptr mov comptr, par ' -| hub pointer to command
hubptr mov hubptr, par ' | hub pointer to hub address
ramptr add hubptr, #4 ' | hub pointer to ram address
lenptr mov ramptr, par ' | hub pointer to length
errptr add ramptr, #8 ' | hub pointer to error status
cmd mov lenptr, par ' | command I/R/W/G/P/Q
hubaddr add lenptr, #12 ' | hub address
ramaddr mov errptr, par ' | ram address
len add errptr, #16 ' | length
err nop ' -+ error status returned (=0=false=good)
' Initialise hardware (unlike the triblade, just tristates everything and read/write set the pins)
init mov err, #0 ' reset err=false=good
mov dira,zero ' tristate the pins
done wrlong err, errptr ' status =0=false=good, else error x
wrlong zero, comptr ' command =0 (done)
' wait for a command (pause short time to reduce power)
pause mov ctr, delay wz ' if =0 no pause
if_nz add ctr, cnt
if_nz waitcnt ctr, #0 ' wait for a short time (reduces power)
rdlong cmd, comptr wz ' command ?
if_z jmp #pause ' not yet
' decode command
cmp cmd, #"R" wz ' R = read block
if_z jmp #rdblock
cmp cmd, #"W" wz ' W = write block
if_z jmp #wrblock
cmp cmd, #"N" wz ' N= led on
if_z jmp #led_turn_on
cmp cmd, #"F" wz ' F = led off
if_z jmp #led_turn_off
cmp cmd, #"H" wz ' H sets the high latch
if_z jmp #sethighlatch
mov err, cmd ' error = cmd (unknown command)
jmp #done
tristate mov dira,zero ' all inputs to zero
jmp #done
' turn led on
led_turn_on or HighLatch,ledpin ' set the led pin high
jmp #OutputHighLatch ' send this out
led_turn_off andn HighLatch,ledpin ' set the led pin low
jmp #OutputHighLatch ' send this out
' set high address bytes with command H, pass value in third variable of the DoCmd
' 4 bytes - masks off all but bits 16 to 23
sethighlatch call #ram_open ' gets address value in 'address'
shr address,#16 ' shift right by 16 places
and address,#$FF ' ensure rest of bits zero
mov HighLatch,address ' put value into HighLatch
jmp #OutputHighLatch ' and output it
'---------------------------------------------------------------------------------------------------------
'Memory Access Functions
rdblock call #ram_open ' get variables from hub variables
rdloop call #read_memory_byte ' read byte from address into data_8
wrbyte data_8,hubaddr ' write data_8 to hubaddr ie copy byte to hub
add hubaddr,#1 ' add 1 to hub address
add address,#1 ' add 1 to ram address
djnz len,#rdloop ' loop until done
jmp #init ' reinitialise
wrblock call #ram_open
wrloop rdbyte data_8, hubaddr ' copy byte from hub
call #write_memory_byte ' write byte from data_8 to address
add hubaddr,#1 ' add 1 to hub address
add address,#1 ' add 1 to ram address
djnz len,#wrloop ' loop until done
jmp #init ' reinitialise
ram_open rdlong hubaddr, hubptr ' get hub address
rdlong ramaddr, ramptr ' get ram address
rdlong len, lenptr ' get length
mov err, #5 ' err=5
mov address,ramaddr ' cluso's variable 'ramaddr' to dracblade variable 'address'
ram_open_ret ret
read_memory_byte call #RamAddress ' sets up the latches with the correct ram address
mov dira,LatchDirection2 ' for reads so P0-P7 tristate till do read
mov outa,GateHigh ' actually ReadEnable but they are the same
andn outa,GateHigh ' set gate low
nop ' short delay to stabilise
nop
mov data_8, ina ' read SRAM
and data_8, #$FF ' extract 8 bits
or outa,GateHigh ' set the gate high again
read_memory_byte_ret ret
write_memory_byte call #RamAddress ' sets up the latches with the correct ram address
mov outx,data_8 ' get the byte to output
and outx, #$FF ' ensure upper bytes=0
or outx,WriteEnable ' or with correct 138 address
mov outa,outx ' send it out
andn outa,GateHigh ' set gate low
nop ' no nop doesn't work, one does, so put in two to be sure
nop ' another NOP
or outa,GateHigh ' set it high again
write_memory_byte_ret ret
RamAddress ' sets up the ram latches. Assumes high latch A16-A18 low so only accesses 64k of ram
mov dira,LatchDirection ' set up the pins for programming latch chips
mov outx,address ' get the address into a temp variable
and outx,#$FF ' mask the low byte
or outx,LowAddress ' or with 138 low address
mov outa,outx ' send it out
andn outa,GateHigh ' set gate low
' ?? a NOP
or outa,GateHigh ' set it high again
' now repeat for the middle byte
mov outx,address ' get the address into a temp variable
shr outx,#8 ' shift right by 8 places
and outx,#$FF ' mask the low byte
or outx,MiddleAddress ' or with 138 middle address
mov outa,outx ' send it out
andn outa,GateHigh ' set gate low
or outa,GateHigh ' set it high again
RamAddress_ret ret
OutputHighLatch ' sends out HighLatch to the 374 that does A16-19, led and the 4 spare outputs
mov dira,latchdirection ' setup active pins 138 and bus
mov outa,HighLatch ' send out HighLatch
or outa,HighAddress ' or with the high address
andn outa,GateHigh ' set gate low
or outa,GateHigh ' set the gate high again
OutputHighLatch_ret jmp #tristate ' set pins tristate
delay long 80 ' waitcnt delay to reduce power (#80 = 1uS approx)
ctr long 0 ' used to pause execution (lower power use) & byte counter
GateHigh long %00000000_00000000_00000001_00000000 ' HC138 gate high, all others must be low
Outx long 0 ' for temp use, same as n in the spin code
LatchDirection long %00000000_00000000_00001111_11111111 ' 138 active, gate active and 8 data lines active
LatchDirection2 long %00000000_00000000_00001111_00000000 ' for reads so data lines are tristate till the read
LowAddress long %00000000_00000000_00000101_00000000 ' low address latch = xxxx010x and gate high xxxxxxx1
MiddleAddress long %00000000_00000000_00000111_00000000 ' middle address latch = xxxx011x and gate high xxxxxxx1
HighAddress long %00000000_00000000_00001001_00000000 ' high address latch = xxxx100x and gate high xxxxxxx1
'ReadEnable long %00000000_00000000_00000001_00000000 ' /RD = xxxx000x and gate high xxxxxxx1
' commented out as the same as GateHigh
WriteEnable long %00000000_00000000_00000011_00000000 ' /WE = xxxx001x and gate high xxxxxxx1
Zero long %00000000_00000000_00000000_00000000 ' for tristating all pins
data_8 long %00000000_00000000_00000000_00000000 ' so code compatability with zicog driver
address long %00000000_00000000_00000000_00000000 ' address for ram chip
ledpin long %00000000_00000000_00000000_00001000 ' to turn on led
HighLatch long %00000000_00000000_00000000_00000000 ' static value for the 374 latch that does the led, hA16-A19 and the other 4 outputs
1) there is a deliberate delay
' wait for a command (pause short time to reduce power)
pause mov ctr, delay wz ' if =0 no pause
if_nz add ctr, cnt
if_nz waitcnt ctr, #0 ' wait for a short time (reduces power)
rdlong cmd, comptr wz ' command ?
- maybe save some lines there
2) Reading in blocks of data. There are 19 address lines on a 512k chip and at the moment these are in two groups - the High group A16 to A18 and the Low and Middle group which are grouped together. This seemed natural for the Z80 emulations with 16 bit addresses.
But what if we separate out the Low and Middle latches?
I count 46 instructions to read one byte from external memory. Surely that can be decreased?!!
First thing might be to leave the middle latch unchanged and just change the lower latch. Maybe do it in groups of 4 bytes, or maybe in groups of 16 or 256?
I think that can save 8 instructions per byte.
Also I think by doing things in blocks, you don't have to keep checking for new instructions each byte. Say the requesting program wanted a Long, well then you can skip a whole lot of rechecking code for new requests.
I think that can halve the number of instructions per byte if you do Longs.
And then one might think about optimising further. For C, it depends on the probability that an instruction will cause a branch outside a block of n bytes. At the extremes, say you requested byte x and it read in the next 64k of bytes. This will take a lot of time but with a small probability that a jump will go outside this block. Read in 1 long, and that is inefficient too. I'm not sure of the maths, but say the probability of a jump was 10%, then maybe as a guess it might be best to read in 16 bytes as a block?
The driver code above already has an instruction for reading in blocks, it is just that I think mostly we read in blocks of 1, ie a byte. Ross, a) is that how catalina works and b) where is the source code for the dracblade driver file and what is it called?
So you might pass an address n=0 to 512k.
1) is this in the same high/medium latch range as the last request?
2) If yes, read bytes but only change the low latch.
3) If no then update the medium and high latches.
I wonder also about a lookahead cache.The requesting spin code requests a byte at address n. The cog goes and starts reading from this address. I'd need to check speeds, but there is a fairly good chance the cog will be faster than the requesting spin, so the cog will always be ahead of the requesting program, so from the requesting programs point of view, it requests byte n and for the next 256 bytes the values are always correct in a buffer.
Then there is another variable - how often would the cog code check the passed parameter to see if the calling program wants a different block. Maybe if the probability of a branch in C is 10%, you check only every 10 bytes? If so, that saves even more code.
I took another look at the dracblade driver code and there are a few things that could be improved.
Hi Dr_A ...
Yes absolutely - I've not really done any optimization on the original caching driver code yet. In fact it only currently supports the DRACBLADE at all because David's and Jazzed's original driver code already did!
What I plan to do next is rewrite the interface from the caching driver to use my standard XMM code. That code is already written for all XMM platforms, and is much more optimized (although probably still a long way from being as good as it could be!).
That's about the last thing I expect to do before I am ready to release Catalina 3.0.
David, Jazzed ...
I found a bug in the Catalina SD Card driver initializtion code that seems to show up on the C3. I've now fixed it, but if you are having occasional strange problems with programs sometimes not being able to access the SD card (but which work ok when you reload them) then this may be the reason. It may also have affected other platforms - for example I think it is the reason I was having occasional problems with the SD card on the RamBlade (and for which I was - quite unfairly - blaming Cluso!).
Oh darn. Someone *extremely* clever has already split the middle and lower latch! This XMM driver looks extremely well optimised. I think only caching would improve that, and any improvements due to caching will apply equally to the C3.
XMM_IncAddr
add XMM_Addr,#1 ' inc sram address
mov outx,XMM_Addr ' does result of incrementing ...
and outx,#$FF ' ... require updating latch 8 - 15 or 16 - 19?
tjnz outx,#XMM_Set0_7 ' if not, just set latch for addr bits 0 - 7
call #XMM_SetAddr ' otherwise we must set all latches
jmp #XMM_IncAddr_ret ' done
I thought David would answer this question, so I didn't.
Yes, xbasic runs on the DracBlade using the same caching driver as the C3. It is slightly faster than on the C3 - say 5s rather than 6s or 7s to run David's test program.
I don't think David would regard that as a really significant speed up.
However, just out of interest, I also tried it on the RamBlade and it runs in about 1.5s - this is partly due to the faster XMM RAM on the RamBlade (I think it is the fastest platform in that respect) and also because the RamBlade clock speed is 100Mz instead of 80Mhz. I wonder if David would consider that fast enough for his purposes?
Ross.
Sorry I didn't post my reply in here. I was trying not to hijack your thread to discuss xbasic. I guess I should try the RamBlade. I've had one for a long time but have never done anything with it. I guess I stopped when I discovered that you couldn't use the standard pin 31/30 serial I/O. How do you have your RamBlade configured?
Sorry I didn't post my reply in here. I was trying not to hijack your thread to discuss xbasic. I guess I should try the RamBlade. I've had one for a long time but have never done anything with it. I guess I stopped when I discovered that you couldn't use the standard pin 31/30 serial I/O. How do you have your RamBlade configured?
Thanks,
David
Just with the SRAM and SD Card. I just use the normal PropPlug for comms. As shown on the diagram below, you plug it onto the middle 4 pins for programming the EEPROM, and the bottom 4 pins for terminal I/O (and use Catalyst to load programs off the SD Card).
Thanks Ross! I'll have to try that setup. I guess I was put off a bit by the fact that I would have to reconnect the serial interface to reprogram the card. It makes development a bit of a pain. I wonder why he used the high numbered pins for his SRAM interface?
Thanks Ross! I'll have to try that setup. I guess I was put off a bit by the fact that I would have to reconnect the serial interface to reprogram the card. It makes development a bit of a pain. I wonder why he used the high numbered pins for his SRAM interface?
If I know Cluso, it was done for a good reason - most likely because it allowed the SRAM to be used with the least possible number of instructions.
Ross.
P.S. In a lot of ways, the RamBlade is my favorite board. If only it could be powered by the USB port, it would be the ideal "portable" Prop platform!
The cogjects project now has 8 cogjects. These can be used by Spin, and they can also be used by Catalina. I have plans to write drivers for other languages as well.
What this means in C is that no more 'inline' pasm code in the C program. Do the debugging in Spin and then when it works, move it over to C. The following code is for the Serial driver and I have left the Spin code in as this will be useful in translating spin in the future.
From a practical perspective, Spin can only do so much even with cogjects. The SD driver takes about 1/4 of hub, a decent video buffer takes just under 20k, and there is not much space left for code.
C in XMM on the other hand puts the SD driver into external memory and most of the hub is free for a video buffer.
/* PASM cogject demonstration, see also cogject example in spin*/
#include <stdio.h>
unsigned long cogarray[511]; // external memory common cog array
// start of C functions
void clearscreen() // white text on dark blue background
{
int i;
for (i=0;i<40;i++)
{
t_setpos(0,0,i); // move cursor to next line
t_color(0,0x08FC); // RRGGBBxx eg dark blue background 00001000 white text 11111100
}
}
void sleep(int milliseconds) // sleep function
{
_waitcnt(_cnt()+(milliseconds*(_clockfreq()/1000))-4296);
}
char peek(int address) // function implementation of peek
{
return *((char *)address);
}
void poke(int address, char value) // function implementation of poke
{
*((char *)address) = value;
}
void external_memory_cog_load(int cognumber, unsigned long cogdata[], unsigned long parameters_array[]) // load a cog from external memory
{
unsigned long hubcog[511]; // create a local array, this is in hub ram, not external ram
int i;
for(i=0;i<512;i++)
{
hubcog[i]=cogdata[i]; // move from external memory to a local array in hub
}
_coginit((int)parameters_array>>2, (int)hubcog>>2, cognumber); // load the cog
}
unsigned long serial_start(unsigned long rxpin,unsigned long txpin,unsigned long mode, unsigned long baudrate, int cognumber, unsigned long par[], unsigned long cogdata[])
{
/*
PUB start(rxpin, txpin, mode, baudrate) : okay
'' Start serial driver - starts a cog
'' returns false if no cog available
''
'' mode bit 0 = invert rx
'' mode bit 1 = invert tx
'' mode bit 2 = open-drain/source tx
'' mode bit 3 = ignore tx echo on rx
stop
longfill(@rx_head, 0, 4)
longmove(@rx_pin, @rxpin, 3)
bit_ticks := clkfreq / baudrate
buffer_ptr := @rx_buffer
okay := cog := cognew(@entry, @rx_head) + 1
*/
unsigned long okay;
unsigned long bit_ticks;
unsigned long buffer_ptr;
par[0] = 0; // rx_head longfill(@rx_head, 0, 4)
par[1] = 0; // rx_tail
par[2] = 0; // tx_head
par[3] = 0; // tx_tail
par[4] = rxpin; // longmove(@rx_pin, @rxpin, 3)
par[5] = txpin; // note - if rewrite the pasm code could save a couple of hub longs here
par[6] = mode; // as rxpin and txpin are not used anywhere else
bit_ticks = _clockfreq() / baudrate; // bit_ticks := clkfreq / baudrate
par[7] = bit_ticks;
buffer_ptr = (unsigned long)&par[9]; // buffer_ptr := @rx_buffer points to start of circular buffer
par[8] = buffer_ptr; // pointer to the start of the circular buffers
// rx buffer is 9 to 12 and tx buffer is 13 to 16 (16 bytes =4 longs)
external_memory_cog_load(cognumber,cogdata,par); // load from external ram
// okay returns the cog number or -1 if a fail page 119 manual. Ignored here
// printf("par array is at %u \n",(unsigned long)&par[0]);
// printf("par array entry 1 is at %u \n",(unsigned long)&par[1]);
// printf("par array entry 7 is at %u \n",(unsigned long)&par[7]);
// printf("rx_head is at %u \n",(unsigned long)&par[9]);
// printf("buffer_ptr is %u \n",par[8]);
return okay;
}
void serial_tx(char tx,unsigned long par[])
{
/*
PUB tx(txbyte)
'' Send byte (may wait for room in buffer)
repeat until (tx_tail <> (tx_head + 1) & $F)
tx_buffer[tx_head] := txbyte
tx_head := (tx_head + 1) & $F
if rxtx_mode & %1000
rx
*/
unsigned long tx_head;
int address;
while ( par[3] == ((par[2] + 1 ) & 0xF)) {} // wait if the head has looped right round and is now one less than the tail
tx_head = par[2]; // get the head value
address = par[8] + 16 + tx_head; // location of rx buffer plus 16 to get tx buffer plus the head value
poke(address,tx); // poke the tx byte value to hub ram
tx_head = tx_head + 1; // add one
tx_head = tx_head & 0xF; // logical and with 15
par[2] = tx_head; // store it back again
// need to add the echo mode?
}
unsigned long serial_rxcheck(unsigned long par[])
{
/*
PUB rxcheck : rxbyte
'' Check if byte received (never waits)
'' returns -1 if no byte received, $00..$FF if byte
rxbyte--
if rx_tail <> rx_head
rxbyte := rx_buffer[rx_tail]
rx_tail := (rx_tail + 1) & $F
*/
unsigned long rxbyte; // actually is a long, so can return -1 FFFFFFFF if nothing and 0-FF if a byte
int address; // hub address
rxbyte = 0; // set explicitly to zero
rxbyte = rxbyte - 1; // return ffffffff if nothing
if (par[1] != par[0])
{
address = par[8] + par[1]; // par[8] is the rx buffer, par[1] is rx_tail
rxbyte = peek(address); // get the return byte from the buffer
par[1] = (par[1] +1) & 0xF; // add one to tail
}
return rxbyte;
}
unsigned long serial_rx(unsigned long par[])
{
/*
PUB rx : rxbyte
'' Receive byte (may wait for byte)
'' returns $00..$FF
repeat while (rxbyte := rxcheck) < 0
*/
unsigned long rxbyte; // actually is a long, not a byte
while ((rxbyte = serial_rxcheck(par)) == -1) {} // 0xffffffff and -1 works, but " < 0" gives a compiler error
return rxbyte; // return the value
}
void serial_rxflush(unsigned long par[]) // flush receive buffer
{
while (serial_rxcheck(par) != -1) {} // keep checking until buffer clear
}
unsigned long serial_rxtime(unsigned long ms,unsigned long par[]) // wait ms milliseconds for byte, -1 if nothing
{
unsigned long rxbyte = -1;
unsigned long counter = 0; // start a counter, 10ms ticks
ms = ms / 10; // internal delay for 1ms ticks is too high
while (((rxbyte = serial_rxcheck(par)) == -1) & (counter < ms)) // wait until a byte or counter times out
{
_waitcnt(_cnt()+(10*(_clockfreq()/1000))-4296); // wait 10 milliseconds
counter +=1; // add one to counter
}
return rxbyte;
}
void serial_str(char lineoftext[],unsigned long par[]) // send out the string
{
/*
'' Send string
repeat strsize(stringptr)
tx(byte[stringptr++])
*/
int i;
for(i=0; i<strlen(lineoftext);i++)
{
serial_tx(lineoftext[i],par); // send out the bytes one at a time
}
}
void serial_dec(signed long value,unsigned long par[]) // send out decimal value - unsigned
{
/*
'' Print a decimal number
if value < 0
-value
tx("-")
i := 1_000_000_000
repeat 10
if value => i
tx(value / i + "0")
value //= i
result~~
elseif result or i == 1
tx("0")
i /= 10
*/
char lineoftext[12] = ""; // enough room for a 32 bit long 2^32 and possibly the minus sign
sprintf(lineoftext, "%d", value); // convert to a string
// printf ("lineoftext is now: %s\n", lineoftext);
serial_str(lineoftext,par); // send out the string
}
void serial_hex(unsigned long value, unsigned long par[]) // send out a hex value
/*
'' Print a hexadecimal number
value <<= (8 - digits) << 2
repeat digits
tx(lookupz((value <-= 4) & $F : "0".."9", "A".."F"))
*/
{
char lineoftext[8] = ""; // enough room for FFFFFFFF
sprintf(lineoftext,"%x",value); // convert to hex value
serial_str(lineoftext,par); // send it out
}
void serial_crlf(unsigned long par[]) // send a crlf
{
serial_tx(13,par); // cr
serial_tx(10,par); // lf
}
int EoF (FILE* stream)
{
register int c, status = ((c = fgetc(stream)) == EOF);
ungetc(c,stream);
return status;
}
void readcog(char *filename,unsigned long external_cog[]) // read in a .cog file into external memory array
{
int i;
FILE *FP1;
i = 0;
if((FP1=fopen(filename,"rb"))==0) // open the file
{
fprintf(stderr,"Can't open file %s\n",filename);
exit(1);
}
fseek(FP1,0,0);
for(i=0;i<24;i++)
{
getc(FP1); // read in the first 24 bytes and discard
}
i = 0;
while(!EoF(FP1) & (i<505)) // run until end of file or 511-6
{
external_cog[i] = getc(FP1) | (getc(FP1)<<8) | (getc(FP1)<<16) | (getc(FP1)<<24); // get the long
i+=1;
}
if(FP1)
{
fclose(FP1); // close the file
FP1=NULL;
}
printf("external array cog first long = 0x%x \n",external_cog[0]); // hex value
}
void serial_demo(unsigned long serial_parameters[]) // demonstrate the serial cog code
{
int i;
unsigned long value = 0x80000000; // 80000000 is -1
char lineoftext[80]; // for string testing
unsigned long received_byte; // actually a long, not a byte
clearscreen(); // white on blue vga
printf("Clock speed %u \n",_clockfreq()); // see page 28 of the propeller manual for other useful commands
printf("Catalina running in cog number %i \n",_cogid()); // integer
readcog("serial.cog",cogarray); // read into general external memory cog array
serial_start(31,30,0,38400,7,serial_parameters,cogarray); // start serial cog pins 31,30, mode 0, cog 7, 38400 baud
printf("Started serial driver\n");
for(i=0; i<10; i++)
{
serial_tx(65+i,serial_parameters); // test sending a byte 10x (delay for starting a serial terminal program)
sleep(500);
printf("send byte %u \n",65+i);
}
serial_crlf(serial_parameters);
strcpy(lineoftext,"This is a really long string test with a slow baud rate to check buffer overruns"); // store a string
serial_str(lineoftext,serial_parameters); // send it out
serial_crlf(serial_parameters); // new line
serial_dec(value,serial_parameters); // send out a big decimal number
serial_crlf(serial_parameters); // new line
serial_str("Hex value is ",serial_parameters);
serial_hex(value,serial_parameters); // send out a hex value
serial_crlf(serial_parameters);
serial_rxflush(serial_parameters); // flush the receive buffer
printf("Type a character within the next 3 seconds \n"); // test the timeout
received_byte = serial_rxtime(3000,serial_parameters); // get a byte with a timeout
printf("character was ascii %d \n",received_byte); // %d is signed
printf("type some characters \n");
for (i=0;i<10;i++) // test 19 times, so tests buffer restarting
{
received_byte = serial_rx(serial_parameters); // get a byte
serial_tx(received_byte,serial_parameters); // echo it back
printf("sent back byte %u \n",received_byte);
}
printf("demo program finished \n");
}
void main ()
{
unsigned long serial_parameters[16]; // reserve hub space in main for buffer, head tail pointers
serial_demo(serial_parameters); // demo routines
while (1); // endless loop as prop reboots on exit from main()
}
A quick question
In spin
n <-= 1
in C, is this
n = (n << 1) | (n >> 31);
Also - I now have catalina booting up in text mode, then stopping the vga drivers and reloading a graphics driver 160x120. I can change the colors from within C eg screen[0] = 0xffffffff sets 4 pixels to white.
However, the screen buffer is stored in longs, and I want to access it in bytes. In spin, the command is
byte[myarray][number] := n
but how would you do this in C?
get the unsigned long, and clear one byte and replace with the new byte?
or get a pointer to the start of the array, add n bytes, then poke a value into hub ram?
or another way?
Comments
Something is not right. On my C3 it is at least twice as fast. I hope I haven't missed something out of the upgrade. Can you post your binary (and your makefile options) and I'll try it when I get home.
Thanks,
Ross.
Here is the binary and the makefile. Hopefully I didn't mess something up!
Ross.
No, it looks like it was me that messed up. I was working with some experimental changes to the caching algortithm, and I appear to have left them enabled.
In the file Catalina_SPI_Cache.spin you will find a line (currently commented out) that says: Remove the quote mark (i.e. define the symbol DISABLE_HASH) and try your program again. Note that you also have to recompile both the xmm.binary (in the utilities folder). You should see the program speed double.
Ross.
Thanks Ross! As you suggested, defining DISABLE_HASH almost doubled the speed of xbasic. It now takes about 7 seconds to compile and run my test program rather than 11-12. While that is certainly an improvement, it is still too slow to be useful. This is only a 35 line program. This isn't Catalina's fault entirely though. The xbasic bytecode compiler makes three passes over the source code so it is parsing the program three times. I may try compiling xbasic for the PIC24H on Andre' LaMothe's Chameleon PIC board just to see how it performs. It may not be much better. Of course, xbasic runs with blinding speed on my MacBook Pro! :-)
Hi David,
Additional speed improvements are possible, but it's never going to make the C3 an order of magnitude faster - not while programs have to be executed out of serial memory! At some point someone may make a parallel RAM add-on board for the C3, and that could change things.
I will keep the caching driver as an option since it also improve performances on other platforms - provided you can afford to sacrifice that much Hub RAM!
Ross.
David,
One more suggestion - why not arrange to load and save the byte-coded format? This was common practice in the "old" days of Basic interpreters (which were all generally pretty slow!). This makes the compilation speed less of an issue.
Ross.
That is certainly possible. In fact, this basic system started out as a compiler that ran on a PC and a VM that ran on the PIC, AVR, or Propeller. Andre' convinced me that we needed a language that would run on the Propeller without need for a PC so I stripped my compiler down and made it fit on the Propeller with external memory.
It's kind of a work in progress. For instance, I haven't completed the heap manager for dynamic strings yet. It should run on the Dracblade but I haven't tried it. I'll attach the sources to this message if you promise not to laugh too loud when you look at them! :-)
I think David's xbasic is different to the one you are probably finding on Google.
Also, I have added improvements to the way plugins are registered for release 3.0 ...
This program:
produces this output:
This should much simplify identifying, stopping and re-starting cogs at runtime.
Ross.
Hi Dr_A,
I thought David would answer this question, so I didn't.
Yes, xbasic runs on the DracBlade using the same caching driver as the C3. It is slightly faster than on the C3 - say 5s rather than 6s or 7s to run David's test program.
I don't think David would regard that as a really significant speed up.
However, just out of interest, I also tried it on the RamBlade and it runs in about 1.5s - this is partly due to the faster XMM RAM on the RamBlade (I think it is the fastest platform in that respect) and also because the RamBlade clock speed is 100Mz instead of 80Mhz. I wonder if David would consider that fast enough for his purposes?
Ross.
Cluso's ramblade is definitely the fastest platform around. I think this gives us a benchmark to work from in terms of how fast things can be if you really optimise the code.
I took another look at the dracblade driver code and there are a few things that could be improved.
1) there is a deliberate delay
- maybe save some lines there
2) Reading in blocks of data. There are 19 address lines on a 512k chip and at the moment these are in two groups - the High group A16 to A18 and the Low and Middle group which are grouped together. This seemed natural for the Z80 emulations with 16 bit addresses.
But what if we separate out the Low and Middle latches?
I count 46 instructions to read one byte from external memory. Surely that can be decreased?!!
First thing might be to leave the middle latch unchanged and just change the lower latch. Maybe do it in groups of 4 bytes, or maybe in groups of 16 or 256?
I think that can save 8 instructions per byte.
Also I think by doing things in blocks, you don't have to keep checking for new instructions each byte. Say the requesting program wanted a Long, well then you can skip a whole lot of rechecking code for new requests.
I think that can halve the number of instructions per byte if you do Longs.
And then one might think about optimising further. For C, it depends on the probability that an instruction will cause a branch outside a block of n bytes. At the extremes, say you requested byte x and it read in the next 64k of bytes. This will take a lot of time but with a small probability that a jump will go outside this block. Read in 1 long, and that is inefficient too. I'm not sure of the maths, but say the probability of a jump was 10%, then maybe as a guess it might be best to read in 16 bytes as a block?
The driver code above already has an instruction for reading in blocks, it is just that I think mostly we read in blocks of 1, ie a byte. Ross, a) is that how catalina works and b) where is the source code for the dracblade driver file and what is it called?
So you might pass an address n=0 to 512k.
1) is this in the same high/medium latch range as the last request?
2) If yes, read bytes but only change the low latch.
3) If no then update the medium and high latches.
I wonder also about a lookahead cache.The requesting spin code requests a byte at address n. The cog goes and starts reading from this address. I'd need to check speeds, but there is a fairly good chance the cog will be faster than the requesting spin, so the cog will always be ahead of the requesting program, so from the requesting programs point of view, it requests byte n and for the next 256 bytes the values are always correct in a buffer.
Then there is another variable - how often would the cog code check the passed parameter to see if the calling program wants a different block. Maybe if the probability of a branch in C is 10%, you check only every 10 bytes? If so, that saves even more code.
Hi Dr_A ...
Yes absolutely - I've not really done any optimization on the original caching driver code yet. In fact it only currently supports the DRACBLADE at all because David's and Jazzed's original driver code already did!
What I plan to do next is rewrite the interface from the caching driver to use my standard XMM code. That code is already written for all XMM platforms, and is much more optimized (although probably still a long way from being as good as it could be!).
That's about the last thing I expect to do before I am ready to release Catalina 3.0.
David, Jazzed ...
I found a bug in the Catalina SD Card driver initializtion code that seems to show up on the C3. I've now fixed it, but if you are having occasional strange problems with programs sometimes not being able to access the SD card (but which work ok when you reload them) then this may be the reason. It may also have affected other platforms - for example I think it is the reason I was having occasional problems with the SD card on the RamBlade (and for which I was - quite unfairly - blaming Cluso!).
Hi Dr_A,
All the XMM code for all platforms is now in the file XMM.inc in the target directory. Look for the section marked #elseifdef DRACBLADE
If you can streamline the DracBlade code, I'll include it in the next release.
Ross.
P.S. If you modify the code, try not to use any more longs - the XMM kernel has very few longs to spare!
Oh darn. Someone *extremely* clever has already split the middle and lower latch! This XMM driver looks extremely well optimised. I think only caching would improve that, and any improvements due to caching will apply equally to the C3.
Sorry I didn't post my reply in here. I was trying not to hijack your thread to discuss xbasic. I guess I should try the RamBlade. I've had one for a long time but have never done anything with it. I guess I stopped when I discovered that you couldn't use the standard pin 31/30 serial I/O. How do you have your RamBlade configured?
Thanks,
David
Just with the SRAM and SD Card. I just use the normal PropPlug for comms. As shown on the diagram below, you plug it onto the middle 4 pins for programming the EEPROM, and the bottom 4 pins for terminal I/O (and use Catalyst to load programs off the SD Card).
Ross.
If I know Cluso, it was done for a good reason - most likely because it allowed the SRAM to be used with the least possible number of instructions.
Ross.
P.S. In a lot of ways, the RamBlade is my favorite board. If only it could be powered by the USB port, it would be the ideal "portable" Prop platform!
Catalina 3.0 has been released. It has a new thread here.
Ross.
What this means in C is that no more 'inline' pasm code in the C program. Do the debugging in Spin and then when it works, move it over to C. The following code is for the Serial driver and I have left the Spin code in as this will be useful in translating spin in the future.
From a practical perspective, Spin can only do so much even with cogjects. The SD driver takes about 1/4 of hub, a decent video buffer takes just under 20k, and there is not much space left for code.
C in XMM on the other hand puts the SD driver into external memory and most of the hub is free for a video buffer.
A quick question
In spin
n <-= 1
in C, is this
n = (n << 1) | (n >> 31);
Also - I now have catalina booting up in text mode, then stopping the vga drivers and reloading a graphics driver 160x120. I can change the colors from within C eg screen[0] = 0xffffffff sets 4 pixels to white.
However, the screen buffer is stored in longs, and I want to access it in bytes. In spin, the command is
byte[myarray][number] := n
but how would you do this in C?
get the unsigned long, and clear one byte and replace with the new byte?
or get a pointer to the start of the array, add n bytes, then poke a value into hub ram?
or another way?