Fast reading/writing of many I/O pins at once?
z80jon
Posts: 12
in Propeller 2
I've written some C code for the Propeller 1, but my first C project on the P2 (using FlexProp) is quite a big one--I'm trying to write 1-2MB to an external parallel RAM chip in a fraction of a second, ideally. (Sidenote: I'm not particularly familiar with P2 ASM or Spin).
As it stands now, my code to write a single byte takes ~2.6 microseconds to run, which is far slower than I'd like. I know that at 160Mhz, each instruction takes ~12.5ns to execute. Given I'm trying to change the values of 8 data lines, 21 address lines, and do a few other things, I'm not sure why it's taking this long, and especially am not sure how I can speed things up.
I know that the P2 is stable in excess of 200Mhz, but my efforts to increase the clock speed have thus far not worked.
The chip I'm using is an IS61WV20488FALL, which certainly has the speeds to handle it.
Right now, I'm not using sequential pins for address/data, and oddly enough from my experiments in using set_outputs and 'pretending' I had them sequentially lined up for code benchmarking purposes, I didn't see any notable difference in timing.
I guess what I'd like help with is this:
-How can I increase the clock speed of the P2? (I've done a bit of playing around with clkset, but haven't gotten anything to go faster than what it is by default when flashed by flexprop)
-How could my code be optimized further?
As it stands now, my code to write a single byte takes ~2.6 microseconds to run, which is far slower than I'd like. I know that at 160Mhz, each instruction takes ~12.5ns to execute. Given I'm trying to change the values of 8 data lines, 21 address lines, and do a few other things, I'm not sure why it's taking this long, and especially am not sure how I can speed things up.
I know that the P2 is stable in excess of 200Mhz, but my efforts to increase the clock speed have thus far not worked.
The chip I'm using is an IS61WV20488FALL, which certainly has the speeds to handle it.
Right now, I'm not using sequential pins for address/data, and oddly enough from my experiments in using set_outputs and 'pretending' I had them sequentially lined up for code benchmarking purposes, I didn't see any notable difference in timing.
I guess what I'd like help with is this:
-How can I increase the clock speed of the P2? (I've done a bit of playing around with clkset, but haven't gotten anything to go faster than what it is by default when flashed by flexprop)
-How could my code be optimized further?
#include "ram.h"
#include "globalConfig.h"
#include "simpletools.h"
#include "propeller.h"
#define RAM_DEBUG 1
void writeByte(uint32_t addr, uint8_t data) {
//ready address and set the system to WRITE...
setAddrPins(addr);
high(RAM_OE);
setRAMIOPins(data);
low(RAM_WE);
low(RAM_CS);
//brief delay to let the RAM chip read the lines
__asm {
nop
};
high(RAM_WE);
}
uint8_t readByte(uint32_t addr) {
setRAMIOPinsToRead();
setAddrPins(addr);
high(RAM_WE);
low(RAM_OE);
low(RAM_CS);
//brief delay to give the chip time to propagate outputs
__asm {
nop
};
return readRAMIOPins();
}
void setAddrPins(uint32_t addr) {
set_output(RAM_A0, addr & 0x01);
set_output(RAM_A1, (addr & 0x02) >> 1);
set_output(RAM_A2, (addr & 0x04) >> 2);
set_output(RAM_A3, (addr & 0x08) >> 3);
set_output(RAM_A4, (addr & 0x010) >> 4);
set_output(RAM_A5, (addr & 0x020) >> 5);
set_output(RAM_A6, (addr & 0x040) >> 6);
set_output(RAM_A7, (addr & 0x080) >> 7);
set_output(RAM_A8, (addr & 0x0100) >> 8);
set_output(RAM_A9, (addr & 0x0200) >> 9);
set_output(RAM_A10, (addr & 0x0400) >> 10);
set_output(RAM_A11, (addr & 0x0800) >> 11);
set_output(RAM_A12, (addr & 0x01000) >> 12);
set_output(RAM_A13, (addr & 0x02000) >> 13);
set_output(RAM_A14, (addr & 0x04000) >> 14);
set_output(RAM_A15, (addr & 0x08000) >> 15);
set_output(RAM_A16, (addr & 0x010000) >> 16);
set_output(RAM_A17, (addr & 0x020000) >> 17);
set_output(RAM_A18, (addr & 0x040000) >> 18);
set_output(RAM_A19, (addr & 0x080000) >> 19);
set_output(RAM_A20, (addr & 0x0100000) >> 20);
}
void setRAMIOPins(uint8_t data) {
set_output(RAM_D0, data & 0x01);
set_output(RAM_D1, (data & 0x02) >> 1);
set_output(RAM_D2, (data & 0x04) >> 2);
set_output(RAM_D3, (data & 0x08) >> 3);
set_output(RAM_D4, (data & 0x10) >> 4);
set_output(RAM_D5, (data & 0x20) >> 5);
set_output(RAM_D6, (data & 0x40) >> 6);
set_output(RAM_D7, (data & 0x80) >> 7);
}
void setRAMIOPinsToRead() {
set_direction(RAM_D0, 0);
set_direction(RAM_D1, 0);
set_direction(RAM_D2, 0);
set_direction(RAM_D3, 0);
set_direction(RAM_D4, 0);
set_direction(RAM_D5, 0);
set_direction(RAM_D6, 0);
set_direction(RAM_D7, 0);
}
uint8_t readRAMIOPins() {
return (input(RAM_D7) << 7) | (input(RAM_D6) << 6) | (input(RAM_D5) << 5) | (input(RAM_D4) << 4) | (input(RAM_D3) << 3) | (input(RAM_D2) << 2) | (input(RAM_D1) << 1) | input(RAM_D0);
}
Comments
It appears that you are writing one address bit at a time. Why not write all the address bits with one instruction?
Likewise with the data in and out. Why one bit at a time? Why not write all 8 bits with one instruction?
Code looks best inside a code block (change from quote to code):
There are several ways to improve your memory access performance.
The first way is to move to parallel instead of bitwise access.
The second way is to use inline PASM code for your reads/writes instead of doing it with SPIN2 or C code.
The third way is to make use of certain P2 capabilities and instructions such as using MUXQ and setting a group of bits from a base pin to the same value.
The P2 has the ability to change multiple bits in parallel with a single instruction when you access the "outa" or "outb" registers. There are other useful instructions that can manipulate individual bits or bytes in 32 bit quantities as well. There is no need to keep everything bit oriented.
My recommendation is to read up about IO pin control instructions (drvh/drvl/flth/fltl etc) and the dira/dirb, outa/outb registers in the P2 documentation. Once you've figured that out your memory access routines then can make good use of the parallel IO capabilities of the P2.
Another thing you should really do is to combine the address and data pins of your memory into contiguous groups of bits and not separate them all individually. For example if you have 21 address bits, 8 data bits and 3 control pins (conveniently already 32 bits total) and want to get close to an optimal high speed byte banging memory access routine you can combine into the same port like this (you'll see why in a moment). In this case we use port A which are the lower 32 IO pins:
You would first need to initialize the P2 output pin directions and control pin levels to a default state once somewhere to begin with using something like this... (I've hard coded it for the above pins but you can use other constant expressions):
Inlined memory reads can them become something like this snippet (I've read back the result into some arbitrary "data" register that you would also need but you could just as well read this back into your "addr" parameter and return that instead if you don't mind trashing it):
The inlined memory writes can them become something like this snippet:
With this code your memory reads should be able to execute in ~10 clocks plus any read access time delays for your memory based on the P2 clock frequency and its internal delays (you will need to experiment a bit there). For writes this is ~16 clocks plus any extra write delays you need to put in for your RAM's timing and the P2 frequency. By keeping the data bus floating while idle it's optimized for reads.
If you are running at 160MHz you can speed up your code to something like 110-120ns per write instead of 2.6 microseconds (over 23x speedup) assuming the memory can be accessed that fast (e.g. using 10ns memory).
Hope this helps....
For sysclock setting look at cdemo.c in the samples folder of FlexProp
And yes, you should use the 3.3V version not the 1.8V version of this RAM.
Andy