Fast reading/writing of many I/O pins at once?

z80jon · 2020-12-17 22:27

I've written some C code for the Propeller 1, but my first C project on the P2 (using FlexProp) is quite a big one--I'm trying to write 1-2MB to an external parallel RAM chip in a fraction of a second, ideally. (Sidenote: I'm not particularly familiar with P2 ASM or Spin).

As it stands now, my code to write a single byte takes ~2.6 microseconds to run, which is far slower than I'd like. I know that at 160Mhz, each instruction takes ~12.5ns to execute. Given I'm trying to change the values of 8 data lines, 21 address lines, and do a few other things, I'm not sure why it's taking this long, and especially am not sure how I can speed things up.

I know that the P2 is stable in excess of 200Mhz, but my efforts to increase the clock speed have thus far not worked.

The chip I'm using is an IS61WV20488FALL, which certainly has the speeds to handle it.

Right now, I'm not using sequential pins for address/data, and oddly enough from my experiments in using set_outputs and 'pretending' I had them sequentially lined up for code benchmarking purposes, I didn't see any notable difference in timing.

I guess what I'd like help with is this:
-How can I increase the clock speed of the P2? (I've done a bit of playing around with clkset, but haven't gotten anything to go faster than what it is by default when flashed by flexprop)
-How could my code be optimized further?

#include "ram.h"
#include "globalConfig.h"
#include "simpletools.h"
#include "propeller.h"
#define RAM_DEBUG 1

void writeByte(uint32_t addr, uint8_t data) {

//ready address and set the system to WRITE...
setAddrPins(addr);
high(RAM_OE);
setRAMIOPins(data);
low(RAM_WE);
low(RAM_CS);

//brief delay to let the RAM chip read the lines
__asm {
nop
};
high(RAM_WE);

}

uint8_t readByte(uint32_t addr) {
setRAMIOPinsToRead();
setAddrPins(addr);
high(RAM_WE);
low(RAM_OE);
low(RAM_CS);

//brief delay to give the chip time to propagate outputs
__asm {
nop
};

return readRAMIOPins();
}

void setAddrPins(uint32_t addr) {
set_output(RAM_A0, addr & 0x01);
set_output(RAM_A1, (addr & 0x02) >> 1);
set_output(RAM_A2, (addr & 0x04) >> 2);
set_output(RAM_A3, (addr & 0x08) >> 3);
set_output(RAM_A4, (addr & 0x010) >> 4);
set_output(RAM_A5, (addr & 0x020) >> 5);
set_output(RAM_A6, (addr & 0x040) >> 6);
set_output(RAM_A7, (addr & 0x080) >> 7);
set_output(RAM_A8, (addr & 0x0100) >> 8);
set_output(RAM_A9, (addr & 0x0200) >> 9);
set_output(RAM_A10, (addr & 0x0400) >> 10);
set_output(RAM_A11, (addr & 0x0800) >> 11);
set_output(RAM_A12, (addr & 0x01000) >> 12);
set_output(RAM_A13, (addr & 0x02000) >> 13);
set_output(RAM_A14, (addr & 0x04000) >> 14);
set_output(RAM_A15, (addr & 0x08000) >> 15);
set_output(RAM_A16, (addr & 0x010000) >> 16);
set_output(RAM_A17, (addr & 0x020000) >> 17);
set_output(RAM_A18, (addr & 0x040000) >> 18);
set_output(RAM_A19, (addr & 0x080000) >> 19);
set_output(RAM_A20, (addr & 0x0100000) >> 20);
}

void setRAMIOPins(uint8_t data) {
set_output(RAM_D0, data & 0x01);
set_output(RAM_D1, (data & 0x02) >> 1);
set_output(RAM_D2, (data & 0x04) >> 2);
set_output(RAM_D3, (data & 0x08) >> 3);
set_output(RAM_D4, (data & 0x10) >> 4);
set_output(RAM_D5, (data & 0x20) >> 5);
set_output(RAM_D6, (data & 0x40) >> 6);
set_output(RAM_D7, (data & 0x80) >> 7);
}

void setRAMIOPinsToRead() {
set_direction(RAM_D0, 0);
set_direction(RAM_D1, 0);
set_direction(RAM_D2, 0);
set_direction(RAM_D3, 0);
set_direction(RAM_D4, 0);
set_direction(RAM_D5, 0);
set_direction(RAM_D6, 0);
set_direction(RAM_D7, 0);
}

uint8_t readRAMIOPins() {

return (input(RAM_D7) << 7) | (input(RAM_D6) << 6) | (input(RAM_D5) << 5) | (input(RAM_D4) << 4) | (input(RAM_D3) << 3) | (input(RAM_D2) << 2) | (input(RAM_D1) << 1) | input(RAM_D0);
}

DaveJenson · 2020-12-17 22:42

I for one am a bit confused.
It appears that you are writing one address bit at a time. Why not write all the address bits with one instruction?

Likewise with the data in and out. Why one bit at a time? Why not write all 8 bits with one instruction?

TonyB_ · 2020-12-17 23:08

Welcome to the forum z80jon.

Code looks best inside a code block (change from quote to code):

#include "ram.h"
#include "globalConfig.h"
#include "simpletools.h"
#include "propeller.h"
#define RAM_DEBUG 1

void writeByte(uint32_t addr, uint8_t data) {

    //ready address and set the system to WRITE...
    setAddrPins(addr);
    high(RAM_OE);
    setRAMIOPins(data);
    low(RAM_WE);
    low(RAM_CS);
...

rogloh · 2020-12-18 01:45

Hello z80jon,

There are several ways to improve your memory access performance.
The first way is to move to parallel instead of bitwise access.
The second way is to use inline PASM code for your reads/writes instead of doing it with SPIN2 or C code.
The third way is to make use of certain P2 capabilities and instructions such as using MUXQ and setting a group of bits from a base pin to the same value.

The P2 has the ability to change multiple bits in parallel with a single instruction when you access the "outa" or "outb" registers. There are other useful instructions that can manipulate individual bits or bytes in 32 bit quantities as well. There is no need to keep everything bit oriented.

My recommendation is to read up about IO pin control instructions (drvh/drvl/flth/fltl etc) and the dira/dirb, outa/outb registers in the P2 documentation. Once you've figured that out your memory access routines then can make good use of the parallel IO capabilities of the P2.

Another thing you should really do is to combine the address and data pins of your memory into contiguous groups of bits and not separate them all individually. For example if you have 21 address bits, 8 data bits and 3 control pins (conveniently already 32 bits total) and want to get close to an optimal high speed byte banging memory access routine you can combine into the same port like this (you'll see why in a moment). In this case we use port A which are the lower 32 IO pins:

P0-P20 = Address0-Address20 (actual address bus pin order within this group doesn't matter with basic static RAM)
P21 = read
P22 = chip select 
P23 = write
P24-P31 = Data0-Data7 (actual data bus bit order within this group doesn't matter with basic static RAM)

You would first need to initialize the P2 output pin directions and control pin levels to a default state once somewhere to begin with using something like this... (I've hard coded it for the above pins but you can use other constant expressions):

mov outa, ##$00E00000 ' control pins should idle high (bits 21,22,23)
mov dira, ##$00FFFFFF ' leave data bus floating, all other control and address bits are output pins

Inlined memory reads can them become something like this snippet (I've read back the result into some arbitrary "data" register that you would also need but you could just as well read this back into your "addr" parameter and return that instead if you don't mind trashing it):

setq ##$001FFFFF  ' setup the mask to only affect the address bits
muxq outa, addr ' only modify the masked port output bits with the new address value
drvl #21+(1<<6) ' drive both RD and CS pins low (1<<6 means one extra pin above the nominated pin gets the same treatment)
{ optional waitx #delay here if needed for reads, where delay is 0..n for a 2...n+2 clock delay}
getbyte data, ina, #3 ' get 3rd byte of ina long which is the data bus value
drvh #21+(1<<6) ' drive both RD and CS pins high

The inlined memory writes can them become something like this snippet:

setbyte outa, data, #3 ' setup the data bus value
dirh #24+(7<<6) ' setup the data pins as outputs (base data pin plus 7 pins above it)
setq ##$001FFFFF ' setup the mask to affect the address bits
muxq outa, addr ' modify the address bits of the output port using the mask
drvl #22+(1<<6) ' drive both CS and WR pins low
{ optional waitx #delay here if needed for writes, where delay is 0..n for a 2..n+2 clock delay }
drvh #22+(1<<6) ' drive both CS and WR pins high
fltl #24+(7<<6) ' float the data bus (base data pin plus 7 pins above it)

With this code your memory reads should be able to execute in ~10 clocks plus any read access time delays for your memory based on the P2 clock frequency and its internal delays (you will need to experiment a bit there). For writes this is ~16 clocks plus any extra write delays you need to put in for your RAM's timing and the P2 frequency. By keeping the data bus floating while idle it's optimized for reads.

If you are running at 160MHz you can speed up your code to something like 110-120ns per write instead of 2.6 microseconds (over 23x speedup) assuming the memory can be accessed that fast (e.g. using 10ns memory).

Hope this helps....

Cluso99 · 2020-12-18 02:11

You should also be aware that there are delays within the silicon for outputs to appear at the pins, and inputs into registers (these reads actually preceed the read instruction as they are actually clocked into the silicon on every clock so they are in the pipeline. IIRC there is a section with timing now in the docs. If not, there is a thread that I did quite some time ago - I would need to search for it tho.

JRoark · 2020-12-18 15:07

As a point of clarity: are you using the IS61WV20488FALL or IS61WV20488FBLL?

Ariba · 2020-12-18 15:44

Here are some snippets that use simpletools functions: (for sure pins must be contiguous for address and also for data, but the order inside the group is not important)

#define  ADDRLO  0      //pins
#define  ADDRHI  20
#define  DATALO  24
#define  DATAHI  31
#define  OE      21
#define  CS      22
#define  WR      23

  //init
  _pinh(CS);    //disable all control lines
  _pinh(OE);
  _pinh(WR);
  set_directions(ADDRHI,ADDRLO, 0x1FFFF);   //Data=inp, addr=output

  //write byte
  set_directions(DATAHI,DATALO, 0xFF);      //data = outputs
  set_outputs(ADDRHI,ADDRLO, myaddr);
  set_outputs(DATAHI,DATALO, mydata);
  _pinl(CS);
  _pinl(WR);
  _pinl(WR);    //a short delay, no pin change (write goes here)
  _pinh(WR);
  _pinh(CS);

  //read byte
  set_directions(DATAHI,DATALO, 0x00);      //data = inputs
  set_outputs(ADDRHI,ADDRLO, myaddr);
  _pinl(CS);
  _pinl(OE);
  mydata = get_states(DATAHI,DATALO);       //read byte
  _pinh(OE);
  _pinh(CS);

For sysclock setting look at cdemo.c in the samples folder of FlexProp

And yes, you should use the 3.3V version not the 1.8V version of this RAM.

Andy

Fast reading/writing of many I/O pins at once?

Comments