Memory expansion for P1

thej · 2017-12-30 03:20

I've been digging through the forum for memory expansion options for the P1 but have not been finding much that is still available.

What P1 boards with extra RAM or RAM boards that can expand an existing P1 board are available?

Thanks!
J

David Betz · 2017-12-30 03:23

How about this?

http://mghdesigns.com/propeller/dna.html

You can plug a SPI SRAM into one of the sockets.

Peter Jakacki · 2017-12-30 03:25

Since P1 does not have an external bus and only 32 I/O total there is no easy or efficient way to expand memory even though it has been done. My memory expansion options involves getting more out of hub RAM using Tachyon and using the second 32k of the EEPROM for dictionary and binary image "ROMs". No hardware involved, no pins sacrificed, and no slow and messy clumsy external memory addressing schemes either.

David Betz · 2017-12-30 03:29

I guess another option would be to use an FPGA with P1v and more than 32k of hub RAM. Tachyon sounds more practical though! :-)

thej · 2017-12-30 03:49

I just purchased the ASC+ board last week. Still waiting for it to arrive.
Since it is "arduino compatible", I guess should be able to use an arduino ram shield.
Any reasons why that might not work?

I had found the 32MB expansion board made by Jazzed but the site linked in the forum no longer works.
http://forums.parallax.com/discussion/127044

thej · 2017-12-30 04:26

I will be using Tachyon for all my coding on the P1.
However, I will need a lot more than 32K for my various "planed" hijinks ...er... projects.

I'm currently looking at this arduino board.
https://www.tindie.com/products/FemtoCow/spi-ram-shield-for-arduino/
It has up to 512KB on board.

Peter, how would accessing this from tachyon work?
It is a SPI interface to the RAM board but does it need to use LMM or XMM ??
Is that in Tachyon? (still reading through the docs)

Cluso99 · 2017-12-30 10:38

I built the RamBlade (still available, link in signature) to add 512KB of SRAM. RamBlade uses all pins (except 2 which can be used for serial or 1-pin mono video and 1-pin PS/2 keyboard. SD is also supported, as is my Prop OS in this config.

I use this as the base in one of my projects that runs a large database and big C (Catalina C) program.

Not sure what mileage you will get from an Arduino RAM add-on board. There are hardware considerations when mixing Arduino an Propeller as the prop runs on 3V3. SPI RAM has speed restrictions too.

Peter Jakacki · 2017-12-30 10:40

@thej - I have not yet run out of code memory that I would be so desperate to implement a form of XMM. I run the SD card interface at 10MHz and even the SPI software routines run at about half that but even so, that means the best you are ever going to do assuming you have a cog dedicated to handling this is about 1us for every byte. But if you have to jump then slow it down again and wait. This might not sound too bad if you are used to XMM but what do you need the extra memory for? Is it code or data or what? Do you need to write often or can you just use SD memory? In Tachyon you can address the card or a file up to 4GB as virtual memory so I guess it would be possible for me to execute code at the virtual memory layer with just a few lines of code.

BTW, Tachyon does have an LMM instruction but that is mainly to allow it to run PASM code from hub, and the only reason I need to do that is because it involves hub operations such as coginit and clkset etc that need to be executed from PASM but aren't time critical that they need to be in cog memory.

jmg · 2018-01-01 01:50

thej wrote: »

Memory expansion for P1...
I will be using Tachyon for all my coding on the P1.
However, I will need a lot more than 32K for my various "planed" hijinks ...er... projects.

What does 'a lot more' mean in numbers ?
What speed do you need, and what memory types ?

Microchip and OnSemi have 128k Byte SPI Static SRAM parts.

There are HyperRAM and OctaRAM parts, if you can work with SDRAM refresh rules.

Then, FRAM can cover both RAM and FLASH, tho at a price premium. I see 4Mbit FRAMs are shipping.

Or, there is ReRAM, (MB85AS4MT) - cheaper than FRAM, 1.2M cycles, with a 256 byte page write, write time seems bit-changed proportional ? Not quad SPI & modest MHz, but may work well with P1 ?

Or MRAM, also showing 4MBits, is close to FRAM in having 'instant' writes.

Serial FLASH is the lowest cost per-bit, but it can have larger erase times, so it's less 'RAM replacement' useful.

mikeologist · 2018-01-01 05:44

jmg wrote: »

thej wrote: »

Memory expansion for P1...
I will be using Tachyon for all my coding on the P1.
However, I will need a lot more than 32K for my various "planed" hijinks ...er... projects.

What does 'a lot more' mean in numbers ?
What speed do you need, and what memory types ?

Microchip and OnSemi have 128k Byte SPI Static SRAM parts.

There are HyperRAM and OctaRAM parts, if you can work with SDRAM refresh rules.

Then, FRAM can cover both RAM and FLASH, tho at a price premium. I see 4Mbit FRAMs are shipping.

Or, there is ReRAM, (MB85AS4MT) - cheaper than FRAM, 1.2M cycles, with a 256 byte page write, write time seems bit-changed proportional ? Not quad SPI & modest MHz, but may work well with P1 ?

Or MRAM, also showing 4MBits, is close to FRAM in having 'instant' writes.

Serial FLASH is the lowest cost per-bit, but it can have larger erase times, so it's less 'RAM replacement' useful.

What is the max random byte reads per second that can one reasonably achieve with external RAM?

I'm just getting started with some parallel SRAM and I'm getting 14,500 random byte read or writes per second and a little over 20,000 sequential byte reads or writes per second, but it's just a 70ns chip and I've just been working in spin so far.

Peter Jakacki · 2018-01-01 06:16

Parallel RAM is impractical as it consumes much of the I/O and interfacing in Spin is obviously too slow. An 8-bit array of 23LCV1024 1Mb parts parts would only need 10 I/O but a single device would work for 128KB and only 4 I/O. With my routines I would get around 1MB/sec sequential but that would slow down to under 500kB/sec for random reads. Still good, but what do you really need it for? Is there a better way? I'd rather wait for P2 and do it properly then kludge the P1.

mikeologist · 2018-01-01 06:30

Peter Jakacki wrote: »

Parallel RAM is impractical as it consumes much of the I/O and interfacing in Spin is obviously too slow. An 8-bit array of 23LCV1024 1Mb parts parts would only need 10 I/O but a single device would work for 128KB and only 4 I/O. With my routines I would get around 1MB/sec sequential but that would slow down to under 500kB/sec for random reads. Still good, but what do you really need it for? Is there a better way? I'd rather wait for P2 and do it properly then kludge the P1.

My goal with parallel memory interface would be to explore sequencing other cogs to read the data at the same time as one cog writes it to RAM by adding a few id pins too. At this point it is academic and I can't get a P2 cheaply. I recognize that there are faster ways and your speeds are impressive.

MIchael_Michalski · 2018-01-01 17:43

I'm thinking , if you really need the speed, once your talking about external memory, it makes sense to simply add a shift register along with an spi memory device. Then you can run the spi buss at 125mhz, the maximum frequency of the video generator. That comes out to about 4million words per second, or 16 million bytes per second. You'd only need an 8bit serial in parallel out device that can latch the output. Every 8bits it would latch the output and you could read it any time before the next byte was ready. Of course there's a lot of overhead to the commands.

jmg · 2018-01-01 19:37

mikeologist wrote: »

What is the max random byte reads per second that can one reasonably achieve with external RAM?
I'm just getting started with some parallel SRAM and I'm getting 14,500 random byte read or writes per second and a little over 20,000 sequential byte reads or writes per second, but it's just a 70ns chip and I've just been working in spin so far.

You can certainly get much faster than that in ASM. The P1 has a 20MHz opcode rate, but can also generate a clock to 40MHz using the timers.
In ASM, the generic 4 opcodes of shift.clk.data.clk give a peak SPI clock of 5MHz

Depends on what you have already, and how much memory you need, and how you want to read it ?

some examples :

Simplest are QuadSPI SRAMs like 23LCV1024, N01S830 (above) - these have a sticky quad mode, so you can load CMD.ADDRESS with a 10 spi-clock overhead to first nibble out.
They are low cost, and easy to use - but will not quite hit video playback speeds. Their 20MHz spec is for 2.5V, so 3v3 can likely go higher.

Also simple to understand, are parallel SRAMs and helper logic.
You need to set address, so latches, counters, PLDs, FPGAs, or even 'use another P1' are all possible ways to wrap generic SRAM, to improve pin counts.
A cpld could let you latch address and auto-inc on R/W, for best burst bandwidth.

More compact and modern, but harder to start with, is a RAM intensive FPGA like Lattice ICE40UP5K-SG48
This would make a nifty P1 companion, & in a compact QFN48 you get 1Mbit SRAM and some useful 5280 Logic elements, roughly the same price as a P1
This can clock faster than a 23LCV1024, and give many more choices on ports, and widths.

mikeologist · 2018-01-01 23:56

jmg wrote: »

Also simple to understand, are parallel SRAMs and helper logic.
You need to set address, so latches, counters, PLDs, FPGAs, or even 'use another P1' are all possible ways to wrap generic SRAM, to improve pin counts.
A cpld could let you latch address and auto-inc on R/W, for best burst bandwidth.

That is my intention as I'd like to explore transmitting the same data to other cogs at the same time. I translated my Spin code to C last night and it came up to a respectable 160kB/s from my CY62256N. I've attached the C code if you can recommend improvements. If I can use that same transaction time to copy that data to 4 other cogs then I can eliminate many slower hub ram operations. I have several different options to play with on the way. I intended to start with the EFM8UB3. This is just to study the potential. If I get good results I'd like to move on to a 16bit, 10ns module and a wider 16bit controller aiming for the 4 to 8MB range.

MIchael_Michalski wrote: »

I'm thinking , if you really need the speed, once your talking about external memory, it makes sense to simply add a shift register along with an spi memory device. Then you can run the spi buss at 125mhz, the maximum frequency of the video generator. That comes out to about 4million words per second, or 16 million bytes per second. You'd only need an 8bit serial in parallel out device that can latch the output. Every 8bits it would latch the output and you could read it any time before the next byte was ready. Of course there's a lot of overhead to the commands.

Now, a question that I have not asked yet. Can multiple cogs receive the same data from SPI RAM simultaneously? I got a deal on some 32MB SPI chips to play with N25Q032A, but I think they cap out at about 108MHz.

@thej hope all this helps your decision process

Peter Jakacki · 2018-01-02 00:22

Some have said that you can use the video generator but where are the examples? The cog only runs at 20MIPs and with a few tricks we can use the counter and implement a 10MHz SPI interface. If we parallel load a shift register we would still have to read/write from hub for anything sustained. We could try to use the video shift register in 2 color mode but that is only good for shifting data out, no clocks, and more importantly, no data in etc. The video sub-system is not adaptable as an SPI interface except maybe for write-only memory

jmg · 2018-01-02 00:40

mikeologist wrote: »

Now, a question that I have not asked yet. Can multiple cogs receive the same data from SPI RAM simultaneously? I got a deal on some 32MB SPI chips to play with N25Q032A, but I think they cap out at about 108MHz.

108MHz is well above what P1 can manage, so should not be a problem.

Sure, receive can be by many cogs, as they have no idea who else may be snooping.
However, SPI read is not totally read-only, you need to write the command and address, before the turnaround to data out.
That means only one COG can be in charge of command and address, & SPI_CLK, but many COGs could look at the data, if you wanted to.
Those slave parts would need to carefully sync to the SPI_CLK, or some frame pin.

mikeologist · 2018-01-02 01:17

jmg wrote: »

108MHz is well above what P1 can manage, so should not be a problem.

Sure, receive can be by many cogs, as they have no idea who else may be snooping.
However, SPI read is not totally read-only, you need to write the command and address, before the turnaround to data out.
That means only one COG can be in charge of command and address, & SPI_CLK, but many COGs could look at the data, if you wanted to.
Those slave parts would need to carefully sync to the SPI_CLK, or some frame pin.

Very cool, sounds like both options are worth pursuing. I'll put the results in the Shrinking Standards thread. Just waiting for several slow boats from China.

MIchael_Michalski · 2018-01-02 09:27

In

Peter Jakacki wrote: »

Some have said that you can use the video generator but where are the examples? The cog only runs at 20MIPs and with a few tricks we can use the counter and implement a 10MHz SPI interface. If we parallel load a shift register we would still have to read/write from hub for anything sustained. We could try to use the video shift register in 2 color mode but that is only good for shifting data out, no clocks, and more importantly, no data in etc. The video sub-system is not adaptable as an SPI interface except maybe for write-only memory

You should be able to set any pll mode to drive the video generator, including outputting the clock on a io pin. That way you don't have to output the clock using 2 bit mode but should have the real clock signal. Hub read writes won't be insurmountable problem problem if your using an external shift register, but you will have to slow things down a little. You may or may not need to transfer to hub ram depending on the application though.

Let's assume an 80mhz spi clock. It takes 8 cycles to transfer data to hub ram once your synced. Send the read command by using a waitvid. Wait for the command to complete, and disable the video generator output but leave the clock. Then wait for your data to be ready. It probably makes sense to use a 16 bit shift register. Read the word. That's 4 system clocks, then write to hub ram. That's 8 more. (If we got everything synced up before we started) do a 4cycle instruction, and we are at 16clocks. Then read the new word, write it to hub ram, execute another 4cycle instruction and repeat. Each 8 clocks, if your device supports that mode, the next byte is clocked out. You can use the extra instructions to monitor the loop iterations, do the loop itself and end the loop when you have all your data. The timing should work.

An interesting statistic is, if you can receive a block of data at 125mhz, that's 3.9 million dwords/sec, 256ns/word (not counting overhead) , or about 130us to fill up a cogs cog ram. Since It takes 100us to start a cog up and load it's memory, It would seem that it could be quite a useful way to reload and repurpose a cog in the middle of a program.A small routine in a cogs memory could load it's new program as fast as it could be reinitialized.

SaucySoliton · 2018-01-03 06:23

Oh how I wish the Propeller had a input shift register... It may be possible to use the counters in feedback mode as an 80MHz shift register. You'll need 1 pin and 1 counter per bit. The counters invert their feedback, but that is easily fixed with one XOR instruction. A possible problem is the delay caused by routing the signals to pins and back.

What if the video generator transmits at 20mpbs and we receive with a sequence of MOVs? The sustained transfer rate would certainly be less than 10mpbs, but we could transfer the data and hand the bus off to another cog.

Related:
http://forums.parallax.com/discussion/164171/what-is-the-maximum-possible-throughput-of-spi-on-the-prop
http://forums.parallax.com/discussion/166850/boost-your-propeller-s-memory-256x-from-32kb-8mb

Wuerfel_21 · 2018-01-03 19:58

Send and receive at 20Mhz (in bursts of 32 bits only) can be done using counters.
Sending is done by setting the CTR to NCO mode with a FRQ of zero. then just use shift instructions to shift all bits through. (The other CTR can be used to generate a clock)
Receiving is little bit more complex, i think you need to set the CTR up to count edges of the clock signal (must be CLKFREQ/4 and be generated from the same clock source) where the data is high. Then by shifting the PHS register up every 4 cycles, the bits end up where they are supposed to be.

At least thats what I remember, I may be wrong.

EDIT:
The maximum speed at which data can be transferred in/out of the prop is 80 Megabit per second, by using P0 through P15 as a 16 bit parallel port and doing something like this:
(Moderators, how do you make code blocks again?)

loop wrword ina,pointer
add pointer,#2
djnz length,#loop

(also works for 8bit and 32 bit, but the latter is rather pointless)

A single cog can do about 10 Megabit using the 20 Mhz shift tricks, if you manage to sync two up, the full bandwidth of 20 Megabit may be usable.
Note that a SD card has a limit of 25 Mhz in SPI mode.

MIchael_Michalski · 2018-01-04 00:26

ive used a counter and a shift left to get data in from a counter pin. However, in a USB driver Ive been playing with I am trying to get hardware assisted data input. I set up the cog to create a clock mask from the video generator (love that thing). The idea Ive been toying with is to send data without waitvid to maintain tight timing. Instead I just make sure that my instruction timing is such that the right data is on the buss when the video generator fetches it. (which Ive had various degrees of success with as I have to be able to execute the right instructions in the right sequence at the right times,and fit everything else in). It works like this:

Suppose your data rate is 10Mhz. Setup the video generator to run at 80Mhz. Send 00010000:00111100:00011000:00010000 out the video generator. You set it up so that the counter counts up on the A pin AND the B pin. The data line is the B pin, the output from the video generator is the A pin. Notice the period of the data bits are 8 clocks. The same time it takes to shift a byte out the video generator. The first bit is masked by the video generator most of the time, but right in the middle,the video generator line goes high. If the data line is high as well, and if the timing is just right, you can catch that on the counter. So you get +1 added to the counter value. Similarly for the second bit,if the data line is high and the timing is just right, you can catch it on the counter. But the output from the video generator is wider. Twice as wide actually,so you get two counts. So if the second bit is high,you get +2 added to the counter. The third bit works the same way, except that its twice as wide again. So it gets four counts if the data line is high. Thats three bits with just one instruction to send data to the video generator. You then shift left three positions and send more data to the video generator. so you need two instructions every three bits to receive data. So you have four instructions left to play with. I didn't try to go 8 clocks wide, which would allow me to get 8 counts per bit and thus 4 bits at a time, because there needs to be enough slop in the system to receive data that isn't exactly timed right. If you were communicating synchronously, with something that where the data your receiving is clocked by a timing signal that is based on the same source the propeller uses it might work. (for example,if you setup another of the propellers counters to provide it)

You CAN get 10 or even 20Mhz by just banging out instructions, but my goal was to actually be able to have enough extra time to do something useful. For SPI you just need to send and receive as fast as you can so that trick isnt as useful.

Wuerfel_21 wrote: »

Send and receive at 20Mhz (in bursts of 32 bits only) can be done using counters.
Sending is done by setting the CTR to NCO mode with a FRQ of zero. then just use shift instructions to shift all bits through. (The other CTR can be used to generate a clock)
Receiving is little bit more complex, i think you need to set the CTR up to count edges of the clock signal (must be CLKFREQ/4 and be generated from the same clock source) where the data is high. Then by shifting the PHS register up every 4 cycles, the bits end up where they are supposed to be.

At least thats what I remember, I may be wrong.

EDIT:
The maximum speed at which data can be transferred in/out of the prop is 80 Megabit per second, by using P0 through P15 as a 16 bit parallel port and doing something like this:
(Moderators, how do you make code blocks again?)
loop wrword ina,pointer
add pointer,#2
djnz length,#loop
(also works for 8bit and 32 bit, but the latter is rather pointless)

A single cog can do about 10 Megabit using the 20 Mhz shift tricks, if you manage to sync two up, the full bandwidth of 20 Megabit may be usable.
Note that a SD card has a limit of 25 Mhz in SPI mode.

In retrospect,were missing some badly needed counter modes. One mode that would be nice would be a mode where the counter adds the phase register to itself on the rising edge of the A line is high and additionally, sets the most significant bit to 1 if the B line is high. (Adding a register to itself,is of course the same multiplying it by 2,which is the same as shifting it left) That plus the video generator and you have hardware serial I/O. Alternatively, they could perhaps have put a data in line on the video generator and modify the "Waitvid" instruction. The modified waitvid instruction would not only feed the video generator but leave the value that was shifted in in the cogram location specified in the destination register. You could use it for high speed serial communications, most especially SPI up to 125Mhz. But the point is moot as the propeller does NOT have that functionality.

jmg · 2018-01-04 00:40

MIchael_Michalski wrote: »

You CAN get 10 or even 20Mhz by just banging out instructions, but my goal was to actually be able to have enough extra time to do something useful. For SPI you just need to send and receive as fast as you can so that trick isnt as useful.

Most QuadSPI memory has sticky-modes, so sending 4 bits per clock boosts thruput, without needing to be too specialized around video-gymnastics, and Read is similar to Write.

MIchael_Michalski · 2018-01-04 01:16

SaucySoliton wrote: »

Oh how I wish the Propeller had a input shift register... It may be possible to use the counters in feedback mode as an 80MHz shift register. You'll need 1 pin and 1 counter per bit. The counters invert their feedback, but that is easily fixed with one XOR instruction. A possible problem is the delay caused by routing the signals to pins and back.

What if the video generator transmits at 20mpbs and we receive with a sequence of MOVs? The sustained transfer rate would certainly be less than 10mpbs, but we could transfer the data and hand the bus off to another cog.

Related:
http://forums.parallax.com/discussion/164171/what-is-the-maximum-possible-throughput-of-spi-on-the-prop
http://forums.parallax.com/discussion/166850/boost-your-propeller-s-memory-256x-from-32kb-8mb

That should work, but Id try receiving in the counter with shift instructions. If your data line is used to mask the clock line, then if the data line is high, then you cannot see the low to high transition and you wont count but if the data line is low, then you will see the transition. Your data will need to be inverted after its received.

Memory expansion for P1

Comments