The last sdspi_bashed.spin2 posted by Rayman yesterday has a bug in it in readSector where a byte count of 512 is used as an immediate value. Attached is a modified version.
I'm trying to get fsrw and sdspi_bashed to work together ... not sure why my test program isn't working.
Mike, I looked at sdspi_bashed and see what you are referring to.
I wouldn't technically call it a "bug", since the code works...
Inline assembly has different rules than regular PASM2. It is more forgiving in FastSpin...
Still, I changed the # to ## and maybe looks better now...
pri readSector(p)| r, c, o,w,n,x,y ' Read 512 bytesfrom the card. 'RJA adding hoping to make faster
c := clk
o := do
w:=xwait
n:=4
asm
'Read in 512 bytes
mov y,##512 '#bytes in a block
I made a slight change to the way the new FastBlocksRead() and FasterBlocksRead() are used. They now require a starting block# parameter. Changed to allow for things like movie files...
Here is the latest BMP speed read test (fsrw_test2.spin2).
I'm going to now see if I can make this work with both FastSpin and PNut.
(currently only works with FastSpin)
Now working in both Fastspin and PNut !
At least, this version of FSRW with the inline assembly version of the SPI driver.
The test program repeatedly loads a bmp image from SD card and shows in over VGA while sending diagnostic info out over serial port.
Still need to work on the assembly SPI driver so that option would be available too...
Very nice to have code now that works in both FastSpin and PNut!
Update: The attached now has a fix that changed ptra to ptrb in sdspi_bashed that caused optimization issues...
The "bashed" version of the SD SPI driver (that doesn't use a cog and has inline assembly) in the previous post is now about 2X as fast as it was before.
Getting 1800 kB/s (used to be only ~900 kB/s) at 250 MHz clock.
It's now nearly as fast as the version that uses a cog.
There's not much point to the version that needs a cog now. It is a hair faster though.
Also, it could be made to not be blocking and let the main cog do other things while it works.
I think at the moment calls to the driver wait for the command to complete before returning. Probably should fix that.
I tried this new "bashed" version on the Video Test. It didn't work at first. I found a bug in the "readblocks" function for multi-block reading. There is a "readresp()" that should be there.
Now, it can play the Sintel with 250 MHz clock and the Buck Bunny with 300 MHz clock.
One thing I really liked with @lonesock's SD SPI driver is the read ahead and write behind feature.
This is the main reason why FSRW is way faster then @Kye's Fat_Engine.
On write access the main program just has to wait until the HUB sector is copied into the COG, the actual write is done in parallel or non blocking.
Same goes for reading in the next sector after delivering one into the HUB, if the main program needs the following sector it is already in the COG and just needs to get transferred into the HUB.
This non blocking use of COGs is something single core MC's can't do and where the Propeller 1/2 can show their power.
Using the LUT as sector buffer would even allow more then one sector to be read ahead or written behind.
Sadly I lost all my files for the RAISD project (4 SDs on one quickstart) thru a messed up Windows Update, I had managed to use @lonesocks SD SPI block driver inside @Kye's Fat_Engine, and both (Fsrw and Fat_Engine) could do transfers around 1,200 kbytes/second on a P1 with 80 Mhz.
@"Oldbitcollector (Jeff)" should still have 50 of the RAISD Kits @MacTuxLin made for me, but sadly Jeff and propellerpowered.com disappeared, so the project died.
Read ahead and write behind are nice, but I'm not actually sure they are needed with P2 being so much faster than P1.
We'll see. Maybe one day I'll come across a need for it...
Here's the latest (hopefully final) version that compiles with both PNut and FastSpin.
Also includes a video player (P2VideoPlayer.spin2) and a test program (fsrw_test2.spin2) that continuously loads a bmp image (bitmap2.bmp) from SD card (good for making sure is error free).
Test video for the player (qvga like resolution) can be found here: http://www.rayslogic.com/Propeller2/P2Video/P2Video.html
Well, the P1 with 80 Mhz does 1,200kbyte/sec with Fsrw and you got 1,800 kbyte/sec on the P2 with 250 Mhz.
Fat_Engine on a P1 does around 700 kbyte/sec without read ahead and write behind so it does make a difference of some 40%.
But I do have to agree that NOT using a dedicated cog for SD has its merits too. Going back from 16 to 8 cog's was the saddest decision to be made going from p2-hot to the current one.
The cogged version of sdspi gets 2000 kB/s at 250 MHz and 2400 KB/s at 300 MHz.
It was a sad day when we dropped from 16 to 8 cogs.
But, now that can do SD and serial without a cog, maybe it's not so bad.
Probably going to need interrupts for some applications to reduce cog count...
16 to 8 cogs...
Something had to give. So much was added to cogs (instructions) that a cog took way more silicon.
For 16 cogs I think we would be down to 128KB hub.
Smart pins also takes a reasonable amount of silicon.
There is a discussion about what each block consumes silicon wise.
@msrobots
I coded the Z80 emulation using skip and exec last August and started testing it then. Just have to dig it out as soon as I have a couple more things done on my OS
While 16 cogs would be nice to have, I'm really happy with the current version of silicon. The one thing that I do wish for was more source-code compatibility between spin1 and spin2. I've got a quite a few things that i really wish this (cog) driver did but I had to take a break for a while and then I got really busy. I gotta say most of the other changes were no big deal but 1) not having a default return, and 2) adopting c like syntax for methods with no parameters made me rage quit for like 2 months. *end rant*
For the touchscreen OS, it would be really nice to use LUT sharing as a fast data path. Copy image data from SD card to LUT with one cog and it's partner cog would then send that data to the display (or SRAM). I was trying to figure out the best way to do this before I started down the SPIN2 translation hell. I know it could be done but it would be nice to be usable with OTHER drivers. I originally was hoping to make something compatible with the hyperram driver but... way above my head.
The other big TODO was run a binary but then I started thinking it might be nice to run a whole-chip binary, as well as a single cog binary. Again, I never took this past the research stage...
I was also pondering the best way to keep compatibility with P1 FSRW. I'm probably going to change return values to match (pretty sure I was the one that changed reporting SD version). Since everyone else agrees that standard SD support should be dropped that's probably the first thing I'll do. I kinda liked being able to grab any ol sd card and use it but in the name of consistency.
Glad this was at least somewhat useful. Once I've had a chance to examine / merge all changes I'll update the top post.
Kudos to Rayman for continuing where i abandoned things
While 16 cogs would be nice to have, I'm really happy with the current version of silicon. The one thing that I do wish for was more source-code compatibility between spin1 and spin2. I've got a quite a few things that i really wish this (cog) driver did but I had to take a break for a while and then I got really busy. I gotta say most of the other changes were no big deal but 1) not having a default return, and 2) adopting c like syntax for methods with no parameters made me rage quit for like 2 months. *end rant*
For the touchscreen OS, it would be really nice to use LUT sharing as a fast data path. Copy image data from SD card to LUT with one cog and it's partner cog would then send that data to the display (or SRAM). I was trying to figure out the best way to do this before I started down the SPIN2 translation hell. I know it could be done but it would be nice to be usable with OTHER drivers. I originally was hoping to make something compatible with the hyperram driver but... way above my head.
The other big TODO was run a binary but then I started thinking it might be nice to run a whole-chip binary, as well as a single cog binary. Again, I never took this past the research stage...
I was also pondering the best way to keep compatibility with P1 FSRW. I'm probably going to change return values to match (pretty sure I was the one that changed reporting SD version). Since everyone else agrees that standard SD support should be dropped that's probably the first thing I'll do. I kinda liked being able to grab any ol sd card and use it but in the name of consistency.
Glad this was at least somewhat useful. Once I've had a chance to examine / merge all changes I'll update the top post.
Kudos to Rayman for continuing where i abandoned things
Agreed re spin1 v spin2 syntax. However, I did manage to convert Kye's FAT driver quite easily once I had my serial debugger working. There's a few gotchas. Unfortunately I was so depressed with the changes for no good reason I forgot to document all the things I had to do
LUT sharing can only be shared between cogs using pasm, not spin (pnut version at least, not sure about fastspin).
There is a common interface that could be used between the spin and pasm for SD....
Initialise which initialises the SD card (ie clocks plus command 0 thru command 16) readCSDCID(@addr) which reads the CSD (16 bytes) followed by CID (16 bytes) into @addr (commands 9 & 10) readSector(sector, @addr) which reads a sector (512 bytes) from "sector" into @addr (command 17) writeSector(sector, @addr) which writes a sector (512 bytes) to "sector" from @addr (command 24)
This is what I've used to interface from Kye's FAT code and I shoehorned it into fsrw for Mike to try in FemtoBasic.
If we could keep this common interface, plus maybe optional readBlock and writeBlock then we could develop the different versions of the SD pasm code while keeping the different spin versions of fsrw and Kye's FAT32. This way we could mix and match to suit.
@Rayman
On second thought, perhaps it's best for you to start a new thread? That way you can keep it up-to-date as you make changes. Looking over the changes you made vs my original port effort, maybe you'd like to take ownership of the port as well. Remove all my old garbage and just an "acknowledgement to cheezus for his original effort" or something of the like. Whatever you think is best.
@cheezus You certainly deserve a lot of credit for being the first to get FSRW to actually work on P2 (in Spin). I tried and failed... I'm sure it must have taken a lot of time.
The pandemic has given me extra time to make it a bit faster and also compatible with the Parallax compiler. It can do video now (admittedly low quality).
It's faster than it was on P1 now, which is saying something because I'm sure Tomas Rokicki and Jonathan Dummer spend endless hours optimizing performance on P1.
At some point it might be worth looking into whether the streamer can be used within the SPIN2 COG's context for an SD SPI driver if you need to save a COG, perhaps that could buy even more performance vs bit bang. You'd need the clock pin driven too with a Smartpin perhaps.
Hopefully the assembly order of bits in a byte before the WFBYTE occurs from the streamer to HUB RAM is compatible with the SPI SD order (MSB arriving first IIRC). Otherwise a bit reversal on each data block byte is required which would be bad for performance.
Update: Looking at the P2 documentation it says this:
"Modes which shift data use bits bottom-first, by default. Some of these modes have the %a bit in D[16] to reorder the data sequence within the individual bytes to top-first when %a = 1."
@cheezus You certainly deserve a lot of credit for being the first to get FSRW to actually work on P2. I tried and failed... I'm sure it must have taken a lot of time.
The pandemic has given me extra time to make it a bit faster and also compatible with the Parallax compiler. It can do video now (admittedly low quality).
It's faster than it was on P1 now, which is saying something because I'm sure Tomas Rokicki and Jonathan Dummer spend endless hours optimizing performance on P1.
Rayman, you may not have followed the p2gcc thread, but I got FSRW working on the P2 about 3 years ago. However, this was on the FPGA, and not the actual P2 silicon. Also it was in C, and not in Spin. FSRW is the basis for the p2gcc C file I/O. The fourth comment in this thread stated that.
However, cheezus is probably the first one to get FSRW working on the P2 in Spin.
How is p2gcc coming along? I would love program the P2 in C++ with all the bells and whistles (but I would only use a subset that is known to perform well on MCU's of course)
I quit working on p2gcc about 10 months ago. Eric's fastspin compiler is an excellent development tool for the P2, and I think he will add some basic C++ features to it over time.
I'm using FastSpin for all my development on the P2 and I LOVE it. He's doing a fantastic job and it's a catalyst for for the whole P2 community. If it wasn't for Eric we would have had to wait for Chips compiler/interpreter to mature to the state it has now and we would be "years" behind. Thank you Eric!
But I know for sure that the P2 NEEDS gcc (or equally well known compiler) to be taken seriously by most professional embedded developers. That's not my opinion, I know this from working as an embedded developer for years. That's the sad truth and Parallax would gain tremendously by knowing this fact! (and act upon it)
@cheezus You certainly deserve a lot of credit for being the first to get FSRW to actually work on P2. I tried and failed... I'm sure it must have taken a lot of time.
The pandemic has given me extra time to make it a bit faster and also compatible with the Parallax compiler. It can do video now (admittedly low quality).
It's faster than it was on P1 now, which is saying something because I'm sure Tomas Rokicki and Jonathan Dummer spend endless hours optimizing performance on P1.
Rayman, you may not have followed the p2gcc thread, but I got FSRW working on the P2 about 3 years ago. However, this was on the FPGA, and not the actual P2 silicon. Also it was in C, and not in Spin. FSRW is the basis for the p2gcc C file I/O. The fourth comment in this thread stated that.
However, cheezus is probably the first one to get FSRW working on the P2 in Spin.
There is GCC for P2, with a RISC-V emulator, which makes it even more interesting for the "professionals" that can't work without standard tools and standard architectures.
The whole P2 is very different from other architectures, so GCC alone will not help much to be taken seriously by professional embedded developers.
For me flexC has many advantages over GCC: You can include Spin2 objects, and you can use P2 inline assembly with an easy syntax.
I think Spin2Cpp let you use Spin1 in PropGCC for P1. Hopefully, one could still do that for P2.
FastSpin seems to be filed away under Spin2Cpp, so maybe this is true...
On this topic: Not sure if it was brought up before, but could fastspin's code generator be appropriated to take in LLVM IR? That'd be a good stopgap solution for C++ compilation.
Comments
I'm trying to get fsrw and sdspi_bashed to work together ... not sure why my test program isn't working.
I wouldn't technically call it a "bug", since the code works...
Inline assembly has different rules than regular PASM2. It is more forgiving in FastSpin...
Still, I changed the # to ## and maybe looks better now...
One, is swapping CS and CLK pin numbers.
The other is changing the mount fail condition from "<>0" to "<0".
See attached, look for "RJA" in two places...
Now it returns 1 for regular SD and 2 for SDHC...
Here is the latest BMP speed read test (fsrw_test2.spin2).
I'm going to now see if I can make this work with both FastSpin and PNut.
(currently only works with FastSpin)
At least, this version of FSRW with the inline assembly version of the SPI driver.
The test program repeatedly loads a bmp image from SD card and shows in over VGA while sending diagnostic info out over serial port.
Still need to work on the assembly SPI driver so that option would be available too...
Very nice to have code now that works in both FastSpin and PNut!
Update: The attached now has a fix that changed ptra to ptrb in sdspi_bashed that caused optimization issues...
Getting 1800 kB/s (used to be only ~900 kB/s) at 250 MHz clock.
It's now nearly as fast as the version that uses a cog.
There's not much point to the version that needs a cog now. It is a hair faster though.
Also, it could be made to not be blocking and let the main cog do other things while it works.
I think at the moment calls to the driver wait for the command to complete before returning. Probably should fix that.
I tried this new "bashed" version on the Video Test. It didn't work at first. I found a bug in the "readblocks" function for multi-block reading. There is a "readresp()" that should be there.
Now, it can play the Sintel with 250 MHz clock and the Buck Bunny with 300 MHz clock.
This is the main reason why FSRW is way faster then @Kye's Fat_Engine.
On write access the main program just has to wait until the HUB sector is copied into the COG, the actual write is done in parallel or non blocking.
Same goes for reading in the next sector after delivering one into the HUB, if the main program needs the following sector it is already in the COG and just needs to get transferred into the HUB.
This non blocking use of COGs is something single core MC's can't do and where the Propeller 1/2 can show their power.
Using the LUT as sector buffer would even allow more then one sector to be read ahead or written behind.
Sadly I lost all my files for the RAISD project (4 SDs on one quickstart) thru a messed up Windows Update, I had managed to use @lonesocks SD SPI block driver inside @Kye's Fat_Engine, and both (Fsrw and Fat_Engine) could do transfers around 1,200 kbytes/second on a P1 with 80 Mhz.
@"Oldbitcollector (Jeff)" should still have 50 of the RAISD Kits @MacTuxLin made for me, but sadly Jeff and propellerpowered.com disappeared, so the project died.
Enjoy!
Mike
We'll see. Maybe one day I'll come across a need for it...
Also includes a video player (P2VideoPlayer.spin2) and a test program (fsrw_test2.spin2) that continuously loads a bmp image (bitmap2.bmp) from SD card (good for making sure is error free).
Test video for the player (qvga like resolution) can be found here: http://www.rayslogic.com/Propeller2/P2Video/P2Video.html
Fat_Engine on a P1 does around 700 kbyte/sec without read ahead and write behind so it does make a difference of some 40%.
But I do have to agree that NOT using a dedicated cog for SD has its merits too. Going back from 16 to 8 cog's was the saddest decision to be made going from p2-hot to the current one.
Enjoy!
Mike
It was a sad day when we dropped from 16 to 8 cogs.
But, now that can do SD and serial without a cog, maybe it's not so bad.
Probably going to need interrupts for some applications to reduce cog count...
Something had to give. So much was added to cogs (instructions) that a cog took way more silicon.
For 16 cogs I think we would be down to 128KB hub.
Smart pins also takes a reasonable amount of silicon.
There is a discussion about what each block consumes silicon wise.
@msrobots
I coded the Z80 emulation using skip and exec last August and started testing it then. Just have to dig it out as soon as I have a couple more things done on my OS
For the touchscreen OS, it would be really nice to use LUT sharing as a fast data path. Copy image data from SD card to LUT with one cog and it's partner cog would then send that data to the display (or SRAM). I was trying to figure out the best way to do this before I started down the SPIN2 translation hell. I know it could be done but it would be nice to be usable with OTHER drivers. I originally was hoping to make something compatible with the hyperram driver but... way above my head.
The other big TODO was run a binary but then I started thinking it might be nice to run a whole-chip binary, as well as a single cog binary. Again, I never took this past the research stage...
I was also pondering the best way to keep compatibility with P1 FSRW. I'm probably going to change return values to match (pretty sure I was the one that changed reporting SD version). Since everyone else agrees that standard SD support should be dropped that's probably the first thing I'll do. I kinda liked being able to grab any ol sd card and use it but in the name of consistency.
Glad this was at least somewhat useful. Once I've had a chance to examine / merge all changes I'll update the top post.
Kudos to Rayman for continuing where i abandoned things
I was about to start a new thread so these files would be in the top post. But I’ll wait for you to do it now.
With FastSpin you do have the option of keeping the P1 syntax. But, I thought it’d be more useful to be compatible with the Parallax compiler.
LUT sharing can only be shared between cogs using pasm, not spin (pnut version at least, not sure about fastspin).
There is a common interface that could be used between the spin and pasm for SD....
Initialise which initialises the SD card (ie clocks plus command 0 thru command 16)
readCSDCID(@addr) which reads the CSD (16 bytes) followed by CID (16 bytes) into @addr (commands 9 & 10)
readSector(sector, @addr) which reads a sector (512 bytes) from "sector" into @addr (command 17)
writeSector(sector, @addr) which writes a sector (512 bytes) to "sector" from @addr (command 24)
This is what I've used to interface from Kye's FAT code and I shoehorned it into fsrw for Mike to try in FemtoBasic.
If we could keep this common interface, plus maybe optional readBlock and writeBlock then we could develop the different versions of the SD pasm code while keeping the different spin versions of fsrw and Kye's FAT32. This way we could mix and match to suit.
On second thought, perhaps it's best for you to start a new thread? That way you can keep it up-to-date as you make changes. Looking over the changes you made vs my original port effort, maybe you'd like to take ownership of the port as well. Remove all my old garbage and just an "acknowledgement to cheezus for his original effort" or something of the like. Whatever you think is best.
The pandemic has given me extra time to make it a bit faster and also compatible with the Parallax compiler. It can do video now (admittedly low quality).
It's faster than it was on P1 now, which is saying something because I'm sure Tomas Rokicki and Jonathan Dummer spend endless hours optimizing performance on P1.
Hopefully the assembly order of bits in a byte before the WFBYTE occurs from the streamer to HUB RAM is compatible with the SPI SD order (MSB arriving first IIRC). Otherwise a bit reversal on each data block byte is required which would be bad for performance.
Update: Looking at the P2 documentation it says this:
Rayman, you may not have followed the p2gcc thread, but I got FSRW working on the P2 about 3 years ago. However, this was on the FPGA, and not the actual P2 silicon. Also it was in C, and not in Spin. FSRW is the basis for the p2gcc C file I/O. The fourth comment in this thread stated that.
However, cheezus is probably the first one to get FSRW working on the P2 in Spin.
How is p2gcc coming along? I would love program the P2 in C++ with all the bells and whistles (but I would only use a subset that is known to perform well on MCU's of course)
/Johannes
But I know for sure that the P2 NEEDS gcc (or equally well known compiler) to be taken seriously by most professional embedded developers. That's not my opinion, I know this from working as an embedded developer for years. That's the sad truth and Parallax would gain tremendously by knowing this fact! (and act upon it)
/Johannes
https://forums.parallax.com/discussion/167839/p2-sd-card-boot-code-for-rom
There is GCC for P2, with a RISC-V emulator, which makes it even more interesting for the "professionals" that can't work without standard tools and standard architectures.
The whole P2 is very different from other architectures, so GCC alone will not help much to be taken seriously by professional embedded developers.
For me flexC has many advantages over GCC: You can include Spin2 objects, and you can use P2 inline assembly with an easy syntax.
Andy
FastSpin seems to be filed away under Spin2Cpp, so maybe this is true...