Bursting data to/from Cog RAM- 6.6MB/sec using 9 pins?

jazzed · 2009-05-06 20:55

Phil Pilgrim (PhiPi) said...

jazzed,

I'm not sure I follow you. One of the gotchas with the carry flag and shifts/rotates is that only the starting bit (31 or 0) gets shifted into carry. IOW, the carry bit doesn't act like a super MSB or LSB when you do a shift or rotate with a wc.

-Phil

Ok, I see it now. The original value of bit 0 is put into C, so to get the original value, one must first shr by 7. But you did say that in so many words. Thanks again [noparse]:)[/noparse]

Welcome to our collaboration Dr. Jim

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
--Steve

Propalyzer: Propeller PC Logic Analyzer
http://forums.parallax.com/showthread.php?p=788230

Post Edited (jazzed) : 5/6/2009 9:07:57 PM GMT

Giemme · 2009-05-06 21:08

@Mikediv

the other chip are:
2 x 512 Kb SRAM (K6X4008C1F
1 X Latch 74HC573N

regards

Gianni

lonesock · 2009-05-06 21:10

Taking jazzed's & Phil's code and tweaking it a bit, I can use a simple MOV for the first byte (so the 0 bit is actually in D[noparse][[/noparse]0]), then a ROR to get it into position, writing the Carry flag at that point (which pulls the carry flag from, coincidentally, D[noparse][[/noparse]0]), saving an instruction later

              mov       val,ina     'xxxxxxxxxxxxxxxxxxxxxxxxAAAAAAAA .
              ror       val,#17 wc  'xxxxxxxxxAAAAAAAAxxxxxxxxxxxxxxx A0
              movi      val,ina     'bBBBBBBBBAAAAAAAAxxxxxxxxxxxxxxx A0
              shr       val,#8      '........bBBBBBBBBAAAAAAAAxxxxxxx A0
              movi      val,ina     'cCCCCCCCCBBBBBBBBAAAAAAAAxxxxxxx A0
              shr       val,#8      '........cCCCCCCCCBBBBBBBBAAAAAAA A0
              movi      val,ina     'dDDDDDDDDCCCCCCCCBBBBBBBBAAAAAAA A0
              rcl       val,#1      'DDDDDDDDCCCCCCCCBBBBBBBBAAAAAAAA A0

Did I miss something, or would that work?
Jonathan

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
lonesock
Piranha are people too.

Post Edited (lonesock) : 5/6/2009 9:35:36 PM GMT

jazzed · 2009-05-06 21:17

MagIO2 said...
Xilinx XC9572

Nice part. Also I noticed the data sheet mentions PQFP-100 ... (wider pin spacing than TQFP-100). Is it generally available? Digikey has a VQFP-64 with 52 IO, is is that a type-o? This is more attractive than the Altera Max3064.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
--Steve

Propalyzer: Propeller PC Logic Analyzer
http://forums.parallax.com/showthread.php?p=788230

MagIO2 · 2009-05-06 21:33

I got the PLCC84 version. For that sockets are available. So you can even do wirewrap prototyping. With only 2 74HC125 and some resistors you can build the programming interface which is supported by the XILINX development environment. Ok ... my current version is 5V, but the shift register with which I tested the video generator shift out is 5V as well and it works fine. So for prototyping it should work. First I have to find the cable - we just moved ;o)

Hanno · 2009-05-06 21:45

You guys rock! I run away for a round of golf and look what everyone has come up with! I think we're onto something here. I like the idea of having just 9 IO pins- 8 for the data which will be burst, and 1 as a serial interface to tell some other device/cog whether you want to read or write a constant amount of data. Once the serial command is sent, the other device is responsible for sending a set amount of bytes at a set speed. This allows you to do away with the clock and other io pins. Exactly how to massage the data into memory will depend on the application. If all you care about is 100 bytes- and need lots of program room, you can use my original code. If you want to pack it into longs, you'll need to shift, probably in an unrolled loop. Or, if you want to store more data, then use the movs, movi, movd variants. BUT, the hardware remains the same. Who wants to cook up a simple demo so we can talk more concretely about this? (I would, but I still have work to do for next week's meetings at Parallax and Google [noparse]:)[/noparse] ) Demo should show:
- 1 cog-the "controller" has some user interface to do something interesting
- 1 cog-the "sender" has data it wants to send to another cog
- 1 cog-the "receiver" wants that data to do something useful with it
You won't need any hardware for this, just pretend that the "sender" and "receiver" are separated by wires- in reality they're communicating via INA and OUTA. Ideally sender and receiver should share a common subroutine... Extra credit if you use some standard serial interface- but it has to be fast. And woohoo, looks like Brian is tempted to make us hardware [noparse]:)[/noparse]
Hanno

jazzed · 2009-05-06 22:11

lonesock said...

              mov       val,ina     'xxxxxxxxxxxxxxxxxxxxxxxxAAAAAAAA .
              ror       val,#17 wc  'xxxxxxxxxAAAAAAAAxxxxxxxxxxxxxxx A0
              movi      val,ina     'bBBBBBBBBAAAAAAAAxxxxxxxxxxxxxxx A0
              shr       val,#8      '........bBBBBBBBBAAAAAAAAxxxxxxx A0
              movi      val,ina     'cCCCCCCCCBBBBBBBBAAAAAAAAxxxxxxx A0
              shr       val,#8      '........cCCCCCCCCBBBBBBBBAAAAAAA A0
              movi      val,ina     'dDDDDDDDDCCCCCCCCBBBBBBBBAAAAAAA A0
              rcl       val,#1      'DDDDDDDDCCCCCCCCBBBBBBBBAAAAAAAA A0

Did I miss something, or would that work?

Works for me! But then again without a quick test apparently, anything works for me [noparse]:)[/noparse]
This is also > than 6.6MB/s. Of course adding instructions for pointer adjust and djnz len
(3 inst or 2 if you get really creative) adds 100-150ns per long to make it 7.3 to 8 MB/s.
Too bad one needs overhead [noparse]:)[/noparse]

MagIO2, the VQFP is similar to TQFP spacing ... 7.5 mil ugh.

Hanno, you're right you don't need a clock pin if it's 2 Propellers since they can share XI pins.
But to do external memory transfers one has no choice because of the variety of data modes.

Maybe Brian would like to talk about how to get a CPLD and SRAM on to a nice small size
prototype that could easily fit into a generally availabe and cheap enclosure. I've been working
on that myself, but having a more experienced layout guy do it would "de-risk" things.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
--Steve

Propalyzer: Propeller PC Logic Analyzer
http://forums.parallax.com/showthread.php?p=788230

Phil Pilgrim (PhiPi) · 2009-05-06 22:15

lonesock,

Excellent! I've been racking my brain, trying to get rid of that extra shift, and you solved it! This makes it possible to use the code with a 10MHz synchronous clock.

-Phil

lonesock · 2009-05-06 22:51

Glad I could help, thanks to Phil and jazzed for the heavy lifting! [noparse][[/noparse]8^)

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
lonesock
Piranha are people too.

mynet43 · 2009-05-06 23:09

I've been following this post all day.

It's fascinating to see the brain cells synchronize and come out with something like this.

You've definitely convinced me to move my data I/O pins to D0..D7. There's no way I want to miss out on the code that's evolved.

Keep up the great work! I was already packing stuff into 4 bytes/long and this will speed things up quite a bit, even with control lines.

Keep us posted as the hardware progresses.

Thanks everyone,

Jim

jazzed · 2009-05-06 23:20

Don't forget Andy's and Hanno's inspirational posts.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
--Steve

Propalyzer: Propeller PC Logic Analyzer
http://forums.parallax.com/showthread.php?p=788230

Hanno · 2009-05-06 23:28

Had some more thoughts:
- If you're using something smart like a Propeller for the sender/receiver, you don't even need the control line, 8 data lines is enough
- However, since the movs,movd,movi commands move 9 bits, you could use 9 lines and move 9 bits at a time
- To connect multiple devices, use an ethernet/canbus type strategy to listen until the bus is empty, then negotiate to use it
- Even if you're not connecting to external memory, this is a great way to share data between cogs at fast speeds!
Hanno (thanks _rjo for the signature tip- here is my very first use of it!)

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Download a free trial of ViewPort- the premier visual debugger for the Propeller
Includes full debugger, simulated instruments, fuzzy logic, and OpenCV for computer vision. Now a Parallax Product!

jazzed · 2009-05-07 00:56

Nice .sig Hanno ... you forgot Logic State Analyzer.

So for 8 data lines, you assume one cog is constantly polling for some non-zero address or attention byte?

What if there is more than one "secondary" Propeller ? Having a start strobe that coincides with the address ensures that there is no confusion about what the byte means especially if a valid token in a packet between the "primary" and another "secondary" happens to have the attention byte for the "N secondary". If there is only one pair of Propellers (or cogs), it doesn't matter.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
--Steve

Propalyzer: Propeller PC Logic Analyzer
http://forums.parallax.com/showthread.php?p=788230

Hanno · 2009-05-07 01:18

Steve,
A separate control line does simplify things like deciding when a request is being started. It's probably worth the 1 IO pin. Have you discovered that the "z" flag can be set on "mov" instructions if you're moving a 0? Same with wrlong/rdlong. Very useful! Is anyone programming this yet? I'm getting anxious to try out some real code and critique a real solution!
(Logic Analyzer is one of the "simulated instruments", ViewPort also offers a spectrum analyzer, xy mode, and of course oscilloscope)
Hanno

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Download a free trial of ViewPort- the premier visual debugger for the Propeller
Includes full debugger, simulated instruments, fuzzy logic, and OpenCV for computer vision. Now a Parallax Product!

mctrivia · 2009-05-07 01:50

you guys are killing me. I had the prop galore almost done now I have to re wroute all the bus lines. oh well it is worth it so I am off to the drawing board.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Need to make your prop design easier or secure? Get a PropMod has crystal, eeprom, and programing header in a 40 pin dip 0.7" pitch module with uSD reader, and RTC options.

jazzed · 2009-05-07 02:31

@Mac,

I did mention P0..9 in your thread on April 17th [noparse]:)[/noparse] For some reason I thought a ready bit was important too ... maybe for arbitrage and/or "asynchronous packet" responses ... too tired to think precisely now.

@Hanno,

Yes indeed. I first ran into this feature last summer when someone was trying to use "rdlong ... wc" which was a bug of course [noparse]:)[/noparse] I'm not coding this right now because I'm too busy trying out Verilog. Not knowing Verilog is a career weakness for me.

Give the driver your best shot ... until hardware is generally available that fits the mold, it's mostly theory anyway. You could do it inter-COG. Also, I do have a 2 Propeller board with 9 bits attached already if you want me to test something.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
--Steve

Propalyzer: Propeller PC Logic Analyzer
http://forums.parallax.com/showthread.php?p=788230

Post Edited (jazzed) : 5/7/2009 5:18:43 AM GMT

mctrivia · 2009-05-07 02:38

yes jazzed i should have listened to you in the begining. Went with P10-19 because it allowed me to make the board a fair bit narrower. Now I am going with P0-P11 though the upper bits will be easily seperated for io use.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
My new unsecure propmod both 1x1 and full size arriving soon.

Brian Fairchild · 2009-05-07 06:35

I haven't fully read all the details in this thread but just one of those early morning random thoughts....

Does it matter if the data is 'scrambled' in the memory? Does it matter if a strange number of bytes/words/longs is mapped to a sensible number of memory bytes or if it's stored in 'odd' addresses in the memory? As long as reading and writing are symmetrical then "data in = data out".

Many years ago, in the days when 8k SRAMs were a luxury, we designed a large PCB with lots of static ram and eprom on. When we had it made we realised the pcb layout guy had messed up and the address lines were scrambled. A new version was going to put the project late and then we realised it didn't matter. The RAM was fine, all we did for eprom was write a little utility to scramble the data so that when read out it was in the right places.

If we accept that RAM is cheap then if we waste even half a chip at the expense of fast access then we're still streets ahead.

jazzed · 2009-05-07 07:01

By the way, the read32 algorithm data rate in an uncached asynchronous XMM is
up to 1.25M LIPS (5.0MB/s) depending on SRAM type.

loadxmm        ' 20 instructions ... 16 instructions for 3.3V < 50ns SRAM ...
               ' unlatched, direct address/data bus ...
               ' dira set by init code to $0fffff00
               mov      outa,   addr
               shl      outa,   #8      ' mov address bits into position
               add      outa,   _0x300  'add to address to fix endian order
               nop                         ' remove for < 50ns 3.3V SRAM
               nop                         ' remove for < 100ns 3.3V SRAM
               mov      val,    ina     'xxxxxxxxxxxxxxxxxxxxxxxxAAAAAAAA .
               sub      outa,   #$100
               ror      val,    #17 wc  'xxxxxxxxxAAAAAAAAxxxxxxxxxxxxxxx A0
               nop                         ' remove for < 100ns 3.3V SRAM
               movi     val,    ina     'bBBBBBBBBAAAAAAAAxxxxxxxxxxxxxxx A0
               sub      outa,   #$100
               shr      val,    #8      '........bBBBBBBBBAAAAAAAAxxxxxxx A0
               nop                         ' remove for < 100ns 3.3V SRAM
               movi     val,    ina     'cCCCCCCCCBBBBBBBBAAAAAAAAxxxxxxx A0
               sub      outa,   #$100
               shr      val,    #8      '........cCCCCCCCCBBBBBBBBAAAAAAA A0
               nop                         ' remove for < 100ns 3.3V SRAM
               movi     val,    ina     'dDDDDDDDDCCCCCCCCBBBBBBBBAAAAAAA A0
               rcl      val,    #1      'DDDDDDDDCCCCCCCCBBBBBBBBAAAAAAAA A0
               ret

_0x300         long $300
addr           long 0               
val            res 1

Brian if one uses a loader and SRAM the address or data doesn't matter as long as there is no address aliasing.
But you knew that. Neat that you had a script to fix the EPROM file [noparse]:)[/noparse]

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
--Steve

Propalyzer: Propeller PC Logic Analyzer
http://forums.parallax.com/showthread.php?p=788230

Post Edited (jazzed) : 5/7/2009 7:09:53 AM GMT

Cluso99 · 2009-05-07 08:06

Nice method to recover 32 bits guys·

··This concept will work on both Blades #1 & #2 of the TriBladeProp.

Jazzed:
You will require the first 'nop' for <50nS 3v3 sram because...

···············add······outa,···_0x300··'add·to·address·to·fix·endian·order
···············nop·························'·remove·for·<·50ns·3.3V·SRAM
···············nop·························'·remove·for·<·100ns·3.3V·SRAM
···············mov······val,····ina·····'xxxxxxxxxxxxxxxxxxxxxxxxAAAAAAAA·.

The timing for this at 5MHz....·····

add      IdSDeR
mov          IdSDer
              ^      <-- write of address to the output pins
               ^     <-- read of data pins
              ||     less than 12.5nS but cannot guarantee any time > 0nS

·The address output will be output at the R cycle and the data will be read in the next S clock cycle. The timing between these cannot be guaranteed (I don't recall this sort of timing published on the data sheet). So you will require the first 'nop' no matter how fast the memory.

Postedit 14Dec2009: ERROR ABOVE: See http://forums.parallax.com/showthread.php?p=861676·for·confirmation from Chip that "ina" is sampled on the "e" clock, not the "S" clock.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Links to other interesting threads:

· Home of the MultiBladeProps: TriBlade,·RamBlade,·SixBlade, website
· Single Board Computer:·3 Propeller ICs·and a·TriBladeProp board (ZiCog Z80 Emulator)
· Prop Tools under Development or Completed (Index)
· Emulators: CPUs Z80 etc; Micros Altair etc;· Terminals·VT100 etc; (Index) ZiCog (Z80) , MoCog (6809)
· Search the Propeller forums·(uses advanced Google search)
My cruising website is: ·www.bluemagic.biz·· MultiBladeProp is: www.bluemagic.biz/cluso.htm

Post Edited (Cluso99) : 12/14/2009 5:17:07 AM GMT

heater · 2009-05-07 10:52

Jazzed, I'm not sure I'm awake enough to fully understand that last code snippet. Are you saying we can now fetch LONGs through a byte wide interface at 1.25 M longs per second from normal RAM with no hardware assistance? That is we can execute XMM PASM at say 1MIP. That's totally awesome.

Looks like the addresses are progressing backwards, or is it me?

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.

jazzed · 2009-05-07 14:29

Yes, the address has to decrement for little endian storage ... DDCCBBAA -> 00112233. This is nice for using DJNZ though ....
Thanks for clarity on the delay Ray ... guess the NOP can be used for something.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
--Steve

Propalyzer: Propeller PC Logic Analyzer
http://forums.parallax.com/showthread.php?p=788230

Hanno · 2009-05-07 15:18

Good morning!
I think we're getting there together- nice teamwork everyone. Sorry I couldn't help this time around-I'll have time next Wednesday when I fly home! I've gone through this thread and collected what I think are the salient points:
- synchronous access device has a defined transfer packet format
- would be great to use the Prop's XO line as a clock
- 8 bits can be shifted/moved into 32 bits
- Xilinx XC9572 may provide a nice interface
- it's ok to waste external memory, or leave it scrambled
Study below:

jazzed: "
Obviously as you mention a synchronous access device is necessary. As I see it, one needs 2 pins in addition to the 8 for byte data. One pin would be for the clock which would be produced by the CTRA on demand. The other pin would be a start bit. The data access would be via a packet as described below.

Synchronous memory transfer packet format:

# S BYTE
1 1 WLLLLLLL
2 0 AAAAAAAA
3 0 AAAAAAAA
4 0 AAAAAAAA
5 0 XXXXXXXX
6 0 DDDDDDDD
7 0 DDDDDDDD
8 0 DDDDDDDD
M 0 DDDDDDDD

Legend:
# - Packet Byte
S - Start bit state
W - Write Bit: Write if high, Read if low
L - Length Bit: Transaction up to N bytes
A - Address Bit: Target address
X - Turn Around: Need to turn the BUS to input
D - Data Bits
M - Packet Length: N data + 5 setup

Timing all depends on how data is stored by Propeller. The packet could be smaller for smaller length and address.
The turnaround byte can be skipped in a write packet. Obviously the more data that is transferred, the higher the burst rate.
"

jazzed: BTW, I don't think it's possible to use 9 pins for this with an 8 bit bus unless the Propeller XO clock can be used some way.

lonesock:
Taking jazzed's & Phil's code and tweaking it a bit, I can use a simple MOV for the first byte (so the 0 bit is actually in D[noparse][[/noparse]0]), then a ROR to get it into position, writing the Carry flag at that point (which pulls the carry flag from, coincidentally, D[noparse][[/noparse]0]), saving an instruction later

mov val,ina 'xxxxxxxxxxxxxxxxxxxxxxxxAAAAAAAA .
ror val,#17 wc 'xxxxxxxxxAAAAAAAAxxxxxxxxxxxxxxx A0
movi val,ina 'bBBBBBBBBAAAAAAAAxxxxxxxxxxxxxxx A0
shr val,#8 '........bBBBBBBBBAAAAAAAAxxxxxxx A0
movi val,ina 'cCCCCCCCCBBBBBBBBAAAAAAAAxxxxxxx A0
shr val,#8 '........cCCCCCCCCBBBBBBBBAAAAAAA A0
movi val,ina 'dDDDDDDDDCCCCCCCCBBBBBBBBAAAAAAA A0
rcl val,#1 'DDDDDDDDCCCCCCCCBBBBBBBBAAAAAAAA A0

jazzed:
MagIO2 said...
Xilinx XC9572

Nice part. Also I noticed the data sheet mentions PQFP-100 ... (wider pin spacing than TQFP-100). Is it generally available? Digikey has a VQFP-64 with 52 IO, is is that a type-o? This is more attractive than the Altera Max3064.

Brian Fairchild:
Does it matter if the data is 'scrambled' in the memory?
If we accept that RAM is cheap then if we waste even half a chip at the expense of fast access then we're still streets ahead.

Jazzed:

loadxmm ' 20 instructions ... 16 instructions for 3.3V < 50ns SRAM ...
' unlatched, direct address/data bus ...
' dira set by init code to $0fffff00
mov outa, addr
shl outa, #8 ' mov address bits into position
add outa, _0x300 'add to address to fix endian order
nop ' remove for < 50ns 3.3V SRAM
nop ' remove for < 100ns 3.3V SRAM
mov val, ina 'xxxxxxxxxxxxxxxxxxxxxxxxAAAAAAAA .
sub outa, #$100
ror val, #17 wc 'xxxxxxxxxAAAAAAAAxxxxxxxxxxxxxxx A0
nop ' remove for < 100ns 3.3V SRAM
movi val, ina 'bBBBBBBBBAAAAAAAAxxxxxxxxxxxxxxx A0
sub outa, #$100
shr val, #8 '........bBBBBBBBBAAAAAAAAxxxxxxx A0
nop ' remove for < 100ns 3.3V SRAM
movi val, ina 'cCCCCCCCCBBBBBBBBAAAAAAAAxxxxxxx A0
sub outa, #$100
shr val, #8 '........cCCCCCCCCBBBBBBBBAAAAAAA A0
nop ' remove for < 100ns 3.3V SRAM
movi val, ina 'dDDDDDDDDCCCCCCCCBBBBBBBBAAAAAAA A0
rcl val, #1 'DDDDDDDDCCCCCCCCBBBBBBBBAAAAAAAA A0
ret

_0x300 long $300
addr long 0
val res 1

Hanno

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Download a free trial of ViewPort- the premier visual debugger for the Propeller
Includes full debugger, simulated instruments, fuzzy logic, and OpenCV for computer vision. Now a Parallax Product!

MagIO2 · 2009-05-07 15:34

Why would you like to send the WLLLLLLLL? The XC9572 simply stays in burst mode until it finds the next start bit. So, as long as it has a clock it counts up the adress. Maybe you have an application where you don't need high speed access, but you want to copy 20k into HUB RAM. Can easily be done that way. And I'd prefere to have the prop control the WR signal, which then allows read modify write cycles.52 IOs is not a typo for the 64pin version of the CPLD. Of course it has some reserved pins for GND and Vcc and so on. It's not necessary to have one pin for each of the 72 Macrocells. Some of the macrocells will be internally then. And we don't need to have pins for the shift register - only for the latch/counter.I also saw a PLC44 version of this CPLD, which has 36 IOs. Maybe in the end this would be enough.PS: Don't know why I don't have linebreaks when I post with the PS3 but ... You can decide in your personal design if you want the WR to be a dedicated pin of the propeller or not. We have a 31 bit wide shift register/latch/counter. All not used bits can be used as output pin expansion of the propeller. It's only a matter of driver programming.

Post Edited (MagIO2) : 5/7/2009 3:54:50 PM GMT

jazzed · 2009-05-07 16:39

MagIO2, I thought early on about just having the start bit be an enable; this way an "assert" change in enable loads the address and while enabled, the CPLD can count. I abandoned that though for some reason. The design I have today uses the length "tuple" ... that could change to save a byte for throughput and allow a 23 bit address range because we still need the WE bit ....

Regarding Write Enable (WE), the more work the Propeller has to do, the slower the interface will be. Perhaps writing is not as critical performance wise as reading, but it would be easier to manage with the CPLD. BTW, you can't just keep WE asserted and change the address; data will get corrupted.

Regarding clock, it would be a reasonable compromise to connect a 3 pin jumper between an XO driver output, a Propeller IO pin, and the CPLD clock input. This would de-risk the issue and allow clock flexibility, but it would be a waste of PCB real estate. Best to de-risk cut/jump though.

By the way, I think I have an incrementing asynchronous read32 that uses the same number of instructions as the decrementing read32 ... it is not symmetrical though.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
--Steve

Propalyzer: Propeller PC Logic Analyzer
http://forums.parallax.com/showthread.php?p=788230

hippy · 2009-05-07 17:12

lonesock said...

              mov       val,ina     'xxxxxxxxxxxxxxxxxxxxxxxxAAAAAAAA .
              ror       val,#17 wc  'xxxxxxxxxAAAAAAAAxxxxxxxxxxxxxxx A0
              movi      val,ina     'bBBBBBBBBAAAAAAAAxxxxxxxxxxxxxxx A0
              shr       val,#8      '........bBBBBBBBBAAAAAAAAxxxxxxx A0
              movi      val,ina     'cCCCCCCCCBBBBBBBBAAAAAAAAxxxxxxx A0
              shr       val,#8      '........cCCCCCCCCBBBBBBBBAAAAAAA A0
              movi      val,ina     'dDDDDDDDDCCCCCCCCBBBBBBBBAAAAAAA A0
              rcl       val,#1      'DDDDDDDDCCCCCCCCBBBBBBBBAAAAAAAA A0

Very neat and very impressive. In terms of 32-bit data transfers ( I'm looking at executing code from XMM rather than just getting the data ) - PASM 20 MIPS, LMM 5 MIPS, XMM 2 MIPS which isn't too bad, although that considers instructions streamed in and not any random adressing overhead.

In terms of the block transfer protocol; would it be possible to minimise the overhead ? That is a command to set block size ( 1..128 longs ) which will be a one-off command for many uses, another to just fetch the next block, maybe another to set address without having to send all address bytes ? Maybe that needs various data transfer modes to be configured ?

Perhaps the WLLLLLLL command could be WxxxxLLL where L sets the 2^L size of block giving 4 bits of other control info, such as indicating how many address bytes follow ( None, 1, 2 or 3 ); WxxAALLL and still 2 bits spare.

This would help in executing code where one wants a single long at a time and a minimum overhead to fetch the next long/bytes is desirable

Also in streaming a block; at what rate do the longs clock out ? This depends on how the Cog is streaming them in, whether an un-rolled loop or in a loop where there will need to be a pause while any 'djnz' and pointer update occurs. One of those control bits could be 'all burst', 'burst-pause-burst-pause..."

I'm assuming that whatever comes out of this will be a 'de facto XMM implementation' which everyone will sieze upon so if it can be optimised for fastest 'one long at a time' fetch as well as burst mode blocks that should suit all uses and maximise interest.

jazzed · 2009-05-07 18:28

hippy said...

Perhaps the WLLLLLLL command could be WxxxxLLL where L sets the 2^L size of block giving 4 bits of other control info, such as indicating how many address bytes follow ( None, 1, 2 or 3 ); WxxAALLL and still 2 bits spare.

I like this, but too much "coding" flexibility will require more real-estate and a more expensive CPLD. Just having a fixed number of addresses will be an advantage in many ways including potential performance issues (it will take cycles to construct the first byte).

As far as clocking in/out, the "header bytes" WxxAALLL, AAAAAAA, ... can be pushed in consecutively (CLK*1). The "data bytes" for long READ access will need CLK*2 to give the hardware time to "deliver" the goods ... Cluso99 made this clear. Long WRITE access will only need CLK*1 ... but again CPLD cost and size may impact that.

A simple implementation CPLD would cost < $4 (qty 1) for the Xilinx part.

Added: Here is an incrementing asynchronous version of the long read.

loadxmm
               mov      outa,   addr
               shl      outa,   #8
               nop
               mov      val,    ina     'xxxxxxxxxxxxxxxxxxxxxxxxAAAAAAAA .
               add      outa,   #$100
               ror      val,    #8      'AAAAAAAAxxxxxxxxxxxxxxxxxxxxxxxx .
               movs     val,    ina     'AAAAAAAAxxxxxxxxxxxxxxxbBBBBBBBB .
               add      outa,   #$100
               ror      val,    #16     'xxxxxxxxxxxxxxxxBBBBBBBBAAAAAAAA .
               shl      val,    #7 wc   'xxxxxxxxxBBBBBBBBAAAAAAAxxxxxxxx A0
               movi     val,    ina     'cCCCCCCCCBBBBBBBBAAAAAAAxxxxxxxx A0
               add      outa,   #$100
               shr      val,    #8      '........cCCCCCCCCBBBBBBBBAAAAAAA A0
               movi     val,    ina     'dDDDDDDDDCCCCCCCCBBBBBBBBAAAAAAA A0
               rcl      val,    #1      'DDDDDDDDCCCCCCCCBBBBBBBBAAAAAAAA A0
               ret

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
--Steve

Propalyzer: Propeller PC Logic Analyzer
http://forums.parallax.com/showthread.php?p=788230

Post Edited (jazzed) : 5/7/2009 6:45:42 PM GMT

Cluso99 · 2009-05-08 03:51

Here is what I am doing with my next pcb...
A0... = P0...
D0..D7 = P24..P31· (corrected)

loadxmm
 mov outa, addr
 add addr, #4      '<-- nop reqd but can advance addr
 mov data, ina     'AAAAAAAA....
 add outa, #1
 shr data, #24     '000000000000000000000000AAAAAAAA
 mov d2, ina       'BBBBBBBB....
 add outa, #1
 and d2, hFF000000 'BBBBBBBB000000000000000000000000
 mov d3, ina       'CCCCCCCC....
 add outa, #1
 shr d3, #24       '000000000000000000000000CCCCCCCC
 mov d4, ina       'DDDDDDDD....
 and d4, hFF000000 'DDDDDDDD000000000000000000000000
 or data, d4       'DDDDDDDD0000000000000000AAAAAAAA
 or d2, d3         'BBBBBBBB0000000000000000CCCCCCCC
 rol d2, #16       '00000000CCCCCCCCBBBBBBBB00000000
 or data, d2       'DDDDDDDDCCCCCCCCBBBBBBBBAAAAAAAA
 ret
 
random
 mov outa, addr
 add addr, #1      '<-- nop reqd but can advance addr
 mov data, ina     'AAAAAAAA....
 shr data, #24     '000000000000000000000000AAAAAAAA
 ret

This method adds 2 instructions to XMM, but random is 1 instruction faster. The main disadvantage to my method·is that·I require the use of SI/SO and the Eeprom pins.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Links to other interesting threads:

· Home of the MultiBladeProps: TriBladeProp, SixBladeProp, website (Multiple propeller pcbs)
· Single Board Computer:·3 Propeller ICs·and a·TriBladeProp board (ZiCog Z80 Emulator)
· Prop Tools under Development or Completed (Index)
· Emulators: Micros eg Altair, and Terminals eg VT100 (Index)
· Search the Propeller forums (via Google)
My cruising website is: ·www.bluemagic.biz·· MultiBladeProp is: www.bluemagic.biz/cluso.htm

Post Edited (Cluso99) : 5/8/2009 8:31:11 AM GMT

jazzed · 2009-05-08 06:25

I thought about that some, but the possibility of writing the eeprom by accident scared me out of it. One could activate the boot eeprom write protect line for normal use and have a switch for reprogramming though. After boot, the serial ports are mostly dispensable one way or another.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
--Steve

Propalyzer: Propeller PC Logic Analyzer
http://forums.parallax.com/showthread.php?p=788230

virtuPIC · 2009-05-08 07:02

Cluso99 said...

...
D0..D7 = P25..P31

Shouldn't this be D0..D7 = P24..P31? I know, I am nitpicking, sorry.

Ever thought about a different wiring and using hub as target on chip memory or conversion buffer? Something like (untested!!!)
A0.. = P8..
D0..D7 = P0..P7

load_via_hub
  mov     outa, addr
  shl     outa, #8
  add     outa, #$100
  wrbyte  outa, hubbuf
  add     addr, #4
  add     outa, #$100
  wrbyte  outa, hubbuf
  'instruction
  add     outa, #$100
  wrbyte  outa, hubbuf
  'instruction
  wrbyte  outa, hubbuf
  'instruction
  'instruction
  rdlong  data, hubbuf
  return

Observations:

The code is a little shorter than loadxmm.
You can insert 4 other instructions executed in the stall time of hub accesses.
The code is slower if you don't insert such instructions.
The code is faster if you insert such instructions.
If you use this code without call / return there is space for two more instructions in the rdlong stall.
In burst transfers you can save the shl in the beginning.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Airspace V - international hangar flying!
www.airspace-v.com/ggadgets for tools & toys

Bursting data to/from Cog RAM- 6.6MB/sec using 9 pins?

Comments