Big Spin - is it still a pipedream?

jmg · 2016-09-11 23:37

David Betz wrote: »

Nice! It should be possible to write an XMM driver for HyperRAM. You just need to write PASM read/write functions that run in a separate COG.

I find the HyperRAM Data annoyingly vague, but the devices show great promise.

Merge P1 and HyperRAM specs, gives a possible 80 opcodes inside the tCSM limit (<4us) , ~ 20 of those set the address & preambles, in slowest access mode, which means it could burst up to 60 bytes, from any given start address.

How many bytes can XMM sensibly read/buffer/cache ?

I think the CLK can be continual on read, and CS# used to control read then give the refresh budget
- ie set a timer carefully phased to opcodes, set to 10MHz(20MHz DDR), and some in-line PASM,
COG-HR could peak at Data flows of 15MBytes/s, and maybe halves when COG-hub is included. (done when CS=H)
Once started, if care is used on CS#, I think the timer phase will always be ok (even across Wr/Rd) ?

Very similar code can be used for Double (ie 2xQuadSRAM), - that also allows(ignores) CLK when CS is hi, only here CLK is set to 20MHz not 10MHz, and the 2x needs nibble split on address (done before bursting).

A single 1xQuadSPI would run about half the CLK speed to allow Bus + (<<4), but would save needing pre-sort address nibbles in code. That results in ~ 5MBytes/sec burst from a single QuadSPI flash device.

David Betz · 2016-09-11 23:42

jmg wrote: »

David Betz wrote: »

Nice! It should be possible to write an XMM driver for HyperRAM. You just need to write PASM read/write functions that run in a separate COG.

I find the HyperRAM Data annoyingly vague, but the devices show great promise.

Merge P1 and HyperRAM specs, gives a possible 80 opcodes in the tCSM limit (<4us) , ~ 20 of those set the address in slowest access mode, which means it could burst up to 60 bytes, from a given start address.

How many bytes can XMM sensibly read/buffer/cache ?

You can use as much or as little as you want. We typically use somewhere between 2K and 8K bytes.

I think the CLK can be continual on read, and CS# used to read then give the refresh budget
- ie set a timer carefully phased to opcodes, set to 10MHz(20MHz DDR), and some in-line PASM,
COG-HR could peak at Data flows of 15MBytes/s, and maybe halves when COG-hub is included. (done when CS=H)
Once started, if care is used on CS#, I think the timer phase will always be ok (even across Wr/Rd) ?

Very similar code can be used for Double (ie 2xQuadSRAM), - that also allows(ignores) CLK when CS is hi, only here CLK is set to 20MHz not 10MHz, and the 2x needs nibble split on address (done before bursting).

A single 1xQuadSPI would run about half the CLK speed to allow Bus + (<<4), but would save needing pre-sort address nibbles in code. That results in ~ 5MBytes/sec burst from a single QuadSPI flash device.

jmg · 2016-09-12 00:16

David Betz wrote: »

jmg wrote: »

... which means it could burst up to 60 bytes, from a given start address.

How many bytes can XMM sensibly read/buffer/cache ?

You can use as much or as little as you want. We typically use somewhere between 2K and 8K bytes.

OK, amounts >> ~60 bytes would need to be done in chunks.
I guess if the handling COG has nothing else to do, it can just forward fetch until it either
a) Gets a change of address
b) Hits cache limit.

That may increase the system Icc and increase the Time-To-Jump, as it would need to finish the present burst, before being able to launch a new one.
On the other hand, 'running on empty' means a purely linear-read needs to wait a little, every 60 bytes.

What are the odds of getting to the end of 60 bytes, without needing a change of address ?

David Betz · 2016-09-12 00:32

jmg wrote: »

David Betz wrote: »

jmg wrote: »

... which means it could burst up to 60 bytes, from a given start address.

How many bytes can XMM sensibly read/buffer/cache ?

You can use as much or as little as you want. We typically use somewhere between 2K and 8K bytes.

OK, amounts >> ~60 bytes would need to be done in chunks.
I guess if the handling COG has nothing else to do, it can just forward fetch until it either
a) Gets a change of address
b) Hits cache limit.

That may increase the system Icc and increase the Time-To-Jump, as it would need to finish the present burst, before being able to launch a new one.
On the other hand, 'running on empty' means a purely linear-read needs to wait a little, every 60 bytes.

What are the odds of getting to the end of 60 bytes, without needing a change of address ?

Well, you're executing LMM instructions so that would be about 15 instructions.

jmg · 2016-09-12 00:43

David Betz wrote: »

Well, you're executing LMM instructions so that would be about 15 instructions.

Which is how much in bytes ?

David Betz · 2016-09-12 00:45

jmg wrote: »

David Betz wrote: »

Well, you're executing LMM instructions so that would be about 15 instructions.

Which is how much in bytes ?

You said 60 bytes so at 4 bytes per instruction (PASM), you get 15 instructions. This is one reason that XMM isn't terribly beneficial using a 64K EEPROM. If you use CMM mode, you can fit almost as much in hub RAM as you can fit in a 64K EEPROM in XMM mode since CMM code is more compact than XMM/LMM/PASM code.

David Betz · 2016-09-12 01:03

Also, this thread is about Spin. I'm not sure how Eric would add XMM support to FastSpin. He might use the same external memory drivers we've used for PropGCC or he might implement his own approach. My comments have been about the PropGCC XMM mode.

Dave Hein · 2016-09-12 01:07

Somebody mentioned using spin2cpp to implement XMM Spin a few hours ago.

Dave Hein wrote: »

With spin2cpp it's possible to convert Spin to C, and then compile it using the XMM model. So this capability has been around for a while.

ersmith · 2016-09-12 08:00

I've been thinking about a slightly different approach to XMM, one that keeps a cache in COG RAM instead of HUB RAM. This has the disadvantage of a much smaller cache (only 1K, probably) but the advantage that cached code can run at full speed instead of LMM speed. It'd be interesting to see how this trade-off would work in practice.

David Betz · 2016-09-12 10:15

ersmith wrote: »

I've been thinking about a slightly different approach to XMM, one that keeps a cache in COG RAM instead of HUB RAM. This has the disadvantage of a much smaller cache (only 1K, probably) but the advantage that cached code can run at full speed instead of LMM speed. It'd be interesting to see how this trade-off would work in practice.

Sounds good. Any chance we could retrofit this to PropGCC if it performs better?

ersmith · 2016-09-12 15:00

David Betz wrote: »

ersmith wrote: »

I've been thinking about a slightly different approach to XMM, one that keeps a cache in COG RAM instead of HUB RAM. This has the disadvantage of a much smaller cache (only 1K, probably) but the advantage that cached code can run at full speed instead of LMM speed. It'd be interesting to see how this trade-off would work in practice.

Sounds good. Any chance we could retrofit this to PropGCC if it performs better?

Anything's possible, especially in the Propeller world

. It doesn't seem like XMM has had much traction in PropGCC though.

David Betz · 2016-09-12 15:19

ersmith wrote: »

David Betz wrote: »

ersmith wrote: »

I've been thinking about a slightly different approach to XMM, one that keeps a cache in COG RAM instead of HUB RAM. This has the disadvantage of a much smaller cache (only 1K, probably) but the advantage that cached code can run at full speed instead of LMM speed. It'd be interesting to see how this trade-off would work in practice.

Sounds good. Any chance we could retrofit this to PropGCC if it performs better?

Anything's possible, especially in the Propeller world . It doesn't seem like XMM has had much traction in PropGCC though.

It is certainly true that XMM never got a lot of traction. However, do you think FastSpin XMM will fare much better? I guess it might because there is still a preference here for Spin over C. However, you still have the problem that Parallax doesn't have any off-the-shelf boards that support XMM with anything other than 64K EEPROMs or SD cards.

Rsadeika · 2016-09-12 19:46

It is certainly true that XMM never got a lot of traction. However, do you think FastSpin XMM will fare much better?

The reason that I did not pursue PropGCC XMM is, because you could not use/start any COGs. So, if this will happen with a FastSpin XMM version, then of course I will not be using it. After all the Propeller is all about using the COGs, right?

Ray

David Betz · 2016-09-12 20:27

Rsadeika wrote: »

It is certainly true that XMM never got a lot of traction. However, do you think FastSpin XMM will fare much better?

The reason that I did not pursue PropGCC XMM is, because you could not use/start any COGs. So, if this will happen with a FastSpin XMM version, then of course I will not be using it. After all the Propeller is all about using the COGs, right?

Ray

There has been support in PropGCC for starting multiple XMM COGs for several years. Parallax just hasn't updated the version of PropGCC that comes with SimpleIDE so you can only use the new version by manually installing a new build.

Dave Hein · 2016-09-12 20:42

It really is time for Parallax to put out an update to SimpleIDE that contains the latest version of PropGCC. I mentioned this in another thread about 4 weeks ago, and the issue was that the latest version of PropGCC was compatible with some things in the simple library. I've yet to see the issue list published. I can't understand why Parallax isn't working on this.

Cluso99 · 2016-09-13 06:27

Big Ooops...

I have been running near 512KB XMM Catalina C with other spin and pasm cogs for years! I thought PropGCC was supposed to do all those things???

yeti · 2016-09-13 08:30

Cluso99 wrote: »

Big Ooops...

https://github.com/parallaxinc/propgcc/blob/master/demos/multi-cog-xmmc/multi-cog-demo.c

ersmith · 2016-09-13 09:36

Cluso99 wrote: »

Big Ooops...

I have been running near 512KB XMM Catalina C with other spin and pasm cogs for years! I thought PropGCC was supposed to do all those things???

PropGCC has always been able to run multiple pasm COGs, even in XMM. The restriction was that the old (SimpleIDE) version couldn't start multiple XMM COGs, that is you could only run XMM C code in one COG, with the other COGS being limited to PASM or COG C. The new version of PropGCC lifts this restriction.

David Betz · 2016-09-13 09:48

deleted

David Betz · 2016-09-13 11:43

ersmith wrote: »

I've been thinking about a slightly different approach to XMM, one that keeps a cache in COG RAM instead of HUB RAM. This has the disadvantage of a much smaller cache (only 1K, probably) but the advantage that cached code can run at full speed instead of LMM speed. It'd be interesting to see how this trade-off would work in practice.

How will you handle multiple COGs running XMM code? Will you still use a separate COG as an XMM memory driver like PropGCC does so that it can be share among a number of XMM Spin COGs?

Cluso99 · 2016-09-13 12:09

Sorry, my misunderstanding of running multiple xmm cogs.

Problem here is the SRAM driver has to share its resource between multiple cog users. I'm not confident of anything general here being anywhere near efficient.

But having a single cog running XMM and multiple cogs running LMM plus Spina/pasm cogs makes more sense.

P2 will be different in that multiple LMM cogs compiled from GCC should be extremely efficient. My concern with multiple LMM cogs has always been power consumption and this still remains to be seen.

David Betz · 2016-09-13 12:13

Cluso99 wrote: »

Sorry, my misunderstanding of running multiple xmm cogs.

Problem here is the SRAM driver has to share its resource between multiple cog users. I'm not confident of anything general here being anywhere near efficient.

It's not so bad because arbitration between XMM COGs only has to happen on cache misses which are fairly slow anyway.

But having a single cog running XMM and multiple cogs running LMM plus Spina/pasm cogs makes more sense.

That was my thought initially as well but Parallax wanted the ability to run XMM in multiple COGs. The lack of that feature is what prevented them from promoting XMM in the early releases of PropGCC.

P2 will be different in that multiple LMM cogs compiled from GCC should be extremely efficient. My concern with multiple LMM cogs has always been power consumption and this still remains to be seen.

I assume no one will use LMM in P2 since we now have hubexec. Of course, XMM could still be interesting but there will be a huge performance penalty going from hubexec to XMM, much more than there is from going from LMM to XMM on P1.

ersmith · 2016-09-13 12:15

David Betz wrote: »

ersmith wrote: »

I've been thinking about a slightly different approach to XMM, one that keeps a cache in COG RAM instead of HUB RAM. This has the disadvantage of a much smaller cache (only 1K, probably) but the advantage that cached code can run at full speed instead of LMM speed. It'd be interesting to see how this trade-off would work in practice.

How will you handle multiple COGs running XMM code? Will you still use a separate COG as an XMM memory driver like PropGCC does so that it can be share among a number of XMM Spin COGs?

Ideally I'd like to have the XMM memory driver inside the same COG, but I don't know if that's going to work. My initial use case was actually a variant of CMM. The PropGCC CMM works by decompressing instructions one at a time and executing them. In FastSpin I wanted to try decompressing blocks of instructions. This would (a) potentially allow better compression (since more inter-instruction redundancy could be exploited within the block) and (b) allow an improved FCACHE (the whole block would run inside COG memory).

I then realized that XMM would essentially be the same thing, except that instead of decompressing the block of instructions we'd be loading them from external memory. Perhaps we could even combine XMM and CMM.

The tricky part is that if we want a large (1K) cache then we only have <1K of space left for the kernel and memory driver / decompresser. I think we might be able to do some of the simpler forms of XMM that way (e.g. I2C EEPROM, which is at least widely available), but you have more experience with that than I do, so please tell me if that's not practical.

Another option, of course, is to have a separate XMM memory driver like PropGCC does (maybe even to use the PropGCC ones) and use some of HUB memory as an L2 cache that the COGs load their L1 caches from.

David Betz · 2016-09-13 12:25

ersmith wrote: »

David Betz wrote: »

ersmith wrote: »

I've been thinking about a slightly different approach to XMM, one that keeps a cache in COG RAM instead of HUB RAM. This has the disadvantage of a much smaller cache (only 1K, probably) but the advantage that cached code can run at full speed instead of LMM speed. It'd be interesting to see how this trade-off would work in practice.

How will you handle multiple COGs running XMM code? Will you still use a separate COG as an XMM memory driver like PropGCC does so that it can be share among a number of XMM Spin COGs?

Ideally I'd like to have the XMM memory driver inside the same COG, but I don't know if that's going to work. My initial use case was actually a variant of CMM. The PropGCC CMM works by decompressing instructions one at a time and executing them. In FastSpin I wanted to try decompressing blocks of instructions. This would (a) potentially allow better compression (since more inter-instruction redundancy could be exploited within the block) and (b) allow an improved FCACHE (the whole block would run inside COG memory).

I then realized that XMM would essentially be the same thing, except that instead of decompressing the block of instructions we'd be loading them from external memory. Perhaps we could even combine XMM and CMM.

The tricky part is that if we want a large (1K) cache then we only have <1K of space left for the kernel and memory driver / decompresser. I think we might be able to do some of the simpler forms of XMM that way (e.g. I2C EEPROM, which is at least widely available), but you have more experience with that than I do, so please tell me if that's not practical.

I think a SPI flash driver could be made pretty small. I assume we're only talking about XMMC where only code is in external memory? In that case, you only need to support reading and not writing. The loader would have to use a different driver to write the image to flash. I think i2c EEPROM might require more code and not be practical but I haven't tried writing a highly optimized EEPROM driver. I suspect others here may know more about that than I do. How small is it possible to make a read-only PASM EEPROM driver?

Another option, of course, is to have a separate XMM memory driver like PropGCC does (maybe even to use the PropGCC ones) and use some of HUB memory as an L2 cache that the COGs load their L1 caches from.

You could certainly use the PropGCC external memory drivers but I'd probably like to change the API a bit before we do that. I patterned my API after Chip's SDRAM API that he defined for P2-hot thinking we could use the same interface on P2 eventually. That was before the new P2 got hubexec and 512K of hub memory which may make XMM less necessary. In any case, there probably isn't any reason to have the P1 and P2 API match.

Dave Hein · 2016-09-13 13:43

The read-only portion of the pasm_i2c_driver is about 144 longs. There's probably some optimization that could be done to make it a bit smaller, so 144 is an upper limit on the size of the driver code for I2C.

David Betz · 2016-09-13 13:53

Dave Hein wrote: »

The read-only portion of the pasm_i2c_driver is about 144 longs. There's probably some optimization that could be done to make it a bit smaller, so 144 is an upper limit on the size of the driver code for I2C.

That doesn't sound bad at all! As long as Eric can fit his kernel in the remaining space a COG-based cache might be possible. You'd still have to use some sort of locking if you want multiple COGs to run XMM code but that is certainly possible.

Cluso99 · 2016-09-13 15:07

Yes, I mean hubexec rather than LMM obviously. But my concerns for power are if all cogs are running hubexec the we will have a lot of power being consumed by 16 sets of hub being accessed every clock cycle. If you look at it realistically, then each instruction takes 2 clocks so for each 16 clocks you get 16 instructions. But you need to account for jumps and also for data hub accesses too.

As for XMM, I2C will be too slow, even with the faster ones. 2x Quad SPI might make the cut, as might DRAM, but my preferred option will be SRAM if I need external memory other than SD.

David Betz · 2016-09-13 15:26

Cluso99 wrote: »

Yes, I mean hubexec rather than LMM obviously. But my concerns for power are if all cogs are running hubexec the we will have a lot of power being consumed by 16 sets of hub being accessed every clock cycle. If you look at it realistically, then each instruction takes 2 clocks so for each 16 clocks you get 16 instructions. But you need to account for jumps and also for data hub accesses too.

As for XMM, I2C will be too slow, even with the faster ones. 2x Quad SPI might make the cut, as might DRAM, but my preferred option will be SRAM if I need external memory other than SD.

Does Catalina XMM use a cache? Have you heard anything from RossH lately? Does he plan to update Catalina to generate P2 code?

Publison · 2016-09-13 17:25

David Betz wrote: »

Cluso99 wrote: »

Yes, I mean hubexec rather than LMM obviously. But my concerns for power are if all cogs are running hubexec the we will have a lot of power being consumed by 16 sets of hub being accessed every clock cycle. If you look at it realistically, then each instruction takes 2 clocks so for each 16 clocks you get 16 instructions. But you need to account for jumps and also for data hub accesses too.

As for XMM, I2C will be too slow, even with the faster ones. 2x Quad SPI might make the cut, as might DRAM, but my preferred option will be SRAM if I need external memory other than SD.

Does Catalina XMM use a cache? Have you heard anything from RossH lately? Does he plan to update Catalina to generate P2 code?

Probably not. He hasn't been on since March 2015. He started a resort or something like that. Maybe the guys from OZ talk to him directley.

jmg · 2016-09-13 21:02

David Betz wrote: »

I think a SPI flash driver could be made pretty small. I assume we're only talking about XMMC where only code is in external memory? In that case, you only need to support reading and not writing. The loader would have to use a different driver to write the image to flash. I think i2c EEPROM might require more code and not be practical but I haven't tried writing a highly optimized EEPROM driver. I suspect others here may know more about that than I do. How small is it possible to make a read-only PASM EEPROM driver?

The main issue with i2c driver would be speed, not so much size.
i2c is single-bit wide, and MHz limited, both are killers.

QuadSPI gives a boost, and 2 x QuadSPI gets very similar to HyperRAM(auto refresh DRAM) as both are 8b transfers.

HyperRAM read I think can be quite small, probably smaller than QuadSPI, tho some of the Init/set mode stuff QuadSPI needs might be able to go into the cache area (run once) ?

HyperRAM has no complicated mode stuff, and P1 can generate a continual Phased CLK* using a timer, and then you feed 8 or 9 bits (Data + CS) to address the read.
The main caveat with HyperRAM is a 4us CS# limit, to allow refresh windows.
That maps to appx 15 Longs @ 80MHz or 19L at 96MHz, once you remove the address-out overhead.
19L in 4us is a peak of 4.75ML/s, and taking 50% split, that is ~2+ Mops, which seems decent ?

* Thinking some more about the Phased Clock, if the user code may use WAITCNTs that could change the phase, so I think every new block read, just before CS=L, would re-sync the counter.
HyperRAM would tolerate any P1 phase wobbles when CS=H

Big Spin - is it still a pipedream?

Comments