Is LUT sharing between adjacent cogs very important?

evanh · 2016-05-14 10:49

Hehe, someone's post-edited my word choice.

evanh · 2016-05-14 10:54

MJB wrote: »

Even if it might look like an idea is in conflict with your own thinking.

I was pointing a logic flaw. I figure Cluso will let me know if I've got the wrong end of the stick. Which, admittedly, I do do a bit.

T Chap · 2016-05-14 10:58

If speed is still the issue with smartpins, I am puzzled why you can't have a 32bit register/bus per cog specifically for sharing data, and allow any cog to read the data on the register. If there were space, add 2 or 4 32bit registers per cog to allow lots more data per read. Wouldn't this bus/register allow for more data per read and faster reads? Seems rather slow to restrict data to 8 bits at 20 clocks.

Rayman · 2016-05-14 11:05

That would be like a non-pinned port C. I think Chip said earlier that it takes up too much logic to implement.

jmg · 2016-05-14 11:21

T Chap wrote: »

If speed is still the issue with smartpins, I am puzzled why you can't have a 32bit register/bus per cog specifically for sharing data, and allow any cog to read the data on the register.

You can, it comes down to the numbers of routing....
If you want 1 x 32b shared across any of 16, that's a 512 wide bus, and each COG needs a 16:1 MUX onto that bus to choose which COG to read. If you want independant R&W, then you double that, to a 1024 wide bus....
If you want 2 or 4 locations, double or quadruple those again... See the problem ?

T Chap wrote: »

If there were space, add 2 or 4 32bit registers per cog to allow lots more data per read. Wouldn't this bus/register allow for more data per read and faster reads? Seems rather slow to restrict data to 8 bits at 20 clocks.

Sure it is faster, but at high logic and area cost.

The LUT share/overlay, takes a port that already exists, and muxes 2:1 onto the adjacent COG on a half-left, half-right basis.
Because this follows the existing memory structure, this allows a large memory area to be shared with more modest Logic and routing cost.
It gives same-as-local speeds, to a nearby COG.
Simplicity and speed. (yes, classic P2 attributes)

One down side technically, is this can attach only to the adjacent N-1, N+1 COGS, but a fast link to 2 COGS is way better than a fast link to Zero Cogs

T Chap wrote: »

Seems rather slow to restrict data to 8 bits at 20 clocks.

Yes, something faster may be needed.
The DAC+Path is still useful, because it allows any-any selection, and reuses resource already there, so is almost free.
Chip has this in there now.

It frees up other HUB access to be aligned to one path, not fight between two.
It is faster for smaller payloads, 32b speed is not looking great, (actual numbers still tbf) but 8b speed is a gain over HUB.

evanh · 2016-05-14 11:24

8 bits is 8 clocks. The 20 clocks is for 32 bits.

T Chap · 2016-05-14 11:32

I vote put LUT back in. It is a regular struggle with P1 on graphics for memory and speed. Let's not hamstring the P2 for handful of beginners that "might" hit a snag. They will sort it out if they do. Let the beginners use a P1.

Rayman · 2016-05-14 11:38

I think things like graphics need transfer of blocks of data, not single bytes so much. For that, we have the fast read/write already.

After some setup, you get a new long transferred every (what is it?) one or two clocks.

Anyway, for blocks of data, I think that is faster than this LUT stuff.

evanh · 2016-05-14 11:38

The Hub speed is much faster now. The FIFO will be dramatic for things like frame buffers.

T Chap · 2016-05-14 11:40

OK that sounds good.

David Betz · 2016-05-14 12:04

Rayman wrote: »

I think things like graphics need transfer of blocks of data, not single bytes so much. For that, we have the fast read/write already.

After some setup, you get a new long transferred every (what is it?) one or two clocks.

Anyway, for blocks of data, I think that is faster than this LUT stuff.

One thing I worry about is that 512K of hub memory still isn't enough to do modern day graphics. The STM32F7 chip and the Atmel ARM chips have external memory busses that allow you to connect an external RAM chip that maps into the processor address space. The STM32F7 also has a cache to speed up access to the external memory. Not sure about the Atmel ARM chips.

Rayman · 2016-05-14 12:10

I thought we were going to have an SDRAM interface at one point.
Don't know if that happened or not...
Maybe the streamer helps with that?

That would be the way to get high res graphics.

evanh · 2016-05-14 12:27

Yes, I have no idea how to set it up but SDRAM can apparently be burst transferred to/from using a Streamer. Presumably at half sys-clock rate given that's the fastest a Smartpin derived clock can go.

I'm guessing the setup steps to initiate a burst, or any SDRAM transaction for that matter, will be bit-bashed.

EDIT: Ah, a whole Cog will manage it. The bit-bashing required is extensive and clock exact irrespective of the Streamer doing the DMA portion. There is a ton of modes and in-line and on-going maintenance.

T Chap · 2016-05-14 13:41

How does the SDRAM work for graphics? You still need to park the data somewhere else like SD or EEPROM that gets loaded to SDRAM on boot? Then the SDRAM acts just like regular memory? The biggest P1 issue is the Cog ram limit of 2k for any graphics that are presently displayed(not much graphics). Not sure how 512k or SDRAM helps out when the grapgics engine is still very small(cog size)

evanh · 2016-05-14 14:08

Yep, say, a 2GB SD card loading graphics into, say, 8MB SDRAM, providing a couple of 2MB page flipping/scrolling frame buffers with space to spare. HubRAM becomes intermediary scanline buffer for matching timing between the SDRAM bursting and the scanline pixel bursts. Image data may never see CogRAM as the Streamer can do the scanline data directly.

T Chap · 2016-05-14 14:18

Ah, that is going to be very nice. Then it will be easy to do 24bit color, slick transitions, fast screen changes, sprites, etc without the 2k graphics cap.

Publison · 2016-05-14 15:23

evanh wrote: »

Hehe, someone's post-edited my word choice.

Do you think Moderators sleep in on Saturdays?

Heater. · 2016-05-14 15:31

evanh,

Hehe, someone's post-edited my word choice.

I thought I imagined a change there.

I could not make an sense out it. The replacement word seems just as bad that the original. Some would say worse.

Publison · 2016-05-14 16:01

Heater. wrote: »

evanh,

Hehe, someone's post-edited my word choice.

I thought I imagined a change there.

I could not make an sense out it. The replacement word seems just as bad that the original. Some would say worse.

Poor choice of words with half a cup a coffee. Updated to be PG.

potatohead · 2016-05-14 18:13

Glad someone is paying attention. I didn't even notice.

Re: graphics

Yes, the Streamer moves it from the HUB. An SD RAM COG in tandem with a video signal COG should stream pixels in nicely. One or more graphics COGS will do other things buffered in the HUB.

The pixel mixer and color engine from HOT are there too. That gives us overlay effects and various bit depths to optimize with.

Things like character displays can run right in the COG, font and all too.

This one isn't hurting for video options. I've been chipping away at a project and have been able to manage assets in a simple bitmap, viewable directly on a PC. It's fast enough to single buffer many things.

Cluso99 · 2016-05-14 18:15

Evan,
I wasnt going to bother responding to you.
Yes, you have totally misquoted me. I have no idea where you got that context.

Cluso99 · 2016-05-14 18:41

I dont require COGNEW2.

I don't need an interrupt, although perhaps a write to LUT $1FF to set and read to clear might be a help to minimise power by waiting for an interrupt.

I can use multiple objects sharing cogs without having to map cogs.

I can start and stop cogs dynamically without specifying cog numbers.

Remember guys....
We now have 16 cogs and smart pins.
We have fast hub block transfers.
..... These mean the pressure to use shared LUT would be reduced mostly to those occasions where hub transfers will not work due to latency.

The whole cog allocation scenario does not go away with smart pin cog to cog transfers. It just places additional pressure to use cogs that are specifically spaced to achieve some results.

With smart pin cog to cog transfers...
Aren't we going to have more complex pin mapping???
Where do these extra smart pins come from???
What happens to those pins where the smart pins are used for cog to cog transfers??? Are they now unusable???

BTW My understanding of shared LUT is that ALL of cog n LUT is shared with cog n+1, while cog n also sees cog n-1 LUT. So we are sharing 2KB with each adjacent cog, in both directions, giving us an effective 4KB LUT although I don't think the extra 2KB can be used as such.

Heater. · 2016-05-14 19:17

Clusso,

I can start and stop cogs dynamically without specifying cog numbers.

How?

The only way I can see to do it is to use COGNEW repeatedly until you get the required adjacent pair of COGs. Then let the other COGs, that were not adjacent, go.

The argument is that in the general case that is prone to fail. As Chip said, perhaps there are no adjacent COGs even though there are enough free COGs.

What happens when I use your object X that uses some such tricks in my program, then I use some other object Y that does similar? Don't they fight with each other? A kind of COG War.

Hmm..."COG wars". Perhaps we should have such a competition.

Marcus76 · 2016-05-14 19:47

How expensive is the logic for switching the connections to LUTs for cog order agnosticism? If you do a simplistic grid of muxes to switch the buses, then you would need them switched at 256 points for 16 LUTs (you could probably put the muxes in trees to half that). However, if the LUTs are more limited (e.g. you have 4 of them that can be allocated), then you will have significantly fewer points where the buses have to be switched. So while you couldn't make a chain of 16 cogs working together, you could set up a few of them to have higher speed connections in an order agnostic way.

MJB · 2016-05-14 21:00

Cluso99 wrote: »

...

BTW My understanding of shared LUT is that ALL of cog n LUT is shared with cog n+1, while cog n also sees cog n-1 LUT.
So we are sharing 2KB with each adjacent cog, in both directions, giving us an effective 4KB LUT although I don't think the extra 2KB can be used as such.

here's what Chip posted

cgracey Posts: 5,667
May 5 edited May 5 Flag0
You do, in effect, get access to the lower AND upper cog. Your LUT is accessible by the lower cog and the upper cog's LUT is accessible by you. This doesn't take much hardware, at all. Right now, we are using 111,758 ALMs out of 113,560. We've got about 112 ALM's for each cog left. I'm already using the 'area-aggressive' setting in the fitter. This is it!

jmg · 2016-05-14 21:01

Cluso99 wrote: »

BTW My understanding of shared LUT is that ALL of cog n LUT is shared with cog n+1, while cog n also sees cog n-1 LUT. So we are sharing 2KB with each adjacent cog, in both directions, giving us an effective 4KB LUT although I don't think the extra 2KB can be used as such.

Almost, but not quite. See Chip's reply above to my question.
The LUT mapping is easier to follow if you think physical and dual-port.
( a Full LUT each way needs more ports)

It is laid out dual-port done half-left, half-right, so N-1 overlays half its LUT memory upwards, and N+1 overlays half downward.

Total net gain is 0.5+0.5 LUT, and the centre COG can talk to TWO other cogs with equal speed.

Technically possible are :
Two Cogs as Master-Minion, or Master-Master
Three COGS as Minion-Master-Minion
Four COGs as Minion-Master-Master-Minion.
Each master has a Minion COG and a 50% sharing with the other master.
Simple, but still quite flexible.

I think the 2nd port map shares with the Streamer, so any LUT operation that is not streamer-using, should work in the mapped memory ?

Cluso99 · 2016-05-14 21:34

jmg wrote: »

Cluso99 wrote: »

BTW My understanding of shared LUT is that ALL of cog n LUT is shared with cog n+1, while cog n also sees cog n-1 LUT. So we are sharing 2KB with each adjacent cog, in both directions, giving us an effective 4KB LUT although I don't think the extra 2KB can be used as such.

Almost, but not quite. See Chip's reply above to my question.
The LUT mapping is easier to follow if you think physical and dual-port.
( a Full LUT each way needs more ports)

It is laid out dual-port done half-left, half-right, so N-1 overlays half its LUT memory upwards, and N+1 overlays half downward.

Total net gain is 0.5+0.5 LUT, and the centre COG can talk to TWO other cogs with equal speed.

Technically possible are :
Two Cogs as Master-Minion, or Master-Master
Three COGS as Minion-Master-Minion
Four COGs as Minion-Master-Master-Minion.
Each master has a Minion COG and a 50% sharing with the other master.
Simple, but still quite flexible.

I think the 2nd port map shares with the Streamer, so any LUT operation that is not streamer-using, should work in the mapped memory ?

From Chip's description you quoted, my understanding is different to yours.

eg. Cog 5 has all of its' LUT readable by Cog 4 (cog 4 will use a different instruction wr/rdxxx - I cannot recall the actual instruction). Cog 5 uses wr/rdlut to access its' LUT. Also Cog 5 can access all of Cog 6 LUT using at/rdxxx.

I am not saying I am correct mind you.

jmg · 2016-05-14 21:42

Cluso99 wrote: »

From Chip's description you quoted, my understanding is different to yours.

Search is hopeless, but I asked Chip (somewhere) explicitly if it was all-left, all-right sharing, or half-left, half-right (which meets layout and Dual port rules) and he replied the latter.
It is a minor detail, but it helps users picture what is possible.

* Found it manually trawling, back on page 8....
cgracey: "It overlaid halves, as in your second supposition, but allocating any two adjacent cogs would give you sharing"

Bill Henning · 2016-05-14 21:53

Did not notice what you wrote at the end... but yes, you did suggest something very similar earlier!

The only difference is I was suggesting a single cog image for both, conditional on WC (= cogid & 1) whereas you suggesting the first one starting the second.

David Betz wrote: »

Bill Henning wrote: »

(I have not read the whole thread, so this may have already been suggested)

what about a

COGNEW2 wc

that starts a cog pair from the same cog image, with same par?

The code in the cog could then use the lowest bit of its cogid (ideally placed in the carry flag) to run odd/even cog code...

Didn't I just say that? :-)

Cluso99 · 2016-05-14 21:56

Jmg,
Thanks. I wonder if it would've simpler if implemented as I thought. Perhaps there is a little less decoding my way. The LUT is still built as 2KB but doesn't need to be split in halves requiring decoding and more r & w strobes.

Is LUT sharing between adjacent cogs very important?

Comments