New Hub Scheme For Next Chip

cgracey · 2014-05-15 21:24

David Betz wrote: »

Any chance hub execution will appear in this new chip or has that idea been abandoned?

I've got the instructions all mapped out to support it, but I will implement it later.

David Betz · 2014-05-15 21:26

cgracey wrote: »

I've got the instructions all mapped out to support it, but I will implement it later.

A P3 feature then I guess. Sorry for sidetracking the new hub discussion.

jmg · 2014-05-15 21:32

Cluso99 wrote: »

Is 'N' a global across the entire chip, or can each cog have a different 'N'?

Each COG has a different N, from my understanding.

Cluso99 wrote: »

I am not sure how this video is going to work...

It is going to get its video data directly from Hub
Fantastic, but doesn't that mean that the cog will not be able to use hub while this is happening?

Yes, while the HUB is streaming data to the HUB, at fSys there is no spare time so the COG waits.
- that leaves Blanking and Flyback times for local COG work or another COG works on the pixel-building, full time.

For slower fSys/N, in theory there are spare HUB slots going past, but a LUT Video form is going to steal the COG Bus to lookup into COG RAM, so that means the COG is frozen for alt least those cycles.

Extra HW would be needed to allow COG execute and Non-LUT Streaming and fSys/N N >= 2, but is that worth bothering with ?

Sounds like the sort of thing one might look at, when the first Pass FPGA is being tested.

cgracey · 2014-05-15 21:39

The way I see the video working, an initial block (16 longs) of hub memory would get spooled up over 16 clocks into 512 flops, then it would be transferred into the video shifter (another 512 flops). The next block would immediately get spooled up, as well, into the first set of flops so that a spooled block is always waiting to be transferred to the shifter on the clock that it is needed. For Fsys/1 cases, the spooling would go on continuously. The trouble is, this takes 1000+ flops per cog, which is 16k+ for the chip. If we cut the hub transfer from 32 bits per clock down to 16 bits per clock, we'd nearly halve the gates in the hub memory mux mechanism. We'd also halve the number of flops in the video circuits. This would result in a hub<-->cog transfer rate of one long every two clocks, which would perfectly match hub exec requirements. Hub exec would be able to execute at 100% speed in a straight line, but then need to wait for the window to come around for a new branch address. This would mean that for video, only 16-bit pixels could be streamed from memory at Fsys/1 (no more 32-bit possibility at the Fsys/1 rate). What do you guys think about this compromise?

I think the video shifter also needs to have modes where instead of all 4 DACs getting updated every clock, you could do one or two DACs, instead, so that you get more efficiency from your data, when only one or two DAC updates are needed. This would help composite video signal generation quite a bit.

potatohead · 2014-05-15 21:40

@jmg: It means a bitmap can be done in a single COG, other things are going to require two COGS.

Edit: Just saw Chip's post.

I think it's reasonable Chip. That still allows for very nice, high color depth resolutions. Definitely think about the modes. They will be useful. Because the pixel clocks are fixed, we may well want to arrange data in creative ways to maximize the stream.

Agreed on composite.

Nice thing about that is the low sweep frequency. We've got a lot of time to get it done.

Cluso99 · 2014-05-15 21:49

jmg wrote: »

Each COG has a different N, from my understanding.

That wasn't from me.

Yes, while the HUB is streaming data to the HUB, at fSys there is no spare time so the COG waits.
- that leaves Blanking and Flyback times for local COG work or another COG works on the pixel-building, full time.

For slower fSys/N, in theory there are spare HUB slots going past, but a LUT Video form is going to steal the COG Bus to lookup into COG RAM, so that means the COG is frozen for alt least those cycles.

I am fine with this, as I said further down in my post.

It may be possible to still execute within the cog (no hub accesses), similar to waitvid (but without hub access), but I don't care either way.

Extra HW would be needed to allow COG execute and Non-LUT Streaming and fSys/N N >= 2, but is that worth bothering with ?

Sounds like the sort of thing one might look at, when the first Pass FPGA is being tested.

I don't think it's worth it either.

I was more interested in understanding the proposed mechanism. My video understanding is limited.
I would like to be able to do Composite color NTSC and VGA (only one at a time is fine). There are really cheap and small composite video monitors out there for cars.

jmg · 2014-05-15 22:00

cgracey wrote: »

The way I see the video working, an initial block (16 longs) of hub memory would get spooled up over 16 clocks into 512 flops, then it would be transferred into the video shifter (another 512 flops). The next block would immediately get spooled up, as well, into the first set of flops so that a spooled block is always waiting to be transferred to the shifter on the clock that it is needed. For Fsys/1 cases, the spooling would go on continuously. The trouble is, this takes 1000+ flops per cog, which is 16k+ for the chip. If we cut the hub transfer from 32 bits per clock down to 16 bits per clock, we'd nearly halve the gates in the hub memory mux mechanism. We'd also halve the number of flops in the video circuits. This would result in a hub<-->cog transfer rate of one long every two clocks, which would perfectly match hub exec requirements. Hub exec would be able to execute at 100% speed in a straight line, but then need to wait for the window to come around for a new branch address. This would mean that for video, only 16-bit pixels could be streamed from memory at Fsys/1 (no more 32-bit possibility at the Fsys/1 rate). What do you guys think about this compromise?

There is more than one question here.
Halving the Bus widths is the same as halving the Clock speed, which seems a shame if the 32b Counters can run at 200MHz, and the memory can MUX at 200MHz, and it is a 32b core.

On the Video streaming, I think there is a much more frugal solution than 1000+ flops per cog, (yikes)

The problem is best shifted to one of matching the Address-change to the Hub Rotate for each fSys/N

That means the Memory scanned pattern changes with each N, but if you write using the same rules as you use to read, that detail is immaterial. Most video lines are build in pixel-order anyway.

- that needs a couple of 4b Adders and 4b counters.
I have Odd-N working with full Memory coverage, and am adding in a fix to change Even-N from Sparse, to full Memory coverage. That will then give fSys/N, if done via DMA, I think from 1..17 & the same Adder is used in SW to Write the Memory, either in this COG, or other COGS.

potatohead · 2014-05-15 22:05

Cluso, the basic idea is composite is two signals combined. There is the monochrome signal, and it's easy. Syncs and levels. The faster the levels, the more pixels there are, etc... The color signal consists of the reference colorburst, and then a modulated color signal which gets combined with the monochrome signal.

In the end, it's all just one pixel stream. On the P1, there was a little circuit that did the color modulation, and it got added to the monochrome signal, and it got used to do the color burst. That just made things easy. Truth is, the P1 could have and has done the entire signal with no color circuit at all! And that's what this chip will do.

We just express the whole signal as DAC levels, not just the monochrome part. There are various ways to do that too.

cgracey · 2014-05-15 22:06

Maybe it's better to stick with 32-bit transfers per clock, after all. It seemed convenient to match the hub data rate with the instruction rate for hub-exec, but that would introduce two more clocks into the PC-to-instruction latency, creating, in effect, two more stages of pipeline, which would be a mess to deal with. I think it's perhaps necessary for hub exec and video to have 16 longs' worth of flops for caching. The beauty of that arrangement is that it always fills in 16 clocks, no matter the initial hub window. So, when hub exec needs a new block, fill the cache and use it. Same for video. Video instructions would not be able to execute during hub exec.

potatohead · 2014-05-15 22:07

That means the Memory scanned pattern changes with each N, but if you write using the same rules as you use to read, that detail is immaterial. Most video lines are build in pixel-order anyway.

No. Let's do the flops. Combining things with a pattern change every N would be difficult to manage in software. Seriously complicated. Besides, wouldn't that then be different per COG? We really don't need that.

With the flops, we get a linear stream, and that's going to be important for a lot of things. The world consists of more than bitmaps.

Tubular · 2014-05-15 22:07

Can RDBLOC run continuously after those initial 2 setup cycles? Or does it need a setup before every 16 longs?

If so, just having a variation where the pipeline still picks up hub longs in the usual way (0, 8, 1, 9, 2, A etc), but always writes it to DACs register $1F7 (rather than 16 contiguous cog longs) might do the trick. It would allow 32 bits on every cycle, outputting at full Fsys/1 rate, with the onus on the programmer to order the data the way the hub picks up. For Fsys/2 its not an issue

The same technique could apply for OUTA

cgracey · 2014-05-15 22:09

jmg wrote: »

There is more than one question here.
Halving the Bus widths is the same as halving the Clock speed, which seems a shame if the 32b Counters can run at 200MHz, and the memory can MUX at 200MHz, and it is a 32b core.

On the Video streaming, I think there is a much more frugal solution than 1000+ flops per cog, (yikes)

The problem is best shifted to one of matching the Address-change to the Hub Rotate for each fSys/N

That means the Memory scanned pattern changes with each N, but if you write using the same rules as you use to read, that detail is immaterial. Most video lines are build in pixel-order anyway.

- that needs a couple of 4b Adders and 4b counters.
I have Odd-N working with full Memory coverage, and am adding in a fix to change Even-N from Sparse, to full Memory coverage. That will then give fSys/N, if done via DMA, I think from 1..17 & the same Adder is used in SW to Write the Memory, either in this COG, or other COGS.

But this would affect the global hub window order, wouldn't it? All cogs must see the same pattern for there to be all-cog/continuous access.

jmg · 2014-05-15 22:10

jmg wrote: »

For slower fSys/N, in theory there are spare HUB slots going past, but a LUT Video form is going to steal the COG Bus to lookup into COG RAM, so that means the COG is frozen for alt least those cycles.

Thinking some more on this, it might not be as complex as I thought.
The COG will have a Clock Enable already, and the WAIT is going to need to run outside that.
In that form COG is off for the whole Wait-Cell time.

If instead the Clock Enable is fed from a signal that is NOT of (Using_HUB_SLOT OR Using_LUT_BUS) then the COG cab auto-throttle. At full speed, COG pauses, at fSys/1 and fSys/2 the LUT pipeline effect also holds off the COG, but for fSYS/3 and slower, there are Clocks where neither HUB_Sot or Using_LUT_BUS over-rides apply,
For those, the COG could clock, at what will be a reduced speed.
Deciding who stops the Video line in a slow-cog mode, is a detail that needs working, if the Video is started with a count, it has it's own wait, and the COG could run-slow, then poll that, to keep in sync with Video.
Outside of Video-lines, COG gets 100% of Clock edges.

As a Clock Enable problem the HW is not complex, and some is already there.

Cluso99 · 2014-05-15 22:13

cgracey wrote: »

The way I see the video working, an initial block (16 longs) of hub memory would get spooled up over 16 clocks into 512 flops, then it would be transferred into the video shifter (another 512 flops). The next block would immediately get spooled up, as well, into the first set of flops so that a spooled block is always waiting to be transferred to the shifter on the clock that it is needed. For Fsys/1 cases, the spooling would go on continuously. The trouble is, this takes 1000+ flops per cog, which is 16k+ for the chip. If we cut the hub transfer from 32 bits per clock down to 16 bits per clock, we'd nearly halve the gates in the hub memory mux mechanism. We'd also halve the number of flops in the video circuits. This would result in a hub<-->cog transfer rate of one long every two clocks, which would perfectly match hub exec requirements. Hub exec would be able to execute at 100% speed in a straight line, but then need to wait for the window to come around for a new branch address. This would mean that for video, only 16-bit pixels could be streamed from memory at Fsys/1 (no more 32-bit possibility at the Fsys/1 rate). What do you guys think about this compromise?

I think the video shifter also needs to have modes where instead of all 4 DACs getting updated every clock, you could do one or two DACs, instead, so that you get more efficiency from your data, when only one or two DAC updates are needed. This would help composite video signal generation quite a bit.

Wow, this sure eats up a lot of space!

Do we require one video for each cog?
Maybe only one (or two) could be in the chip and grabbed by any cog?
I don't care if you take the whole cog's hub bandwidth. Either the cog can rest in a waitvid, or execute within the cog (no hub access). We now have 16 cogs after all.
Could you cut this size down if you took over the cogs hub access completely?

And remember, it's not on Ken's priority list

cgracey · 2014-05-15 22:14

Tubular wrote: »

Can RDBLOC run continuously after those initial 2 setup cycles? Or does it need a setup before every 16 longs?

If so, just having a variation where the pipeline still picks up hub longs in the usual way (0, 8, 1, 9, 2, A etc), but always writes it to DACs register $1F7 (rather than 16 contiguous cog longs) might do the trick. It would allow 32 bits on every cycle, outputting at full Fsys/1 rate, with the onus on the programmer to order the data the way the hub picks up. For Fsys/2 its not an issue

The same technique could apply for OUTA

But as soon as you want to do Fsys/2, it all blows up, doesn't it? Many video modes are going to want pixels at less than the top speed.

jmg · 2014-05-15 22:16

cgracey wrote: »

But this would affect the global hub window order, wouldn't it? All cogs must see the same pattern for there to be all-cog/continuous access.

No the Rotate is untouched, so all COGs are ok.
What you do is phase the COG Pointer so it is tracking the Rotator, at fSys/N.
The physically order of Nibble values changes with /N, but the coverage of Memory is 100%.

SW write and HW read have to use the same Adder-Algorithm, not sure if that means one adder+Muxex, or two adders.
If the COG is allowed to use spare Clocks in slow video cases, separate adders may be safer/faster.

potatohead · 2014-05-15 22:17

I think it's perhaps necessary for hub exec and video to have 16 longs' worth of flops for caching. The beauty of that arrangement is that it always fills in 16 clocks, no matter the initial hub window. So, when hub exec needs a new block, fill the cache and use it. Same for video. Video instructions would not be able to execute during hub exec.

Making these exclusive makes sense. A video signal COG can do some things during blanks, stable sync times, borders, blank lines, etc... Pretty much everything else will be done by one or more graphics COGS, and how they do it can depend on the signal COG. If it's a big bitmap, everybody does a region, etc... if it's scan line buffers, COGS tag team on buffers, etc...

We won't really need hubex to do signals, IMHO.

potatohead · 2014-05-15 22:19

The physically order of Nibble values changes with /N, but the coverage of Memory is 100%.

I'm not cool with the order changes, unless both modes are included. Having the order of things change like that makes higher than signal level code very complicated and slow.

Cluso99 · 2014-05-15 22:20

potatohead wrote: »

Cluso, the basic idea is composite is two signals combined. There is the monochrome signal, and it's easy. Syncs and levels. The faster the levels, the more pixels there are, etc... The color signal consists of the reference colorburst, and then a modulated color signal which gets combined with the monochrome signal.

In the end, it's all just one pixel stream. On the P1, there was a little circuit that did the color modulation, and it got added to the monochrome signal, and it got used to do the color burst. That just made things easy. Truth is, the P1 could have and has done the entire signal with no color circuit at all! And that's what this chip will do.

We just express the whole signal as DAC levels, not just the monochrome part. There are various ways to do that too.

Thanks for that potatohead. It helps a lot. As long as we can generate color composite using a cog I am a happy camper.
It might be possible to generate lower bandwidth VGA and a cog using this new chip too ??? (80 columns of color = 480 pixels horiz)

potatohead · 2014-05-15 22:22

Yeah. That's gonna happen. You know I like TV

Let's just hope the buffer memory arrangement stays simple, linear, sane.

Another way to think of it is putting the monochrome signal on one DAC, and the color signal on another, with a level shift. What happens when both get connected to the composite input?

Tubular · 2014-05-15 22:25

cgracey wrote: »

But as soon as you want to do Fsys/2, it all blows up, doesn't it? Many video modes are going to want pixels at less than the top speed.

Fsys/2 is where it would probably work nicely, since you'd be copying Longs 0, 1, 2, 3 etc in order.
Fsys/1 would still work but require the data to be interleaved in the hub, in advance (0,8,1,9,2 etc).

This is just an alternative (non-flop) way to achieve your proposed VID32, for the cases where N=1 or N=2, since it's difficult using the other (lots of flops) model.

For other N>2, and/or other bit depths, the other techniques would apply.

jmg · 2014-05-15 22:33

potatohead wrote: »

Combining things with a pattern change every N would be difficult to manage in software. Seriously complicated.

Nope, quite wrong - it is invisible, I mentioned the Physical memory to try to help readers grasp what is going on.
Seems I achieved the opposite....

Think of swapping pins on a PCB RAM memory.

Because you write and read the same, you have no idea the pins are swapped.
( Only If you have NV RAM, and remove it from the socket, can you tell he pins were swapped )
This is the same as that.

The master here is the Rotator, if video wants to read every 5 SysClks it needs to stay in phase, and advance its nibble by 5, every 5 clocks. If it does that, there is no waiting and video is fSys/N etc (fSys/5 here )

potatohead · 2014-05-15 22:37

And what about the many other cogs that are contributing to that bitstream?

jmg · 2014-05-15 22:38

potatohead wrote: »

I'm not cool with the order changes, unless both modes are included. Having the order of things change like that makes higher than signal level code very complicated and slow.

You can Relax, When Delta=1, things are nice and linear.
Even when Delta > 1, filling the line, is done in the same order as reading it out. Nice and linear from code.

potatohead · 2014-05-15 22:40

When it's not 1, what of the other COGS contributing to that bitstream? In most cases, video will not be a single COG solution.

jmg · 2014-05-15 22:42

potatohead wrote: »

And what about the many other cogs that are contributing to that bitstream?

{added}

When it's not 1, what of the other COGS contributing to that bitstream? In most cases, video will not be a single COG solution.

I'm not sure what you are asking ? only one cog streams to one set of DACS (or pins)
Multiple COGS can create alternating line buffers if they really want, all they need to know is the N used in fSys/N and a start address, usually Block aligned.

potatohead · 2014-05-15 22:53

Actually, multiple COGS might do this for one. I've some ideas along those lines, but that's a secondary discussion. For now, one COG is the signal COG, and it's driving the DACS, and it does so from line buffers.

Or it's doing a simple bitmap. Set that case aside too.

Accessing line RAM isn't always a linear thing! Unless of course, we want to start introducing the overhead of copying things around a lot. And those copies won't be linear, which means they will wait a lot. And that's two sets of waiting minimum now, where a linear buffer would only have the minimum waits to process objects.

Where there are objects, layers, etc... those things will get processed by other COGS, and the addressing of the line buffer RAM won't be step by step linear, unless those get sorted, or lines get built then copied, all of which doesn't need to happen if we've got a linear line buffer to start with. And all of that jazz is the "higher level code more complicated, messy and slow" part.

...and there may be times when the bitstream is generated just ahead of the signal COG rendering it too. Just saying.

To sum up, having a linear stream from the HUB helps by not having to copy from the HUB to the COG, somehow buffer that and send it to the DACS. These COGS have no AUX / CLUT. So then, if we do the simple stream from the HUB, then complicate it by making it a non-linear, variable thing, we end up doing a batch of other copies elsewhere all over the place, depending on what needs to get done, blowing the savings out that was had in the first place.

Brian Fairchild · 2014-05-15 22:58

I wondered what that noise during the night was. And now I know, it's feature creep.

jmg · 2014-05-15 23:32

potatohead wrote: »

These COGS have no AUX / CLUT.

Not directly, but Chip is talking about the DMA modes using the COG ram as LUT, sound good to me.

RossH · 2014-05-15 23:45

Brian Fairchild wrote: »

I wondered what that noise during the night was. And now I know, it's feature creep.

New Hub Scheme For Next Chip

Comments