Shop OBEX P1 Docs P2 Docs Learn Events
New Hub Scheme For Next Chip - Page 14 — Parallax Forums

New Hub Scheme For Next Chip

1111214161737

Comments

  • cgraceycgracey Posts: 14,222
    edited 2014-05-15 21:24
    David Betz wrote: »
    Any chance hub execution will appear in this new chip or has that idea been abandoned?


    I've got the instructions all mapped out to support it, but I will implement it later.
  • David BetzDavid Betz Posts: 14,516
    edited 2014-05-15 21:26
    cgracey wrote: »
    I've got the instructions all mapped out to support it, but I will implement it later.
    A P3 feature then I guess. Sorry for sidetracking the new hub discussion.
  • jmgjmg Posts: 15,175
    edited 2014-05-15 21:32
    Cluso99 wrote: »
    Is 'N' a global across the entire chip, or can each cog have a different 'N'?

    Each COG has a different N, from my understanding.

    Cluso99 wrote: »
    I am not sure how this video is going to work...

    It is going to get its video data directly from Hub
    Fantastic, but doesn't that mean that the cog will not be able to use hub while this is happening?

    Yes, while the HUB is streaming data to the HUB, at fSys there is no spare time so the COG waits.
    - that leaves Blanking and Flyback times for local COG work or another COG works on the pixel-building, full time.


    For slower fSys/N, in theory there are spare HUB slots going past, but a LUT Video form is going to steal the COG Bus to lookup into COG RAM, so that means the COG is frozen for alt least those cycles.

    Extra HW would be needed to allow COG execute and Non-LUT Streaming and fSys/N N >= 2, but is that worth bothering with ?

    Sounds like the sort of thing one might look at, when the first Pass FPGA is being tested.
  • cgraceycgracey Posts: 14,222
    edited 2014-05-15 21:39
    The way I see the video working, an initial block (16 longs) of hub memory would get spooled up over 16 clocks into 512 flops, then it would be transferred into the video shifter (another 512 flops). The next block would immediately get spooled up, as well, into the first set of flops so that a spooled block is always waiting to be transferred to the shifter on the clock that it is needed. For Fsys/1 cases, the spooling would go on continuously. The trouble is, this takes 1000+ flops per cog, which is 16k+ for the chip. If we cut the hub transfer from 32 bits per clock down to 16 bits per clock, we'd nearly halve the gates in the hub memory mux mechanism. We'd also halve the number of flops in the video circuits. This would result in a hub<-->cog transfer rate of one long every two clocks, which would perfectly match hub exec requirements. Hub exec would be able to execute at 100% speed in a straight line, but then need to wait for the window to come around for a new branch address. This would mean that for video, only 16-bit pixels could be streamed from memory at Fsys/1 (no more 32-bit possibility at the Fsys/1 rate). What do you guys think about this compromise?

    I think the video shifter also needs to have modes where instead of all 4 DACs getting updated every clock, you could do one or two DACs, instead, so that you get more efficiency from your data, when only one or two DAC updates are needed. This would help composite video signal generation quite a bit.
  • potatoheadpotatohead Posts: 10,261
    edited 2014-05-15 21:40
    @jmg: It means a bitmap can be done in a single COG, other things are going to require two COGS.

    Edit: Just saw Chip's post.

    I think it's reasonable Chip. That still allows for very nice, high color depth resolutions. Definitely think about the modes. They will be useful. Because the pixel clocks are fixed, we may well want to arrange data in creative ways to maximize the stream.

    Agreed on composite. :) Nice thing about that is the low sweep frequency. We've got a lot of time to get it done.
  • Cluso99Cluso99 Posts: 18,069
    edited 2014-05-15 21:49
    jmg wrote: »
    Each COG has a different N, from my understanding.
    That wasn't from me.
    Yes, while the HUB is streaming data to the HUB, at fSys there is no spare time so the COG waits.
    - that leaves Blanking and Flyback times for local COG work or another COG works on the pixel-building, full time.

    For slower fSys/N, in theory there are spare HUB slots going past, but a LUT Video form is going to steal the COG Bus to lookup into COG RAM, so that means the COG is frozen for alt least those cycles.
    I am fine with this, as I said further down in my post.

    It may be possible to still execute within the cog (no hub accesses), similar to waitvid (but without hub access), but I don't care either way.
    Extra HW would be needed to allow COG execute and Non-LUT Streaming and fSys/N N >= 2, but is that worth bothering with ?

    Sounds like the sort of thing one might look at, when the first Pass FPGA is being tested.
    I don't think it's worth it either.

    I was more interested in understanding the proposed mechanism. My video understanding is limited.
    I would like to be able to do Composite color NTSC and VGA (only one at a time is fine). There are really cheap and small composite video monitors out there for cars.
  • jmgjmg Posts: 15,175
    edited 2014-05-15 22:00
    cgracey wrote: »
    The way I see the video working, an initial block (16 longs) of hub memory would get spooled up over 16 clocks into 512 flops, then it would be transferred into the video shifter (another 512 flops). The next block would immediately get spooled up, as well, into the first set of flops so that a spooled block is always waiting to be transferred to the shifter on the clock that it is needed. For Fsys/1 cases, the spooling would go on continuously. The trouble is, this takes 1000+ flops per cog, which is 16k+ for the chip. If we cut the hub transfer from 32 bits per clock down to 16 bits per clock, we'd nearly halve the gates in the hub memory mux mechanism. We'd also halve the number of flops in the video circuits. This would result in a hub<-->cog transfer rate of one long every two clocks, which would perfectly match hub exec requirements. Hub exec would be able to execute at 100% speed in a straight line, but then need to wait for the window to come around for a new branch address. This would mean that for video, only 16-bit pixels could be streamed from memory at Fsys/1 (no more 32-bit possibility at the Fsys/1 rate). What do you guys think about this compromise?

    There is more than one question here.
    Halving the Bus widths is the same as halving the Clock speed, which seems a shame if the 32b Counters can run at 200MHz, and the memory can MUX at 200MHz, and it is a 32b core.

    On the Video streaming, I think there is a much more frugal solution than 1000+ flops per cog, (yikes)

    The problem is best shifted to one of matching the Address-change to the Hub Rotate for each fSys/N

    That means the Memory scanned pattern changes with each N, but if you write using the same rules as you use to read, that detail is immaterial. Most video lines are build in pixel-order anyway.

    - that needs a couple of 4b Adders and 4b counters.
    I have Odd-N working with full Memory coverage, and am adding in a fix to change Even-N from Sparse, to full Memory coverage. That will then give fSys/N, if done via DMA, I think from 1..17 & the same Adder is used in SW to Write the Memory, either in this COG, or other COGS.
  • potatoheadpotatohead Posts: 10,261
    edited 2014-05-15 22:05
    Cluso, the basic idea is composite is two signals combined. There is the monochrome signal, and it's easy. Syncs and levels. The faster the levels, the more pixels there are, etc... The color signal consists of the reference colorburst, and then a modulated color signal which gets combined with the monochrome signal.

    In the end, it's all just one pixel stream. On the P1, there was a little circuit that did the color modulation, and it got added to the monochrome signal, and it got used to do the color burst. That just made things easy. Truth is, the P1 could have and has done the entire signal with no color circuit at all! And that's what this chip will do.

    We just express the whole signal as DAC levels, not just the monochrome part. There are various ways to do that too.
  • cgraceycgracey Posts: 14,222
    edited 2014-05-15 22:06
    Maybe it's better to stick with 32-bit transfers per clock, after all. It seemed convenient to match the hub data rate with the instruction rate for hub-exec, but that would introduce two more clocks into the PC-to-instruction latency, creating, in effect, two more stages of pipeline, which would be a mess to deal with. I think it's perhaps necessary for hub exec and video to have 16 longs' worth of flops for caching. The beauty of that arrangement is that it always fills in 16 clocks, no matter the initial hub window. So, when hub exec needs a new block, fill the cache and use it. Same for video. Video instructions would not be able to execute during hub exec.
  • potatoheadpotatohead Posts: 10,261
    edited 2014-05-15 22:07
    That means the Memory scanned pattern changes with each N, but if you write using the same rules as you use to read, that detail is immaterial. Most video lines are build in pixel-order anyway.

    No. Let's do the flops. Combining things with a pattern change every N would be difficult to manage in software. Seriously complicated. Besides, wouldn't that then be different per COG? We really don't need that.

    With the flops, we get a linear stream, and that's going to be important for a lot of things. The world consists of more than bitmaps.
  • TubularTubular Posts: 4,706
    edited 2014-05-15 22:07
    Can RDBLOC run continuously after those initial 2 setup cycles? Or does it need a setup before every 16 longs?

    If so, just having a variation where the pipeline still picks up hub longs in the usual way (0, 8, 1, 9, 2, A etc), but always writes it to DACs register $1F7 (rather than 16 contiguous cog longs) might do the trick. It would allow 32 bits on every cycle, outputting at full Fsys/1 rate, with the onus on the programmer to order the data the way the hub picks up. For Fsys/2 its not an issue

    The same technique could apply for OUTA
  • cgraceycgracey Posts: 14,222
    edited 2014-05-15 22:09
    jmg wrote: »
    There is more than one question here.
    Halving the Bus widths is the same as halving the Clock speed, which seems a shame if the 32b Counters can run at 200MHz, and the memory can MUX at 200MHz, and it is a 32b core.

    On the Video streaming, I think there is a much more frugal solution than 1000+ flops per cog, (yikes)

    The problem is best shifted to one of matching the Address-change to the Hub Rotate for each fSys/N

    That means the Memory scanned pattern changes with each N, but if you write using the same rules as you use to read, that detail is immaterial. Most video lines are build in pixel-order anyway.

    - that needs a couple of 4b Adders and 4b counters.
    I have Odd-N working with full Memory coverage, and am adding in a fix to change Even-N from Sparse, to full Memory coverage. That will then give fSys/N, if done via DMA, I think from 1..17 & the same Adder is used in SW to Write the Memory, either in this COG, or other COGS.


    But this would affect the global hub window order, wouldn't it? All cogs must see the same pattern for there to be all-cog/continuous access.
  • jmgjmg Posts: 15,175
    edited 2014-05-15 22:10
    jmg wrote: »
    For slower fSys/N, in theory there are spare HUB slots going past, but a LUT Video form is going to steal the COG Bus to lookup into COG RAM, so that means the COG is frozen for alt least those cycles.

    Thinking some more on this, it might not be as complex as I thought.
    The COG will have a Clock Enable already, and the WAIT is going to need to run outside that.
    In that form COG is off for the whole Wait-Cell time.

    If instead the Clock Enable is fed from a signal that is NOT of (Using_HUB_SLOT OR Using_LUT_BUS) then the COG cab auto-throttle. At full speed, COG pauses, at fSys/1 and fSys/2 the LUT pipeline effect also holds off the COG, but for fSYS/3 and slower, there are Clocks where neither HUB_Sot or Using_LUT_BUS over-rides apply,
    For those, the COG could clock, at what will be a reduced speed.
    Deciding who stops the Video line in a slow-cog mode, is a detail that needs working, if the Video is started with a count, it has it's own wait, and the COG could run-slow, then poll that, to keep in sync with Video.
    Outside of Video-lines, COG gets 100% of Clock edges.

    As a Clock Enable problem the HW is not complex, and some is already there.
  • Cluso99Cluso99 Posts: 18,069
    edited 2014-05-15 22:13
    cgracey wrote: »
    The way I see the video working, an initial block (16 longs) of hub memory would get spooled up over 16 clocks into 512 flops, then it would be transferred into the video shifter (another 512 flops). The next block would immediately get spooled up, as well, into the first set of flops so that a spooled block is always waiting to be transferred to the shifter on the clock that it is needed. For Fsys/1 cases, the spooling would go on continuously. The trouble is, this takes 1000+ flops per cog, which is 16k+ for the chip. If we cut the hub transfer from 32 bits per clock down to 16 bits per clock, we'd nearly halve the gates in the hub memory mux mechanism. We'd also halve the number of flops in the video circuits. This would result in a hub<-->cog transfer rate of one long every two clocks, which would perfectly match hub exec requirements. Hub exec would be able to execute at 100% speed in a straight line, but then need to wait for the window to come around for a new branch address. This would mean that for video, only 16-bit pixels could be streamed from memory at Fsys/1 (no more 32-bit possibility at the Fsys/1 rate). What do you guys think about this compromise?

    I think the video shifter also needs to have modes where instead of all 4 DACs getting updated every clock, you could do one or two DACs, instead, so that you get more efficiency from your data, when only one or two DAC updates are needed. This would help composite video signal generation quite a bit.
    Wow, this sure eats up a lot of space!

    Do we require one video for each cog?
    Maybe only one (or two) could be in the chip and grabbed by any cog?
    I don't care if you take the whole cog's hub bandwidth. Either the cog can rest in a waitvid, or execute within the cog (no hub access). We now have 16 cogs after all.
    Could you cut this size down if you took over the cogs hub access completely?


    And remember, it's not on Ken's priority list ;)
  • cgraceycgracey Posts: 14,222
    edited 2014-05-15 22:14
    Tubular wrote: »
    Can RDBLOC run continuously after those initial 2 setup cycles? Or does it need a setup before every 16 longs?

    If so, just having a variation where the pipeline still picks up hub longs in the usual way (0, 8, 1, 9, 2, A etc), but always writes it to DACs register $1F7 (rather than 16 contiguous cog longs) might do the trick. It would allow 32 bits on every cycle, outputting at full Fsys/1 rate, with the onus on the programmer to order the data the way the hub picks up. For Fsys/2 its not an issue

    The same technique could apply for OUTA


    But as soon as you want to do Fsys/2, it all blows up, doesn't it? Many video modes are going to want pixels at less than the top speed.
  • jmgjmg Posts: 15,175
    edited 2014-05-15 22:16
    cgracey wrote: »
    But this would affect the global hub window order, wouldn't it? All cogs must see the same pattern for there to be all-cog/continuous access.

    No the Rotate is untouched, so all COGs are ok.
    What you do is phase the COG Pointer so it is tracking the Rotator, at fSys/N.
    The physically order of Nibble values changes with /N, but the coverage of Memory is 100%.

    SW write and HW read have to use the same Adder-Algorithm, not sure if that means one adder+Muxex, or two adders.
    If the COG is allowed to use spare Clocks in slow video cases, separate adders may be safer/faster.
  • potatoheadpotatohead Posts: 10,261
    edited 2014-05-15 22:17
    I think it's perhaps necessary for hub exec and video to have 16 longs' worth of flops for caching. The beauty of that arrangement is that it always fills in 16 clocks, no matter the initial hub window. So, when hub exec needs a new block, fill the cache and use it. Same for video. Video instructions would not be able to execute during hub exec.

    Making these exclusive makes sense. A video signal COG can do some things during blanks, stable sync times, borders, blank lines, etc... Pretty much everything else will be done by one or more graphics COGS, and how they do it can depend on the signal COG. If it's a big bitmap, everybody does a region, etc... if it's scan line buffers, COGS tag team on buffers, etc...

    We won't really need hubex to do signals, IMHO.
  • potatoheadpotatohead Posts: 10,261
    edited 2014-05-15 22:19
    The physically order of Nibble values changes with /N, but the coverage of Memory is 100%.

    I'm not cool with the order changes, unless both modes are included. Having the order of things change like that makes higher than signal level code very complicated and slow.
  • Cluso99Cluso99 Posts: 18,069
    edited 2014-05-15 22:20
    potatohead wrote: »
    Cluso, the basic idea is composite is two signals combined. There is the monochrome signal, and it's easy. Syncs and levels. The faster the levels, the more pixels there are, etc... The color signal consists of the reference colorburst, and then a modulated color signal which gets combined with the monochrome signal.

    In the end, it's all just one pixel stream. On the P1, there was a little circuit that did the color modulation, and it got added to the monochrome signal, and it got used to do the color burst. That just made things easy. Truth is, the P1 could have and has done the entire signal with no color circuit at all! And that's what this chip will do.

    We just express the whole signal as DAC levels, not just the monochrome part. There are various ways to do that too.
    Thanks for that potatohead. It helps a lot. As long as we can generate color composite using a cog I am a happy camper.
    It might be possible to generate lower bandwidth VGA and a cog using this new chip too ??? (80 columns of color = 480 pixels horiz)
  • potatoheadpotatohead Posts: 10,261
    edited 2014-05-15 22:22
    Yeah. That's gonna happen. You know I like TV :)

    Let's just hope the buffer memory arrangement stays simple, linear, sane. :)

    Another way to think of it is putting the monochrome signal on one DAC, and the color signal on another, with a level shift. What happens when both get connected to the composite input?
  • TubularTubular Posts: 4,706
    edited 2014-05-15 22:25
    cgracey wrote: »
    But as soon as you want to do Fsys/2, it all blows up, doesn't it? Many video modes are going to want pixels at less than the top speed.

    Fsys/2 is where it would probably work nicely, since you'd be copying Longs 0, 1, 2, 3 etc in order.
    Fsys/1 would still work but require the data to be interleaved in the hub, in advance (0,8,1,9,2 etc).

    This is just an alternative (non-flop) way to achieve your proposed VID32, for the cases where N=1 or N=2, since it's difficult using the other (lots of flops) model.

    For other N>2, and/or other bit depths, the other techniques would apply.
  • jmgjmg Posts: 15,175
    edited 2014-05-15 22:33
    potatohead wrote: »
    Combining things with a pattern change every N would be difficult to manage in software. Seriously complicated.

    Nope, quite wrong - it is invisible, I mentioned the Physical memory to try to help readers grasp what is going on.
    Seems I achieved the opposite.... ;)

    Think of swapping pins on a PCB RAM memory.

    Because you write and read the same, you have no idea the pins are swapped.
    ( Only If you have NV RAM, and remove it from the socket, can you tell he pins were swapped )
    This is the same as that.

    The master here is the Rotator, if video wants to read every 5 SysClks it needs to stay in phase, and advance its nibble by 5, every 5 clocks. If it does that, there is no waiting and video is fSys/N etc (fSys/5 here )
  • potatoheadpotatohead Posts: 10,261
    edited 2014-05-15 22:37
    And what about the many other cogs that are contributing to that bitstream?
  • jmgjmg Posts: 15,175
    edited 2014-05-15 22:38
    potatohead wrote: »
    I'm not cool with the order changes, unless both modes are included. Having the order of things change like that makes higher than signal level code very complicated and slow.

    You can Relax, When Delta=1, things are nice and linear.
    Even when Delta > 1, filling the line, is done in the same order as reading it out. Nice and linear from code.
  • potatoheadpotatohead Posts: 10,261
    edited 2014-05-15 22:40
    When it's not 1, what of the other COGS contributing to that bitstream? In most cases, video will not be a single COG solution.
  • jmgjmg Posts: 15,175
    edited 2014-05-15 22:42
    potatohead wrote: »
    And what about the many other cogs that are contributing to that bitstream?

    {added}

    When it's not 1, what of the other COGS contributing to that bitstream? In most cases, video will not be a single COG solution.

    I'm not sure what you are asking ? only one cog streams to one set of DACS (or pins)
    Multiple COGS can create alternating line buffers if they really want, all they need to know is the N used in fSys/N and a start address, usually Block aligned.
  • potatoheadpotatohead Posts: 10,261
    edited 2014-05-15 22:53
    Actually, multiple COGS might do this for one. I've some ideas along those lines, but that's a secondary discussion. For now, one COG is the signal COG, and it's driving the DACS, and it does so from line buffers.

    Or it's doing a simple bitmap. Set that case aside too.

    Accessing line RAM isn't always a linear thing! Unless of course, we want to start introducing the overhead of copying things around a lot. And those copies won't be linear, which means they will wait a lot. And that's two sets of waiting minimum now, where a linear buffer would only have the minimum waits to process objects.

    Where there are objects, layers, etc... those things will get processed by other COGS, and the addressing of the line buffer RAM won't be step by step linear, unless those get sorted, or lines get built then copied, all of which doesn't need to happen if we've got a linear line buffer to start with. And all of that jazz is the "higher level code more complicated, messy and slow" part.

    ...and there may be times when the bitstream is generated just ahead of the signal COG rendering it too. Just saying.

    To sum up, having a linear stream from the HUB helps by not having to copy from the HUB to the COG, somehow buffer that and send it to the DACS. These COGS have no AUX / CLUT. So then, if we do the simple stream from the HUB, then complicate it by making it a non-linear, variable thing, we end up doing a batch of other copies elsewhere all over the place, depending on what needs to get done, blowing the savings out that was had in the first place.
  • Brian FairchildBrian Fairchild Posts: 549
    edited 2014-05-15 22:58
    I wondered what that noise during the night was. And now I know, it's feature creep.
  • jmgjmg Posts: 15,175
    edited 2014-05-15 23:32
    potatohead wrote: »
    These COGS have no AUX / CLUT.

    Not directly, but Chip is talking about the DMA modes using the COG ram as LUT, sound good to me.
  • RossHRossH Posts: 5,489
    edited 2014-05-15 23:45
    I wondered what that noise during the night was. And now I know, it's feature creep.

    :lol:
Sign In or Register to comment.