New Hub Scheme For Next Chip

Cluso99 · 2014-05-16 00:04

potatohead
What jmg is doing is swapping the address lines over (by an instruction) which means the original address sequence is being modified to cater for a different sequence other than 0,1,2,3,...F. But r/w hub data gets unscrambled when it comes back into video or cog.

Both (or more) cogs have to set their "N" to the same values/sequence in order to r/w to the same hub locations.
The disadvantage is that in this mode, all other hub accesses will also be shifted. Therefore, effectively hubexec is out too.

Does this explanation help?

Cluso99 · 2014-05-16 00:07

Brian Fairchild wrote: »

I wondered what that noise during the night was. And now I know, it's feature creep.

Actually, we lost the video and clut when the older P2 croaked with too much power usage.
So, it has to be done again

I hope we aren't transferring the power problem from the cog to the hub though.

Brian Fairchild · 2014-05-16 00:20

Cluso99 wrote: »

Actually, we lost the video and clut when the older P2 croaked with too much power usage.

To be accurate, they were lost, along with the rest of the chip, when feature creep caused the power usage to exceed the power dissipation of the die and package.

Brian Fairchild · 2014-05-16 00:28

*IF* video is so important to future volume sales of the chip formerly known as the P1+, and I don't personally accept that it is, then why isn't it going in the hub as a dedicated video engine driving a defined set of pins?

jmg · 2014-05-16 00:32

Cluso99 wrote: »

potatohead
What jmg is doing is swapping the address lines over (by an instruction) which means the original address sequence is being modified to cater for a different sequence other than 0,1,2,3,...F. But r/w hub data gets unscrambled when it comes back into video or cog.

Both (or more) cogs have to set their "N" to the same values/sequence in order to r/w to the same hub locations.
The disadvantage is that in this mode, all other hub accesses will also be shifted. Therefore, effectively hubexec is out too.

I'm not following these claims. When using this nibble-adder mode, the rotate is unchanged, what does change is the Nibble presented to the rotator by that opcode.

I'm not sure you would want to mix an opcode in this mode, with any other HUB opcode at the same time, as the whole reason to match-spoke is to be cycle deterministic. Any nibble out of phase, will trigger a wait.

However, in HW DMA mode, (assuming the COG works in slow Clock Enable interleave), it can use spare slots, as the HW scanner, does not 'see' the wait. That is for the COG.
HW scanner uses the slots it wants, and forces a Clock-Disable on the COG at those instants.

Cluso99 · 2014-05-16 00:56

jmg wrote: »

I'm not following these claims. When using this nibble-adder mode, the rotate is unchanged, what does change is the Nibble presented to the rotator by that opcode.

I'm not sure you would want to mix an opcode in this mode, with any other HUB opcode at the same time, as the whole reason to match-spoke is to be cycle deterministic. Any nibble out of phase, will trigger a wait.

However, in HW DMA mode, (assuming the COG works in slow Clock Enable interleave), it can use spare slots, as the HW scanner, does not 'see' the wait. That is for the COG.
HW scanner uses the slots it wants, and forces a Clock-Disable on the COG at those instants.

Now you have totally confused me too. I thought I understood what you were suggesting.

Brian Fairchild · 2014-05-16 03:31

cgracey wrote: »

Hub exec does complicate the cog quite a bit, though. It takes a whole slew of instructions to support.

I can't quite put my finger on why but that statement worries me.

jmg · 2014-05-16 03:39

Brian Fairchild wrote: »

I can't quite put my finger on why but that statement worries me.

It would bother me too, if HubExec were not field-proven already, in FPGA.

JRetSapDoog · 2014-05-16 03:50

Just thinking out loud again (showing my confusion)...

cgracey wrote: »

The trouble is, this takes 1000+ flops per cog, which is 16k+ for the chip. <-- pg. 20, post 395

I think it's perhaps necessary for hub exec and video to have 16 longs' worth of flops for caching. <--pg. 20, post 400

In that it seems that 1K+ of flops is needed for video and hub exec (two great uses), maybe there are other uses for those flops, too, to help justify them.

Could the video shifter be configured such that it could automatically wrap around to keep sending the same 512 bits out repeatedly? I'm sure it could if it can't already.

If so, that would make for storing some pretty nice waveforms in it, such as sine and triangle waves or lots of other things. For example, with the DACS (after any CLUT transforms) eating the data 8 bits at a time for 256 levels (of amplitude), there would be 64 samples (512 bits/8 bits = 64) in what we could call a frame (BTW, double that to 128 samples if the initial spooled up buffer on the front end (before) the video shifter could be used).

Then, as the waveform in the frame was automatically being output on a continuous basis, kind of like a carrier wave or similar, perhaps the cog associated with the DAC at hand here could alter the CLUT on the fly to maybe "modulate" the waveform (such as make peaks higher) and also do any related calculations (I'm assuming that the CLUT transform is applied after the bits move through the video shifter). I suppose a cog could handle all of that on its own without the video shifter automatically wrapping around, but having the assistance from the video shifter would give the cog more processing time by splitting up the workload and also make things easier to conceptualize for the programmer, kind of like how, on the P1, the cog code doesn't need to worry as much about the counter stuff once things are handed off/delegated to counters. Perhaps frames of 64 (or 128) 256-level (8-bit) samples would become a kind of a standard for doing lots of signaling things.

The key here would be to keep the cog from blocking. We wouldn't want (or need) the hub accessing a cog-based CLUT. Actually, I don't know how it could, but I thought I read somewhere that the hub might be involved (but maybe not, as a quick check didn't find the reference). Anyway, I believe that the hub wouldn't have to be involved in what's been stated so far. It's obviously best to avoid cog stalls if they're not really required (such that a cog can run "in the background" so to speak while something else is happening, and hopefully at full speed, excepting hub ops, of course, if the hub is busy).

cgracey wrote: »

This means that we CAN have a 256 LUT by reading the pixels from hub, translating them via cog RAM into 32-bit patterns, and outputting them to the DACs. This simplifies video quite a bit.

Once these instructions are over, they can return the DAC states to whatever they were before, with a mapped DAC register holding the four 8-bit values. That way, horizontal sync's can be done with 'MOV DAC,dacstates' and 'WAIT clocks' instructions. This simplifies the video greatly. Because there is no decoupling, though, the cog will be busy while it generates the pixels.

I'm confused on when the CLUT transform takes place. I presume the transformation happens as the pixels are being read out of the video shifter (i.e., after each "chunk" of pixels has been read out), somehow reaching back into a cog-based CLUT. However, that presumably blocks the cog by either [1] halting it while its "brain" (CLUT) is tapped, or, [2] requires an assist from the cogs processing "innards" to "upload" the CLUT value. So, in either of those two cases, the cog is prevented from executing code. That seems quite limiting and perhaps unnecessary, or a high price to pay for the savings provided by using cog registers for the CLUT (and kind of reminds me of motherboard graphics solutions using (eating into) the regular system DRAM). By the way, if some kind of DMA is being used (sorry, I forget where things stand), I'm unclear on how or if the blocking comes into play. Also, please enlighten me if n-port register memory is involved for the cog that would allow a cog-based CLUT to be read by some kind of CLUT transform circuitry (whatever and wherever it is) without causing the cog to stall or be occupied with the look-up transformation, as that would put a different spin on things.

But I can't help but feeling that the CLUT needs to be outside of the cog, perhaps in a "no-man's land" between the hub or "transform circuitry" (as I don't understand how it would be the hub) and the cog. Perhaps it would be accessible to both the cog and the hub, though not necessarily at the same time (and, no, I don't know why the hub would need access, but maybe for some kind of direct manipulation for modulation or something). Anyway, having a separate CLUT area would potentially allow the transform circuitry, that I would think would be after-and-connected-with the video shifter, to access such a separate CLUT directly without causing the cog to stall, at least for non-hub cog instructions (unless n-port registers already get around such stalls).

Moreover, a separate CLUT associated with each cog allows for use of the stacks that the prior P2 was designed with, without cutting into valuable cog registers (how can we give those up!). And, conceptually, having the CLUT there (whether for color info, general data or stacks) would seem to me to be a really good organizational thing in the mind of the programmer because, if such CLUTS are not separate from cog registers, then the programmer will always need to carefully analyze how much cog space can be dedicated to it, and such analysis will often be gut-wrenching. But if the CLUT is separate/additional, then it's there if you need it (or use-it-or-lose it). But there are so many potential uses for it that such power and flexibility would seem to outweigh the times when it wasn't used (wasted). I'm pretty sure I'd be willing to eat into the 512KB somewhat (not too much, though) to have that. A 256 entry CLUT at a long across per entry takes 1024 bytes, or a total of 16KB for 16 cogs, of n-port memory (sorry, I don't know how many ports it would take). It may very well be more expensive than hub memory, but if it "only" cut into hub memory by, say, 32KB (double the 16KB as a fudge factor), I'm guessing it would really be worth it from an architectural/programming standpoint.

Brian Fairchild · 2014-05-16 03:59

cgracey wrote: »

Good idea. 16x oversampling for NTSC...

cgracey wrote: »

This would help composite video signal generation quite a bit.

Composite video is dead. Please bury it and move on. The real world wants bit-mapped LCD displays.

jmg · 2014-05-16 04:10

Brian Fairchild wrote: »

Composite video is dead. Please bury it and move on. The real world wants bit-mapped LCD displays.

Nope, too sweeping a statement.
The simple factor of cables, is why we still have Composite video.
Plenty of Car Monitors have Composite, as do their Cameras.
Component Video can give better results, but needs more wires, so is less common.

I would agree Parallel connected LCDs is a significant market, but it looks like Composite (& component) can come via the Video DACs with high sample RATES almost for free, so it is valid to plan for.
People will be writing code for Composite, and expecting to port code too.

RossH · 2014-05-16 04:17

Brian Fairchild wrote: »

Composite video is dead. Please bury it and move on. The real world wants bit-mapped LCD displays.

Well, if it isn't dead now, it certainly will be by the time this chip comes out!

Ross.

Brian Fairchild · 2014-05-16 04:22

jmg wrote: »

Nope, too sweeping a statement. The simple factor of cables, is why we still have Composite video.

Composite video might exists for low-end applications but I'll bet that no more than 1% of TMCs* will use it.

RossH wrote: »

Well, if it isn't dead now, it certainly will be by the time this chip comes out!

You owe me a new keyboard. Right, off to make another coffee.....

*TMC = This Month's Chip.

Tubular · 2014-05-16 04:25

JR, for an 8 bit lookup, you can think of the CLUT as like executing the instruction
mov #$1F7, <index byte from hub>

ie the next 8 bit data to loop up is copied into the source. The move instruction looks up the corresponding 32 bit data stored at that location, and puts it in the DAC output register that goes to the pins. So the building blocks are in place, the logic synthesis should merge it right in pretty easily.

Edit: Also, yes you're right there are many other applications once you have such a lookup table arrangement. Arbitrary waveform generation, frequency modulation, phase modulation etc

Brian Fairchild · 2014-05-16 04:48

jmg wrote: »

... but it looks like Composite (& component) can come via the Video DACs with high sample RATES almost for free, so it is valid to plan for.

...but color modulation is no longer part of the package. This is a little sad, in that a one-wire color signal was nice, but you wouldn't want to have to read small text on a TV, anyway, so it was kind of a novelty.

mindrobots · 2014-05-16 04:50

RossH wrote: »

Well, if it isn't dead now, it certainly will be by the time this chip comes out!

Ross.

Most of us will be!

But there will be a healthy trail of feature creepers to lead any survivors back to our rotting carcasses and dying dreams!

MJB · 2014-05-16 05:24

cgracey wrote: »

The trouble is, this takes 1000+ flops per cog, which is 16k+ for the chip. If we cut the hub transfer from 32 bits per clock down to 16 bits per clock, we'd nearly halve the gates in the hub memory mux mechanism. We'd also halve the number of flops in the video circuits. This would result in a hub<-->cog transfer rate of one long every two clocks, which would perfectly match hub exec requirements. Hub exec would be able to execute at 100% speed in a straight line, but then need to wait for the window to come around for a new branch address. This would mean that for video, only 16-bit pixels could be streamed from memory at Fsys/1 (no more 32-bit possibility at the Fsys/1 rate). What do you guys think about this compromise?

I now, "all COGs shall be equal ..."
but is there any application needing 16 video streams ?
OR could this be made a HUB resource ... maybe two of them for dual displays? ( I have NO idea of chip design !!)
should save a lot of flops.

Tubular · 2014-05-16 05:25

Brian Fairchild wrote: »

Composite video is dead. Please bury it and move on. The real world wants bit-mapped LCD displays.

Does this mean you're calling for the ability to do LVDS?

Cluso99 · 2014-05-16 05:27

Brian Fairchild wrote: »

Composite video is dead. Please bury it and move on. The real world wants bit-mapped LCD displays.

Not everyone has discarded composite. It is very much alive because of simplicity and cheapness.

Anyway, video of any kind is not on Ken's important list.

Brian Fairchild · 2014-05-16 05:51

Tubular wrote: »

Does this mean you're calling for the ability to do LVDS?

Oh no, nothing that complicated. I suspect the 'sweet spot' is for WQVGA display running 6:6:6.

Tubular · 2014-05-16 05:54

I did a bit of spreadsheet simulation for cases where FSys/3 and higher, and I think around 250 flops (6 elements deep * 32 bits + supporting flops) may be able to decode to a linear stream. I'm going to think a bit more about it and run some more test values tomorrow.

Intuitively, for Fsys/3, if you have a full buffer, then encounter a hub slot miss, at that point you have enough data in the buffer to cover you for 6*3=18 cycles. By the time you almost exhaust that buffer, you've gone your 15 cycles around the hub, and resume picking up the data you need to fill the buffer again.

Fsys/1 and Fsys/2 would be done more directly

Brian Fairchild · 2014-05-16 05:54

MJB wrote: »

but is there any application needing 16 video streams ?

No, at least not any that this chip is targeted at.

tonyp12 · 2014-05-16 06:15

Video mode, could the ReadBlock made to be flexible so it will only write to one address in cog and not advance its address?
And instead of just 16, it does up to 1024 reads/or forever.

Could you not then make rdblock write to the cogs DAC or OUT etc?

You will make a dummy rdlong from #$0E first, so when the ReadBlock starts (or have a flag to stall until xx0) you will get an address starting on $xxx0 first as grabbing 16 in any order don't fly
There should be rdblock sysclk 1/2/4/8/16 choices but you also have to choose a crystal correctly and if needed interleave bitmap in hub ram beforehand.

Could it be made to use 256 longs starting at cog address $0 as a lookup table?
Your code have a jmp #$100 initially here but your first line of code replaces the jmp opcode with a real lookup value.

Rayman · 2014-05-16 06:19

Brian Fairchild wrote: »

I wondered what that noise during the night was. And now I know, it's feature creep.

I'm feeling deja vu... In a bad way...

JRetSapDoog · 2014-05-16 06:28

MJB wrote: »

but is there any application needing 16 video streams ?
OR could this be made a HUB resource ... maybe two of them for dual displays? ( I have NO idea of chip design !!)
should save a lot of flops.

Brian Fairchild wrote: »

No, at least not any that this chip is targeted at.

I'm odd, but I've used more than one display on the P1. The flexibility of the Prop is its distinctive feature. One reason that there aren't more multiple video setups is because it's not easy to do. But the P1/P2 simplifies things. Plus, those 16 video channels likely have more uses than just video. And the infrastructure behind them (the flops) may support hub exec or similar.

But about running 16 displays, consider one less, 15. The P2 could have a master program in one cog (if needed) and drive 15 smallish (for example ~7") WVGA LCD displays, possibly turned vertically, all deployed in a linear fashion to act as a "panoramic" message display system, each LCD displaying 1 (or 2) characters of a message. If using VGA and all of RGB + VH, one would need 75 pins, which exceeds what the package provides, but maybe one could run 12. And if using composite displays, one could run 15 on 15 pins (or maybe even more displays if there is a way to switch the DACs to other pins on the fly while still maintaining the previous pin state). That's ONE SMALL CHIP (for man, lol) driving 12 to 15 (or even 16) displays, if not more! Yes, it's an oddball example, but it shows the flexibility of the chip.

Brian Fairchild · 2014-05-16 06:36

JRetSapDoog wrote: »

The P2 could have a master program in one cog (if needed) and drive 15 smallish (for example ~7") WVGA LCD displays,

All in 512k?

Brian Fairchild · 2014-05-16 06:47

JRetSapDoog wrote: »

I'm odd, but I've used more than one display on the P1.

Actually, so have I. I've got a commercial unit out there which has two VGA outputs plus a keyboard and mouse (not used). Kind of a dual-head VT100.

JRetSapDoog · 2014-05-16 07:52

Brian Fairchild wrote: »

All in 512k?

Yeah, because you'd just need 1-bit color (white on blue, etc.) and the character bitmaps (for each letter) could be stored in the hub and then read on-the-fly as needed (or completely downloaded into a cog if small enough, though unlikely, or a mathematical, scalable font could perhaps be used) to build the display showing 1 letter per cog, basically the way tile-map drivers work. For example, designing as I go, here, using 10X the resolution of the P1's built-in font (16x32 px) for 160px wide x 320px tall and then doubling adjacent pixels in both directions to enlarge it to give 320px wide x 640px tall (to nicely fill a screen used in portrait mode, 480w x 800h along with a visibly pleasing margin) would take, let's see, 160x320 pixels/char = 51,200 pixels, which, divided by 8 for bytes, gives 6400 bytes/character. Well, if we needed 100 characters (A..Z, a..z, 0..9, etc.), we'd need Bill Gates 640KB of hub, but we don't have that, and 80 chars would take 512KB, so maybe go with 70 chars for 448KB. BTW, I'm assuming no SD card and I don't guess the external ROM chip could be read fast enough continuously. The code wouldn't take up much space, likely well under 32KB, and the video driver cog code would be 2KB or less, with multiple instances of it used, so just 2KB total, there. Well, anyway, adjust the details as necessary.

Imagine a potential customer walks into Parallax's lobby and is greeted by a panoramic display of 15 LCD's boldly spelling out the message " Welcome to... " followed by " Parallax, Inc. " and messages could scroll, with a sign or message saying "Driven by 1 P2!" It would make a solid point about the flexibility of the chip.

Brian Fairchild wrote: »

Actually, so have I. I've got a commercial unit out there which has two VGA outputs plus a keyboard and mouse (not used). Kind of a dual-head VT100.

That's cool! Good to know that I'm not the only strange one out there. Eh, my meaning is resourceful one, though perhaps peculiar. You're making my case. Anyway, I don't see Chip breaking out video and handling it separately. And if he did, I would hope that he wouldn't limit it to just 1 or 2 displays. Besides, it seems that this video stuff isn't hurting anything, considering that it can bring other advantages, so why would he change now (and lose the symmetry/flexibility)! Well, yeah, the ADC's/DAC's are now limited to cog groups (I believe), but video was also limited to pin groups on the P1. Anyway, we don't want just another "me-too" ARM with dedicated (and constrained) video, do we? Not if we can get much more flexible video and other bonus features as well (although I've seen ARM's with dual and even quad display capabilities, though not necessarily using the same type of video for each monitor).

Brian Fairchild · 2014-05-16 07:56

JRetSapDoog wrote: »

Good to know that I'm not the only strange one out there. Eh, my meaning is resourceful one, though perhaps peculiar....

Lawson · 2014-05-16 08:13

cgracey wrote: »

The way I see the video working, an initial block (16 longs) of hub memory would get spooled up over 16 clocks into 512 flops, then it would be transferred into the video shifter (another 512 flops). The next block would immediately get spooled up, as well, into the first set of flops so that a spooled block is always waiting to be transferred to the shifter on the clock that it is needed. For Fsys/1 cases, the spooling would go on continuously. The trouble is, this takes 1000+ flops per cog, which is 16k+ for the chip. If we cut the hub transfer from 32 bits per clock down to 16 bits per clock, we'd nearly halve the gates in the hub memory mux mechanism. We'd also halve the number of flops in the video circuits. This would result in a hub<-->cog transfer rate of one long every two clocks, which would perfectly match hub exec requirements. Hub exec would be able to execute at 100% speed in a straight line, but then need to wait for the window to come around for a new branch address. This would mean that for video, only 16-bit pixels could be streamed from memory at Fsys/1 (no more 32-bit possibility at the Fsys/1 rate). What do you guys think about this compromise?

I think the video shifter also needs to have modes where instead of all 4 DACs getting updated every clock, you could do one or two DACs, instead, so that you get more efficiency from your data, when only one or two DAC updates are needed. This would help composite video signal generation quite a bit.

Stupid question. With the Fsys/N (N>1) modes, why can't the hub reads be buffered in COG memory adjacent to the color pallet LUT? Assuming the cog is stalled during a video burst.

Marty

P.S. don't forget the symmetrical input modes! There are several 100-200Msps ADCs I'd love to play with. Also, I'd like use of the color pallet LUT for streaming inputs. Several of these fast ADCs do bit-scrambling to reduce EMI.

New Hub Scheme For Next Chip

Comments