Seems very complicated and it looks like to optimize it you need to chop your data up between 16 different blocks of ram?
Just give each cog it's own 32K of ram and maybe a small shared buffer or data exchange command and be done with it. The 'Hub' has been out grown with the move to 16 cogs. For a small shared data pool fine, for stuff that these new high power cogs will need to use all the time (data/program) no.
This seems to be moving in the opposite direction of KISS from a programmer standpoint.
Second half sys clock is not possible with 2 cogs.
It's true the you need 16 instructions to transfer pins from/to cog ram, and this will overlap with the other cog transfering cog ram from/to hub, but you have still those 2 clocks to setup the rd/wr-block that will jitter, it cant be continuous: 16 clocks are needed to sample 16 times(a long) from pins but then you need 18 to transfer these to hub.
16vs18, two cogs can't team for continuous sampling, it's not enough.
16 instructions takes 32 sysclks (instructions are 2 clocks each in most cases). So you have plenty of overlap space to cover the 2 clocks for the rd/wrbloc. It's 32 clocks of sampling then 18 clocks of writing to hub. then 14 clocks of stuff(or nops) done by each of the 2 cogs. So it works out.
2cogs for 1/2sycClk no jitter PIN reading is doable.
Add 1 to startcnt for second set of pair cogs and you get 100% sysclk sampling (hub storing address separated and let software sort them later)
REP forever (is there a forever settings?)
waitcnt startcnt,#64 ; startcnt is offset by 32 and ptra is offset by 16 at init for each of the 2cogs
mov t0,ina 2nd cog would now start wrblock
mov t1,ina
mov t2,ina
mov t3,ina
mov t4,ina
mov t5,ina
mov t6,ina
mov t7,ina
mov t8,ina 2nd cog is now doing the last final two long writes
mov t9,ina 2nd cog is: add ptra,#32
mov tA,ina 2nd cog is: waitcnt startcnt,#64
mov tB,ina
mov tC,ina
mov tD,ina
mov tE,ina
mov tF,ina 2nd COG:
wrblock t0,ptra mov t0,ina
add ptra,#32 mov t1,ina
EndRep ....
stop thinking you need to chop your data up or whatever. it's all handled for you if you don't care about precise timing of the reads/writes. You can just read or write to hub like you always did before and it will work. If you care about synchronizing or precise timing then you have to do something extra like use rd/wrbloc, insert appropriate spacer instructions (like on P1), or adjust your read order.
For normal, everyday reading or writing to the HUB, it just works, no need to worry about anything.
Exactly. It's a lot of steam about nothing. Those doing the most complaining, for the most part, are those most likely never to do anything that would require knowing the nitty-gritty details anyway.
On another matter... Someone expressed surprise that Cyclone V isn't faster than Cyclone IV. It may bear repeating once again that:
1) Cyclone is Altera's low-cost family. For performance, consider Stratix.
2) Altera's taglines for Cyclone V are lower power and lower system costs, not higher speed.
It's not that complex, at all. In fact, the Verilog code that I posted yesterday IS the complete solution, as it just gets synthesized and poses no extra layout effort.
Before Roy showed up, I was thinking about how the hub memory would work and I was dissatisfied with what I was imagining. I was delaying implementation because I knew I didn't have the right idea, yet. So, this is all providence.
I've been running this scheme through my head and I think it is as fast as any scheme could be that has to connect 16 cogs to 16 RAMs. Each RAM mux's one of 16 cog's control signals for its inputs and each cog mux's one of 16 RAM's data outputs for reading. Though rather large in transistor count, It's way fewer circuit stages than a central hub would have imposed on the signals. So, it seems pretty ideal.
One thing about the current scheme that is not ideal is that when data is transferring between hub and cog, execution is stalled. This means that for some high-bandwidth apps, two cogs will be needed to tag-team, so that while one loads/saves data from/to the hub, the other outputs/inputs data via pins/video. If the foundry had a 3-port RAM, the hub memory exchange could occur in the background, which would be fantastic. What we have is pretty good, though, and if the cogs can be kept small, it's no big deal to use two of them in an app where sustained bandwidth is needed.
Well, I'm not going to argue with any of that, as you are the expert.
However, can you at least comment on Cluso's adaption?
Would not 16 Cores with 'seperate' 18-26K RAM not reduce the need for crutches such as LMM for many, and allow those Cores to do actual work instead of stalling/having to use Hub as their main memory?
LMM would still be available up to ~128K for large program use, which I am not sure is ultimately a valid concern/complaint anyways.
If someone is going to use the majority of Hub for one program, then what is going to be left for the other 15 Cores?
It's not that complex, at all.
<snip>.
One thing about the current scheme that is not ideal is that when data is transferring between hub and cog, execution is stalled. This means that for some high-bandwidth apps, two cogs will be needed to tag-team, so that while one loads/saves data from/to the hub, the other outputs/inputs data via pins/video.
What we have is pretty good, though, and if the cogs can be kept small, it's no big deal to use two of them in an app where sustained bandwidth is needed.
I agree it is a great start, anything that bumps peak bandwidth this much, is a Whole New Engine.
The detail is that when you drop a Whole New Engine into your spiffy Hot Rod, you really should also check again the little things, like the Gearbox, Clutch, Diff and Brakes.
Some of those may need some small changes, to co-ooperate better with the Whole New Engine.
Where I see this is now, is very early on in Whole New Engine. terms, and some minor silicon additions can mitigate cases and avoid having to pair COGS - that's never easy to code, or debug without scopes, or keep in sync during edits and updates..
It also makes porting from P1 easier, and Objects will be easier to follow and modify, if they do not come with "Uses 2 COGS" in the headers...
I agree it is a great start, anything that bumps peak bandwidth this much, is a Whole New Engine.
The detail is that when you drop a Whole New Engine into your spiffy Hot Rod, you really should also check again the little things, like the Gearbox, Clutch, Diff and Brakes.
Some of those may need some small changes, to co-ooperate better with the Whole New Engine.
Where I see this is now, is very early on in Whole New Engine. terms, and some minor silicon additions can mitigate cases and avoid having to pair COGS - that's never easy to code, or debug without scopes, or keep in sync during edits and updates..
It also makes porting from P1 easier, and Objects will be easier to follow and modify, if they do not come with "Uses 2 COGS" in the headers...
I strongly agree. A major change like this shouldn't be a simple open and shut case then be rushed off to the fab without careful consideration of the potential implications. I think this is the single biggest change of the entire chip, so for Parallax's sake, I urge them to tread lightly.
The only case I picture that would need two cogs is video, pushing 24 bits per pixel at Fsys. I think the solution for that is to get the cog out of the way, and have memory stream directly to the DACs.
Remember, with this memory scheme, all cogs can stream longs in and out of the hub at Fsys (200MHz). Going between pins and hub is useful, too, for SDRAM. Those circuits in the cog would be almost nothing.
EDIT: Never mind about SDRAM. It would use a 16-bit data path and clock at Fsys/2, so that cog software could do the signalling.
Maybe there is no need to push DAC data at Fsys. It would be simple to do, though. Maybe transferring pin data to and from hub ram at Fsys would be useful. It would make a 200MS/s logic analyzer.
1) What MIPS do you get for hubexec?
2) Is that figure fixed or does it jitter?
I don't think the design is developed enough to answer your questions. It depends on whether there will be instruction and data caches, and how much of each. It also depends on whether the instruction caches will be pre-fill like they were on the previous design. Oh, I think it also depends on whether hubexec will be implemented.
One thing about the current scheme that is not ideal is that when data is transferring between hub and cog, execution is stalled. This means that for some high-bandwidth apps, two cogs will be needed to tag-team, so that while one loads/saves data from/to the hub, the other outputs/inputs data via pins/video.
Hi Chip/All. For those of us struggling to hang on to this fast-moving merry-go-round (but enjoying the ride), I was wondering if someone could put a (possibly rough) number on the phrase "high-bandwidth apps" in terms of bytes or longs per second? A number would give us a better feel for things, even though the sensitivity of the application could come into play.
For example, would a single cog be sufficient for 640x480 VGA or 800x480 WVGA video at say, 1 byte/pixel of color depth (256 colors/pixel), which involves roughly 18 to 23 MB/S of image data at about 60Hz. It would be nice if one cog could drive such a relatively low-resolution display (as the coding simplicity of dealing with one cog will be appreciated by the user). I'm guessing with cog buffering between stalls (if any), this would be doable in 1 cog. Hmm, in fact, block reads might not even be necessary for the 640x480 version.
UPDATE: Okay, I see that Chip touched upon that while I was posting (and before refreshing):
The only case I picture that would need two cogs is video, pushing 24 bits per pixel at Fsys. I think the solution for that is to get the cog out of the way, and have memory stream directly to the DACs.
Hi Chip/All. For those of us struggling to hang on to this fast-moving merry-go-round (but enjoying the ride), I was wondering if someone could put a (possibly rough) number on the phrase "high-bandwidth apps" in terms of bytes or longs per second? A number would give us a better feel for things, even though the sensitivity of the application could come into play.
For example, would a single cog be sufficient for 640x480 VGA or 800x480 WVGA video at say, 1 byte/pixel of color depth (256 colors/pixel), which involves roughly 18 to 23 MB/S of image data at about 60Hz. It would be nice if one cog could drive such a relatively low-resolution display (as the coding simplicity of dealing with one cog will be appreciated by the user). I'm guessing with cog buffering between stalls (if any), this would be doable in 1 cog. Hmm, in fact, block reads might not even be necessary for the 640x480 version.
The only case I picture that would need two cogs is video, pushing 24 bits per pixel at Fsys. I think the solution for that is to get the cog out of the way, and have memory stream directly to the DACs.
....
... Maybe transferring pin data to and from hub ram at Fsys would be useful. It would make a 200MS/s logic analyzer.
Certainly, a means to connect the Hub Rotate directly to/From Pins or DAC would be very useful, at fSys/N rates
That is just the sort of small silicon assist I am talking about.
Is there still a Video Buffer / LUT in the COG ?
I have another small silicon assist, in the form of a NIbble-Adder (just 4 bits, special adder) that can allow SW to Stream HUB memory, in just one cog, at fSys/3,fSys/5,fSys/7,fSys/9,fSys/11,fSys/13,fSys/15,fSys/17,
Certainly, a means to connect the Hub Rotate directly to/From Pins or DAC would be very useful, at fSys/N rates
That is just the sort of small silicon assist I am talking about.
Is there still a Video Buffer / LUT in the COG ?
I have another small silicon assist, in the form of a NIbble-Adder (just 4 bits, special adder) that can allow SW to Stream HUB memory, in just one cog, at fSys/3,fSys/5,fSys/7,fSys/9,fSys/11,fSys/13,fSys/15,fSys/17,
Fsys/N would be great, but how can that be achieved without affecting all cogs' access rates?
For example, would a single cog be sufficient for 640x480 VGA or 800x480 WVGA video at say, 1 byte/pixel of color depth (256 colors/pixel), which involves roughly 18 to 23 MB/S of image data at about 60Hz. It would be nice if one cog could drive such a relatively low-resolution display (as the coding simplicity of dealing with one cog will be appreciated by the user).
Good targets.
The BW can exceed these values, the tricky bits of this are
a) getting BW to reduce to Pixel Clocks 20~40MHz speeds, when using a higher fSys.
fSys/N control, would be nice, in a Single COG.
b) 8 Bit Video is a little light, so a 16b (or more) colour palette LUT would help - that needs the ability to insert a Video LUT in between the Data and the Pins
Fsys/N would be great, but how can that be achieved without affecting all cogs' access rates?
I can get Odd-N cases, with efficient (all) Memory use, using REPS and a Nibble adder.
(even-N cases are possible, but with sparse use, so even-N has caveats, but could work for Video where you re-index every scan line, less useful for Logic-Analyser )
Hmm... I can see unused RAM is costly, but this really can help compensate for overall ram being not-quite-enough.
What about using COG RAM as LUT, via a (3+ cycle?*) Video-LUT-opcode
RDBYTIL PinReg,PTR++
Reads Byte from HUB, then uses as Address into upper 256 of COG , and writes that to masked Pins, or DAC.
Also usable for DDS Sine/Triangle loops, as I think Sine is also gone ?
*3 cycles may be trickier than 4, but would allow 40MHz video at 120MHz fSys, and support peak video CLKs ~ 66MHz
Chip, with this new system, how does it affect the performance of rd/wrbyte and rd/wrword?
I know it's awesomely fast at DMA type stuff, but from my experience, most hub-op usage isn't block based, but random pickings, and a lot of what is block based, is read and act upon data, then read next long/word/byte.
Or am I just not seeing the full picture of this new system? as I'm getting a fear of it being slower in the long run to the old round robin.
I think this compares very favorably to what Hanno was able to pull off with P1... I think he had 32-bit wide capture at 80 MSPS into ~8 kB of cog ram using 1/2 the cogs.
He did this miracle with very complicated self-modifying code and I don't think there was any time margin during acquisition to do any trigger processing...
Here we could do 64-bit wide capture at 200 MSPS into ~500 kB of HUB RAM using 1/2 the cogs and still have margin to easily do pretriggering.
And, the code would be very simple...
On top of that, we could do arbitrary waveform generation at high speed and analog capture on many other pins at slow speed...
Here we could do 64-bit wide capture at 200 MSPS into ~500 kB of HUB RAM using 1/2 the cogs and still have margin to easily do pretriggering.
I think Chip was musing about possible DMA-like direct paths, as he said more than you quoted. My emphasis added.
[" I think the solution for that is [U]to get the cog out of the way[/U], and have memory stream directly to the DACs."]
&
[" Maybe transferring [U]pin data to and from hub ram at Fsys would be useful[/U]. It would make a 200MS/s logic analyzer."]
Once the data can stream HUB-COG, via the new Rotate Scheme, it make sense to think a little about what simple things can be done with it.
The already mentioned BLOCK opcodes are just one such pathway.
... but random pickings, and a lot of what is block based, is read and act upon data, then read next long/word/byte.
Or am I just not seeing the full picture of this new system? as I'm getting a fear of it being slower in the long run to the old round robin.
Depends on what 'next' means. If the data is truly random, and your code simply runs as fast as it can, then it will not be slower than the P1, as the worse case added latency for Any_Address HUB is +8 Opcode Cycles, with an average of +4, and those are 10ns Opcodes, not 50ns Opcodes.
In most high speed cases I can think of (Buffers/FIFOs/Video/Audio), data is not Random, and the more Random cases tend to have complex code doing the indexing.
I don't care that we have a rotating hub memory although I think it spins too fast to be useful for anything that isn't synchronized to it. So having the 16 blade propeller rather than a single blade as in P1 can only be better. However........and this is a big however, is that this solution seems to be touted as "the final solution", but it ain't. It has never addressed low-latency "random" access, that is any access that may even be sequential but not synchronized etc. What if I wanted a 60MHz bit blaster but had stuff to do between each blast? Oh yeah, that's right, just turn a cog into a hub slave zombie and do the real work with another cog or tag team them up which makes coding so much simpler and appealing to all those masses Parallax are trying to attract?
While Chip and Co are OCD'd on this high speed spinning whirly-jig and the army of Gigabits it will blast into space I'm afraid that most applications will only gain the cog speed improvement as the hub whirls by taunting "Ha Ha, ya missed me".
Don't get me wrong, P2 will be good but might be typecast into "it's good for blasting" or something like that but use an ARM for everything else.
I think the apprehension over this new hub memory scheme is overblown. It's true that there will be some jitter for random accesses, but the flip side is that by paying attention to the order you do your writes in, you can actually get higher throughput.
I've been thinking about the video mechanism all afternoon and I just realized that because it's going to be tied to the system clock, all the clock domain decoupling that has been part of video since Prop1 can go away. Now, there can be different instructions to do different video output streams. There's no longer a need to chain video commands, in other words. This means that we CAN have a 256 LUT by reading the pixels from hub, translating them via cog RAM into 32-bit patterns, and outputting them to the DACs. This simplifies video quite a bit.
There can now be a small set of video output instructions that get the job done in a simple way, outputting a whole visible scan line at a time:
VID32........32-bit hub-to-DAC mode at Fsys/N
VID16........16-bit hub-to-DAC mode at Fsys/N
VID8..........8-bit hub-to-LUT-to-DAC mode at Fsys/N
VID4..........4-bit hub-to-LUT-to-DAC mode at Fsys/N
VID2..........2-bit hub-to-LUT-to-DAC mode at Fsys/N
VID1..........1-bit hub-to-LUT-to-DAC mode at Fsys/N
Once these instructions are over, they can return the DAC states to whatever they were before, with a mapped DAC register holding the four 8-bit values. That way, horizontal sync's can be done with 'MOV DAC,dacstates' and 'WAIT clocks' instructions. This simplifies the video greatly. Because there is no decoupling, though, the cog will be busy while it generates the pixels.
Comments
Just give each cog it's own 32K of ram and maybe a small shared buffer or data exchange command and be done with it. The 'Hub' has been out grown with the move to 16 cogs. For a small shared data pool fine, for stuff that these new high power cogs will need to use all the time (data/program) no.
This seems to be moving in the opposite direction of KISS from a programmer standpoint.
16 instructions takes 32 sysclks (instructions are 2 clocks each in most cases). So you have plenty of overlap space to cover the 2 clocks for the rd/wrbloc. It's 32 clocks of sampling then 18 clocks of writing to hub. then 14 clocks of stuff(or nops) done by each of the 2 cogs. So it works out.
Add 1 to startcnt for second set of pair cogs and you get 100% sysclk sampling (hub storing address separated and let software sort them later)
stop thinking you need to chop your data up or whatever. it's all handled for you if you don't care about precise timing of the reads/writes. You can just read or write to hub like you always did before and it will work. If you care about synchronizing or precise timing then you have to do something extra like use rd/wrbloc, insert appropriate spacer instructions (like on P1), or adjust your read order.
For normal, everyday reading or writing to the HUB, it just works, no need to worry about anything.
On another matter... Someone expressed surprise that Cyclone V isn't faster than Cyclone IV. It may bear repeating once again that:
1) Cyclone is Altera's low-cost family. For performance, consider Stratix.
2) Altera's taglines for Cyclone V are lower power and lower system costs, not higher speed.
1) What MIPS do you get for hubexec?
2) Is that figure fixed or does it jitter?
Well, I'm not going to argue with any of that, as you are the expert.
However, can you at least comment on Cluso's adaption?
Would not 16 Cores with 'seperate' 18-26K RAM not reduce the need for crutches such as LMM for many, and allow those Cores to do actual work instead of stalling/having to use Hub as their main memory?
LMM would still be available up to ~128K for large program use, which I am not sure is ultimately a valid concern/complaint anyways.
If someone is going to use the majority of Hub for one program, then what is going to be left for the other 15 Cores?
3 Port memory will have a significant area cost, which will reduce the RAM size.
I agree it is a great start, anything that bumps peak bandwidth this much, is a Whole New Engine.
The detail is that when you drop a Whole New Engine into your spiffy Hot Rod, you really should also check again the little things, like the Gearbox, Clutch, Diff and Brakes.
Some of those may need some small changes, to co-ooperate better with the Whole New Engine.
Where I see this is now, is very early on in Whole New Engine. terms, and some minor silicon additions can mitigate cases and avoid having to pair COGS - that's never easy to code, or debug without scopes, or keep in sync during edits and updates..
It also makes porting from P1 easier, and Objects will be easier to follow and modify, if they do not come with "Uses 2 COGS" in the headers...
I strongly agree. A major change like this shouldn't be a simple open and shut case then be rushed off to the fab without careful consideration of the potential implications. I think this is the single biggest change of the entire chip, so for Parallax's sake, I urge them to tread lightly.
Remember, with this memory scheme, all cogs can stream longs in and out of the hub at Fsys (200MHz). Going between pins and hub is useful, too, for SDRAM. Those circuits in the cog would be almost nothing.
EDIT: Never mind about SDRAM. It would use a 16-bit data path and clock at Fsys/2, so that cog software could do the signalling.
Maybe there is no need to push DAC data at Fsys. It would be simple to do, though. Maybe transferring pin data to and from hub ram at Fsys would be useful. It would make a 200MS/s logic analyzer.
Hi Chip/All. For those of us struggling to hang on to this fast-moving merry-go-round (but enjoying the ride), I was wondering if someone could put a (possibly rough) number on the phrase "high-bandwidth apps" in terms of bytes or longs per second? A number would give us a better feel for things, even though the sensitivity of the application could come into play.
For example, would a single cog be sufficient for 640x480 VGA or 800x480 WVGA video at say, 1 byte/pixel of color depth (256 colors/pixel), which involves roughly 18 to 23 MB/S of image data at about 60Hz. It would be nice if one cog could drive such a relatively low-resolution display (as the coding simplicity of dealing with one cog will be appreciated by the user). I'm guessing with cog buffering between stalls (if any), this would be doable in 1 cog. Hmm, in fact, block reads might not even be necessary for the 640x480 version.
UPDATE: Okay, I see that Chip touched upon that while I was posting (and before refreshing):
Here are the obvious transfers that are possible, at one long per clock:
hub to cog registers
hub to OUTA
hub to OUTB
hub to four 8-bit DACs
cog registers to hub
INA to hub
INB to hub
Anything else?
Your assumptions are all correct.
Certainly, a means to connect the Hub Rotate directly to/From Pins or DAC would be very useful, at fSys/N rates
That is just the sort of small silicon assist I am talking about.
Is there still a Video Buffer / LUT in the COG ?
I have another small silicon assist, in the form of a NIbble-Adder (just 4 bits, special adder) that can allow SW to Stream HUB memory, in just one cog, at fSys/3,fSys/5,fSys/7,fSys/9,fSys/11,fSys/13,fSys/15,fSys/17,
Fsys/N would be great, but how can that be achieved without affecting all cogs' access rates?
There is no LUT in the new cog.
The BW can exceed these values, the tricky bits of this are
a) getting BW to reduce to Pixel Clocks 20~40MHz speeds, when using a higher fSys.
fSys/N control, would be nice, in a Single COG.
b) 8 Bit Video is a little light, so a 16b (or more) colour palette LUT would help - that needs the ability to insert a Video LUT in between the Data and the Pins
I can get Odd-N cases, with efficient (all) Memory use, using REPS and a Nibble adder.
(even-N cases are possible, but with sparse use, so even-N has caveats, but could work for Video where you re-index every scan line, less useful for Logic-Analyser )
Hmm... I can see unused RAM is costly, but this really can help compensate for overall ram being not-quite-enough.
What about using COG RAM as LUT, via a (3+ cycle?*) Video-LUT-opcode
RDBYTIL PinReg,PTR++
Reads Byte from HUB, then uses as Address into upper 256 of COG , and writes that to masked Pins, or DAC.
Also usable for DDS Sine/Triangle loops, as I think Sine is also gone ?
*3 cycles may be trickier than 4, but would allow 40MHz video at 120MHz fSys, and support peak video CLKs ~ 66MHz
I know it's awesomely fast at DMA type stuff, but from my experience, most hub-op usage isn't block based, but random pickings, and a lot of what is block based, is read and act upon data, then read next long/word/byte.
Or am I just not seeing the full picture of this new system? as I'm getting a fear of it being slower in the long run to the old round robin.
I think this compares very favorably to what Hanno was able to pull off with P1... I think he had 32-bit wide capture at 80 MSPS into ~8 kB of cog ram using 1/2 the cogs.
He did this miracle with very complicated self-modifying code and I don't think there was any time margin during acquisition to do any trigger processing...
Here we could do 64-bit wide capture at 200 MSPS into ~500 kB of HUB RAM using 1/2 the cogs and still have margin to easily do pretriggering.
And, the code would be very simple...
On top of that, we could do arbitrary waveform generation at high speed and analog capture on many other pins at slow speed...
I think Chip was musing about possible DMA-like direct paths, as he said more than you quoted. My emphasis added.
[" I think the solution for that is [U]to get the cog out of the way[/U], and have memory stream directly to the DACs."]
&
[" Maybe transferring [U]pin data to and from hub ram at Fsys would be useful[/U]. It would make a 200MS/s logic analyzer."]
Once the data can stream HUB-COG, via the new Rotate Scheme, it make sense to think a little about what simple things can be done with it.
The already mentioned BLOCK opcodes are just one such pathway.
In most high speed cases I can think of (Buffers/FIFOs/Video/Audio), data is not Random, and the more Random cases tend to have complex code doing the indexing.
You could then maybe mix pictures and text and graphics on the fly and output in high-res...
Yes, Chips 'DMA' musing opens many ideas - hope it gets in there.
The challenge is in getting Fsys/N into the mix to allow higher fSys for other COGs
While Chip and Co are OCD'd on this high speed spinning whirly-jig and the army of Gigabits it will blast into space I'm afraid that most applications will only gain the cog speed improvement as the hub whirls by taunting "Ha Ha, ya missed me".
Don't get me wrong, P2 will be good but might be typecast into "it's good for blasting" or something like that but use an ARM for everything else.
I've been thinking about the video mechanism all afternoon and I just realized that because it's going to be tied to the system clock, all the clock domain decoupling that has been part of video since Prop1 can go away. Now, there can be different instructions to do different video output streams. There's no longer a need to chain video commands, in other words. This means that we CAN have a 256 LUT by reading the pixels from hub, translating them via cog RAM into 32-bit patterns, and outputting them to the DACs. This simplifies video quite a bit.
There can now be a small set of video output instructions that get the job done in a simple way, outputting a whole visible scan line at a time:
VID32........32-bit hub-to-DAC mode at Fsys/N
VID16........16-bit hub-to-DAC mode at Fsys/N
VID8..........8-bit hub-to-LUT-to-DAC mode at Fsys/N
VID4..........4-bit hub-to-LUT-to-DAC mode at Fsys/N
VID2..........2-bit hub-to-LUT-to-DAC mode at Fsys/N
VID1..........1-bit hub-to-LUT-to-DAC mode at Fsys/N
Once these instructions are over, they can return the DAC states to whatever they were before, with a mapped DAC register holding the four 8-bit values. That way, horizontal sync's can be done with 'MOV DAC,dacstates' and 'WAIT clocks' instructions. This simplifies the video greatly. Because there is no decoupling, though, the cog will be busy while it generates the pixels.