The Secondary Allocation can match your SuperCOG0, or it could match this new Quasi-FIFO scheme.
No, no! - what we obviously need here is a Tertiary FIFO, to augment the Primary and Secondary Allocation tables you are already proposing.
That way, the user could choose between every single hub access scheme yet proposed!
Sure, in order to implement all this we might have to sacrifice a few hundred kilobytes of Hub RAM, or perhaps a dozen or so cogs, and perhaps we might consume a few additional watts of power, and perhaps we might need to reduce the clock speed a few megahertz, and of course we would need a few dozen extra instructions to manage it all, and a few hundred pages in the data sheets to describe all the possible variations ... but so what? Just think of the massive benefit in flexibility this would give us!
Why is it not possible to... on the cog launch, include a parameter in the cognew that enables or disables hub access for that cog. Then, only the cogs that have required access will be included in the round robin access.
No, no! - what we obviously need here is a Tertiary FIFO, to augment the Primary and Secondary Allocation tables you are already proposing.
That way, the user could choose between every single hub access scheme yet proposed!
Sure, in order to implement all this we might have to sacrifice a few hundred kilobytes of Hub RAM, or perhaps a dozen or so cogs, and perhaps we might consume a few additional watts of power, and perhaps we might need to reduce the clock speed a few megahertz, and of course we would need a few dozen extra instructions to manage it all, and a few hundred pages in the data sheets to describe all the possible variations ... but so what? Just think of the massive benefit in flexibility this would give us!
You certainly are the forum master of Hyperbole.
The Primary and Secondary have been shown to have very low Logic cost, and no MHz impact.
I even coded earlier a close cousin of the OP (quasi) FIFO (a true fifo cannot work) where the verilog effectively 'scanned away' from the scan value, to find the 'nearest requester'.
That scheme will effectively Allocate HUB slots, as fast as they are requested, and has similar ceilings.
Yes, it is larger and slower again, as there are more balls in the air.
(but, as with most reality, not matching your hyperbole)
Because it gives the floating point (!) averages nicely shown in #8 ( ie very non-deterministic ) and has a high logic cost, I placed it as less-useful than other schemes.
Why is it not possible to... on the cog launch, include a parameter in the cognew that enables or disables hub access for that cog. Then, only the cogs that have required access will be included in the round robin access.
It is, of course. Primary COGs can be given deterministic access, and Secondary ones are allowed 'have at it', and yes, such a scheme already exists in Table Mapping solutions, as a variant on Secondary decision choice.
Hate to say it, however it has for a while seemed as though some simply do not want Users to have additional control for whatever reason. They've stopped arguing the technical merits which have been basically put to bed.
Sorry, but you're wrong on this. Especially on the last point.
Many of us supported some of the early hub sharing or mooching schemes, because there was some demonstrable benefit to them, and also because they incurred minimal additional cost and other consequences - and not only in additional silicon, but also in cost of learning and usage, and also compatibility and troubleshooting.
But as the schemes got more and more complex, and the costs consequently became higher and higher, some of us began asking what the actual benefit of these schemes would be, so that we could try and assess which ones were worthwhile and which ones were not.
So far, that question has never really been answered, other than a pretty lame "because more control must be better!", or an equally lame "because we can, and anyway who can predict what it might get used for in future!". To my knowledge, the closest we got to an actual concrete example of what we might be able to do with all the additional complexity was that it might enable FS USB - but on closer inspection the proposer of that idea acknowledged that even that was unlikely without other changes as well.
And to understand why this question is so important, you have to look no further than the fate of the original P2 - that chip died because no-one bothered to ask these questions in the first place - people just kept piling Pelion on Ossa because each new feature seemed like a good idea. Only now, at the death, are people finally trying to decide whether each feature added was actually worthwhile implementing, and (if so) whether there might not be a simpler form of it that could do the job at less cost.
But as the schemes got more and more complex, and the costs consequently became higher and higher, some of us began asking what the actual benefit of these schemes would be, so that we could try and assess which ones were worthwhile and which ones were not.
With continuing hyperbole like that, is it any wonder more and more think you are merely trolling ?
Claims of 'higher and higher costs' are pure nonsense - verilog has been provided showing the gate count of this is tiny, but you seem incapable to grasp that ?
Likewise example cases of benefits have been given, (many times) but again, blinkered hyperbole is the response.
I think of slot allocation as being a bit like DMA on other chips. The hardware is there in the chip but at power up all the enable bits are set to 'OFF'. You don't have to use it but take the trouble to read the manual; spend some time writing test code and having a play and you will discover something wonderful.
I'm struggling to keep up to date with multiple topics discussing slot allocation but I don't think I've yet seen a good reason not to include a scheme like table allocation.
Yes, there might be problems with some soft-peripherals interacting but I don't see any reason that a standard could not be worked out which would lay down a scheme for making sure it didn't happen. Yes, some more complex soft-peripherals might need a bit of head scratching before they play nicely together but surely that is part of being an embedded engineer? Unless the P16X64A is intended as a drag-and-drop chip then that will always be the way.
[I first used DMA over 30 years ago on a 5MHz 8085 board with multiple 8237/8259/8274. Boy, did that setup fly, running multiple 800kbps serial lines].
I think this would work well in conjunction with the 32 slot slot assignment table as the secondary assignment method.
Yes, something similar to this certainly could apply as the Secondary Rule, but as to 'work well' - that is less certain.
Secondary members would need separation from the Primary Group (which is another 16 signals) to avoid disturbing the important Primary determinism.
It needs more Logic than a Table, to do a free-for-all Alloc from the Secondary Group, and there is no control over relative bandwidth of the secondary Group members, or over any pairing of resource.
Perhaps the biggest drawback is the lack of user control and visibility - making it very hard to 'sign off' during development, as right up until ship-date, the numbers can change.
A more direct Secondary table makes it possible to avoid such 'COG crosstalk', and you can 'sign off' knowing how much project portions use.
Unless the P16X64A is intended as a drag-and-drop chip then that will always be the way.
A few think that is exactly what it should try to be, which is why I then suggested a Fuse Family of parts :
PartOne that is targeted to 'Plug and Play' and PartTwo that has more control, but is less 'Plug and Play' and a little more 'embedded engineer'
With separate part codes (but the same wafer) - it is then clearer to everyone what the Objects target.
These features can be Boolean-enabled, and Parallax have proven fuse technology.
Other vendors already generate multiple part numbers from a single die design, it is common practice.
Claims of 'higher and higher costs' are pure nonsense - verilog has been provided showing the gate count of this is tiny, but you seem incapable to grasp that ?
I grasp it alright - it is just that I'm not solely focused on "gate counts". Also, I don't like to criticize Chip's amazing efforts, but you are making a very similar mistake to the one Chip made on the original P2. Each incremental feature cost was "tiny" - and he only discovered he had a problem when he put all these individual "tiny" costs together, and pretty much has had to throw the whole thing out and start all over again as a result.
The lesson Chip has learned - but you apparently haven't - is that you don't just do things because you can.
As my post said ... "additional cost and other consequences - and not only in additional silicon, but also in cost of learning and usage, and also compatibility and troubleshooting"
Although even just in the additional silicon it is worth pointing out that relatively small changes in the operation of the the hub will multiply many times over the process of verifying that the cogs and the chip as a whole operates correctly in all the new operating configurations that these complex schemes introduce.
But that is a one-off cost that will simply add an increment to the cost of each chip - the real costs of these schemes for users go way beyond the silicon costs. These schemes complicate - and will add significant cost to - the learning process, the adoption process, the development process, the debugging process, and the maintenance process.
So are they worthwhile? Really? Why? What new worlds do you think they will allow the P16X64A to conquer?
So are they worthwhile? Really? Why? What new worlds do you think they will allow the P16X64A to conquer?[/COLOR]
- it has been mentioned before...
Worlds that are cycle deterministic, but do not come quantized neatly in multiples of 16 SysCLKs.
Worlds that do not run identical code in every COG, and where cycle deterministic bandwidths faster than the slow, fixed 100ns are needed, on some COGs.
As my post said ... "additional cost and other consequences - and not only in additional silicon, but also in cost of learning and usage, and also compatibility and troubleshooting"
If you are really convinced of all those costs, then you should like the idea of a Fuse-variant, as all of those imagined costs, go away on a part that has nothing to change.
If you are really convinced of all those costs, then you should like the idea of a Fuse-variant, as all of those imagined costs, go away on a part that has nothing to change.
As others have pointed out ... the solution to complexity is very rarely to add more complexity.
Worlds that are cycle deterministic, but do not come quantized neatly in multiples of 16 SysCLKs.
Worlds that do not run identical code in every COG, and where cycle deterministic bandwidths faster than the slow, fixed 100ns are needed, on some COGs.
And how many of those worlds already better addressed by other cheaper and simpler solutions?
Unless the P16X64A is intended as a drag-and-drop chip...
What an excellent idea. There are many here who would like to see exactly that. Jazzed for example started up a discussion about it a year or more ago. That is partly why all this talk of introducing COG asymmetry gets red flagged.
What an excellent idea. There are many here who would like to see exactly that. Jazzed for example started up a discussion about it a year or more ago. That is partly why all this talk of introducing COG asymmetry gets red flagged.
This is a no-brainer. Of course the P16X64A has to be "drag and drop". Or (as I put it) "plug and play".
Anything that prevents that is (at the risk of being called a "master of hyperbole" again!) utter madness.
Sorry, but you're wrong on this. Especially on the last point.
Many of us supported some of the early hub sharing or mooching schemes, because there was some demonstrable benefit to them, and also because they incurred minimal additional cost and other consequences - and not only in additional silicon, but also in cost of learning and usage, and also compatibility and troubleshooting.
Ross.
Which hub sharing scheme would you support then?
As I mentioned, JMG's TopScan sort of threw me initially, however I kind of thought it became more involved as several people were complaining that they wanted Mooching, so JMG found a way to do that as well.
This may be all academic, as Chip seemed to have an idea in mind, so it may pop up later on anyways.
On the whole, several 'newbies' have said they can follow it simply enough, so I fail to see exactly why its considered so difficult by more expert users such as yourself and Heater.
I mean, LMM came about exactly because of a failing of the Prop, to begin with.
This is yet another way to extract additional performance from the Prop, which would otherwise go unused.
People who do not want to use it, are not going to be forced to use.
People who want to use it, are going to be able to.
People who don't know any better are going to learn from their mistakes, just as they are going to learn from all of the other mistakes they are going to make with the Prop. They will then learn, or they will stop using hub sharing objects, or they can try to code a similar object with conjoined cores.
I no longer think Parallax 'needs' this to improve the P2's marketability, because thats not the plan.
However for everyone else who wants/needs the benefits of this, and especially those doing video which is mentioned all the time, it seems like a useful tool.
...so I fail to see exactly why its considered so difficult by more expert users such as yourself and Heater
I don't like surprises. Years of surprises in all kinds of systems have worn me down. I favour regular predictable systems over chaos. It's true that often every little exception and twist in a system is understandable when looked at in isolation. It's just that these quirks have a habit of piling up. The end result is a whole mass details you have to keep in mind all the time and weird dependencies between things.
I just happen to think that if you want a community of users all contributing to the Propeller ecosystem it's best to make things as simple and painless as possible with no surprises. Even at the cost of a little performance here and there. Even at the cost of making some things impossible.
I mean, LMM came about exactly because of a failing of the Prop, to begin with.
This is yet another way to extract additional performance from the Prop, which would otherwise go unused.
The LMM technique is in no way comparable to any HUB arbiter hardware changes. There is nothing special about LMM, it's just the COG doing what COGs do, running code. Using LMM does not impact the operation of any other COG in the device. A user can include software into his project that may or may not use LMM without ever having to even know about it.
I'm curious why you say "because of a failing of the Prop". What "failing"? The Prop was designed the way it was and LMM tries to bend it another way. It's a bit like finding you can use a flat screwdriver as a chisel. It does not work very well but can be done to some extent. That does not mean the screwdriver has a failing.
I originally supported Cluso's "paired cog" scheme, and later the "SuperCog" scheme - because I could see they were easy to understand, easy to implement, in keeping with the Propeller philosophy, and also because I could see a definite use case for them.
Also, both have the benefit of not distrupting the determinism of the remaining cogs (or the Propeller chip as a whole), and they also fully support the kind of "plug and play" we have all grown used to on the P1.
None of the other schemes I have seen can do all this.
Could a simple utility not be made to read a tag in the object files included in your project which specify hub bandwidth requirements, and then generate a suitable table to use in your code? Of course it could. Heck, it might even show you how much "spare" bandwidth you have which you can then assign to your "super cog".
I'm curious why you say "because of a failing of the Prop". What "failing"?
P8x32a fails to be like other chips. That's all.
Relative Strengths?
1. Submicrosecond deterministic behaviour
2. Software defined peripherals are flexible
3. Video support
4. Only need one chip type for many applications
5. Easy multi-processing on up to 8 cores.
Relative Weaknesses?
1. Poor single thread of execution performance
2. No built-in ADCs
3. Not enough global memory space for expanding applications
4. Not enough pins
5. Software defined peripherals are slower than hardware solutions
6. Requires external support chips for USB, storage, and crystal
7. Relatively expensive
8. Forum contributors always argue ;-)
Weakness items being addressed by P16x64a: 2,3,4 and maybe 1 if we're lucky.
Item 5 gets some help by faster clock and fewer pipeline stages. Item 8? LOL.
Pairing would benefit, particularly given the pair can hand the baton back and forth. Signal COG gets it, and the other COG gets it during blanking and other key times.
Super COG won't benefit directly. Nice to have though.
Table based schemes? No thanks. I would not want to sort that out.
If we get QUAD per HUB read, video will run just fine round robin as it's access for the signal is linear and easily parallelized. 4*12.5 = 50Mhz which does a fine display. And the mor limited waitvid means instructions will need to process what comes from the HUB, and that processing is as important as the overall throughput is.
If we do nothing and it's one long per HUB, then it will default to needing more than one signal COG to get super high color resolution and or depth. On this design, that would mean electrically connecting some dacs. Bummer. But workable. I think we will miss being able to have more than one COG driving a given dac with WAITVID.
If we get Quads, then we will go higher in resolution and color depth befor needing more than one signal COG. At an effective 50Mhz throughput, people will be happy.
Video signal can be done in parallel, buffering can, graphics draw, sprites, scan line buffering, color conversion, all of it works great in parallel. Getting it done with a few COGS will always work, and given how people like to just drop video objects in, that's my preference.
Comments
No, no! - what we obviously need here is a Tertiary FIFO, to augment the Primary and Secondary Allocation tables you are already proposing.
That way, the user could choose between every single hub access scheme yet proposed!
Sure, in order to implement all this we might have to sacrifice a few hundred kilobytes of Hub RAM, or perhaps a dozen or so cogs, and perhaps we might consume a few additional watts of power, and perhaps we might need to reduce the clock speed a few megahertz, and of course we would need a few dozen extra instructions to manage it all, and a few hundred pages in the data sheets to describe all the possible variations ... but so what? Just think of the massive benefit in flexibility this would give us!
I can hardly wait!
Ross.
You certainly are the forum master of Hyperbole.
The Primary and Secondary have been shown to have very low Logic cost, and no MHz impact.
I even coded earlier a close cousin of the OP (quasi) FIFO (a true fifo cannot work) where the verilog effectively 'scanned away' from the scan value, to find the 'nearest requester'.
That scheme will effectively Allocate HUB slots, as fast as they are requested, and has similar ceilings.
Yes, it is larger and slower again, as there are more balls in the air.
(but, as with most reality, not matching your hyperbole)
Because it gives the floating point (!) averages nicely shown in #8 ( ie very non-deterministic ) and has a high logic cost, I placed it as less-useful than other schemes.
It is, of course. Primary COGs can be given deterministic access, and Secondary ones are allowed 'have at it', and yes, such a scheme already exists in Table Mapping solutions, as a variant on Secondary decision choice.
I can't take all the credit - these increasingly complex and bizarre hub sharing schemes provide so much material to work with!
-Phil
Sorry, but you're wrong on this. Especially on the last point.
Many of us supported some of the early hub sharing or mooching schemes, because there was some demonstrable benefit to them, and also because they incurred minimal additional cost and other consequences - and not only in additional silicon, but also in cost of learning and usage, and also compatibility and troubleshooting.
But as the schemes got more and more complex, and the costs consequently became higher and higher, some of us began asking what the actual benefit of these schemes would be, so that we could try and assess which ones were worthwhile and which ones were not.
So far, that question has never really been answered, other than a pretty lame "because more control must be better!", or an equally lame "because we can, and anyway who can predict what it might get used for in future!". To my knowledge, the closest we got to an actual concrete example of what we might be able to do with all the additional complexity was that it might enable FS USB - but on closer inspection the proposer of that idea acknowledged that even that was unlikely without other changes as well.
And to understand why this question is so important, you have to look no further than the fate of the original P2 - that chip died because no-one bothered to ask these questions in the first place - people just kept piling Pelion on Ossa because each new feature seemed like a good idea. Only now, at the death, are people finally trying to decide whether each feature added was actually worthwhile implementing, and (if so) whether there might not be a simpler form of it that could do the job at less cost.
Ross.
@PhiPi
Yes, I have a day job. That's why I'm reading the forum and making posts late at night, which in turn is why I have been dead tired at work ;-)
With continuing hyperbole like that, is it any wonder more and more think you are merely trolling ?
Claims of 'higher and higher costs' are pure nonsense - verilog has been provided showing the gate count of this is tiny, but you seem incapable to grasp that ?
Likewise example cases of benefits have been given, (many times) but again, blinkered hyperbole is the response.
I'm struggling to keep up to date with multiple topics discussing slot allocation but I don't think I've yet seen a good reason not to include a scheme like table allocation.
Yes, there might be problems with some soft-peripherals interacting but I don't see any reason that a standard could not be worked out which would lay down a scheme for making sure it didn't happen. Yes, some more complex soft-peripherals might need a bit of head scratching before they play nicely together but surely that is part of being an embedded engineer? Unless the P16X64A is intended as a drag-and-drop chip then that will always be the way.
[I first used DMA over 30 years ago on a 5MHz 8085 board with multiple 8237/8259/8274. Boy, did that setup fly, running multiple 800kbps serial lines].
Yes, something similar to this certainly could apply as the Secondary Rule, but as to 'work well' - that is less certain.
Secondary members would need separation from the Primary Group (which is another 16 signals) to avoid disturbing the important Primary determinism.
It needs more Logic than a Table, to do a free-for-all Alloc from the Secondary Group, and there is no control over relative bandwidth of the secondary Group members, or over any pairing of resource.
Perhaps the biggest drawback is the lack of user control and visibility - making it very hard to 'sign off' during development, as right up until ship-date, the numbers can change.
A more direct Secondary table makes it possible to avoid such 'COG crosstalk', and you can 'sign off' knowing how much project portions use.
A few think that is exactly what it should try to be, which is why I then suggested a Fuse Family of parts :
PartOne that is targeted to 'Plug and Play' and PartTwo that has more control, but is less 'Plug and Play' and a little more 'embedded engineer'
With separate part codes (but the same wafer) - it is then clearer to everyone what the Objects target.
These features can be Boolean-enabled, and Parallax have proven fuse technology.
Other vendors already generate multiple part numbers from a single die design, it is common practice.
I grasp it alright - it is just that I'm not solely focused on "gate counts". Also, I don't like to criticize Chip's amazing efforts, but you are making a very similar mistake to the one Chip made on the original P2. Each incremental feature cost was "tiny" - and he only discovered he had a problem when he put all these individual "tiny" costs together, and pretty much has had to throw the whole thing out and start all over again as a result.
The lesson Chip has learned - but you apparently haven't - is that you don't just do things because you can.
As my post said ... "additional cost and other consequences - and not only in additional silicon, but also in cost of learning and usage, and also compatibility and troubleshooting"
Although even just in the additional silicon it is worth pointing out that relatively small changes in the operation of the the hub will multiply many times over the process of verifying that the cogs and the chip as a whole operates correctly in all the new operating configurations that these complex schemes introduce.
But that is a one-off cost that will simply add an increment to the cost of each chip - the real costs of these schemes for users go way beyond the silicon costs. These schemes complicate - and will add significant cost to - the learning process, the adoption process, the development process, the debugging process, and the maintenance process.
So are they worthwhile? Really? Why? What new worlds do you think they will allow the P16X64A to conquer?
Ross.
Worlds that are cycle deterministic, but do not come quantized neatly in multiples of 16 SysCLKs.
Worlds that do not run identical code in every COG, and where cycle deterministic bandwidths faster than the slow, fixed 100ns are needed, on some COGs.
If you are really convinced of all those costs, then you should like the idea of a Fuse-variant, as all of those imagined costs, go away on a part that has nothing to change.
As others have pointed out ... the solution to complexity is very rarely to add more complexity.
Ross.
And how many of those worlds already better addressed by other cheaper and simpler solutions?
Ross.
What added complexity ? - the fuses are already there.
This is a no-brainer. Of course the P16X64A has to be "drag and drop". Or (as I put it) "plug and play".
Anything that prevents that is (at the risk of being called a "master of hyperbole" again!) utter madness.
Ross.
Which hub sharing scheme would you support then?
As I mentioned, JMG's TopScan sort of threw me initially, however I kind of thought it became more involved as several people were complaining that they wanted Mooching, so JMG found a way to do that as well.
This may be all academic, as Chip seemed to have an idea in mind, so it may pop up later on anyways.
On the whole, several 'newbies' have said they can follow it simply enough, so I fail to see exactly why its considered so difficult by more expert users such as yourself and Heater.
I mean, LMM came about exactly because of a failing of the Prop, to begin with.
This is yet another way to extract additional performance from the Prop, which would otherwise go unused.
People who do not want to use it, are not going to be forced to use.
People who want to use it, are going to be able to.
People who don't know any better are going to learn from their mistakes, just as they are going to learn from all of the other mistakes they are going to make with the Prop. They will then learn, or they will stop using hub sharing objects, or they can try to code a similar object with conjoined cores.
I no longer think Parallax 'needs' this to improve the P2's marketability, because thats not the plan.
However for everyone else who wants/needs the benefits of this, and especially those doing video which is mentioned all the time, it seems like a useful tool.
I just happen to think that if you want a community of users all contributing to the Propeller ecosystem it's best to make things as simple and painless as possible with no surprises. Even at the cost of a little performance here and there. Even at the cost of making some things impossible. The LMM technique is in no way comparable to any HUB arbiter hardware changes. There is nothing special about LMM, it's just the COG doing what COGs do, running code. Using LMM does not impact the operation of any other COG in the device. A user can include software into his project that may or may not use LMM without ever having to even know about it.
I'm curious why you say "because of a failing of the Prop". What "failing"? The Prop was designed the way it was and LMM tries to bend it another way. It's a bit like finding you can use a flat screwdriver as a chisel. It does not work very well but can be done to some extent. That does not mean the screwdriver has a failing.
I originally supported Cluso's "paired cog" scheme, and later the "SuperCog" scheme - because I could see they were easy to understand, easy to implement, in keeping with the Propeller philosophy, and also because I could see a definite use case for them.
Also, both have the benefit of not distrupting the determinism of the remaining cogs (or the Propeller chip as a whole), and they also fully support the kind of "plug and play" we have all grown used to on the P1.
None of the other schemes I have seen can do all this.
Ross.
1. Submicrosecond deterministic behaviour
2. Software defined peripherals are flexible
3. Video support
4. Only need one chip type for many applications
5. Easy multi-processing on up to 8 cores.
Relative Weaknesses?
1. Poor single thread of execution performance
2. No built-in ADCs
3. Not enough global memory space for expanding applications
4. Not enough pins
5. Software defined peripherals are slower than hardware solutions
6. Requires external support chips for USB, storage, and crystal
7. Relatively expensive
8. Forum contributors always argue ;-)
Weakness items being addressed by P16x64a: 2,3,4 and maybe 1 if we're lucky.
Item 5 gets some help by faster clock and fewer pipeline stages. Item 8? LOL.
Pairing would benefit, particularly given the pair can hand the baton back and forth. Signal COG gets it, and the other COG gets it during blanking and other key times.
Super COG won't benefit directly. Nice to have though.
Table based schemes? No thanks. I would not want to sort that out.
If we get QUAD per HUB read, video will run just fine round robin as it's access for the signal is linear and easily parallelized. 4*12.5 = 50Mhz which does a fine display. And the mor limited waitvid means instructions will need to process what comes from the HUB, and that processing is as important as the overall throughput is.
If we do nothing and it's one long per HUB, then it will default to needing more than one signal COG to get super high color resolution and or depth. On this design, that would mean electrically connecting some dacs. Bummer. But workable. I think we will miss being able to have more than one COG driving a given dac with WAITVID.
If we get Quads, then we will go higher in resolution and color depth befor needing more than one signal COG. At an effective 50Mhz throughput, people will be happy.
Video signal can be done in parallel, buffering can, graphics draw, sprites, scan line buffering, color conversion, all of it works great in parallel. Getting it done with a few COGS will always work, and given how people like to just drop video objects in, that's my preference.