... Now, there can be different instructions to do different video output streams. There's no longer a need to chain video commands, in other words. This means that we CAN have a 256 LUT by reading the pixel from hub, translating them via cog RAM into 32-bit patterns, and outputting them to the DACs. This simplifies video quite a bit.
There can now be all kinds of video output instructions that get the job done in a simple way:
VID32........32-bit hub-to-DAC mode at Fsys/N
VID16........16-bit hub-to-DAC mode at Fsys/N
VID8..........8-bit hub-to-LUT-to-DAC mode at Fsys/N
VID4..........4-bit hub-to-LUT-to-DAC mode at Fsys/N
VID2..........2-bit hub-to-LUT-to-DAC mode at Fsys/N
VID1..........1-bit hub-to-LUT-to-DAC mode at Fsys/N
Once these instructions are over, they can return the DAC states to whatever they were before, with a mapped DAC register holding the four 8-bit values. That way, horizontal sync's can be done with 'MOV DAC,dacstates' and 'WAIT clocks' instructions. This simplifies the video greatly. Because there is no decoupling, though, the cog will be busy while it generates the pixels.
Sounding very good ( wait - Did you mean fSys/N or fSys ? )
I guess these can also map to Pins (skip the DAC ?)
To clarify, these are DMA like opcodes that are 'launched' somehow and stream from HUB until a count, or stopped ?
( I would call the LUT ones VIDLUT8, VIDLUT4 etc )
Can you give a code snippet for a Video line ?
Re the LUT ones to Read from HUB and lookup via GOG RAM - I guess there is a LUT pipeline so the first Video Pixel is one-clock delayed from the HubRead, and that read will be Nibble-LSB-Sync'd ?
If scan lines are not 16N, keeping time-sync across lines might get tricky ?
If using WAIT, the COG itself will be low power, but the DMA will be spinning, so Power will be DAC's + in-between COG power.
Q; Is there a WRLONG, REG, PTR-- opcode paired with WRLONG, REG, PTR++, or is it only PTR++ in this latest COG design ?
So, at 200MHz, you could output pixels at 200MHz, 100Mhz, 66MHz, 50MHz, 40MHz, etc. Those seem a little choppy, but if you need another frequency, you could use a different crystal or change the master clock multiplier.
Oh man! I was on a similar path, when you mentioned the xfer types above.
It's OK that the COG is busy, because any number of other ones can be writing scan lines / tiles / whatever to the HUB. That's actually quite nice.
This also means we can do color managment of a sorts too. The COG will need to do some computations, but once they are done, they are done for the palette modes. I was struggling with how that might happen before.
Preserving the whole scan line is a big win!
Will writing out a part of a scan line still be an option? (Yes, I have reasons for this.)
Sounding very good ( wait - Did you mean fSys/N or fSys ? )
I guess these can also map to Pins (skip the DAC ?)
To clarify, these are DMA like opcodes that are 'launched' somehow and stream from HUB until a count, or stopped ?
( I would call the LUT ones VIDLUT8, VIDLUT4 etc )
Can you give a code snippet for a Video line ?
Re the LUT ones to Read from HUB and lookup via GOG RAM - I guess there is a LUT pipeline so the first Video Pixel is one-clock delayed from the HubRead, and that read will be Nibble-LSB-Sync'd ?
If scan lines are not 16N, keeping time-sync across lines might get tricky ?
If using WAIT, the COG itself will be low power, but the DMA will be spinning, so Power will be DAC's + in-between COG power.
Q; Is there a WRLONG, REG, PTR-- opcode paired with WRLONG, REG, PTR++, or is it only PTR++ in this latest COG design ?
I mean Fsys divided by an integer (1..64).
Because these instructions would stall execution for their duration, the cog RAM can be used as a LUT. Practically, you would do ONE instruction for a whole visible scan line, unless you didn't mind stalled-pixel zones between instructions. I need to work out the details yet on how this would be coded.
The PTRA/PTRB expressions can be -- or ++, before or after. This video streaming would definitely be forward-only.
So, at 200MHz, you could output pixels at 200MHz, 100Mhz, 66MHz, 50MHz, 40MHz, etc. Those seem a little choppy, but if you need another frequency, you could use a different crystal or change the master clock multiplier.
Great - but how does it manage (eg) fSys/3, and still stay Rotator slot-sync'd ? or do you now have a Nibble Adder in there ?
Oh man! I was on a similar path, when you mentioned the xfer types above.
It's OK that the COG is busy, because any number of other ones can be writing scan lines / tiles / whatever to the HUB. That's actually quite nice.
This also means we can do color managment of a sorts too. The COG will need to do some computations, but once they are done, they are done for the palette modes. I was struggling with how that might happen before.
Preserving the whole scan line is a big win!
Will writing out a part of a scan line still be an option? (Yes, I have reasons for this.)
What do you mean by writing out part of a scan line?
The cog generating the video could reload its 'palette' in the vertical blank with a block-load instruction.
I mean changing modes. Say you want 2 bits for x part of the line, then you want 8 bits for part, then go back to 2 bits...
In the other video engine, and to some (tricky) degree on P1, it is easy to make a window that is different resolution / color depth from the rest of the screen.
One use would be a character driver that has a few bitmap regions for high color, perhaps different resolution, etc... Done to maximize RAM and or present data in a real-time way.
Yes, palette changes during VBLANK make perfect sense. HBLANK would be excellent, but no worries there. Might be possible, but not a focus IMHO.
Will writing out a part of a scan line still be an option? (Yes, I have reasons for this.)
Do you mean mixing Text and Graphics ? - I would think so, with the 'DAC default pixel' proviso Chip mentioned already. - but you'd probably want a small tramline anyway
Great - but how does it manage (eg) fSys/3, and still stay Rotator slot-sync'd ? or do you now have a Nibble Adder in there ?
I'm not sure yet. It might need to buffer a block first in 16 cycles, then transfer that to the shifter and load the next block. That would take a lot of flops, though. I need to map out the timing.
You said your nibble adder was good for odd-N values, right?
Do you mean mixing Text and Graphics ? - I would think so, with the 'DAC default pixel' proviso Chip mentioned already. - but you'd probably want a small tramline anyway
Yes that is one use. There are others. Since we are in software, display lists could be driving the display too.
I mean changing modes. Say you want 2 bits for x part of the line, then you want 8 bits for part, then go back to 2 bits...
In the other video engine, and to some (tricky) degree on P1, it is easy to make a window that is different resolution / color depth from the rest of the screen.
One use would be a character driver that has a few bitmap regions for high color, perhaps different resolution, etc... Done to maximize RAM and or present data in a real-time way.
Yes, palette changes during VBLANK make perfect sense. HBLANK would be excellent, but no worries there. Might be possible, but not a focus IMHO.
There would probably be time for a palette reload in hsync. It would take 256 clocks, or 1.28us. No problem.
About switching modes: It could be done, but there would be some dead time between video commands, where the pixel value would stay static. That would be tolerable in many cases, I think.
We get a slightly longer pixel somewhere. No worries. Really, I guess the question is we can specify X pixels at Fsys/y. At the speeds we are talking, that's a little tiny dead spot. Or, a border.
1.28... Yeah. I forget the speed of things relative to the last FPGA
We get a slightly longer pixel somewhere. No worries. Really, I guess the question is we can specify X pixels at Fsys/y. At the speeds we are talking, that's a little tiny dead spot.
I think Chip said that 'dead spots' can have a Default value (at least in the DAC cases) - maybe also in the direct-to-pins case ?
Chip know this isn't your current focus, but what registers are left in the high cog area? OUTA, OUTB, DIRA, DIRB, presumably. A DACS register? PTRA & B?
Counter and indirect registers are gone, presumably.
Could we just get an FPGA with the new hub method out so we can begin some testing please?
I don't much care what else it has, or is missing. At least we could do some testing.
BTW I have my own reservations on the new hub scheme, and yet I love the BW it offers.
Anyway, you have done the Verilog, so why waste it while everyone argues over it.
Lets stop arguing and try it!
So the same goes for video, DACs, etc. Please wait until you get us an FPGA image to play with... please, please, please.
Chip know this isn't your current focus, but what registers are left in the high cog area? OUTA, OUTB, DIRA, DIRB, presumably. A DACS register? PTRA & B?
Counter and indirect registers are gone, presumably.
Could we just get an FPGA with the new hub method out so we can begin some testing please?
I don't much care what else it has, or is missing. At least we could do some testing.
BTW I have my own reservations on the new hub scheme, and yet I love the BW it offers.
Anyway, you have done the Verilog, so why waste it while everyone argues over it.
Lets stop arguing and try it!
So the same goes for video, DACs, etc. Please wait until you get us an FPGA image to play with... please, please, please.
It's not all tied together yet and I don't know when it will be, exactly. Stuff just needs to come together a little more before I can make an FPGA image. As soon as I have one, though, you will have one.
Here is the Verilog code that makes the memory system
I've been trying to grok the timing of this, but I keep losing track. I think it's time to throw it into ModelSim. Do you happen to have a test script for this and would you be willing to share it?
So, at 200MHz, you could output pixels at 200MHz, 100Mhz, 66MHz, 50MHz, 40MHz, etc. Those seem a little choppy, but if you need another frequency, you could use a different crystal or change the master clock multiplier.
Where do the 1..64 come from? Is it an instruction parameter? Something to do with the DACs?
I've been trying to grok the timing of this, but I keep losing track. I think it's time to throw it into ModelSim. Do you happen to have a test script for this and would you be willing to share it?
I don't have a test script. One thing that might throw you off is the ENA signal. That is an active-low chip-wide RESET signal. It's only low during whole-chip reset. After that, it stays high. Just imagine that the 's' flops are initialized to the reset pattern, and thereafter they rotate by four bits. You could get rid of the ENA sensitivity everywhere but on the 's' flops.
Where do the 1..64 come from? Is it an instruction parameter? Something to do with the DACs?
It will be a parameter used by the video instructions to divide the clock by, in order to get the pixel clock. I think 4 bits will be adequate, actually, for 1..16 values.
It will be a parameter used by the video instructions to divide the clock by, in order to get the pixel clock. I think 4 bits will be adequate, actually, for 1..16 values.
Is 'N' a global across the entire chip, or can each cog have a different 'N'?
Ariba (here) mentioned a possible way to store the NTSC chroma as a sequence of precalculated samples output to the DACs. Ie a 'color' would be encoded by embedding the appropriate carrier into the DAC samples themselves, and outputting at the Fsys rate, but rotating through the sequence of dac samples.
If you always keep a 256 element LUT, then in 16 color mode, you could afford a sequence of 16 samples, for instance
It's not all tied together yet and I don't know when it will be, exactly. Stuff just needs to come together a little more before I can make an FPGA image. As soon as I have one, though, you will have one.
Any chance hub execution will appear in this new chip or has that idea been abandoned?
It is going to get its video data directly from Hub
Fantastic, but doesn't that mean that the cog will not be able to use hub while this is happening?
BTW I am fine with this. I would even be happy to sacrifice the cog totally for this (we have 16 of them).
Any chance hub execution will appear in this new chip or has that idea been abandoned?
I think HUBEXEC may currently be in the too hard basket for the time being.
But, when implemented, all we require to start trying it is this...
1. PC counter to be expanded to 17 bits (long address)
2. CALLH @/#nnnnnnnnnnnnnnnnn [WC,WZ] 'return address stored in a fixed location (say $1EF for now)
3. RETH [WC,WZ] 'jmp indirectly via the fixed location (say $1EF for now)
4. It would be nice to have a JMPH version of the CALLH (does not store the return address) but we could get by with the above for testing.
5. When fetching instructions from Hub, no caching to be used - just wait for hub cycle and load the contents directly to the ALU. Again, all we need for testing.
I am trying to keep this absolutely as simple as possible, so that Chip can implement this easily.
Any advances (if any) can be done later. At least we can then begin testing.
If you always keep a 256 element LUT, then in 16 color mode, you could afford a sequence of 16 samples, for instance
Good idea. 16x oversampling for NTSC would be ~57MHz. It would be easy to make a one-DAC byte output mode, as well, in the video shifter. There could even be a mode where the byte output is looked up from 4 longs, giving 16 discrete byte values, with wrapping inside the 4 longs. That way, pixels could be any duration. Actually, the initial pixels could be bytes, where 4 bits determine which 4-long group and the other 4 bits determine initial offset within those 16 bytes. Ah, maybe that's getting too complex.
Comments
What is "N" in the "Fsys/N"?
Sounding very good ( wait - Did you mean fSys/N or fSys ? )
I guess these can also map to Pins (skip the DAC ?)
To clarify, these are DMA like opcodes that are 'launched' somehow and stream from HUB until a count, or stopped ?
( I would call the LUT ones VIDLUT8, VIDLUT4 etc )
Can you give a code snippet for a Video line ?
Re the LUT ones to Read from HUB and lookup via GOG RAM - I guess there is a LUT pipeline so the first Video Pixel is one-clock delayed from the HubRead, and that read will be Nibble-LSB-Sync'd ?
If scan lines are not 16N, keeping time-sync across lines might get tricky ?
If using WAIT, the COG itself will be low power, but the DMA will be spinning, so Power will be DAC's + in-between COG power.
Q; Is there a WRLONG, REG, PTR-- opcode paired with WRLONG, REG, PTR++, or is it only PTR++ in this latest COG design ?
1, 2, 3,... 64
Fsys is the system clock frequency (200MHz).
So, at 200MHz, you could output pixels at 200MHz, 100Mhz, 66MHz, 50MHz, 40MHz, etc. Those seem a little choppy, but if you need another frequency, you could use a different crystal or change the master clock multiplier.
It's OK that the COG is busy, because any number of other ones can be writing scan lines / tiles / whatever to the HUB. That's actually quite nice.
This also means we can do color managment of a sorts too. The COG will need to do some computations, but once they are done, they are done for the palette modes. I was struggling with how that might happen before.
Preserving the whole scan line is a big win!
Will writing out a part of a scan line still be an option? (Yes, I have reasons for this.)
I mean Fsys divided by an integer (1..64).
Because these instructions would stall execution for their duration, the cog RAM can be used as a LUT. Practically, you would do ONE instruction for a whole visible scan line, unless you didn't mind stalled-pixel zones between instructions. I need to work out the details yet on how this would be coded.
The PTRA/PTRB expressions can be -- or ++, before or after. This video streaming would definitely be forward-only.
What do you mean by writing out part of a scan line?
The cog generating the video could reload its 'palette' in the vertical blank with a block-load instruction.
In the other video engine, and to some (tricky) degree on P1, it is easy to make a window that is different resolution / color depth from the rest of the screen.
One use would be a character driver that has a few bitmap regions for high color, perhaps different resolution, etc... Done to maximize RAM and or present data in a real-time way.
Yes, palette changes during VBLANK make perfect sense. HBLANK would be excellent, but no worries there. Might be possible, but not a focus IMHO.
Do you mean mixing Text and Graphics ? - I would think so, with the 'DAC default pixel' proviso Chip mentioned already. - but you'd probably want a small tramline anyway
I'm not sure yet. It might need to buffer a block first in 16 cycles, then transfer that to the shifter and load the next block. That would take a lot of flops, though. I need to map out the timing.
You said your nibble adder was good for odd-N values, right?
Yes that is one use. There are others. Since we are in software, display lists could be driving the display too.
Yes, but I think I may have just solved the Even Values issue too - just compiling now...
There would probably be time for a palette reload in hsync. It would take 256 clocks, or 1.28us. No problem.
About switching modes: It could be done, but there would be some dead time between video commands, where the pixel value would stay static. That would be tolerable in many cases, I think.
1.28... Yeah. I forget the speed of things relative to the last FPGA
I think Chip said that 'dead spots' can have a Default value (at least in the DAC cases) - maybe also in the direct-to-pins case ?
Counter and indirect registers are gone, presumably.
May I interpose a suggestion here...
Could we just get an FPGA with the new hub method out so we can begin some testing please?
I don't much care what else it has, or is missing. At least we could do some testing.
BTW I have my own reservations on the new hub scheme, and yet I love the BW it offers.
Anyway, you have done the Verilog, so why waste it while everyone argues over it.
Lets stop arguing and try it!
So the same goes for video, DACs, etc. Please wait until you get us an FPGA image to play with... please, please, please.
You got it pretty straight.
1F7 = DACS?
1F8 = PTRA
1F9 = PTRB
1FA = INA
1FB = INB
1FC = OUTA
1FD = OUTB
1FE = DIRA
1FF = DIRB
It's not all tied together yet and I don't know when it will be, exactly. Stuff just needs to come together a little more before I can make an FPGA image. As soon as I have one, though, you will have one.
I've been trying to grok the timing of this, but I keep losing track. I think it's time to throw it into ModelSim. Do you happen to have a test script for this and would you be willing to share it?
Where do the 1..64 come from? Is it an instruction parameter? Something to do with the DACs?
I don't have a test script. One thing that might throw you off is the ENA signal. That is an active-low chip-wide RESET signal. It's only low during whole-chip reset. After that, it stays high. Just imagine that the 's' flops are initialized to the reset pattern, and thereafter they rotate by four bits. You could get rid of the ENA sensitivity everywhere but on the 's' flops.
It will be a parameter used by the video instructions to divide the clock by, in order to get the pixel clock. I think 4 bits will be adequate, actually, for 1..16 values.
Is 'N' a global across the entire chip, or can each cog have a different 'N'?
Ariba (here) mentioned a possible way to store the NTSC chroma as a sequence of precalculated samples output to the DACs. Ie a 'color' would be encoded by embedding the appropriate carrier into the DAC samples themselves, and outputting at the Fsys rate, but rotating through the sequence of dac samples.
If you always keep a 256 element LUT, then in 16 color mode, you could afford a sequence of 16 samples, for instance
It is going to get its video data directly from Hub
Fantastic, but doesn't that mean that the cog will not be able to use hub while this is happening?
BTW I am fine with this. I would even be happy to sacrifice the cog totally for this (we have 16 of them).
But, when implemented, all we require to start trying it is this...
1. PC counter to be expanded to 17 bits (long address)
2. CALLH @/#nnnnnnnnnnnnnnnnn [WC,WZ] 'return address stored in a fixed location (say $1EF for now)
3. RETH [WC,WZ] 'jmp indirectly via the fixed location (say $1EF for now)
4. It would be nice to have a JMPH version of the CALLH (does not store the return address) but we could get by with the above for testing.
5. When fetching instructions from Hub, no caching to be used - just wait for hub cycle and load the contents directly to the ALU. Again, all we need for testing.
I am trying to keep this absolutely as simple as possible, so that Chip can implement this easily.
Any advances (if any) can be done later. At least we can then begin testing.
Good idea. 16x oversampling for NTSC would be ~57MHz. It would be easy to make a one-DAC byte output mode, as well, in the video shifter. There could even be a mode where the byte output is looked up from 4 longs, giving 16 discrete byte values, with wrapping inside the 4 longs. That way, pixels could be any duration. Actually, the initial pixels could be bytes, where 4 bits determine which 4-long group and the other 4 bits determine initial offset within those 16 bytes. Ah, maybe that's getting too complex.