Understanding WAITVID

Vega256 · 2011-07-21 06:46

ericball wrote: »

PixelClocks is VSCL[19.12]. However, because the TV PLLA frequency and line period are fixed, the number of PLLA per pixel is inversely proportional to the horizontal resolution.

With the normal TV 3640 PLLA per line, 2987 PLLA correspond to the "square pixel" 640x480 (240 non-interlaced) frame. So 320 at 9 PLLA per pixel will be fairly close.

Just for reference: square pixel NTSC is 12.272727MHz or 780 pixels per line (640 active). The NES has a 5.369MHz pixel clock (3/2 * colorburst) for 341 pixels per line (256 active).

I was reviewing the manual and the VSCL register. PixelClocks is the PLLA clocks; how many PLL clocks should elapse before another pixel is shifted out of the video generator. My question is, what determines the PLL frequency? I am sure that I am overlooking something.

ericball · 2011-07-21 07:19

Vega256 wrote: »

I was reviewing the manual and the VSCL register. PixelClocks is the PLLA clocks; how many PLL clocks should elapse before another pixel is shifted out of the video generator. My question is, what determines the PLL frequency? I am sure that I am overlooking something.

The PLL frequency is determined by FRQA and CTRA. However, since the video generator uses a 16 step shift register to generate color, the PLLA frequency is always 16 times the colorburst frequency, i.e. for NTSC 3,579,545 Hz x 16 = 57,272,727 Hz.

Vega256 · 2011-07-21 07:26

ericball wrote: »

The PLL frequency is determined by FRQA and CTRA. However, since the video generator uses a 16 step shift register to generate color, the PLLA frequency is always 16 times the colorburst frequency, i.e. for NTSC 3,579,545 Hz x 16 = 57,272,727 Hz.

So then some of the values and frequencies in these graphics drivers are based upon constant NTSC values and others depend on the desired resolution. What parts of the driver are the parts that can be edited based on desired features?

potatohead · 2011-07-21 09:11

Generally speaking, you can fairly easily modify the active graphics area.

If you modify the core timing of the driver, it's timing based on the colorburst, you are going to have to edit / rewrite the signal part of the driver, because all the signal pulses are built on that timing reference.

Once you've entered the scan-line, you have some options there. The PLLA per scan line is determined by the base timing of the driver.

The scan line is both the porches (overscan), and the active area. (graphics)

For the driver you are working with, Eric cited the PLLA / scan line. That's what you've got to work with. If you are looking for a specific PLLA / pixel, you need to calculate the number of pixels, as I did above, subtract that from the total PLLA, then size the porches from what is left over.

Essentially, the scan line timing is fixed, so more PLLA / pixel will equal fatter pixels, consuming more of the screen, leaving smaller borders.

Vega256 · 2011-07-21 09:36

potatohead wrote: »

Generally speaking, you can fairly easily modify the active graphics area.

If you modify the core timing of the driver, it's timing based on the colorburst, you are going to have to edit / rewrite the signal part of the driver, because all the signal pulses are built on that timing reference.

Once you've entered the scan-line, you have some options there. The PLLA per scan line is determined by the base timing of the driver.

The scan line is both the porches (overscan), and the active area. (graphics)

For the driver you are working with, Eric cited the PLLA / scan line. That's what you've got to work with. If you are looking for a specific PLLA / pixel, you need to calculate the number of pixels, as I did above, subtract that from the total PLLA, then size the porches from what is left over.

Essentially, the scan line timing is fixed, so more PLLA / pixel will equal fatter pixels, consuming more of the screen, leaving smaller borders.

Gotcha. Well, it's off to work with me. I'll see how close I can get to a high resolution with no artifacts, tearing, or banding. I may just be able to bend the code a bit on the DK driver, but like you said, the PLLA is based on the core timing of the driver, so, I should look to change everything else based on that frequency.

Vega256 · 2011-07-22 11:41

How do tile/sprite engines keep up with the TV drivers that render one line at a time? It would seem to me that the TV driver requests lines faster than the render cog an keep the scanline buffer filled with the right data.

potatohead · 2011-07-22 13:49

There are a few methods.

The most important thing is to select data formats for the sprites that align well with the propeller. The 4x8 size is great for full-color sprites, because it's only 4 pixels, the max amount of data gets transferred per hub operation, and bit masking only needs to happen on byte boundaries, which keeps the number of masks and shifts down to the minimum. Other sizes might make sense, depending on what the buffer format is for the signal, or TV COG.

Maybe that buffer stores color, pixels, color, pixels, color, pixels, etc.... Or maybe it's pixels, pixels, pixels, pixels in one linear buffer, and colors, colors, colors on another one. Could do fixed color sets too, just storing pixels only, or in the case of the full color option, pixels are colors, requiring only one buffer.

Each shift, add, move, mask move operation series consumes scan line time. The more there are, the fewer the number of sprites per scan line possible.

How the sprite data is stored impacts this as well. Sprite lists need data alignments much like color and pixel data does, and they might need sorting too, depending on what was done.

The other technique is to use rotating buffers. With a single buffer, one really only gets part of a scan line to render with. The rest is spent syncing with the signal COG. It's also not possible to utilize multiple render COGS very easily with a single buffer, because timing gets in the way. A fair amount of logic is require to make sure things happen in the right order on a single buffer, limiting everything overall.

If there are multiple buffers, a lot of things change. The signal COG can be writing one buffer to the screen, while render COGS can attack the other buffers, often rolling over to the next one when they finish early. A few buffers can yield multiple scan lines of time to get sprites placed, at the cost of more render COGS, and RAM.

There is enough time to do a lot, and actually get tens of sprites per line running nicely with transparency. The real key is thinking through the ops needed, factoring out duplicates and structuring data so that logic is kept minimal and as consistent as possible.

It's worth stepping through one to understand the buffer scheme used, how the sprites are organized in RAM, and what the ops are, and when they happen and how the COGS synchronize.

There are differences in sweep frequencies too. TV is the slowest, so the most can be done. VGA is faster, limiting sprite / line numbers.

In general, full color sprites top out at somewhere around 256 pixels, depending on the speed of the Propeller. 4 color data can be drawn at a higher resolution. 16 color data is about as tough as full color data is, just because it's not native to the prop, requiring some table lookups and such that are not needed with full color (8 bits per pixel), and the 2 / 4 color data options.

Vega256 · 2011-07-22 14:15

potatohead wrote: »

There are a few methods.

The most important thing is to select data formats for the sprites that align well with the propeller. The 4x8 size is great for full-color sprites, because it's only 4 pixels, the max amount of data gets transferred per hub operation, and bit masking only needs to happen on byte boundaries, which keeps the number of masks and shifts down to the minimum. Other sizes might make sense, depending on what the buffer format is for the signal, or TV COG.

Maybe that buffer stores color, pixels, color, pixels, color, pixels, etc.... Or maybe it's pixels, pixels, pixels, pixels in one linear buffer, and colors, colors, colors on another one. Could do fixed color sets too, just storing pixels only, or in the case of the full color option, pixels are colors, requiring only one buffer.

Each shift, add, move, mask move operation series consumes scan line time. The more there are, the fewer the number of sprites per scan line possible.

How the sprite data is stored impacts this as well. Sprite lists need data alignments much like color and pixel data does, and they might need sorting too, depending on what was done.

The other technique is to use rotating buffers. With a single buffer, one really only gets part of a scan line to render with. The rest is spent syncing with the signal COG. It's also not possible to utilize multiple render COGS very easily with a single buffer, because timing gets in the way. A fair amount of logic is require to make sure things happen in the right order on a single buffer, limiting everything overall.

If there are multiple buffers, a lot of things change. The signal COG can be writing one buffer to the screen, while render COGS can attack the other buffers, often rolling over to the next one when they finish early. A few buffers can yield multiple scan lines of time to get sprites placed, at the cost of more render COGS, and RAM.

There is enough time to do a lot, and actually get tens of sprites per line running nicely with transparency. The real key is thinking through the ops needed, factoring out duplicates and structuring data so that logic is kept minimal and as consistent as possible.

It's worth stepping through one to understand the buffer scheme used, how the sprites are organized in RAM, and what the ops are, and when they happen and how the COGS synchronize.

There are differences in sweep frequencies too. TV is the slowest, so the most can be done. VGA is faster, limiting sprite / line numbers.

In general, full color sprites top out at somewhere around 256 pixels, depending on the speed of the Propeller. 4 color data can be drawn at a higher resolution. 16 color data is about as tough as full color data is, just because it's not native to the prop, requiring some table lookups and such that are not needed with full color (8 bits per pixel), and the 2 / 4 color data options.

Yeah, I am planning on using 8 bits per pixel, the full color option. My problem, I suppose, is the timing. It is my understanding that the TV driver just "draws" but the sprite/tile engine gives the TV driver something to draw in a particular format. Reverting back to the DK driver I am using, one of the parameters of the TV driver is the address of a variable. In this variable, the driver will write the number of the line that is next to be drawn. Using only number of the next line, how exactly can my tile/sprite driver be on que to deliver the next stream of pixels?

potatohead · 2011-07-22 14:37

You have to insure that the TV COG writes it's state to the HUB. If it does that at a very consistent time, like say when the scan line is finished, or when it begins a scan line, the render COG then either pre-renders the first necessary scan line, moving to a wait loop, so that it knows it can move to the next one

, or

waits for a signal to begin rendering the first necessary scan line.

It's either pre-render and wait, or wait, then render.

Let's say it's pre-render, then wait. It would look like this:

Start TV COG, Start Render COG.

TV COG writes it's current scan line to HUB, just before drawing that scan line. a great time is between the end of the graphics area, in the back porch, or right overscan border.

TV COG does this every scan line, and that's all it does.

Render COG fills the buffer with the first scan line full of data.

Render COG reads scan line from HUB

If it's less than the first displayed line, keep waiting.

If it's the first displayed line, then increment the line check, and begin to fill the buffer again.

Loop back to the waiting process.

That's a single buffer scenario, where the render COG is literally rendering just ahead of the TV COG.

A double buffer scenario is about the same, only the render COG fills buffers, and then tells the TV COG which buffer to draw from.

A dual buffer scenario can be found in my current Potatotext 2 driver. There is a single buffer scenario in the older one in the OBEX, and a multi-buffer scheme can be found in the Tile + Sprite driver linked in my blog. I don't know how the DK driver buffer scheme works. Never looked at that one very closely.

If I were you, I would write my sprite code, count cycles and figure out how long it takes. Then choose a buffer scheme, then connect the two cogs together rendering simple test data, and when all of that works, add in the sprites.

Edit: It is important for one COG to be in charge of each part of the process. The TV cog is in charge of communicating the scan line, and the frame state. Two variables then. Display blanking, and visible for one, and scan line / buffer address for the other.

One scheme I like is for the render COG to put a buffer address in for the TV cog to render, and to have the TV COG clear that address, when it's done. The render COG can just write the address, go off and fill another scan line in a multi-buffer scenario, then come back to watch for when it's cleared. When it is cleared, it writes a new address for the TV COG, and repeats.

Maybe have the TV cog update the scan line variable, or keep it in a counter in one or both COGs too.

At the higher level, your SPIN program, or attached computer, will need to know the display state, blanking or not, so that it knows when it can draw to it, or not.

Vega256 · 2011-07-22 15:47

Lets suppose I pre-render, then wait. The render cog prepares data for line 0, but by the time the buffer is ready, the TV is already on line 2, 3, or even further away than that. What do I do then, wait until the beam wraps around?

In short, can preparation of the buffer take too long?

potatohead · 2011-07-22 16:48

Nope.

You've got to get it all done before the beam gets there.

Or... Use a buffered display, such as a bitmap, or tiles where the rendering of things can be decoupled from the scan line drawing of the display.

If your sprite code is taking a few scan lines, consider multiple buffers and have COGS render to them concurrently. Or... Simplify the sprite render code, optimizing away excessive hub access cycles, extra instructions, and combine ops where possible. Hub access windows are 2, 6, 10 instructions. Out of order processing of sprite data can help with the HUB access window delays, as can sizing data to take best advantage of the HUB transfer times.

With a single buffer, you have to get it done in less than a scan line. With a double buffer, you can get it done in a scan line, and you've got the possibility to have some slop in the system, like the occasional sprite set that takes a bit longer than a scan line. Because the buffers are latched to the display, a little over-run will just appear as a missing pixel, or sprite element, or might not even be seen, depending on where the beam is when it happens.

Multiple buffers can extend the draw time to the number of scan lines used to buffer, because COGS can operate round robin on the buffers. That is what is done in the Tile & Sprite driver in my blog. Jim Bagley came up with that buffer scheme, and it's fast and fails nicely.

Go and read through that driver, understand how the buffers work, and look at the sprite code. It's flat out brilliant. Well worth however much time it takes. Jim knows his prop stuff cold.

Basically, with two or more buffers, you end up rendering several scans in advance of the beam. With a single buffer, you are literally rendering just ahead of the beam. The closer the render happens to where the beam is, the less variance the process can tolerate.

The maximum variance happens when the entire display is buffered, like a bitmap, or tiles.

Cut your sprite renderer back to say, just a coupla sprites. Then get it rendering perfectly, and position test sprites at all the screen extents. When that works, then add sprites to see where failure happens.

Then make decisions. Either add COGS, buffers, or change the buffer latches so that failure doesn't corrupt the display. The best failures are where the sprite just isn't visible on a timing intense scan line and nothing else happens. The worst failures are where the screen timing breaks waiting on a buffer, glitching the display, potentially losing sync. Always let the signal drive everything, and if something is taking too long, have it check to see if it needs to exit to move on, so that the maximum number of display elements are rendered correctly at the tighter timing conditions.

Doesn't hurt to clock your prop up to 100Mhz either, though you really should make every effort to get it running at 80Mhz. That's what everybody can be assumed to have.

ericball · 2011-07-23 08:32

Tile drivers are full screen drivers - the data for the entire screen is stored in memory. So there's no synchronization issues. Tile drivers also have the advantage that they can output 16 pixels per tile width, which gives either higher resolution (not really important for TV due to artifact / modulation restrictions) or more time for work per WAITVID. The limitation is only 4 colors per tile. Bitmap drivers (byte per pixel) don't have this restriction but the size of the bitmap is limited by available HUB RAM. Line drivers avoid this limitation but require one or more "render cogs" to turn the tile/sprite information into pixels. (Tile drivers effectively do this "on the fly" as they have enough WAITVID time.)

The number of line buffers required by a line driver depends on the time required for a render cog to create the buffer. If the render can be done during HSync (~11.4usec) then only one buffer is required, otherwise take the time required to render the line, divide by 63.555usec (rounded up) and that's the number of render cogs you will need with one buffer per render cog plus one for output.

Just for comparison, my sprite driver used up to 5 cogs to render 140+ 8x8 sprites. Each cog rendered to a 240 pixel buffer in cog RAM, then output in sequence.

potatohead · 2011-07-23 08:45

Actually, you can do a render that is longer than a HSync, but less than a scan line, if you trigger the render to happen right at the start of the front porch, or right border / overscan. The renderer will race ahead of the beam, using the border + HSync time as "buffer" time to complete operations. As long as the render does not suffer a slow down greater than that time, the display will appear un-corrupted. Did that in one of my earlier drivers successfully.

Vega256 · 2011-07-23 08:53

Perhaps I am over-complicating things in filling my scanline buffer. The concept behind my current code is this.

The TV driver is setup for a 256x240 screen resolution. I will manage each tile as an 8x8.

(256 / 8) = 32 tiles horizontally
(240 / 8) = 30 tiles vertically

32 x 30 = 960 tiles

There is a tile table in main RAM, stored as an array, that is 960 bytes large. Each byte represents the 960 tiles available on-screen, so table [0] would represent the first tile in the first line (upper-leftmost tile) and table [31] would represent the last tile in the first line (upper-rightmost tile) etc. up until the last tile in the last line. The value of each byte references one of the 256 tiles defined in main RAM, so table [0] := 0 means that the first tile should be the first block defined in RAM etc. Of course, in order to arrive at an address from only a table entry requires some math. In summary,

- Render cog calculates address to move tile data from based on retrieved table entry
- Render cog moves the tile data from the address calculated into the scanline buffer
- Repeat

You see, for every 8 pixels moved into the buffer, the render cog accesses the tile table again at the next index.

What are your guys' opinion on this design?

potatohead · 2011-07-23 09:10

That kind of screen should be possible with one render COG, and may be possible in one COG period. If it's multi-cog, two buffers are all that is needed. One would work, but I don't recommend that, simply because it becomes difficult to add COGS for sprites and such.

I resolved that problem by incorporating a scan line counter and a tile line counter (0-7).

Prior to entering the scan line, the base address of the tile table is added to the tile line counter, pre-computing the offset into the tile table needed for each vertical line in the tiles.

During the scan line, the render cog reads a byte of "screen" memory, from the tile table, multiplies it by 8, adds it to the pre-computed base address + offset above, then reads from the tile pixel data in the HUB, either writing it to the waitvid directly if a single cog design is in play, or to a buffer, if a multi-cog scenario is in play.

The key is to have the render COG only computing what is needed for the tile specifically, spending most of it's time fetching pixel data, and writing it to the scan line buffer.

Chances are, you are doing too many ops in your render loop, and or are bumping into the HUB access windows, slowing things down more than would be necessary.

Also, doing full color tiles is timing tight at 80Mhz. I think I was able to get 320 pixels at 80Mhz, doing nothing but tiles, though that might have been at 96, or 100. Either way, that pixel resolution, with only 4 pixels per waitvid, requires fairly tight code, or multiple COGS, because the waitvid loops will be short, and the number of HUB operations high for the number of pixels desired.

Maybe you should post up some code. The discussion from here would be a lot easier.

I see you are using the full field, 240 lines vertically. Are you also keeping the overscan small as well, such that the 256 pixels nearly fills the frame? If so, most displays won't display all the tiles, although most small LCD displays will, as will PC capture cards. Most anything that is actually built as a TV, for consumers, will have portions of the display hidden, both vertically and horizontally. Of course, you could just carve out a border too, blanking a few tiles all around.

The design is fine. That's how most tile displays have been written. The other alternative, built by Chip, is to store tile addresses in the tile table, along with color index data. That's how the Parallax tile drivers work, and they allow palettes of 4 colors per tile, and many pixels per tile. Clever actually.

What you've done is a nice 8x8, which is efficient for a lot of reasons. Should work well, if you get your render code tuned to beat the beam.

Edit: Make sure you are doing the minimum HUB ops too. When you fetch from the screen, that's a byte per tile. You could fetch all four, and make a longer loop that does 4 tiles at once, shaving off 3 HUB operations per 4 tiles. When fetching pixel data, that's a long too, don't get each byte. Two fetches required per tile, 4 pixels per fetch, one long.

Vega256 · 2011-07-23 09:22

I was working on the code the other day, so, it is a bit torn apart but the core routines are the same.

con

  _clkmode = xtal1 + pll16x
  _xinfreq = 5_000_000

  paramcount = 2

  '80,000,000 = 1 second
  '80 = 1 microsecond
  '8 = 0.1 microsecond

  'Refesh Rate = 16666.666666667 microseconds
  
var

  long nextLineAddress
  long scanlineBufferAddress
  long tileDefAddress
  long tableAddress

  long nextLine
  long scanlineBuffer [64]
  byte table [960]

obj

  tv    :       "dk_tv_drv"
    
pub main

  nextLineAddress := @nextLine
  scanlineBufferAddress := @scanlineBuffer
  tileDefAddress := @tile0
  tableAddress := @table

  'longfill (@scanlineBuffer, $02020202, 64)          'Clear scanline buffer to black

  bytefill (@table, 7, 960)
  table [0] := 0
  table [1] := 1
  table [2] := 2
  table [3] := 3
  table [4] := 4
  table [5] := 5
  table [6] := 6 

  tv.start (@nextLineAddress)
  cognew (@start, @nextLineAddress)
  'waitcnt ((clkfreq * 2) + cnt)                                                                                 

dat


{
white           $07
lightGrey       $05
grey            $04
black           $02

red             $5C
orange          %6C
yellow          %8C
green           %AC
blue            %0C
purple          %2C
magenta         %3C
}


tile0   long  $07070707, $07070707
        long  $07070707, $07070707
        long  $07070707, $07070707
        long  $07070707, $07070707
        long  $07070707, $07070707
        long  $07070707, $07070707
        long  $07070707, $07070707
        long  $07070707, $07070707

        long  $0c0c0c0c, $0c0c0c0c
        long  $0c0c0c0c, $0c0c0c0c
        long  $0c0c0c0c, $0c0c0c0c
        long  $0c0c0c0c, $0c0c0c0c
        long  $0c0c0c0c, $0c0c0c0c
        long  $0c0c0c0c, $0c0c0c0c
        long  $0c0c0c0c, $0c0c0c0c
        long  $0c0c0c0c, $0c0c0c0c

        long  $05050505, $05050505
        long  $05050505, $05050505
        long  $05050505, $05050505
        long  $05050505, $05050505
        long  $05050505, $05050505
        long  $05050505, $05050505
        long  $05050505, $05050505
        long  $05050505, $05050505

        long  $acacacac, $acacacac
        long  $acacacac, $acacacac
        long  $acacacac, $acacacac
        long  $acacacac, $acacacac
        long  $acacacac, $acacacac
        long  $acacacac, $acacacac
        long  $acacacac, $acacacac
        long  $acacacac, $acacacac

        long  $04040404, $04040404
        long  $04040404, $04040404
        long  $04040404, $04040404
        long  $04040404, $04040404
        long  $04040404, $04040404
        long  $04040404, $04040404
        long  $04040404, $04040404
        long  $04040404, $04040404

        long  $5c5c5c5c, $5c5c5c5c
        long  $5c5c5c5c, $5c5c5c5c
        long  $5c5c5c5c, $5c5c5c5c
        long  $5c5c5c5c, $5c5c5c5c
        long  $5c5c5c5c, $5c5c5c5c
        long  $5c5c5c5c, $5c5c5c5c
        long  $5c5c5c5c, $5c5c5c5c
        long  $5c5c5c5c, $5c5c5c5c

        long  $2c2c2c2c, $2c2c2c2c
        long  $2c2c2c2c, $2c2c2c2c
        long  $2c2c2c2c, $2c2c2c2c
        long  $2c2c2c2c, $2c2c2c2c
        long  $2c2c2c2c, $2c2c2c2c
        long  $2c2c2c2c, $2c2c2c2c
        long  $2c2c2c2c, $2c2c2c2c
        long  $2c2c2c2c, $2c2c2c2c

        long  $02020202, $02020202
        long  $02020202, $02020202
        long  $02020202, $02020202
        long  $02020202, $02020202
        long  $02020202, $02020202
        long  $02020202, $02020202
        long  $02020202, $02020202
        long  $02020202, $02020202        

dat


org 0



                        'Load parameters
start                   mov nextLineAddr, par
                        rdlong nextLineAddr, nextLineAddr
                        mov scanlineBufferAddr, par    'Load address of the next line in the frame
                        add scanlineBufferAddr, #4
                        rdlong scanlineBufferAddr, scanlineBufferAddr
                        mov baseTileDefAddr, par
                        add baseTileDefAddr, #8
                        rdlong baseTileDefAddr, baseTileDefAddr
                        mov tableAddr, par
                        add tableAddr, #12
                        rdlong tableAddr, tableAddr

                        rdlong lineNum, nextLineAddr
                        mov baseIndex, #0
                        mov maxIndex, #32
                        mov lineIndex, lineNum
                        rol lineIndex, #3
                        

                        

                        
loadNextLine            mov tableIndex, baseIndex                                   'Initialize the sprite table index
                        mov scanlineIndex, scanlineBufferAddr
loadNextEntry           mov tileEntryAddr, tableAddr
                        add tileEntryAddr, tableIndex
                        rdbyte tileTableEntry, tileEntryAddr                    'Read in an entry from the sprite table 
                        rol tileTableEntry, #6                                  'Multiply table entry by 64

                        mov tileAddr, baseTileDefAddr
                        add tileAddr, tileTableEntry    'Obtain the base address of the tile referenced
                        add tileAddr, lineIndex


loadTile                rdlong tileData, tileAddr
                        wrlong tileData, scanlineIndex
                        add tileAddr, #4
                        add scanlineIndex, #4
                        djnz tileSectorCounter, #loadTile

                        mov tileSectorCounter, #2

                        add tableIndex, #1

                        'rdlong lineNum, nextLineAddr
                        'cmp lineNum, #1 wz
              'if_nz    jmp #loadNextEntry

                        cmp tableIndex, maxIndex wz
              if_nz     jmp #loadNextEntry
loop2                   jmp #loop2

                        rdlong lineNum, nextLineAddr
                        mov baseIndex, lineNum
                        rol baseIndex, #5
                        'add baseIndex, #1
                        mov maxIndex, baseIndex
                        add maxIndex, #32
                        mov lineIndex, lineNum
                        rol lineIndex, #3
                        jmp #loadNextLine
                         
                        'cmp nextline, 
                        'jmp #loadNextLine 
loop                    'jmp #loop







loopControl             long    16
tileSectorCounter       long    2

cogBuffer               res     64


nextLineAddr            res
scanlineBufferAddr      res
baseTileDefAddr         res
tableAddr               res

lineNum                 res

tableIndex              res
lineIndex               res
tileEntryAddr           res
tileTableEntry          res
tileAddr                res

tileData                res
scanlineIndex           res

baseIndex               res
maxIndex                res

currentLine             res

potatohead · 2011-07-23 09:32

Right off the bat, I can see a lot of extra cycles. Think do more initial work to make the render loop smaller. This can go faster. Good news for you.

loadTile                rdlong tileData, tileAddr                         
                        wrlong tileData, scanlineIndex

This is the same as:

loadTile       rdlong tileData, tileAddr
               nop  <--- put something here, because it's time spent anyway
               nop
               wrlong tileData, scanlineIndex

One thing you can think about is the HUB windows are 2, 6, 10 instructions.

If you have, say 3 instructions between HUB ops, you might as well have 6, because that's how long it will really take.

If you have no instructions, you might as well stuff 2 in there, for the same reason. What you want to do is have as few HUB ops as possible, with the windows optimized for the best balance on computation vs window size. One long window, plus one short one = long time. Two moderate windows = less time, but probably the same computations!

You've got one long one there, and a really short one, and a medium length one. Moving some instructions into the short one basically means executing "free" instructions, making the loop faster overall. Probably, combine and simplify computations will get this under the wire, fast enough.

Prioritize your computations, keeping as many of them out of the render loop as possible, and structure the order of things to best fit the HUB windows. Finally, do as few computations as possible.

This can be simpler too, unless I'm missing something. Need to read it over closely and step through.

I'll go and fetch some of my code that does almost the same kind of screen. That's on another box. Maybe others will start to chime in on this one too, and we can get it faster!

potatohead · 2011-07-23 10:19

Here's a renderer code chunk from a 4x8 hi-color tile driver. 4x8 is the easiest and fastest. Doing 8x8 will require a coupla more hub-ops. I would double the loop, one for the lower long, and one for the upper long, adjusting counters and shifts and such accordingly. Or, use two render COGS, each one doing part of the tile, for higher resolutions, if needed.

nextscan               [some code]


                        mov     _fontline, active_scan                           'Prepare to operate on active_scan
                        and     _fontline, #%111        wr                       'only need modulo
                        shl     _fontline, #2                                    'one long per vertical tile row #3 for 8x8 tile
                        mov     _fontsum, _fontline                              'calculate font table offset once per scanline
                        add     _fontsum, fonttab                                'font table offset keyed to active scanlines
                                                                                 'pointer to tile table done!!

                        add     active_scan, #1                                  'pre add counter for next scanline

                        mov     count, numwtvd                                   'do every character on scanline
                        mov     _lnram, lnram                                    'point to beginning of line buffer


scanloop1               add     _lnram, #4                                       'index to next buffer element

                        RDbyte  A, _screen                                       'get tile offset from screen array
                        shl     A, #5  (#6 for 8x8 tile)                          'multiply by 32
                        add     A, _fontsum                                       'calculate effective tile Y address
                        
                        RDlong  B, A                                             'fetch pixel data
                        add     _screen, #1                                      'point to next tile table address
                        mov     C, active_scan                                   'prepare to adjust tile table pointer
                        
                        WRlong  B, _lnram                                        'write to scan buffer
                        djnz    count, #scanloop1                                'done with all the buffer writes?
                                                                                 'no, goto scanloop1
                                                                                 'yes, prepare for next scan line 
                        mov     C, active_scan                                   'need working copy of scan line counter
                        and     C, #%111                                         'get modulo 7 (tiles 8 rows high)
                        cmp     C, #0  wz, wc                                    'are we done with a full set of tiles?
              if_NZ     sub     _screen, numwtvd                                 'no, keep screen pointer on same set of tiles
                                                                                 'otherwise, it's point at next row on screen
                        jmp     #next_scan                                       'do next scan line

'   _fontline = vertical offset into tiles modulo (0-7)
'   _fontsum = base tile addres, plus vertical offset into tiles
'   These values are common for a entire scan line



'   count = number of tiles to process.  This loop was doing 4x8 tiles.

'   _lnram = HUB scan line buffer

'   A, B, C = temp operating variables

'   active_scan = current scan line

'   numwtvd = number of waitvids per scan line

'   Various constants, #4, #%111, #5, etc... are all sized for 4x8 tiles.

'   Note the block of initial compuations are outside the render loop.  Also note the HUB windows
'   are all two instructions.  Adding one instruction to any of those bumps the time to 6
'   instructions, slowing the loop considerably.



DAT
fonttab       long      $06000600  '<---fontline 0 = 0  (scan line 0)
                        $00060006  '<---fontline 1 = 4   (scan line 1)
                        $06000600
                        $00060006  '<---fontline 3 = 12
                        $06000600
                        $00060006
                        $06000600
                        $00060006  '<---fontline 7 = 28   (scan line 7)

'Fontsum = fonttab + fontline  This is how you calculate the vertical offset into the tile.  All other
'Computations are done with that sum, simplifying the render loop, which just gets a tile address, does
'the required multiply to get the pixel data from this table.

'Say, tile 2 is desired, and we are on scan line 3, and fonttab = 1000.  1000 + (3*4) = 1012.  That's the base
'address the render loop uses, so all tiles are offset by three rows.

'Render COG reads the tile table, multiples it by 32 for a (4x8) tile, or 64 for a (8x8) tile, and adds that
'to fontsum.  If Tile 1 was desired, it would be offset from fontsum, which equals 1012 + (1*32) = 1044  



tile 1        long      $06000600
                        $00060006
                        $06000600
                        $00060006     '<----render cog points here, instead of at the top of the tile (1044)
                        $06000600
                        $00060006
                        $06000600
                        $00060006

tile 2        long      $06000600
                        $00060006
                        $06000600
                        $00060006
                        $06000600
                        $00060006
                        $06000600
                        $00060006

I stripped out a lot of Smile to highlight what the render code can look like. I'll look yours over later today and see if I can spot some easy speed ups. Thought you might like another one to look at and think on.

(not on a prop at the moment, so I can't run any of this stuff, or I would have just posted up a working 8x8 )

Vega256 · 2011-07-23 14:16

potatohead wrote: »

Here's a renderer code chunk from a 4x8 hi-color tile driver. 4x8 is the easiest and fastest. Doing 8x8 will require a coupla more hub-ops. I would double the loop, one for the lower long, and one for the upper long, adjusting counters and shifts and such accordingly. Or, use two render COGS, each one doing part of the tile, for higher resolutions, if needed.

nextscan               [some code]


                        mov     _fontline, active_scan                           'Prepare to operate on active_scan
                        and     _fontline, #%111        wr                       'only need modulo
                        shl     _fontline, #2                                    'one long per vertical tile row #3 for 8x8 tile
                        mov     _fontsum, _fontline                              'calculate font table offset once per scanline
                        add     _fontsum, fonttab                                'font table offset keyed to active scanlines
                                                                                 'pointer to tile table done!!

                        add     active_scan, #1                                  'pre add counter for next scanline

                        mov     count, numwtvd                                   'do every character on scanline
                        mov     _lnram, lnram                                    'point to beginning of line buffer


scanloop1               add     _lnram, #4                                       'index to next buffer element

                        RDbyte  A, _screen                                       'get tile offset from screen array
                        shl     A, #5  (#6 for 8x8 tile)                          'multiply by 32
                        add     A, _fontsum                                       'calculate effective tile Y address
                        
                        RDlong  B, A                                             'fetch pixel data
                        add     _screen, #1                                      'point to next tile table address
                        mov     C, active_scan                                   'prepare to adjust tile table pointer
                        
                        WRlong  B, _lnram                                        'write to scan buffer
                        djnz    count, #scanloop1                                'done with all the buffer writes?
                                                                                 'no, goto scanloop1
                                                                                 'yes, prepare for next scan line 
                        mov     C, active_scan                                   'need working copy of scan line counter
                        and     C, #%111                                         'get modulo 7 (tiles 8 rows high)
                        cmp     C, #0  wz, wc                                    'are we done with a full set of tiles?
              if_NZ     sub     _screen, numwtvd                                 'no, keep screen pointer on same set of tiles
                                                                                 'otherwise, it's point at next row on screen
                        jmp     #next_scan                                       'do next scan line

'   _fontline = vertical offset into tiles modulo (0-7)
'   _fontsum = base tile addres, plus vertical offset into tiles
'   These values are common for a entire scan line



'   count = number of tiles to process.  This loop was doing 4x8 tiles.

'   _lnram = HUB scan line buffer

'   A, B, C = temp operating variables

'   active_scan = current scan line

'   numwtvd = number of waitvids per scan line

'   Various constants, #4, #%111, #5, etc... are all sized for 4x8 tiles.

'   Note the block of initial compuations are outside the render loop.  Also note the HUB windows
'   are all two instructions.  Adding one instruction to any of those bumps the time to 6
'   instructions, slowing the loop considerably.



DAT
fonttab       long      $06000600  '<---fontline 0 = 0  (scan line 0)
                        $00060006  '<---fontline 1 = 4   (scan line 1)
                        $06000600
                        $00060006  '<---fontline 3 = 12
                        $06000600
                        $00060006
                        $06000600
                        $00060006  '<---fontline 7 = 28   (scan line 7)

'Fontsum = fonttab + fontline  This is how you calculate the vertical offset into the tile.  All other
'Computations are done with that sum, simplifying the render loop, which just gets a tile address, does
'the required multiply to get the pixel data from this table.

'Say, tile 2 is desired, and we are on scan line 3, and fonttab = 1000.  1000 + (3*4) = 1012.  That's the base
'address the render loop uses, so all tiles are offset by three rows.

'Render COG reads the tile table, multiples it by 32 for a (4x8) tile, or 64 for a (8x8) tile, and adds that
'to fontsum.  If Tile 1 was desired, it would be offset from fontsum, which equals 1012 + (1*32) = 1044  



tile 1        long      $06000600
                        $00060006
                        $06000600
                        $00060006     '<----render cog points here, instead of at the top of the tile (1044)
                        $06000600
                        $00060006
                        $06000600
                        $00060006

tile 2        long      $06000600
                        $00060006
                        $06000600
                        $00060006
                        $06000600
                        $00060006
                        $06000600
                        $00060006

I stripped out a lot of Smile to highlight what the render code can look like. I'll look yours over later today and see if I can spot some easy speed ups. Thought you might like another one to look at and think on.

(not on a prop at the moment, so I can't run any of this stuff, or I would have just posted up a working 8x8 )

Ah, I see.

Our math is on the same plane, but your engine is more efficient. I should definitely be able to speed things up if I were to execute the same math in less steps as you did. In my driver, I went around acquiring the terms in the address equation and then, added them together with the base address. But you acquired all terms, progressively accumulating them together. You also made good use of the "free execution time" in between HUB operations.

Confirming my understanding of the HUB instructions; if I were to have two, one immediately after the other, the acting cog would just sit there until the HUB gives it access again?

potatohead · 2011-07-23 14:44

Yes, exactly.

mov     C, active_scan

Looks like I've got that one in there twice! It was removed on a later version. This was early code for the tile driver in my blog. Be sure and strip that out. Took me several re-writes to get to that instruction sequence. Use any and all of it as you see fit. I have a fairly consistent problem using too many variables. Write it, then get rid of 10 percent of it, then do it again... After doing that a few times, I now think about the variables needed first, then write from there, trying not to add new ones. Works for me.

Yeah, the COG can only progress past a HUB operation when the window is open. It waits otherwise. It can either be waiting, or doing something. Either is fine, and either produces the same result.

Some consider it a poor practice, but I like to drop "nop" instructions between HUB ops, then work to stuff real instructions in there. I'll also use them to test where things really matter. Better to cycle count though. Visually, the "nop" is a simple device to get started. Just don't leave 'em in there like I've been caught doing a time or two.

Remember, 2, 6, 10.

I also like to group them so I can see the windows.

Cluso helped me big once. Was working on a messy loop, trying to optimize for the HUB. What helped was to get it working, even if slow. Then write another one, using a simple branch to toggle between loops. Put things out of order, leaving the original in the code so it makes sense. Recommended. It's easier to rewrite a complex loop for out of order speed from the original, clean, sequential source than it is to deal with the out of order product. At least that's true for me.

Doubling this one would work for the 8x8 tiles too. Instead of reading one screen address, just read two, with RDWORD, do the loop I've got there, then just do another one for the other half of the 8x8 tile. You can pre-add the offset needed for the odd tiles, or even tiles, and just duplicate the instructions seen here. I'll bet that ends up quick enough.

From your comments, I think you see where the speed ups can come from, so I'm not gonna step through the one you posted up here. Post up your next iteration, if you want, later. There are people here who can always squeeze something out of it.

In general, you can get 15-30 percent speed increase just by framing instructions to fit the window best case. You can get another chunk by combining operations. That's roughly equal to clocking up a prop from 80 to 100Mhz!!

Vega256 · 2011-07-23 15:13

potatohead wrote: »

Here's a renderer code chunk from a 4x8 hi-color tile driver. 4x8 is the easiest and fastest. Doing 8x8 will require a coupla more hub-ops. I would double the loop, one for the lower long, and one for the upper long, adjusting counters and shifts and such accordingly. Or, use two render COGS, each one doing part of the tile, for higher resolutions, if needed.

nextscan               [some code]


                        mov     _fontline, active_scan                           'Prepare to operate on active_scan
                        and     _fontline, #%111        wr                       'only need modulo
                        shl     _fontline, #2                                    'one long per vertical tile row #3 for 8x8 tile
                        mov     _fontsum, _fontline                              'calculate font table offset once per scanline
                        add     _fontsum, fonttab                                'font table offset keyed to active scanlines
                                                                                 'pointer to tile table done!!

                        add     active_scan, #1                                  'pre add counter for next scanline

                        mov     count, numwtvd                                   'do every character on scanline
                        mov     _lnram, lnram                                    'point to beginning of line buffer


scanloop1               add     _lnram, #4                                       'index to next buffer element

                        RDbyte  A, _screen                                       'get tile offset from screen array
                        shl     A, #5  (#6 for 8x8 tile)                          'multiply by 32
                        add     A, _fontsum                                       'calculate effective tile Y address
                        
                        RDlong  B, A                                             'fetch pixel data
                        add     _screen, #1                                      'point to next tile table address
                        mov     C, active_scan                                   'prepare to adjust tile table pointer
                        
                        WRlong  B, _lnram                                        'write to scan buffer
                        djnz    count, #scanloop1                                'done with all the buffer writes?
                                                                                 'no, goto scanloop1
                                                                                 'yes, prepare for next scan line 
                        mov     C, active_scan                                   'need working copy of scan line counter
                        and     C, #%111                                         'get modulo 7 (tiles 8 rows high)
                        cmp     C, #0  wz, wc                                    'are we done with a full set of tiles?
              if_NZ     sub     _screen, numwtvd                                 'no, keep screen pointer on same set of tiles
                                                                                 'otherwise, it's point at next row on screen
                        jmp     #next_scan                                       'do next scan line

'   _fontline = vertical offset into tiles modulo (0-7)
'   _fontsum = base tile addres, plus vertical offset into tiles
'   These values are common for a entire scan line



'   count = number of tiles to process.  This loop was doing 4x8 tiles.

'   _lnram = HUB scan line buffer

'   A, B, C = temp operating variables

'   active_scan = current scan line

'   numwtvd = number of waitvids per scan line

'   Various constants, #4, #%111, #5, etc... are all sized for 4x8 tiles.

'   Note the block of initial compuations are outside the render loop.  Also note the HUB windows
'   are all two instructions.  Adding one instruction to any of those bumps the time to 6
'   instructions, slowing the loop considerably.



DAT
fonttab       long      $06000600  '<---fontline 0 = 0  (scan line 0)
                        $00060006  '<---fontline 1 = 4   (scan line 1)
                        $06000600
                        $00060006  '<---fontline 3 = 12
                        $06000600
                        $00060006
                        $06000600
                        $00060006  '<---fontline 7 = 28   (scan line 7)

'Fontsum = fonttab + fontline  This is how you calculate the vertical offset into the tile.  All other
'Computations are done with that sum, simplifying the render loop, which just gets a tile address, does
'the required multiply to get the pixel data from this table.

'Say, tile 2 is desired, and we are on scan line 3, and fonttab = 1000.  1000 + (3*4) = 1012.  That's the base
'address the render loop uses, so all tiles are offset by three rows.

'Render COG reads the tile table, multiples it by 32 for a (4x8) tile, or 64 for a (8x8) tile, and adds that
'to fontsum.  If Tile 1 was desired, it would be offset from fontsum, which equals 1012 + (1*32) = 1044  



tile 1        long      $06000600
                        $00060006
                        $06000600
                        $00060006     '<----render cog points here, instead of at the top of the tile (1044)
                        $06000600
                        $00060006
                        $06000600
                        $00060006

tile 2        long      $06000600
                        $00060006
                        $06000600
                        $00060006
                        $06000600
                        $00060006
                        $06000600
                        $00060006

I stripped out a lot of Smile to highlight what the render code can look like. I'll look yours over later today and see if I can spot some easy speed ups. Thought you might like another one to look at and think on.

(not on a prop at the moment, so I can't run any of this stuff, or I would have just posted up a working 8x8 )

In the beginning of your code, why did you AND the line number with 7?

potatohead · 2011-07-23 15:33

Active scan will run from 0 to however many scan lines are in play. The tiles are stacked up on the screen vertically, one after the other. I do the AND operation so I can get the modulo 7 of active_scan. That number is multiplied by the number of bytes in each tile row, so it can be added to the base tile addresses before the render loop starts.

For each scan line, one row of the tiles will be drawn to the screen. That row is the active scan line modulo 7, which is what the AND operation does. Think of it like a counter that just goes 0, 1, 2, 3, ... 7, 0, 1, 2, ....

Scan line 0 is tile row 0, so nothing needs to be added to the base tile pixel data address. Scan line 1 is tile row 1. The addition will offset the base tile row address by one row. Scan line 7 is tile row 7, and scan line 8 is tile row 0 again. The AND operation just computes the relationship between tiles and scanlines, which is only the three lower order bits.

The render loop then only needs to fetch the tile out of the tile table, multiply by the tile size, and add to the base address that has been offset by the row number needed for that scan line.

Edit: This is why powers of two come in so handy for video related stuff. If the tiles were, say 9 rows high, a simple AND operation wouldn't cut it. One would either need to do the AND with one more bit and compare and adjust the value(harder), or maintain another variable to count tile rows.(easier) With the 8 row high size, the relationship between scan lines and tiles comes down to one instruction. That would impact the render loop too, because a simple shift and add would not get the right pixel data, unless the tiles were stored in a wasteful way, allowing for a shift. The next shift up from 32, is 64, so 9 row tiles would be located on 64 byte boundaries, assuming 4x9, with bytes wasted, due to only 9 of 16 rows actually being displayed.

Vega256 · 2011-07-23 16:37

What's the difference between your font table and your tile table?

potatohead · 2011-07-23 16:50

the font table is simply the start of the pixel data, like for a 8x8 text font. A 8x8 font, at 2 colors, takes 2Kb and it has 256 tiles in it, 8 bytes / tile, sequential.

I call the tile table, "the screen", like old computers did. It contains one byte pointers to the font_table.

Vega256 · 2011-07-23 20:04

I just thought of something. Shouldn't the driver wait for the TV to at least do a Vsync so that the signal will start at scanline 0?

potatohead · 2011-07-23 20:38

Yeah.

Have the TV COG, either output the current scan line, which the render COG can read, or just have it output a 1 state, when it's at the bottom of the screen, finished with the last scan line, or maybe at the end of VBLANK. The render COG writes a zero to that location, then waits for the 1, at which point it enters it's frame loop. When it's done with all the scan lines, have it write the zero again, and wait for the TV cog to signal another frame.

ericball · 2011-07-24 04:55

potatohead wrote: »

Actually, you can do a render that is longer than a HSync, but less than a scan line, if you trigger the render to happen right at the start of the front porch, or right border / overscan. The renderer will race ahead of the beam, using the border + HSync time as "buffer" time to complete operations. As long as the render does not suffer a slow down greater than that time, the display will appear un-corrupted. Did that in one of my earlier drivers successfully.

This only works if the rendering is can be done left to right

potatohead · 2011-07-24 08:43

Actually I did it left to right in Potatotext 1. I went and looked to see if I had a djnz instruction in there and didn't. It's left to right, just ahead of the beam. That render was convoluted too. Took nearly the entire scan line to do. It was that experience that more or less convinced me to use a double buffer from then on. Very difficult to utilize more than one render COG, due to the precise timing issues. At any one time, only a small window of time is available for another COG to write the buffer. The write will either not be seen as the beam has passed, or is stomped on by the primary render COG having yet to write it's render product to the buffer.

Not saying it was wise. Only that it could be done.

Vega256 · 2011-07-25 08:02

I have bad news but very good news.

The very good news is that I got my tile driver running. The bad news is that the tiles are being are being shifted down the screen for every frame render; almost as if the tile driver is late filling the buffer by one line.

potatohead · 2011-07-25 08:14

That has happened to me. Yeah, you are close!

You have a sync problem. The way to solve it is to think through the states of both the TV driver and the renderer and have them do interlocking things to sync up. It could be you are really close, and not properly initializing your renderer. Have it keep track of it's own scan lines, and reset that every frame sync.

For the frame: Have the TV driver write a one to a HUB variable at some specific time. I recommend after the last scan line has completed. Have the renderer start up, render it's first scan line, then write a 0 to that same location, looping to check for the 1, before rendering all the scan lines. When they are done, initialize for the next frame, and repeat. The renderer can count it's own scan lines. That is all that you need to sync up for the frame.

For the scan line, I don't know if you are using a single or double buffer. The single method has been discussed above, and it's tricky. Not recommended. Reads like you've got the scan line working though, so don't tinker with it yet.

For a double buffer, I suggest having the TV COG read it's buffer address during the HBLANK, so that it's rendering from one of the two buffers on that scan line, directed by the render COG. After it fetches that buffer address, have it write a 0 to that location. The render cog writes the buffer address it just rendered to that same location, and it loops and waits for the zero to be written before advancing to render the next scan line.

Those two latches will keep the render COG rendering in the right place, at the right time. The 80x50 driver in my blog uses that basic latch sync technique, if you want to look at some code.

Understanding WAITVID

Comments