At what frame update speeds do these HDTV displays start to have visible issues ? ie just how slow can you go ?
Chip was referring to the old analogue CRTs that do have flicker and sync issues, as opposed to HDTVs which don't have such issues because they have built-in scan converters.
24p will be the bottom limit for framerate. Bottom limit for resolution/pixel clock and the likes is a question still to be answered. EDIT: HDTV's will likely have lower minimums than LCD monitors to accommodate PAL/NTSC frequencies.
I'm sorry, but that is absolutely not my experience, or we are talking past one another. To be clear on what I was writing about, say we are on a PAL interlaced display. All the lines are displayed at a rate of 25hz. Motion faster than that will not be scan line synchronized, and the shape of the image will degrade.
Degrade is a relative thing. Upping the dot clock rate degrades the perceived pixel quality too. Just because motion is not as crisp as a still image doesn't mean it ain't a 50 Hz effective framerate.
The effective motion framerate is not the same thing as a PAL encoded "frame". The field rate is the motion framerate for interlaced footage.
In progressive scan, motion can be 50hz without tearing. Interlaced is constrained to 25hz.
Progressive, by both definitions, is the full image space update rate. It only changes information in the captured/rendered image after the complete set of pixels have been updated. Interlaced does not have this limit so can have a higher effective framerate than it's full pixel count would otherwise imply.
The degree of object tearing depends on the position change. With standard definition TV sets, the cost is vertical resolution. Full vertical resolution = 1/2 the motion, half vertical resolution = full motion at 25 and 50hz respectively for PAL, 30 / 60hz NTSC.
At the same data-rate, interlaced has the same full framerate as sequential but double the resolution(definition). It looses detail in motion and gains it back again as the motion reduces. Although a good deinterlacer will maintain almost all of the detail but at the cost of added lag.
The trade-offs you are describing sounds more like progressive vs sequential.
Video processing can mitigate this, at the potential cost of both detail and or temporal accuracy for the motion. Put one of those 3D games up on an interlaced display and go move around some. Or do the same with some bright sprites or vectors against a dark background. It's gonna tear or blur on an interlaced display when the motion exceeds the "show all the scan lines" rate, and the tear will happen on analog devices with no processing and the blur will happen on those that feature video processors.
This is tainted by horribly bad deinterlacing experiences. Just remember that all LCDs have had to be deinterlaced for. But in the case of 50 Hz refresh on a CRT you are going to be seeing a lot of flicker with or without interlacing so you aren't exactly comparing apples with apples. Wind that up to your beloved 100 Hz and interlaced won't be so crumby. Still half the bandwidth of a sequential 100 Hz at the same resolution.
Degrade is a relative thing. Upping the dot clock rate degrades the perceived pixel quality too. Just because motion is not as crisp as a still image doesn't mean it ain't a 50 Hz effective framerate.
Indeed we are talking past one another. To get pixel perfect motion, it is necessary to constrain motion to the "display all the scan lines" rate. Once that is exceeded, things degrade. You are essentially saying that's no big deal, and I would agree for most content. Computer graphics like for CAD or many types of data visualization, and games in general do not always meet that criteria, and I wrote up where people see the differences and how they play out.
There are enough problem use cases to warrant things like "GAME" mode on modern HDTVs. Much of the processing loop is skipped, leaving a more pixel perfect display. The difference can be seen on a typical "program guide" where the text will change. First it's one character, then a blend of the two, somewhat indistinct, then the new character. A slow video processor might take a few frames to get that done, a faster one less, and GAME mode doesn't do it much at all, leaving a slightly less refined image, but one that doesn't mush things together across frames.
Video processors have their costs, and the costs or trade-offs between these various things was all I was getting at.
Now, I don't have a "beloved 100hz" preference now, but I did on early graphics systems. Actually it was 70-90 for me. Anywhere in there and the display fatigue was nearly eliminated, and interaction precise, productive. That is optimal for games, CAD, visualization.
I do keep a CRT around, component for TV type devices and VGA for computer type devices, and I do so for the pixel perfect display it delivers. Video processors leave artifacts. I enjoy the accurate, analog display stream. I also don't care for most things I do. Those happen on LCD mostly, sometimes plasma.
aren't exactly comparing apples with apples.
You seem to be arguing the benefits of interlacing and that people won't notice or it doesn't matter. Comparing interlaced vs progressive scans at 50 / 60 Hz is comparing apples to apples because that is what people got on their TV's until recently with the HDTV sets. For a movie, or some other "natural" content, interlace delivers superior resolution, and it's great! For computer graphics, it can be great, or not so great, depending on the use case, which again I was writing about more than I was anything else.
The trade-offs you are describing sounds more like progressive vs sequential.
Progressive is a sequential scan. The differences described are between interlaced and progressive/sequential. The simple cost of interlaced display is motion artifacts and or image artifacts depending on whether or not there is a video processor patching it all up, or one is viewing on a CRT sans video processor.
This is odd
Upping the dot clock rate degrades the perceived pixel quality too.
Well yes! That one is interesting, and since you brought it up, there are motion advantages there TOO! Lots of ways to exploit it as well. Again, I'm writing about TV type displays as most of the things I'm going to write about become marginal at higher resolutions.
Say we are on a TV and we've got 160 pixels of horizontal resolution and the frame rate is 60hz, non interlaced. An object motion can be updated 60 times per second, and it's motion conveyed by any of 160 pixels. This means there is a minimum motion possible, assuming the image isn't changed to fool the eye. Take that same image and run the dot clock at 320 pixels. On NTSC displays, some minor color detail is lost, however the minimum horizontal motion just got smaller, allowing for more fluid representations of more motion paths. It's more 'smooth" essentially. Do that again, with a 512 pixel clock, and an NTSC display will render artifacts on the fringes of things, and more color detail is lost, but horizontal motion is now very precise. Doing things like scrolling multiple backgrounds to give a depth illusion allow for more varied backgrounds and a smoother look.
The artists can exploit the dot clock by making art center around the 160 pixel color detail level, while adding highlights at higher resolutions and or keeping regions of art closely aligned on the color wheel, emphasizing luma instead of color overall. The difference in motion is stark, which is why game consoles went ahead and did that on a TV, offering VGA, S-video, or component outputs for those with better displays.
A key point to note here is those are managed trade-offs, sans video processing. When properly exploited, the user just sees a better experience, despite a fairly crappy signal delivering it. Add a video processor to that, and often the results are sub-par, but sometimes excellent depending on the video processor.
I'm a very perceptive and technical person regarding displays. When I'm viewing things I'm not really interacting with, most of these trade-offs do not matter. When I am interacting with it, many of them do, and some times are annoying, taking me out of the experience.
One big annoyance today is the HD image compression metrics. Some scenes are so bad that I see little colored squares during periods of intense motion! I have since learned many people either don't care or can't see them due to nature of their perception. Tune your average cable carrier and take a photo during an action scene in an action movie with a car, some mega explosion, etc... and it's amazing how much really isn't there. View that same title on DVD / Blu-Ray or the theater and it's an entirely different thing.
That motion sure is fast though! Can't really discern the objects on low bit rate streams, but it's fast!
So, back to the 24p discussion. I really like this one and 25/30hz progressive scanned signals because the processing on them is simple. Just send all the scan-lines. The video COG will be simple, and for a dynamically drawn display, no secondary buffer needed either. 24-30fps leaves a lot of time to build images too, further exploiting things allowing for more complex dynamic display, or even timed writes to a bitmap buffer. I like displays like that, and will find out what the HDTV sets will deal with, just because it's fun to exploit display technology, artifacts and such.
I don't care for video processors much and turn them off for my projects. The primary reason is they don't always do the same thing. If they did, I would like them much better.
Do you guys really think it's important to keep the UDQM/LDQM pins of the SDRAM connected to the Prop2 chip? If we grounded them, we could save two pins, but wouldn't be able to do byte-level writes, only word-level and up. Is it worth committing two pins to?
They are very important, otherwise byte writes will take a full read cycle, masking read value to isolate byte, or-ing value, and a full write cycle.
My guesstimate is that:
- with UDQM/LDQM it will take ~ 12 cycles to write a byte (assuming 10 cycles for SDRAM access, 2 cycles for testing odd/even address and SHL #8 if needed)
- without it will take ~ 10 to read word + 4 + 10 to write word = ~ 24 cycles to write a byte (same assumptions as above, adding an and and an or)
All byte writes would take twice as long, which will matter for any point / line / circle etc. graphics op.
Given that I expect 8bpp graphics modes to be used a lot, we need UDQM/LDQM desperately.
Do you guys really think it's important to keep the UDQM/LDQM pins of the SDRAM connected to the Prop2 chip? If we grounded them, we could save two pins, but wouldn't be able to do byte-level writes, only word-level and up. Is it worth committing two pins to?
I think that we could and should do without byte level access to SDRAM, and I say that while I plan to have 8bit per pixel graphics buffers in SDRAM. The reason is that while byte level access seems like it's "required" in that case, the performance impact of doing byte level SDRAM writes means that everyone should avoid it entirely. I plan to do all my SDRAM graphics stuff in blocks (tiles as I would call them based on GPU tech, but not the same kind of tiles you might be thinking of based on old school graphics). It's more work up front to get the "system" working, but in the long run it's faster (for most things) and likely scales across multiple COGs better.
I honestly feel that losing two whole I/O pins just to have byte level access to SDRAM isn't worth it. This is just for the module Parallax is going to be making, right? If you really want your board to have byte level access then you can consume the two extra pins.
Maybe, I'm less worried about it since I tend to not care that much about changing one pixel at a time....
I agree, single byte access isn't necessary, as it'll mean changes all round to the SD driver too, and saving two pins is great, there's a LOT you can do with two pins on Prop2
I understand Bill's argument very well, but in the code I've written so far, I've found that reading/writing anything less than several QUADs is too slow. I'm thinking that my own graphics practices will have to move toward being more block-oriented, like Roy was describing.
For anyone who wants to understand how the SDRAM works, I'll describe it in the way I think about it:
- Imagine 4 separate cabinets (SDRAM "banks"), each holding 8,192 pull-out drawers (SDRAM "rows").
- In each drawer is a loop of 512 words (SDRAM "columns"), which can be accessed sequentially from any offset for reading and writing.
- Only one drawer can be opened ("active") at a time within each cabinet.
- If you want to open another drawer within a cabinet, you must first close the one currently open (or "precharge" it).
- Each read and write operation specifies which bank and column is to be accessed, with the assumption that the row of interest has already been made active in that bank.
Once a bank and row have been made active, a word in that row can be randomly read or written on every clock cycle. There are only 512 words per row, though, so you very often need to close one drawer and open another. If you imagine the analogy of the cabinet of drawers and what a pain it is to open and close drawers, versus looking into a drawer to see what is there, you get the idea. Opening and closing drawers is going to be the predominant effort in dealing with non-block accesses. To get efficiency, data might need to be arranged so that 512 words is a meaningful chunk, on its own. Roy explained to me once that graphic memory is often arranged so that a limited 2D area, rather than a scan line, say, would be grouped as contiguous memory, or a "row", in this case.
If I think about writing a simple text editor and how quaint it would be to be able to scroll buffer contents by one byte using byte accesses, it makes my code very simple, but it's maybe 20x slower than doing read-modify-write operations on 1024-byte (512-word) blocks. So, while I like the idea of byte access, it seems like a crutch. This is why I'm asking whether we should support UDQM/LDQM on our Prop2 module.
The graphics memory arrangement I was describing to chip is called "morton order" or "Z-order" and here's a decent description of it: http://en.wikipedia.org/wiki/Z-order_curve
It also has the benefit of being easy to "encode/decode" the address from the X/Y. Since you just interleave the bits to go from X/Y to address, and vice versa.
GPUs have used this memory layout for textures for a very long time, because it's a much more cache friendly layout when doing accesses across textures as arbitrary angles/slopes. It's used for backbuffers often as well, although GPUs also support a "linear" format for backbuffers that allows easier software access, but it's lower performance in most cases, and eve still it breaks up the backbuffer into a bunch of "tiles" that are something like 64x128 pixels or similar.
I'm all for simple code but if that's at the expense of efficiency/performance then I have to wonder if the trade off is worth it.
For my 2 cents I would say no, do not support byte wide access.
Roy, can you explain a little more about your graphics 'blocks/tiles' please.
It seems very clear to me that SDRAM isn't ideal for low latency access, but high bandwidth. It was clear to me that it's necessary to do block caching of SDRAM contents in whatever the most efficient order is.
I totally understand what you mean re/ text editors, and being able to chunk (scatter-gather) dram writes. I'd never do that for a screen buffer, and frankly, that is such a low-bandwidth use case that it is not relevant.
Unfortunately it does matter significantly for 8 bits per pixel computer graphics.
Look at various line, circle, arc, etc. algorithms. Those are the problem children - they will not chunk nicely at all.
They will degenerate into reading a word, modifying a byte in it, then write it back.
When there is an attempt to scatter-gather, N-words will have to be written, then the correct byte modified, and the whole line-cache flushed when a different scan line needs to be modified.
You can help this somewhat why maintaining multiple line / tile caches at once, but it will break with diagonal lines, circles, vertical lines etc.
Regarding tiling the screen... that will make reading the raster complex, and will severely slow down "blit" style raster ops.
Before you decide not to implement the byte write control lines, I invite you, Roy, Baggers (and anyone else) to write line drawing algorithms using both methods, and benchmark random plotting 1M points, and drawing 100k lines - both ways (with control lines, and without).
Billting rectangular pixel blocks on a tiled vs non-tiled approach at pixel (NOT tile) boundaries will also show a huge speed difference.
If you do not have the control lines, I predict 16bpp graphics will dominate - which has update bandwidth issues at 1080p
Update:
Roy, thanks for the link, will check it out. These tiled approaches have an issue with soft-scrolling large bitmaps horizontally, and retrieving the raster fro the sdram will be more compilcated.
Coley - I understand your viewpoint, byte access does not matter for textures etc., but it does for plotting points / lines / circles etc. in 8bpp
pedward - please try cached point/line plotting - you will see what I mean.
My first iteration is going to use 32x32 pixel "tiles" (for 8bit pixels), I'll make my "render" cog operate on those. If you are familiar with the PowerVR tech at all, it's kind of similar to that.
You filter your data into buckets for each tile based on what intersects with it. Then render each tile in isolation and clipped. Then ship it off to the SDRAM backbuffer.
I figure I can get it working all with one cog rendering to tiles, one cog running the SDRAM driver, and a 3rd cog drawing the display (reading from the SDRAM front buffer). Then I can add more render cogs to gain performance. When we have 8 cogs, it seems reasonable to have 4 cogs rendering tiles.
Another, thing I will try is having tiles be larger. In multiples of 32x32 area, so 64x32, 64x64, etc... larger tiles means less sorting and less reduntant work across tiles, but uses more memory and makes other things like non-z-buffer-hsr slower.
I just realized something that makes supporting UDQM/UDQL kind of moot:
Because once a row is opened and you can do back-to-back reads and writes, this byte-masking can be handled in software! There will be about a 6-clock difference, having to read and write, versus controlling UDQM/LDQM, but that's not much of a price to pay for two pins, if someone wants to write a byte-level SDRAM driver. I'm getting rid of those two connections on the Prop2 module. This frees two more I/O's.
Bill,
Line, circle, etc. drawing to chunks of the screen at a time is only slightly more complex than going to the whole screen. It has some added overhead and redundant work, but overall I bet the performance to SDRAM is faster doing so than ANY form of byte/word access. Not only faster, but likely an order of magnitude faster (if not more than one order).
Bill,
Line, circle, etc. drawing to chunks of the screen at a time is only slightly more complex than going to the whole screen. It has some added overhead and redundant work, but overall I bet the performance to SDRAM is faster doing so than ANY form of byte/word access. Not only faster, but likely an order of magnitude faster (if not more than one order).
I just realized something that makes supporting UDQM/UDQL kind of moot:
Because once a row is opened and you can do back-to-back reads and writes, this byte-masking can be handled in software! There will be about a 6-clock difference, having to read and write, versus controlling UDQM/LDQM, but that's not much of a price to pay for two pins, if someone wants to write a byte-level SDRAM driver. I'm getting rid of those two connections on the Prop2 module. This frees two more I/O's.
@Chip,
Can you please specify weak pull-downs on the DQMs with 604 size resistors?
That way people who may find a real need for controlling DQM can have it via some wires.
Your P2 protoboard will after all be an everything for everyone board, right?
I just realized something that makes supporting UDQM/UDQL kind of moot:
Because once a row is opened and you can do back-to-back reads and writes, this byte-masking can be handled in software! There will be about a 6-clock difference, having to read and write, versus controlling UDQM/LDQM, but that's not much of a price to pay for two pins, if someone wants to write a byte-level SDRAM driver. I'm getting rid of those two connections on the Prop2 module. This frees two more I/O's.
That is an excellent idea; they could be brought out to via's that could be jumpered to two other prop pins.
edit:
Perhaps have jumpers to the two freed-up pins, which if shorted, control the DQM's, if open, DQM's pulled down by weak pulldowns as per Steve's suggestion.
Can you please specify weak pull-downs on the DQMs with 604 size resistors?
That way people who may find a real need for controlling DQM can have it via some wires.
Your P2 protoboard will after all be an everything for everyone board, right?
Chip:
Can you provide those UDQM/UDQL pins terminated in vias with 0.024" holes. Then those of us who want to try byte mode can just link the vias with kynar wire. I think it would be a big mistake not to provide for byte access in sme form so that experiments can be performed both ways.
Bill,
I shared that link as more info based on Chips mentioning our discussion about it. My tile based P2 rendering plan doesn't involve using that ordering. I was planning to use linear order on my 32x32 (or larger) tiles for P2 rendering.
I should have been more clear about that. I'm not sure the z-order thing helps us unless we could make the waidvid instruction access data in that way, and even then it doesn't make sense for how the vid stuff works in P2. It's something Chip and I will likely discuss in more detail for future ideas.
Anything but a simple linear ordering of scan lines in memory will complicate drivers to no end, and significantly limit resolutions possible to display.
Even worse, it will consume extra cogs to decode the non-linear format in the sdram, to convert it to a scan line buffer, which if modified, would have to be re-swizzled for the tile format before writing.
In denegerate cases, where only one or two pixels are changed in a 32x32 tile, performance will be incredibly slow. They will also significantly slow down pixel address calculations.
To use Chip's idea of read-modify-write for pixel plotting, the "gpu" cog would have to control the memory, and coordinate access with a display refresh cog.
Basically, saving two pins brings quite a few slow downs, and complicates matters significantly for 8bpp modes. 16bpp is not affected, so that is what I will tend to use (even though it uses twice the bandwidth)
Of course, on my boards I will implement the byte select lines
I think what makes the most sense for the parallax demo/proto board is to incorporate Steve's suggestion, and have jumpers so that the end user can decide between two more IO pins, or byte masks.
Bill,
I shared that link as more info based on Chips mentioning our discussion about it. My tile based P2 rendering plan doesn't involve using that ordering. I was planning to use linear order on my 32x32 (or larger) tiles for P2 rendering.
I should have been more clear about that. I'm not sure the z-order thing helps us unless we could make the waidvid instruction access data in that way, and even then it doesn't make sense for how the vid stuff works in P2. It's something Chip and I will likely discuss in more detail for future ideas.
Can you please specify weak pull-downs on the DQMs with 604 size resistors?
That way people who may find a real need for controlling DQM can have it via some wires.
Your P2 protoboard will after all be an everything for everyone board, right?
Thanks.
We've got that board really tightly laid out. I don't know if there's room for a few 0603's. Then, we'd need to take those pins down to the connector, unless someone wanted to solder wires, as you said. That ~6-clock penalty for software masking vs UDQM/LDQM controlling only represents about a 10% penalty for a byte write, by the time you get the code wrapped in the other instructions that make it all play. My gut feeling is that it's not worth all the extra consideration.
We've got that board really tightly laid out. I don't know if there's room for a few 0603's. Then, we'd need to take those pins down to the connector, unless someone wanted to solder wires, as you said. That ~6-clock penalty for software masking vs UDQM/LDQM controlling only represents about a 10% penalty for a byte write, by the time you get the code wrapped in the other instructions that make it all play. My gut feeling is that it's not worth all the extra consideration.
Please at least allow us the option of using byte access. Wires are fine between vias, and allow us to be able to cut the ground traces to UDQM/LDQM if you cannot provide 0603 pulldowns. Otherwise you preclude other uses for the SRAM such as emulators etc.
I'm actually not sure about getting rid of the UDQM/LDQM pins.
What about the CKE pin? It's used to skip clocks and put the SDRAM into power-down/self-refresh mode. It allows the Prop2 Module to go into a ~4mA mode while preserving SDRAM contents. That's important, I think. What do you guys think? Could we do without it? The SDRAM data sheet says that it can be tied high if you don't want the functionality.
Why not leave the access at 16 bit and use the field mover for byte level access? The field mover is at most 2 clocks if you have to call SETF, it's 1 clock otherwise to put any byte locating into any other byte location.
pedward,
The issue is that if you want to write one byte of a 16bit word, you need to read the original 16bit word first then mod whichever byte, then write the result back out. With the two lines, you can have the SDRAM chip just mod whichever byte, so you just write out a byte with the read/modify/write operation.
Even still, accessing SDRAM a byte at a time is more than 100x slower than accessing SDRAM 64 quads at a time. I think choosing to do something 100x or more slower because it's easier is silly.
my gut feeling is that it would be useful in non-burst cases, possibly including read-modify-write. Consider when pasm code may need to compute (taking more than 1 cycle) another column address to read/write; suspending the clock would be a lot cheaper than re-starting the whole transaction.
I'm actually not sure about getting rid of the UDQM/LDQM pins.
What about the CKE pin? It's used to skip clocks and put the SDRAM into power-down/self-refresh mode. It allows the Prop2 Module to go into a ~4mA mode while preserving SDRAM contents. That's important, I think. What do you guys think? Could we do without it? The SDRAM data sheet says that it can be tied high if you don't want the functionality.
Even still, accessing SDRAM a byte at a time is more than 100x slower than accessing SDRAM 64 quads at a time. I think choosing to do something 100x or more slower because it's easier is silly.
Comments
Chip was referring to the old analogue CRTs that do have flicker and sync issues, as opposed to HDTVs which don't have such issues because they have built-in scan converters.
24p will be the bottom limit for framerate. Bottom limit for resolution/pixel clock and the likes is a question still to be answered. EDIT: HDTV's will likely have lower minimums than LCD monitors to accommodate PAL/NTSC frequencies.
The effective motion framerate is not the same thing as a PAL encoded "frame". The field rate is the motion framerate for interlaced footage.
Progressive, by both definitions, is the full image space update rate. It only changes information in the captured/rendered image after the complete set of pixels have been updated. Interlaced does not have this limit so can have a higher effective framerate than it's full pixel count would otherwise imply.
At the same data-rate, interlaced has the same full framerate as sequential but double the resolution(definition). It looses detail in motion and gains it back again as the motion reduces. Although a good deinterlacer will maintain almost all of the detail but at the cost of added lag.
The trade-offs you are describing sounds more like progressive vs sequential.
This is tainted by horribly bad deinterlacing experiences. Just remember that all LCDs have had to be deinterlaced for. But in the case of 50 Hz refresh on a CRT you are going to be seeing a lot of flicker with or without interlacing so you aren't exactly comparing apples with apples. Wind that up to your beloved 100 Hz and interlaced won't be so crumby. Still half the bandwidth of a sequential 100 Hz at the same resolution.
Indeed we are talking past one another. To get pixel perfect motion, it is necessary to constrain motion to the "display all the scan lines" rate. Once that is exceeded, things degrade. You are essentially saying that's no big deal, and I would agree for most content. Computer graphics like for CAD or many types of data visualization, and games in general do not always meet that criteria, and I wrote up where people see the differences and how they play out.
There are enough problem use cases to warrant things like "GAME" mode on modern HDTVs. Much of the processing loop is skipped, leaving a more pixel perfect display. The difference can be seen on a typical "program guide" where the text will change. First it's one character, then a blend of the two, somewhat indistinct, then the new character. A slow video processor might take a few frames to get that done, a faster one less, and GAME mode doesn't do it much at all, leaving a slightly less refined image, but one that doesn't mush things together across frames.
Video processors have their costs, and the costs or trade-offs between these various things was all I was getting at.
Now, I don't have a "beloved 100hz" preference now, but I did on early graphics systems. Actually it was 70-90 for me. Anywhere in there and the display fatigue was nearly eliminated, and interaction precise, productive. That is optimal for games, CAD, visualization.
I do keep a CRT around, component for TV type devices and VGA for computer type devices, and I do so for the pixel perfect display it delivers. Video processors leave artifacts. I enjoy the accurate, analog display stream. I also don't care for most things I do. Those happen on LCD mostly, sometimes plasma.
You seem to be arguing the benefits of interlacing and that people won't notice or it doesn't matter. Comparing interlaced vs progressive scans at 50 / 60 Hz is comparing apples to apples because that is what people got on their TV's until recently with the HDTV sets. For a movie, or some other "natural" content, interlace delivers superior resolution, and it's great! For computer graphics, it can be great, or not so great, depending on the use case, which again I was writing about more than I was anything else.
Progressive is a sequential scan. The differences described are between interlaced and progressive/sequential. The simple cost of interlaced display is motion artifacts and or image artifacts depending on whether or not there is a video processor patching it all up, or one is viewing on a CRT sans video processor.
This is odd Well yes! That one is interesting, and since you brought it up, there are motion advantages there TOO! Lots of ways to exploit it as well. Again, I'm writing about TV type displays as most of the things I'm going to write about become marginal at higher resolutions.
Say we are on a TV and we've got 160 pixels of horizontal resolution and the frame rate is 60hz, non interlaced. An object motion can be updated 60 times per second, and it's motion conveyed by any of 160 pixels. This means there is a minimum motion possible, assuming the image isn't changed to fool the eye. Take that same image and run the dot clock at 320 pixels. On NTSC displays, some minor color detail is lost, however the minimum horizontal motion just got smaller, allowing for more fluid representations of more motion paths. It's more 'smooth" essentially. Do that again, with a 512 pixel clock, and an NTSC display will render artifacts on the fringes of things, and more color detail is lost, but horizontal motion is now very precise. Doing things like scrolling multiple backgrounds to give a depth illusion allow for more varied backgrounds and a smoother look.
The artists can exploit the dot clock by making art center around the 160 pixel color detail level, while adding highlights at higher resolutions and or keeping regions of art closely aligned on the color wheel, emphasizing luma instead of color overall. The difference in motion is stark, which is why game consoles went ahead and did that on a TV, offering VGA, S-video, or component outputs for those with better displays.
A key point to note here is those are managed trade-offs, sans video processing. When properly exploited, the user just sees a better experience, despite a fairly crappy signal delivering it. Add a video processor to that, and often the results are sub-par, but sometimes excellent depending on the video processor.
I'm a very perceptive and technical person regarding displays. When I'm viewing things I'm not really interacting with, most of these trade-offs do not matter. When I am interacting with it, many of them do, and some times are annoying, taking me out of the experience.
One big annoyance today is the HD image compression metrics. Some scenes are so bad that I see little colored squares during periods of intense motion! I have since learned many people either don't care or can't see them due to nature of their perception. Tune your average cable carrier and take a photo during an action scene in an action movie with a car, some mega explosion, etc... and it's amazing how much really isn't there. View that same title on DVD / Blu-Ray or the theater and it's an entirely different thing.
That motion sure is fast though! Can't really discern the objects on low bit rate streams, but it's fast!
So, back to the 24p discussion. I really like this one and 25/30hz progressive scanned signals because the processing on them is simple. Just send all the scan-lines. The video COG will be simple, and for a dynamically drawn display, no secondary buffer needed either. 24-30fps leaves a lot of time to build images too, further exploiting things allowing for more complex dynamic display, or even timed writes to a bitmap buffer. I like displays like that, and will find out what the HDTV sets will deal with, just because it's fun to exploit display technology, artifacts and such.
I don't care for video processors much and turn them off for my projects. The primary reason is they don't always do the same thing. If they did, I would like them much better.
I think I'll move on now. Fun discussion though.
I have a question:
Do you guys really think it's important to keep the UDQM/LDQM pins of the SDRAM connected to the Prop2 chip? If we grounded them, we could save two pins, but wouldn't be able to do byte-level writes, only word-level and up. Is it worth committing two pins to?
My guesstimate is that:
- with UDQM/LDQM it will take ~ 12 cycles to write a byte (assuming 10 cycles for SDRAM access, 2 cycles for testing odd/even address and SHL #8 if needed)
- without it will take ~ 10 to read word + 4 + 10 to write word = ~ 24 cycles to write a byte (same assumptions as above, adding an and and an or)
All byte writes would take twice as long, which will matter for any point / line / circle etc. graphics op.
Given that I expect 8bpp graphics modes to be used a lot, we need UDQM/LDQM desperately.
I honestly feel that losing two whole I/O pins just to have byte level access to SDRAM isn't worth it. This is just for the module Parallax is going to be making, right? If you really want your board to have byte level access then you can consume the two extra pins.
Maybe, I'm less worried about it since I tend to not care that much about changing one pixel at a time....
Roy
An extra USB port
Keyboard
Two TV displays
I understand Bill's argument very well, but in the code I've written so far, I've found that reading/writing anything less than several QUADs is too slow. I'm thinking that my own graphics practices will have to move toward being more block-oriented, like Roy was describing.
For anyone who wants to understand how the SDRAM works, I'll describe it in the way I think about it:
- Imagine 4 separate cabinets (SDRAM "banks"), each holding 8,192 pull-out drawers (SDRAM "rows").
- In each drawer is a loop of 512 words (SDRAM "columns"), which can be accessed sequentially from any offset for reading and writing.
- Only one drawer can be opened ("active") at a time within each cabinet.
- If you want to open another drawer within a cabinet, you must first close the one currently open (or "precharge" it).
- Each read and write operation specifies which bank and column is to be accessed, with the assumption that the row of interest has already been made active in that bank.
Once a bank and row have been made active, a word in that row can be randomly read or written on every clock cycle. There are only 512 words per row, though, so you very often need to close one drawer and open another. If you imagine the analogy of the cabinet of drawers and what a pain it is to open and close drawers, versus looking into a drawer to see what is there, you get the idea. Opening and closing drawers is going to be the predominant effort in dealing with non-block accesses. To get efficiency, data might need to be arranged so that 512 words is a meaningful chunk, on its own. Roy explained to me once that graphic memory is often arranged so that a limited 2D area, rather than a scan line, say, would be grouped as contiguous memory, or a "row", in this case.
If I think about writing a simple text editor and how quaint it would be to be able to scroll buffer contents by one byte using byte accesses, it makes my code very simple, but it's maybe 20x slower than doing read-modify-write operations on 1024-byte (512-word) blocks. So, while I like the idea of byte access, it seems like a crutch. This is why I'm asking whether we should support UDQM/LDQM on our Prop2 module.
It also has the benefit of being easy to "encode/decode" the address from the X/Y. Since you just interleave the bits to go from X/Y to address, and vice versa.
GPUs have used this memory layout for textures for a very long time, because it's a much more cache friendly layout when doing accesses across textures as arbitrary angles/slopes. It's used for backbuffers often as well, although GPUs also support a "linear" format for backbuffers that allows easier software access, but it's lower performance in most cases, and eve still it breaks up the backbuffer into a bunch of "tiles" that are something like 64x128 pixels or similar.
I'm all for simple code but if that's at the expense of efficiency/performance then I have to wonder if the trade off is worth it.
For my 2 cents I would say no, do not support byte wide access.
Roy, can you explain a little more about your graphics 'blocks/tiles' please.
Regards,
Coley
EDIT: Roy, I see you have done already ;-)
I totally understand what you mean re/ text editors, and being able to chunk (scatter-gather) dram writes. I'd never do that for a screen buffer, and frankly, that is such a low-bandwidth use case that it is not relevant.
Unfortunately it does matter significantly for 8 bits per pixel computer graphics.
Look at various line, circle, arc, etc. algorithms. Those are the problem children - they will not chunk nicely at all.
They will degenerate into reading a word, modifying a byte in it, then write it back.
When there is an attempt to scatter-gather, N-words will have to be written, then the correct byte modified, and the whole line-cache flushed when a different scan line needs to be modified.
You can help this somewhat why maintaining multiple line / tile caches at once, but it will break with diagonal lines, circles, vertical lines etc.
Regarding tiling the screen... that will make reading the raster complex, and will severely slow down "blit" style raster ops.
Before you decide not to implement the byte write control lines, I invite you, Roy, Baggers (and anyone else) to write line drawing algorithms using both methods, and benchmark random plotting 1M points, and drawing 100k lines - both ways (with control lines, and without).
Billting rectangular pixel blocks on a tiled vs non-tiled approach at pixel (NOT tile) boundaries will also show a huge speed difference.
If you do not have the control lines, I predict 16bpp graphics will dominate - which has update bandwidth issues at 1080p
Update:
Roy, thanks for the link, will check it out. These tiled approaches have an issue with soft-scrolling large bitmaps horizontally, and retrieving the raster fro the sdram will be more compilcated.
Coley - I understand your viewpoint, byte access does not matter for textures etc., but it does for plotting points / lines / circles etc. in 8bpp
pedward - please try cached point/line plotting - you will see what I mean.
You filter your data into buckets for each tile based on what intersects with it. Then render each tile in isolation and clipped. Then ship it off to the SDRAM backbuffer.
I figure I can get it working all with one cog rendering to tiles, one cog running the SDRAM driver, and a 3rd cog drawing the display (reading from the SDRAM front buffer). Then I can add more render cogs to gain performance. When we have 8 cogs, it seems reasonable to have 4 cogs rendering tiles.
Another, thing I will try is having tiles be larger. In multiples of 32x32 area, so 64x32, 64x64, etc... larger tiles means less sorting and less reduntant work across tiles, but uses more memory and makes other things like non-z-buffer-hsr slower.
Because once a row is opened and you can do back-to-back reads and writes, this byte-masking can be handled in software! There will be about a 6-clock difference, having to read and write, versus controlling UDQM/LDQM, but that's not much of a price to pay for two pins, if someone wants to write a byte-level SDRAM driver. I'm getting rid of those two connections on the Prop2 module. This frees two more I/O's.
Line, circle, etc. drawing to chunks of the screen at a time is only slightly more complex than going to the whole screen. It has some added overhead and redundant work, but overall I bet the performance to SDRAM is faster doing so than ANY form of byte/word access. Not only faster, but likely an order of magnitude faster (if not more than one order).
I just looked at the link you posted.
It kills the concept of a simple, linear bitmap. Your idea is good for tile based game drivers, but inappropriate for fast bitmap graphics.
@Chip,
Can you please specify weak pull-downs on the DQMs with 604 size resistors?
That way people who may find a real need for controlling DQM can have it via some wires.
Your P2 protoboard will after all be an everything for everyone board, right?
Thanks.
1080p Raster refresh bandwidth:
1920x1080p - 8bpp @ 30Hz = 62.2MB/sec
1920x1080p - 16bpp @ 30Hz =124.4MB/sec
1920x1080p - 32bpp @ 30Hz = 248.8MB/sec
160Mhz * .89 efficiency * 2 bytes per word = 284.8MB/sec
So at 24/32bpp there is essentially no time to update the bitmap, however 16bpp will still be fine
That is an excellent idea; they could be brought out to via's that could be jumpered to two other prop pins.
edit:
Perhaps have jumpers to the two freed-up pins, which if shorted, control the DQM's, if open, DQM's pulled down by weak pulldowns as per Steve's suggestion.
Can you provide those UDQM/UDQL pins terminated in vias with 0.024" holes. Then those of us who want to try byte mode can just link the vias with kynar wire. I think it would be a big mistake not to provide for byte access in sme form so that experiments can be performed both ways.
Postedit: Just saw Steve suggested the same.
I shared that link as more info based on Chips mentioning our discussion about it. My tile based P2 rendering plan doesn't involve using that ordering. I was planning to use linear order on my 32x32 (or larger) tiles for P2 rendering.
I should have been more clear about that. I'm not sure the z-order thing helps us unless we could make the waidvid instruction access data in that way, and even then it doesn't make sense for how the vid stuff works in P2. It's something Chip and I will likely discuss in more detail for future ideas.
Anything but a simple linear ordering of scan lines in memory will complicate drivers to no end, and significantly limit resolutions possible to display.
Even worse, it will consume extra cogs to decode the non-linear format in the sdram, to convert it to a scan line buffer, which if modified, would have to be re-swizzled for the tile format before writing.
In denegerate cases, where only one or two pixels are changed in a 32x32 tile, performance will be incredibly slow. They will also significantly slow down pixel address calculations.
To use Chip's idea of read-modify-write for pixel plotting, the "gpu" cog would have to control the memory, and coordinate access with a display refresh cog.
Basically, saving two pins brings quite a few slow downs, and complicates matters significantly for 8bpp modes. 16bpp is not affected, so that is what I will tend to use (even though it uses twice the bandwidth)
Of course, on my boards I will implement the byte select lines
I think what makes the most sense for the parallax demo/proto board is to incorporate Steve's suggestion, and have jumpers so that the end user can decide between two more IO pins, or byte masks.
We've got that board really tightly laid out. I don't know if there's room for a few 0603's. Then, we'd need to take those pins down to the connector, unless someone wanted to solder wires, as you said. That ~6-clock penalty for software masking vs UDQM/LDQM controlling only represents about a 10% penalty for a byte write, by the time you get the code wrapped in the other instructions that make it all play. My gut feeling is that it's not worth all the extra consideration.
Please at least allow us the option of using byte access. Wires are fine between vias, and allow us to be able to cut the ground traces to UDQM/LDQM if you cannot provide 0603 pulldowns. Otherwise you preclude other uses for the SRAM such as emulators etc.
What about the CKE pin? It's used to skip clocks and put the SDRAM into power-down/self-refresh mode. It allows the Prop2 Module to go into a ~4mA mode while preserving SDRAM contents. That's important, I think. What do you guys think? Could we do without it? The SDRAM data sheet says that it can be tied high if you don't want the functionality.
The issue is that if you want to write one byte of a 16bit word, you need to read the original 16bit word first then mod whichever byte, then write the result back out. With the two lines, you can have the SDRAM chip just mod whichever byte, so you just write out a byte with the read/modify/write operation.
Even still, accessing SDRAM a byte at a time is more than 100x slower than accessing SDRAM 64 quads at a time. I think choosing to do something 100x or more slower because it's easier is silly.
my gut feeling is that it would be useful in non-burst cases, possibly including read-modify-write. Consider when pasm code may need to compute (taking more than 1 cycle) another column address to read/write; suspending the clock would be a lot cheaper than re-starting the whole transaction.
Please show us pasm code for writing an 8 bit pixel in SDRAM using bursts where it would be faster than writing it as a byte.
Plotting graphics (NOT tile oriented game graphics) is pixel oriented, and forcing non-linear bitmaps for complex caching is silly.
A fighter jet is 100x faster than a car... but is not practical for going to the store.
Here is a link to line drawing algorithms: http://en.wikipedia.org/wiki/Line_drawing_algorithm
Not a good fit to tiling/caching.