What is fastest way to copy big chunk of hub to another location in hub?
Probably RDFAST/WRFAST inside a REP. Because the hub ram is single-ported, you will always have to temporarily copy it to cog/lut memory.
Edit: single-ported ram isn't really the issue, as long as the copy isn't overlapped (which it wouldn't be, in this case). It's the RAM's organization: you can't address two different spots in the ram at the same time. However, since the ram is split into 16 banks, it would technically be possible to directly copy ram from one bank to another bank. Except that a cog has access to only one bank at a time.
I'm starting to wonder if I should lobby chip for a special instruction to convert 24-bit color to 16-bit color... It might help a lot (or not, still thinking about it).
I don't think you can do rdfast and rwfast at the same time, can you?
Maybe have to stream HUB to LUT and then from LUT to HUB?
No, you can't do it at the same time, but you can do it sequentially:
Repeat as necessary:
1. RDFAST from hub to local buffer
2. WRFAST from local buffer to different hub location
3. Update read/write pointers
Actually, this gives me an idea! Since RDFAST and WRFAST are using a streaming buffer, suppose you had an instruction like CPFAST which did the following:
1. Fill streaming buffer (as if preparing for RDFAST)
2. Write streaming buffer to different hub location (as if WRFAST had been called)
3. Repeat for whatever size was indicated with FBLOCK(?).
Now you have the fastest hub block copy you can get without implementing direct bank-to-bank copying.
Now you have the fastest hub block copy you can get without implementing direct bank-to-bank copying.
Is that true ?
The streaming buffer has to wait for egg-beater alignment, which also varies with data-address lower bits.
I think if copying a larger block, you are better to use an interim memory array to reduce the effect of those align-phase delays.
Now you have the fastest hub block copy you can get without implementing direct bank-to-bank copying.
Is that true ?
The streaming buffer has to wait for egg-beater alignment, which also varies with data-address lower bits.
I think if copying a larger block, you are better to use an interim memory array to reduce the effect of those align-phase delays.
Ahh, that's true. It would only be guaranteed to be faster if you were copying to the same bank, at which point there would be no delay. Otherwise, you will always have an additional 16 clock cycles per round, which would certainly add up.
Okay then. Just stick with sequential RDFAST/WRFAST in a tight loop.
Also, I suppose you don't really need to use RDFAST/WRFAST either. Wouldn't SETQ2 with RDLONG/WRLONG be just as fast?
Edit: never mind. Reading through Chip's document, the SETQ2 relies on the fifo. And it's behavior for WRLONG is different. It appears that there's no way to block copy from LUT to HUB.
If I did it right, should allow use on DE2, P123 or nano setup.
Nano is a pain because the Prop Plug gets in the way.
Not really sure if it's worth bothering with nano support.
Still, I think the nano might be able to show 2-color text...
Anybody know if P86 through P91 on DE2 and nano header connect to Prop pins?
Don't have enough pins to do I2C so I connected sda and scl to P90 and P91.
Is that a problem?
BTW: I put pads for uSD card, SQI flash chip and i2c eeprom on the bottom of the board, just in case... But, uSD and SQI flash share same control and data pins, so can only use one or the other...
Also, I forgot to say that I think I've decided on 16-bit color interface to LCD.
Also, I suppose you don't really need to use RDFAST/WRFAST either. Wouldn't SETQ2 with RDLONG/WRLONG be just as fast?
Edit: never mind. Reading through Chip's document, the SETQ2 relies on the fifo. And it's behavior for WRLONG is different. It appears that there's no way to block copy from LUT to HUB.
Actually, all day was spent getting SETQ2+WRLONG to perform a block move from lut to hub.
These block moves can work while RDFAST/WRFAST/hubexec (FIFO) are in operation.
I'm starting to wonder if I should lobby chip for a special instruction to convert 24-bit color to 16-bit color... It might help a lot (or not, still thinking about it).
Would it just convert 8:8:8 to 5:5:5, and swap words in D?
As you know, most camera's can directly produce formatted 16 bit RGB565 images... coming in 8 bits at a time. To record it... you need 15 pins. A P2 camera is going to need external memory... if you are using 16 pins for color and a few more for controls... you are eating a lot of pins.
I don't know if it is worth the effort. The price of vga enabled, small LCD's is dropping. Have you thought about using a shift register for the LCD?
Most of the time, I only need 8 bit gray scale.. So, I like to use 16 bit YUV format coming out of the camera...as it is the default on most cheap cameras and I can use one of the two bytes as gray scale... (sometimes it is the first, sometimes it is the second byte.) I just ignore the other byte and do something else with the time I save.
I think I understand that you want 16 bit data coming out of the P2 and going to what amounts to a 24 bit input on the LCD? If you decide to go in that direct,
I wish you would add a switch... so that a user could switch between 16 bit rgb and 8 bit gray...sending the same byte to each of the R,G,B inputs.
I don't think a special instruction for RGB transformation is needed, now that I think about it... That can be done slowly before showing an image.
I'm thinking that the 8bpp bitmap color palette can be converted to 16-bit color and then stored in LUT 0..255 before showing the image. Wouldn't want to slow down the pixel output doing the conversion in real time anyway...
What I guess I do need is a way to quickly read 8-bit pixel data from HUB, get it's color from LUT and then put on OUTA P0..15. Maybe things to do that already exist, have to see...
Actually, a horizontal line on 4.3" lcd is typically only 480 pixels.
That can fit in cog ram easily and can be loaded during horizontal refresh.
So, just need to quickly read from LUT to OUTA based on bytes in COG...
I guess what I'd try is prepping LUT with all OUTA pins states for vsync, etc.
Then, copy lower 256 longs to upper 256 longs of LUT (so that 9th bit in RDLUT source doesn't matter).
Then, I think can just do:
What I guess I do need is a way to quickly read 8-bit pixel data from HUB, get it's color from LUT and then put on OUTA P0..15. Maybe things to do that already exist, have to see...
Certainly does exist already - it's the original purpose of the LUT and the Streamer. BTW, the actual "Streamer" is the engine that automatically paces data out to the pins. The Streamer is not the HubRAM FIFO engine ... but it does slave the FIFO for engaging HubRAM transfers.
Comments
What is fastest way to copy big chunk of hub to another location in hub?
Probably RDFAST/WRFAST inside a REP. Because the hub ram is single-ported, you will always have to temporarily copy it to cog/lut memory.
Edit: single-ported ram isn't really the issue, as long as the copy isn't overlapped (which it wouldn't be, in this case). It's the RAM's organization: you can't address two different spots in the ram at the same time. However, since the ram is split into 16 banks, it would technically be possible to directly copy ram from one bank to another bank. Except that a cog has access to only one bank at a time.
Maybe have to stream HUB to LUT and then from LUT to HUB?
No, you can't do it at the same time, but you can do it sequentially:
Repeat as necessary:
1. RDFAST from hub to local buffer
2. WRFAST from local buffer to different hub location
3. Update read/write pointers
Actually, this gives me an idea! Since RDFAST and WRFAST are using a streaming buffer, suppose you had an instruction like CPFAST which did the following:
1. Fill streaming buffer (as if preparing for RDFAST)
2. Write streaming buffer to different hub location (as if WRFAST had been called)
3. Repeat for whatever size was indicated with FBLOCK(?).
Now you have the fastest hub block copy you can get without implementing direct bank-to-bank copying.
The streaming buffer has to wait for egg-beater alignment, which also varies with data-address lower bits.
I think if copying a larger block, you are better to use an interim memory array to reduce the effect of those align-phase delays.
Ahh, that's true. It would only be guaranteed to be faster if you were copying to the same bank, at which point there would be no delay. Otherwise, you will always have an additional 16 clock cycles per round, which would certainly add up.
Okay then. Just stick with sequential RDFAST/WRFAST in a tight loop.
Edit: never mind. Reading through Chip's document, the SETQ2 relies on the fifo. And it's behavior for WRLONG is different. It appears that there's no way to block copy from LUT to HUB.
"The lookup RAM must be read and written using RDLUT/WRLUT instructions."
Guess need to use COG RAM as a buffer.
If I did it right, should allow use on DE2, P123 or nano setup.
Nano is a pain because the Prop Plug gets in the way.
Not really sure if it's worth bothering with nano support.
Still, I think the nano might be able to show 2-color text...
Anybody know if P86 through P91 on DE2 and nano header connect to Prop pins?
Don't have enough pins to do I2C so I connected sda and scl to P90 and P91.
Is that a problem?
Also, I forgot to say that I think I've decided on 16-bit color interface to LCD.
Actually, all day was spent getting SETQ2+WRLONG to perform a block move from lut to hub.
These block moves can work while RDFAST/WRFAST/hubexec (FIFO) are in operation.
Would it just convert 8:8:8 to 5:5:5, and swap words in D?
long = (0,R,G,B) -> Word=(5bitsR, 6bits green,5bits Blue)
(r>>3) && %11111
(g>>2) && %111111
(b>>3) && %11111
I have some thoughts.
As you know, most camera's can directly produce formatted 16 bit RGB565 images... coming in 8 bits at a time. To record it... you need 15 pins. A P2 camera is going to need external memory... if you are using 16 pins for color and a few more for controls... you are eating a lot of pins.
I don't know if it is worth the effort. The price of vga enabled, small LCD's is dropping. Have you thought about using a shift register for the LCD?
Most of the time, I only need 8 bit gray scale.. So, I like to use 16 bit YUV format coming out of the camera...as it is the default on most cheap cameras and I can use one of the two bytes as gray scale... (sometimes it is the first, sometimes it is the second byte.) I just ignore the other byte and do something else with the time I save.
I think I understand that you want 16 bit data coming out of the P2 and going to what amounts to a 24 bit input on the LCD? If you decide to go in that direct,
I wish you would add a switch... so that a user could switch between 16 bit rgb and 8 bit gray...sending the same byte to each of the R,G,B inputs.
I'm thinking that the 8bpp bitmap color palette can be converted to 16-bit color and then stored in LUT 0..255 before showing the image. Wouldn't want to slow down the pixel output doing the conversion in real time anyway...
What I guess I do need is a way to quickly read 8-bit pixel data from HUB, get it's color from LUT and then put on OUTA P0..15. Maybe things to do that already exist, have to see...
That can fit in cog ram easily and can be loaded during horizontal refresh.
So, just need to quickly read from LUT to OUTA based on bytes in COG...
I guess what I'd try is prepping LUT with all OUTA pins states for vsync, etc.
Then, copy lower 256 longs to upper 256 longs of LUT (so that 9th bit in RDLUT source doesn't matter).
Then, I think can just do:
REP (four times)
RDLUT OUTA,FourPixels
ANDN OUTA, clkmask 'toggle clock
ror FourPixels, #8
For the inner loop. Can this work?
If it does, then should be OK, even with 50 MHz clock.
If actual clock is 100 or 200 MHz, would be a lot easier...
Certainly does exist already - it's the original purpose of the LUT and the Streamer. BTW, the actual "Streamer" is the engine that automatically paces data out to the pins. The Streamer is not the HubRAM FIFO engine ... but it does slave the FIFO for engaging HubRAM transfers.
With P1, I had the video generator toggling the pixel clock during refresh and load next row of pixels then...
I think for P2, I'll have to use a second cog to toggle the clock during refresh.
Or, maybe the LCD will tolerate a period of inactivity...
Actually guess I'll have cogs do every other line. Sorta like Chip did for high res vga on p1
You'll be able to have a smart pin do the toggling, I'd think.