Looking at this: To facilitate handshaking between cogs sharing lookup RAM, the SETRDL/SETWRL instructions can be used to set up lookup RAM read and write events.
Wondering how that is going to work...
I think those instructions work with the special registers at end of RAM.
Are they changed to work with LUT too?
I guess I should wait until it's done to ask questions about it...
Looking at this: To facilitate handshaking between cogs sharing lookup RAM, the SETRDL/SETWRL instructions can be used to set up lookup RAM read and write events.
Wondering how that is going to work...
I think those instructions work with the special registers at end of RAM.
Are they changed to work with LUT too?
I guess I should wait until it's done to ask questions about it...
Same old names, but totally new meanings. The old hub RAM events are gone, replaced by LUT events.
I have a question: on the FPGA images, are the no-smart pins there or are they completely disabled ?, say you only get 8 PINs, smart PINs, period ?
My understanding is the Smart-Pin cell clips-onto the standard pin, so the fallback would be a standard pin cell, not no pin at all.
Besides, for testing it is best to not drop pins too, keep what is removed simple.
Certainly Streamer testing is going to need many pins available.
I think I got the v9 docs finished, for now. I covered the new 1/2/4-bit streamer modes and smart pin modes today. I'm looking at the PIX instructions to document them, but I can't make sense of it, at the moment. I thought I had made some minor change to the USB smart pin mode, but I don't see it. It was something to help Rayman.
How do you envision the right way to set the time between USB outgoing packets ?
Looks like I need a spacing of 2 to 6 usb clocks of J between packets.
I haven't had time, yet, to make an actual spacer instruction. Each time you write out a state command like J (IDLE), that takes one bit clock to execute. For low-speed, you could do 2 to 6 of those conmands. For full-speed, I need to make an instruction.
Actually, looking back, I see that I was doing it wrong by waiting for idle, which meant 7 or 8 bits of idle.
What I need to do is wait for EOP and then send a couple idles.
Then, I'll have the spacing I want.
I think I got the v9 docs finished, for now. I covered the new 1/2/4-bit streamer modes and smart pin modes today. I'm looking at the PIX instructions to document them, but I can't make sense of it, at the moment. I thought I had made some minor change to the USB smart pin mode, but I don't see it. It was something to help Rayman.
I think I got the v9 docs finished, for now. I covered the new 1/2/4-bit streamer modes and smart pin modes today. I'm looking at the PIX instructions to document them, but I can't make sense of it, at the moment. I thought I had made some minor change to the USB smart pin mode, but I don't see it. It was something to help Rayman.
Right now, the ports can be updated/read on every clock, but there is no way to get the system clock out on a pin.
So, you can output a signal of clk/2 and you can have the streamer transact on every clock (DDR) or every other clock (edge synchronous), or any lower frequency.
Right now, the ports can be updated/read on every clock, but there is no way to get the system clock out on a pin.
So, you can output a signal of clk/2 and you can have the streamer transact on every clock (DDR) or every other clock (edge synchronous), or any lower frequency.
This sounds close - can you give some code examples, of how you Start the Streamer, then the CLKout, to maintain clock phase and counts ?
I think a 74AUP1G57 (XNOR+RC) can regenerate a 2x Clock from a DDR signal, if a System-Clock out proves impossible.
Right now, the ports can be updated/read on every clock, but there is no way to get the system clock out on a pin.
So, you can output a signal of clk/2 and you can have the streamer transact on every clock (DDR) or every other clock (edge synchronous), or any lower frequency.
Yes DDR makes a big difference here and reduces my claim.
If a clock enable (a pin that's high for full cycle if streamer has fresh data) was possible, that would still future proof things somewhat, because we could externally And it with a system clock.
So, you can output a signal of clk/2 and you can have the streamer transact on every clock (DDR) or every other clock (edge synchronous), or any lower frequency.
Thinking some more about Start/Phase issues, can the Streamer use the same START mechanism you have for multiple Smart-Pins ?
That way, you could configure a smart pin as SysCLK/2, complete with preset phase,
(IIRC P2 NCO now has Phase ? ) then configure Streamer and then tell both to GO! on the same SysCLK ?
Chip selects can precede the CLK & Data, so only CLK.Data need the careful sync.phase.
If someone wanted CLK, !CLK, Data, I think that is possible with 2 smart pins ?
So, you can output a signal of clk/2 and you can have the streamer transact on every clock (DDR) or every other clock (edge synchronous), or any lower frequency.
Thinking some more about Start/Phase issues, can the Streamer use the same START mechanism you have for multiple Smart-Pins ?
That way, you could configure a smart pin as SysCLK/2, complete with preset phase,
(IIRC P2 NCO now has Phase ? ) then configure Streamer and then tell both to GO! on the same SysCLK ?
Chip selects can precede the CLK & Data, so only CLK.Data need the careful sync.phase.
If someone wanted CLK, !CLK, Data, I think that is possible with 2 smart pins ?
You can feed the streamer an instruction to effectively stall some number of clocks, then feed it another to take care of business when the first instruction finishes. Then, you are free to orchestrate some smart pin activity that will coincide with the streamer business.
You can feed the streamer an instruction to effectively stall some number of clocks, then feed it another to take care of business when the first instruction finishes. Then, you are free to orchestrate some smart pin activity that will coincide with the streamer business.
OK, sounds like it needs some one-off tuning, then it could work.
Code would issue a Delay_Streamer, then a Queue_Start_Streamer then a Start_Clock
One LCD spec I looked at, with simplest one-use HyperRAM, would need 429792 Clocks per frame.
The read command is the leading 48 bits / 6 bytes, after that, the HyperRAM simply runs for 429792-6 Clocks
I think you only need to repeat the read pass, inside 64ms, to meet refresh.
Read code looks to be very compact here ?
Also, during the read time, the COG is free to collect the next write blocks, read for fastest delivery in the write window.
I'm sorry it's been taking me so long to get this update out.
The problem has been long compile times on the A9 and bad Fmax results.
I noticed the LUT write sharing was creating a critical path because I was doing a magnitude comparison on what was potentially the late-arriving LUT exec address. I got rid of the possibility of the LUT instruction fetch triggering a 'read LUT' event, so Fmax should go back up.
It's still taking over 1.5 hours to compile, though, which makes things very tedious.
At this point, I must break and meet with Treehouse, OnSemi, and pjv, and I'll be unable to work on this again until Friday. So, no update for a few more days. I'm sorry about this. I was really hoping I could get this done before the this meeting.
Hei Chip, what kind of machine do you use to compile ? I think Quartus would only benefit from an 1-Core 200 GHz processor, parallel compilation seems to be a very hard nut.
Maybe you could have a farm of 8 or so machines with SSDs and loads of RAM and launch them in parallel
Comments
Any chance of some documentation on the pixel instructions?
Wondering how that is going to work...
I think those instructions work with the special registers at end of RAM.
Are they changed to work with LUT too?
I guess I should wait until it's done to ask questions about it...
Same old names, but totally new meanings. The old hub RAM events are gone, replaced by LUT events.
I don't think I'd miss them...
At the very end of RAM are the debug interrupt hooks. Right below them used to be those read/write event longs - they are gone now.
Similar with the REP instructions, Chip. It'd be good to add to the list.
Nothing urgent, but they're not there at the moment.
Maybe the MSbit in address can be used to indicate a debug hook?
Fine like it is, just contemplating...
Besides, for testing it is best to not drop pins too, keep what is removed simple.
Certainly Streamer testing is going to need many pins available.
It's a fitful, herky-jerky process that probably wouldn't inspire any confidence, if witnessed.
Looks like I need a spacing of 2 to 6 usb clocks of J between packets.
I haven't had time, yet, to make an actual spacer instruction. Each time you write out a state command like J (IDLE), that takes one bit clock to execute. For low-speed, you could do 2 to 6 of those conmands. For full-speed, I need to make an instruction.
What I need to do is wait for EOP and then send a couple idles.
Then, I'll have the spacing I want.
So, maybe it's OK as is.
Sorry, I have not followed a lot of the discussion due to a lack of time.
Can the streamer modes (1/2/4/8/16/32 bit) to/from pins output a strobe signal for output, and be externally clocked for input?
If so, what is the maximum output clock rate that will work, and the maximum strobed input data rate?
Thanks,
Bill
Right now, the ports can be updated/read on every clock, but there is no way to get the system clock out on a pin.
So, you can output a signal of clk/2 and you can have the streamer transact on every clock (DDR) or every other clock (edge synchronous), or any lower frequency.
This sounds close - can you give some code examples, of how you Start the Streamer, then the CLKout, to maintain clock phase and counts ?
I think a 74AUP1G57 (XNOR+RC) can regenerate a 2x Clock from a DDR signal, if a System-Clock out proves impossible.
Yes DDR makes a big difference here and reduces my claim.
If a clock enable (a pin that's high for full cycle if streamer has fresh data) was possible, that would still future proof things somewhat, because we could externally And it with a system clock.
But we can do a lot with what we have, using DDR
Thinking some more about Start/Phase issues, can the Streamer use the same START mechanism you have for multiple Smart-Pins ?
That way, you could configure a smart pin as SysCLK/2, complete with preset phase,
(IIRC P2 NCO now has Phase ? ) then configure Streamer and then tell both to GO! on the same SysCLK ?
Chip selects can precede the CLK & Data, so only CLK.Data need the careful sync.phase.
If someone wanted CLK, !CLK, Data, I think that is possible with 2 smart pins ?
You can feed the streamer an instruction to effectively stall some number of clocks, then feed it another to take care of business when the first instruction finishes. Then, you are free to orchestrate some smart pin activity that will coincide with the streamer business.
Code would issue a Delay_Streamer, then a Queue_Start_Streamer then a Start_Clock
One LCD spec I looked at, with simplest one-use HyperRAM, would need 429792 Clocks per frame.
The read command is the leading 48 bits / 6 bytes, after that, the HyperRAM simply runs for 429792-6 Clocks
I think you only need to repeat the read pass, inside 64ms, to meet refresh.
Read code looks to be very compact here ?
Also, during the read time, the COG is free to collect the next write blocks, read for fastest delivery in the write window.
The problem has been long compile times on the A9 and bad Fmax results.
I noticed the LUT write sharing was creating a critical path because I was doing a magnitude comparison on what was potentially the late-arriving LUT exec address. I got rid of the possibility of the LUT instruction fetch triggering a 'read LUT' event, so Fmax should go back up.
It's still taking over 1.5 hours to compile, though, which makes things very tedious.
At this point, I must break and meet with Treehouse, OnSemi, and pjv, and I'll be unable to work on this again until Friday. So, no update for a few more days. I'm sorry about this. I was really hoping I could get this done before the this meeting.
Maybe you could have a farm of 8 or so machines with SSDs and loads of RAM and launch them in parallel
Thanks for all your efforts Chip.
If not,
Maybe disable LUT exec when LUT sharing is enabled?
Or, shelve the whole LUT sharing thing. Rather have faster chips than LUT sharing.