P2 DVI/VGA driver

rogloh · 2020-01-27 13:33

It's an interesting idea AJL. Though the way it is implemented there is that we would have to do that extra hub burst read after every single memory request is serviced which is its main downside. I guess it could be used to insert new COG requestors on the fly which is handy. Maybe an atomic table switch operation is required so the whole table is always consistent before reading it.

Something else could possibly be used to trigger a new poll table update. I've been sort of wondering if the mailbox for the (currently spare) COG ID of the HyperRAM driver itself could be used to adjust parameters internal to the driver and used to add/remove COGs from the polling list. The issue there is that if all 8 COGs get polled instead of just 7 COGs it would increase the polling loop time to be just over that nice egg-beater timing window of 40 clocks in the REP loops, and could increase by 8 more clocks. I'd imagine in most cases the number of new COGs requesting access is not going to change around all that rapidly so we may not need to optimise the code update speed. In any case if you simply kill a requesting COG the HyperRAM driver wouldn't know about this and will still poll the COG that was killed until its table somehow gets changed. If it turns out to be a problem it might be simpler to just poll them all the time and live with the extra jitter, in the short term at least.

Also looking at how things currently work and how I fragment the longer non-video burst transfers I've realized I should be able to improve the overall bandwidth fairness at some point later by tracking timestamps and burst transfer sizes per COG and temporarily delaying access using token/leaky bucket approaches whenever they exceed their configure rate/burst limit. Basically this is some simple traffic shaping. This option may be useful in some cases where you don't want one COG doing massive transfers all the time and killing the performance for all other COGs transferring smaller amounts. It could possibly also let you set weights per COG where they yield some fraction of their request opportunities to other COGs when they try to burst too much.

I'm almost done coding up a major unrolled prototype and hope I haven't broken too much before I start testing it out. I just need to finish the code that updates all the polling loops based on a set of active cogs passed in at init time and an optional priority COG which can be overridden later. I have also been able to eliminate several JMPs in the code path which is good.

Once it is functional again I hope to get this first version out soon with the video driver and we can all optimise and improve its operation over time with various ideas.

I also wonder if a variant of this driver including caching capabilities could be developed later too allowing some type of XMM execution etc, or if that sort of function is best done outside this COG by another requestor. We still have that LUTRAM which may be able to hold tags etc. Right now I'm only using the first 8 LUT registers for maintaining the transfer count per COG with nothing else using the LUTRAM yet, but maybe more code can go there if all this unrolled stuff bloats too much. Though given the transfer performance of the HyperRAM and how the results get streamed directly into hub, maybe caching may not make any sense anyway. I also may wish to have the HyperFLASH transfer code in there too for the Parallax module, as the memory access code can be different, certainly for writes it will be. That could need the space from the LUTRAM. We'll see...

@rogloh, Your poll3 seems to have an extra rep in there (rep #9, #0).

Thanks AJL, a good catch - cut/paste error and now I fixed it. It would have got me.

Update: Damn all this poller loop unrolling, COG code is now up to 468 longs and it's not yet complete! I think I'll need to use the LUT RAM for code soon or go back to what I had before with ALTI.

cgracey · 2020-01-27 14:56

Rogloh, if all that unrolling only saves, say, 24 clocks per loop, then maybe it's not worth doing. I was thinking it would result in a double-digit percentage of speed increase.

jmg · 2020-01-27 20:49

rogloh wrote: »

Right now looking at a scope the total read overhead per burst seems to be about 0.78us in the overall 3.75us time with !CS low when running the P2 at 200MHz (100MHz HyperRAM) and I'd like to improve it a little more if I can...

I notice this today, ISSI mention OctalRAM to 133MHz
http://www.issi.com/WW/pdf/OctalRAM-Brochure.pdf

Densities: 256Mb and 128Mb
Availability: Sampling Now
Applications: Automotive, IoT, Industrial/Medical DRAM technology based solution with Hidden Refresh and OPI(Octal Peripheral Interface) Protocol
Very Low Signal Count: 11(12) pins for Functions (CS#, SCLK..) and 8 IOs. (Optional ERR# for 128Mb)
Up to 200MHz DDR Operation: 400MB/s @ 1.70~1.95V &
Up to 133MHz DDR Operation: 266MB/s @ 2.70~3.6V,
Variable Latency or Fixed Latency, Burst Read/Write Operation & features Read Data Training (16-bit Pattern for Training Purpose).
Automotive Temperature: Up to A2= -40°C to 105°C for 256Mb & Up to A3 = -40°C to 125°C for 128Mb
Optional On Chip ECC for 128Mb: 1-bit Correction, 2-bit detection
Small Foot Print: 6mm x 8mm 24-ball TFBGA (5x5 ball array)

Not sure what Read Data Training (16-bit Pattern for Training Purpose) means, but it might be useful on P2, if that emits a known test 'ROM' like stream ?

rogloh · 2020-01-27 22:38

cgracey wrote: »

Rogloh, if all that unrolling only saves, say, 24 clocks per loop, then maybe it's not worth doing. I was thinking it would result in a double-digit percentage of speed increase.

Yeah it may not be as good as hoped due to the complexity of updates and excess cog register usage making it a PITA to code and test, but I think I have found a possible compromise solution I might now try instead that should get thing going sooner...

It's another variant to what AJL suggested with only one copy of the polling loop but doesn't need the hub and is therefore probably somewhat faster to update. It also still keeps the polling loop tight but uses a cyclic copy of the poll sequence after a successful transfer by the round robin COGs. Plus it uses skipf to only do the mininum arount of updating based on the number of active RR COGs. We are basically trading the original ALTI instruction overhead we had before in order to do the update (and only when needed up to the number of RR COGs). Stealing from Peter to pay Paul. But this now lets us reduce the polling loops down to only poll the COGs needed as we can now have an arbitrary sequence.

Here's the general idea...

cyclepoll       skipf   pattern     ' e.g. pattern would be 111100_0 for 3 active RR cogs
                mov     temp, rr1             
                mov     rr1, rr2
                mov     rr2, rr3
                mov     rr3, rr4
                mov     rr4, rr5
                mov     rr5, rr6
                mov     rr6, rr7
patch           mov     rr1-0, temp             'D is patched with correct last COG in list
                rep     pollcount, #0           'pollcount = 2 + number of COGs to poll
                setq    #16-1                   'setup for reading 16 longs
                rdlong  req0, mbox              'read all mailbox requests/data from hub
prioritytest    tjs     req0-0, priority_jmp    'priority COG checked first
rr1             tjs     req1, cog1_handler      'then cog check order 1,2,3,4,5,6,7 etc
rr2             tjs     req2, cog2_handler
rr3             tjs     req3, cog3_handler
rr4             tjs     req4, cog4_handler
rr5             tjs     req5, cog5_handler
rr6             tjs     req6, cog6_handler
rr7             tjs     req7, cog7_handler

                ' at end of RR cog memory transfers, it jumps back here
                jmp     #cyclepoll

If the number of active RR COGs has to change dynamically you would adjust the skip pattern as well as patch the D field of the instruction at "patch" with the last COG in the list and setup the rr1 through rr'n' instruction block stopping once you have coded up all active RR COG handlers.

Another similar variant can be coded if all COGs are RR with no priority video COGs too. I'll hope to try this idea today.

rogloh · 2020-01-27 22:48

jmg wrote: »

I notice this today, ISSI mention OctalRAM to 133MHz
http://www.issi.com/WW/pdf/OctalRAM-Brochure.pdf

Up to 133MHz DDR Operation: 266MB/s @ 2.70~3.6V,

This will be very nice for 252MHz P2 operation with DVI. With any luck it is a drop in replacement of the device on the Parallax board...

Electrodude · 2020-01-27 23:08

@cgracey I'm under the impression that the {RD,WR}{WORD,BYTE} and WMLONG instructions use atomic read-modify-write operations in the HUB. In future versions of the silicon, would it be practical to add atomic swap and compare-and-swap instructions, like x86's XCHG and CMPXCHG instructions? You'd only need one D,S opcode for both operations - use SETQ to specify the comparison value and make it a compare-and-swap, and don't use SETQ to make it an unconditional swap.

In this case, having compare-and-swap would allow the HyperRAM server cog to only have to check a single mailbox - client cogs would make their request via a compare-and-swap, and if the compare failed, the client cog would know that its command wasn't accepted yet and that it should keep trying. The high-priority video cog would use an unconditional swap to replace any pending command with its own high-priority command, and then it would use further swaps to put any command it stole into a lower-priority mailbox; no other cog would ever write to the lower-priority mailbox, to ensure that the video cog is always able to place a command there. The server cog would use a compare-and-swap to atomically read the command and clear the mailbox, so that no commands are dropped if the video cog overrides a command within a hub cycle of the server cog reading a command and then clearing the mailbox. I'm writing PC software in C that uses these methods through the liburcu library, and it works very well - many threads can all simultaneously send messages to each other without any conventional locks.

Also, locks wouldn't be necessary with this mechanism: every long of hubram could serve as a lock. However, I suspect locks would still be simpler to use.

cgracey · 2020-01-27 23:19

Electrodude, WRxxxx/WMLONG have byte-level write granularity. There is no read-modify-write going on. WMLONG just withholds the byte-write signal on $00 bytes. Otherwise, yeah, all kinds of things would be possible.

Electrodude · 2020-01-28 05:26

OK, thanks. I'm really having fun with these instructions on x86 - multithreaded code is so much less painful with them.

cgracey · 2020-01-28 05:38

> @Electrodude said:
> OK, thanks. I'm really having fun with these instructions on x86 - multithreaded code is so much less painful with them.

On the Propeller chips, we do have atomic byte/word/long reads and writes, but the granularity doesn't go below a byte. By using SETQ+RDLONG/WRLONG, you could have effective granularity of many longs, since all cogs read/write the next long on each clock.

Electrodude · 2020-01-28 14:15

But there no practical way in a future silicon to allow atomically writing a new value to a long and simultaneously returning the old value that was replaced?

It's not a big deal, though. I haven't found a potential use for it yet on the P2 except as an optimization for busy mailboxes. It's proved very useful on a PC, where you have an OS that can interrupt processes at any time, but the P2 has none of those problems, so it's not necessary, and SETQ+RDLONG/WRLONG suffices.

evanh · 2020-01-28 20:32

The SETQ+RD/WRLONG burst read/write instruction has a caveat I believe - the FIFO has priority. If the FIFO is operating then it can pause the burst instruction in mid operation.

rogloh · 2020-01-28 21:42

After the memory driver rewrite I found my changes broke my code as it was a little too much to get totally perfect in one go which happens from time to time. Normally I like to work incrementally and test smaller changes but sometimes you have to make more extensive changes to remain self-consistent and this then opens things up to new bugs etc.

I've fixed one nasty little thing that crept in with some of my changes, but still have something else before it is 100% again.

Here was something subtle I found that can bite you...this code didn't work:

                    getnib  request, addrhi, #7     'get request
                    add     request, #service-8     'compute service jump address
                    tjnf    addrhi, request         'jump to service if not configuring

but this does...you'd think they would jump to the same location, but they apparently don't.

                    getnib  request, addrhi, #7     'get request
                    alts    request, #service-8     'compute service jump address
                    tjnf    addrhi, request         'jump to service if not configuring

EDIT: Actually it's not that subtle, it was just my own late night tiredness that confused me. Looks like I was jumping to my jump table, instead of reading it and jumping to the address in the jump table.

evanh · 2020-01-28 23:08

Yep, done that one myself too many times. #service is head of a jump-table (containing only addresses, no instructions), and request is the index into the table.

The problem with just adding the two together is with the way register-direct branching works. It produces a branch into the table rather than where the table is pointing. If the table was a collection of fixed sized instructions, eg: long JMPs, then it would've worked. But that'd also be a double branch in execution time of course.

rogloh · 2020-01-29 04:20

I think I found a very flexible HyperRAM memory request polling structure that offers multiple capabilities and is fairly simple to generate and update during operation and won't need a lot of COGRAM with unrolling etc. It uses SKIPF again now but doesn't suffer from the problem I had encountered earlier because the way the skipping works now means that the skipped instructions won't take effect in any time critical cog handlers, only in the optional (slower) management path, which can easily be corrected for with some extra nops.

It supports multiple priority levels and now does the round robin handling in just 3 instructions plus it gives us another management channel into the HyperRAM COG using either ATN or interrupts if required, which could be good for dynamic control such as the addition of new COGs to the active RR polling list, or servicing the HyperFlash INT# signal etc. Having the potential to provide multiple priority levels could be rather good for some applications where you have multiple real-time requirements to try to fulfil. Eg. having say 3 effective priority levels you could do this...

1) Video COG - highest priority
2) Audio COG - this COG does short periodic accesses for randomly accessing audio wavetable samples in HyperRAM etc and is still somewhat real time so it doesn't want to be starved out by other COGs due to the changing round robin poll order. As soon as the system is idle without a video request set it gets its chance. I know local buffering can help here, but I'm thinking of situations where you want to keep audio delay to a minimum, or samples are more randomly accessed where the buffering may be of limited use.
3) all other COGs - can be round robin polled and allocated the remaining bandwidth.

1) Video1 COG - highest priority
2) Video2 COG (in some low resolution cases if there is enough bandwidth and its burst is small this may still work and not disrupt the Video1 COG)
3) Other COGs - round robin shared

etc.

Here's the basic idea:

cyclepoll                   incmod  n, rrcount             ' rrcount is number of RR COGs - 1
                            bmask   pattern, n
                            shl     pattern, #1

                            rep     pollcount, #0         ' pollcount = 4+total cog poll count
                            setq    #16-1
                            rdlong  req0, mbox
pri0                        tjs     req0, priority0_jmp    'priority0 COG checked first
pri1                        tjs     req1, priority1_jmp    'priority1 COG next
'...etc  up to the number  of priority levels - the above code order then mostly remains static 
                            skipf   pattern              'this skip pattern effectively cycles the RR order
                            jatn    #control                ' make use of this for management (or use jint, jevt etc)
rrlist                      tjs     req2, cog2_handler      '2 copies of RR COG loop, uses skip pattern 
                            tjs     req3, cog3_handler
                            tjs     req4, cog4_handler
                            tjs     req5, cog5_handler
                            tjs     req6, cog6_handler
                            tjs     req7, cog7_handler
                            tjs     req2, cog2_handler
                            tjs     req3, cog3_handler
                            tjs     req4, cog4_handler
                            tjs     req5, cog5_handler
                            tjs     req6, cog6_handler
                            tjs     req7, cog7_handler ' last one is redundant and can be omitted
                            
control     
            nop ' as many nop's as maximum skip pattern bits can be set or just always using 7 is safe
            nop
            nop 
            nop
            nop
            nop
            nop
            'do control action here, eg. modify RR / priority list or get statistics or change parameters etc
        
...
            'all service handlers jump back to cyclepoll

When it is spawned the memory driver COG would be passed a list of client COGs to exclude and a priority list of COGs in the initial order desired. The exclude list would contain COG IDs already known to not ever need access to HyperRAM such as any USB, I2C or PS/2 driver COGs etc, and this can help optimise and speed up the polling by eliminating excess instructions. Once operational a new COG can still come in and override the rest as the new highest priority COG just as the video COG already does today and the priority polling instruction sequence is then adjusted.

This approach saves up to 6 instructions (12 clocks) during service processing while only introducing 2 instructions more polling latency for providing the extra flexibility and simplicity compared to my previous post. I think it is probably worth it in most cases. In many cases the polling loop time is quantized by the egg-beater hub window interval anyway and for smaller numbers of COGs polled the extra two instruction overhead may not have any affect at all as it would be contained within the slack time waiting for the hub window.

If the RR COG count is zero, it can be customized out of the sequence altogether when the polling loop code is constructed, and the pollcount adjusted to suit. The "JATN" can follow the "TJS" instructions directly if it is used in the sample code above.

The special case of a single COG ever needing to access HyperRAM in a P2 system (e.g. from your "main" COG or a MicroPython COG etc), is a degenerate case probably with two optimal solutions -
1) if there is room in the client COG just build the HyperRAM access code into the COG that is using it, to eliminate the hub transfer request overhead and allow it to be used directly and exclusively, OR
2) use an absolute mimimal polling structure dedicated to servicing only one mailbox like this...

rep #3,#0
setq #2-1
rdlong req, mbox
tjs req, #service  ' this tight poll loop timing should converge to normally taking 16 clocks
jatn #control  ' optional if you still want a management / interrupt path
'service

'control etc

AJL · 2020-01-29 05:14

At control simply do a SKIPF #0

This should cancel the skip pattern and revert to normal operation; you minimise code space used, and you eliminate the risk of unintentionally eliminating one of the nops while editing.

rogloh · 2020-01-29 05:16

The problem is the first 0-6 instructions of control can be skipped. ie. the SKIPF #0 is itself skipped, LOL. NOPs solve it and are not that big a deal.

evanh · 2020-01-29 06:43

uh-oh, one rule is the last instruction of a REP block cannot reliably be a branch.

EDIT: It does something weird on branch. Oddly, this detail is not mentioned in the docs. I thought it was.

EDIT2: I documented it in the tricks'n'traps - https://forums.parallax.com/discussion/comment/1459273/#Comment_1459273

EDIT3: Actually, it could be done by building in compensation for the block length subtraction that occurs.

rogloh · 2020-01-29 06:53

Hmm, not good if true. Can we put the setq there?

Edit: actually it's important to keep them combined, so could we do this..

setq #16-1
rdlong req0, mbox
rep pollcount, #0
tjs ... optional for priority cogs
skipf pattern
...
tjs .. for rr cogs
tjs ..
setq #16-1
rdlong req0, mbox

Where the rep loop includes enough instructions to also reload the mailboxes at the end

evanh · 2020-01-29 07:22

I've realised that if you really want to, you can have a branch last by including a compensating offset into that branch address.

I think it would mean adding the REP block length, in longwords, to the #branch. Or maybe the block length minus one, EDIT: Oops, that's the SETQ D value.
ie: The D value of the REP instruction itself. Needs tested ... yep, that's it, add the block length, which is the D value.

Here's the basic test code I used. Like this, misc1 accumulates to 8. If I delete a NOP ahead of the TJZ then misc1 accumulates to the correct value of 5.

ORG
_main
		mov	bcdlen, #8
		loc	pa, #@titlemsg
		call	#puts

		rep	#3, #0
		nop
		nop
		tjz	misc1, #testj
		nop
		nop
		nop


		loc	pa, #@fallmsg
		call	#puts

		add	misc1, #1
		add	misc1, #1
		add	misc1, #1
		add	misc1, #1
		add	misc1, #1
testj
		add	misc1, #1
		add	misc1, #1
		add	misc1, #1
		add	misc1, #1
		add	misc1, #1

		mov	pa, misc1
		call	#itod
		call	#putnl
		jmp	#$


ORGH
titlemsg	byte	13,10,"REP branch testing",13,10,0
fallmsg		byte	"Fell through!",13,10,0

rogloh · 2020-01-29 08:56

Yeah that's not a good outcome, glad you mentioned it before I went down that path @evanh. I think I'll try the rdlong burst at the end of the REP loop. It's not a branch so should be okay (I hope).

TonyB_ · 2020-01-29 10:23

evanh wrote: »

uh-oh, one rule is the last instruction of a REP block cannot reliably be a branch.

But does this rule only apply to relative jumps using instructions described in the spreadsheet as jump to S**?

evanh · 2020-01-29 11:17

Huh, that's an unexpected result. Good question there Tony. An absolute register-direct behaves correctly! The following misc1 accumulates to 5.

		loc	pa, #testj

		rep	#3, #0
		nop
		nop
		tjz	misc1, pa
		nop
		nop
		nop

		loc	pa, #@fallmsg
		call	#puts

		add	misc1, #1
		add	misc1, #1
		add	misc1, #1
		add	misc1, #1
		add	misc1, #1
testj
		add	misc1, #1
		add	misc1, #1
		add	misc1, #1
		add	misc1, #1
		add	misc1, #1

evanh · 2020-01-29 11:29

And replacing the TJZ with a JMP PA gives a correct 5, and JMP #\testj also gives correct 5, but JMP #testj is wrong 8 again.

So, it only affects relative branching.

AJL · 2020-01-29 13:27

rogloh wrote: »

The problem is the first 0-6 instructions of control can be skipped. ie. the SKIPF #0 is itself skipped, LOL. NOPs solve it and are not that big a deal.

If you have at least one rr COG then the SKIPF #0 at control works every time.
If you have no rr COGs then you patch the SKIPF to be 'jatn #control' and the first instruction of control is never skipped and the SKIPF #0 does no harm.
Having two 'jatn's in a row also does no harm either as only one or none of the the jumps will be taken, depending on the time of arrival of the ATN signal in the loop, meaning you don't need to special-case pollcount.

rogloh · 2020-01-29 21:38

evanh wrote: »

And replacing the TJZ with a JMP PA gives a correct 5, and JMP #\testj also gives correct 5, but JMP #testj is wrong 8 again.

So, it only affects relative branching.

Good. I can avoid the problem and simplify the loop creation to be what I had intended originally, as I am not using that a relative branch case.

AJL wrote: »

If you have at least one rr COG then the SKIPF #0 at control works every time.

I don't know about this. I think the way my skip sequence currently works is that the very first instruction after the JATN can get skipped. In fact the skipped instructions all have to start and end before any real TJS begin. If I found another useful (to the polling loop) non-jumping instruction to squeeze in after the JATN your way could work, but I don't really need any more work done in the loop there. In fact I sort of only put the JATN there to fill an otherwise unusable spot (but it could become quite useful).

If you have no rr COGs then you patch the SKIPF to be 'jatn #control' and the first instruction of control is never skipped and the SKIPF #0 does no harm.

Yes, and I've already made it work that way during code creation. The no RR COGs case gets rid of all that SKIPF stuff, but I can still generate a single JATN there if required.

evanh · 2020-04-29 07:11

Roger,
Just compiled your code and Fastspin is flagging a warning about this:

                            subr    c, $1ff

It's from line 1706 of p2videodrv.spin2. I haven't attempted to understand its purpose but I doubt you intended to subtract from INB register. I've taken the liberty to make it #$1ff in my copy.

rogloh · 2020-04-29 07:30

Yes that is a known bug we encountered in the first beta, I have fixed this in my new as yet unreleased codebase. Your fix is correct.

rogloh · 2020-05-14 08:41

Here is a very cheap way to get HDMI output from the P2 at high resolutions via VGA. I found it gave good signal quality into my Dell 2405 monitor. It doesn't seem to do any real processing or timing conversion because the signal output timing/resolution seems to be preserved from the input. It includes audio input encoding too but I haven't tried that because this monitor doesn't have audio capabilities to test it. In my quick testing of it I was able to get 1080p60 from the P2 to my monitor via DVI this way as well as 1024x768, 1280x1024, 720p, 480p etc. It seems to work okay with my driver and its a handy thing to have, plus they are cheap! ~$10 USD from a local dealer nearby, probably even cheaper online. Its small plastic case only felt lukewarm after an hour or so of running, hopefully it should last for a while.

One thing I noticed is that the black level appears to be very dark grey compared to what I had before with direct VGA. This might be a mapping issue from VGA into HDMI/DVI colour encoding (not sure). My monitor has a "Video Mode" option which normally might do something there but it doesn't let me select it for the DVI input for some reason. An HDTV may behave differently. When I can visit elsewhere I will have to try it on some other HDTVs.

This was the device I used:

https://www.simplecom.com.au/simplecom-cm201-full-hd-1080p-vga-to-hdmi-converter-with-audio.html

Tubular · 2020-05-14 08:46

Super neat.

So for output of 480p/720p/1080p is it just auto selecting based on the input resolution? Or is there some kind of switch that lets
you dictate what output res you want?

rogloh · 2020-05-14 10:33

No auto-selection, it just seems to generate the output at the same resolution as coming in from the input side - ie. no conversion. It must somehow generate the HDMI clock based on the source input timing with a PLL. So despite the documentation only mentioning 480p, 720p and 1080p specifically, it appears to do other resolutions as well. This is according to the signal listed on screen by the Dell monitor over the DVI input. Perhaps an HDTV will report it differently, TBD.

P2 DVI/VGA driver

Comments