All PASM2 gurus - help optimizing a text driver over DVI?

rogloh · 2019-12-06 23:54

I spent time looking yesterday and overnight found a way to share an XOR instruction for PAL colour flip. So with dynamic CY/CI/CQ adjustment and PAL fixes we still have 3 instructions left.

I was able to retain the use of C odd/even field state in my test code above to decide when to modify the call that flips the CQ colour space parameter for PAL. C is already getting set for odd fields 1, 3 from my field counter (which I needed regardless), but I want to do this CQ XOR flip on fields 2,3.

So if I just patch the call to the vertical back porch lines to either run normally (no XOR), or at the line before this where I now do the XOR flip, this XOR happens on fields 2,3 because in the code snippet below the call will now be to [blank, blank_pal, blank_pal, blank], for fields [1,2,3,4] respectively etc. It is achieved by modifying the calling code based on C with another XOR instruction that decides whether blank or blank_pal will be called.

This optimization is sensitive to the distance between patchvbp and blank_pal being even number of longs unfortunately, however this can be remedied via padding with other data items whenever this distance changes. I've coded it in a way where it should only affect PAL, not all code that shares this common blanking call, which makes it better for my testing as this padding wouldn't have to be done EVERY time I make modifications even when I am not running PAL.

Seems to work on the screen when I tried it. Saving instructions is a good challenge with this code, and is getting harder each time.

                            testb   fieldcount, #0 wc       'field interlace state
            if_c            xor     patchvbp, #1            'modify code on fields 1, 3
...
patchvbp                    callpa  #V_BP-0, #blank-0       'send vertical back porch lines
...

blank_pal                   xor     cq, palflipcq           'NEEDS TO BE EVEN DIST. FROM patchvbp
blank                       call    #hsync                  'do hsync at the start of the line
                            xcont   m_vi, hsync0            'generate blank line pixels
            _ret_           djnz    pa, #blank              'repeat to generate multiple lines

cgracey · 2019-12-07 01:19

It would be neat, in a future P3, if we could look ahead to the next instruction and execute upcoming branches in parallel with current logic/math operations.

If we could make clean divisions between functional groups of instructions, they could be mixed and matched maybe 2..4 deep, without any fancy speculative-execution circuits.

I wish foundries had more than dual-port RAMs. The more read and write ports, the more can be done in parallel. As it is now, if you want a 3-read/1-write port RAM, you need to instantiate 3 dual-port RAMs and dedicate a port on each to the common 1-write cause.

rogloh · 2019-12-07 01:27

Would that need a form of branch prediction+unwinding if the incorrect branch gets taken? Or only early branching if simple non-conditional branching is done.

rogloh · 2019-12-07 01:29

You might also consider combining branching with ALU operations in a VLIW type of system (wider instruction).

evanh · 2019-12-07 01:45

I'd put priority on having more cogs.

evanh · 2019-12-07 02:01

cgracey wrote: »

I wish foundries had more than dual-port RAMs. The more read and write ports, the more can be done in parallel. As it is now, if you want a 3-read/1-write port RAM, you need to instantiate 3 dual-port RAMs and dedicate a port on each to the common 1-write cause.

The reason why is probably because no one else, these days at least, uses SRAM for the general register set.

rogloh · 2019-12-07 02:05

I'd like more internal hub RAM and native 10b video output so hires DVI/HDMI resolutions higher than VGA resolutions are then enabled. But others could want more COGs. P1 and to a much lesser extent P2 (with video) have seemed to me to mainly be memory limited not processor limited, though they are micro-controller's after all and we are pushing for a lot here. In my own applications which typically have been video focussed to date, I've usually run out of memory before COGs. Given the raw speed of the P2 cogs, I am thinking this would continue. Thankfully the extra pins and HyperRAM can now help increase the memory significantly.

cgracey · 2019-12-07 02:24

rogloh wrote: »

I'd like more internal hub RAM and native 10b video output so hires DVI/HDMI resolutions higher than VGA resolutions are then enabled. But others could want more COGs. P1 and to a much lesser extent P2 (with video) have seemed to me to mainly be memory limited not processor limited, though they are micro-controller's after all and we are pushing for a lot here. In my own applications which typically have been video focussed to date, I've usually run out of memory before COGs. Given the raw speed of the P2 cogs, I am thinking this would continue. Thankfully the extra pins and HyperRAM can now help increase the memory significantly.

Has anyone demonstrated video streaming from a HyperRAM, yet?

cgracey · 2019-12-07 02:25

evanh wrote: »

cgracey wrote: »

I wish foundries had more than dual-port RAMs. The more read and write ports, the more can be done in parallel. As it is now, if you want a 3-read/1-write port RAM, you need to instantiate 3 dual-port RAMs and dedicate a port on each to the common 1-write cause.

The reason why is probably because no one else, these days at least, uses SRAM for the general register set.

Are you saying they just use flops and mux's? That only works if your register set is small, like maybe 16 registers, in my experience.

rogloh · 2019-12-07 02:26

I think yes for static image tests, both Rayman and ozpropdev have had things working from HyperRAM. My driver is all ready for it, but needs a front end arbiter COG to feed it data. I'd really like to take some HyperRAM code and hack something up myself soon but have been tied up with the PAL fixes and other things.

evanh · 2019-12-07 02:40

cgracey wrote: »

Are you saying they just use flops and mux's? That only works if your register set is small, like maybe 16 registers, in my experience.

I'd assume so, given every popular CPU design out there uses a small set. That said, it would seem SIMD like extensions do use much more than the base general registers.

I see that quadport have existed historically - https://www.idt.com/document/apn/253-introduction-multi-port-memories-0
And there recognition of an increasing need for them - https://www.researchgate.net/publication/260563060_Modular_Multi-ported_SRAM-based_Memories

evanh · 2019-12-07 02:58

Ah, and looking at the IDT schematic, any read-only ports would only need a single "bit line". So what you want would be seven transistors and nine lines instead of the ten transistors and twelve lines they've got there.

EDIT: Modern primary caches will be a good candidate for multi-ported SRAMs. They are now a tiny portion of the total CPU silicon and are a vital repository for super-scalar operations. I'm guessing certain fabs have primitives just for this.

potatohead · 2019-12-07 05:41

cgracey wrote: »

evanh wrote: »

cgracey wrote: »

I wish foundries had more than dual-port RAMs. The more read and write ports, the more can be done in parallel. As it is now, if you want a 3-read/1-write port RAM, you need to instantiate 3 dual-port RAMs and dedicate a port on each to the common 1-write cause.

The reason why is probably because no one else, these days at least, uses SRAM for the general register set.

Are you saying they just use flops and mux's? That only works if your register set is small, like maybe 16 registers, in my experience.

And other constructs. I was on a flight recently with someone from a big chip company we all know. Her title:

Register Designer / Architect

Chip, this way of having the CPU be a big pile of registers is unique out there right now.

We got to chatting and frankly, how the P2 works was rather difficult to communicate. It's very, very different from how the major players are doing things.

rogloh · 2019-12-07 06:26

Hey @potatohead if you get a chance, try out the latest PAL fix and see if your PVM now recognizes it as valid PAL and shows the colour bars + olive burst colour at the bottom of screen. Binary to use is in that last posted zip file.

rogloh · 2019-12-07 07:11

Continued from the fastspin thread where I first raised this issue:

rogloh wrote: »

I've found through adding debug printing of the addresses of my key data structures, that by padding with longs at the end of the fastspin image, only the VAR structures in memory I pass into my driver get moved around, but the runtime buffers used remain at the same locations (as I expected). So it appears something is being messed up upon startup of my driver. I image the wrong parameters must be being read in. I need to dump out to memory what I read in from my driver at startup to see why this is happening. It would appear to be more SPIN oriented right now unless I do something weird with the addresses I read from at the start of my code.

Update: Looking like it may possibly be some type of bug in my code...this code doesn't loop correctly when reading its parameters from certain egg beater offset addresses for some reason and appears to stay in some inner loop sending sync pulses. Very weird. Might be a setq+rdlong type of problem somewhere, still homing in, or perhaps it gets a 0 somewhere and this effectively loops infinitely in a rep/djnz loop.

Update2: will continue any further discussion of this issue in my main driver thread so as not to pollute this thread any more unless it comes back to fastspin which at this point looks like it may not...I have found an infinite loop candidate in one case but don't quite yet know how it comes to be from parameter address offsets.

I just located the obscure issue I encountered with padding of the binary image size causing timing failures for PAL/NTSC. Turns out that it was a nasty piece of incomplete/untested code I'd somehow left lying around in this image and it conspired with the addresses of parameters being passed in to create an infinite loop! I was originally trying to allow cases where you could have non-zero front porches in PAL/NTSC to allow more customization of the display and centering the image in the screen etc, but I did it the wrong way and hadn't yet tested it out. I must have thought the alti would somehow modify the following instruction to extract the D field of patchvfp and test it for non-zero cases, but it doesn't do that and it must test "a" (a temp variable) with the register at the address of whatever the D field of patchvfp references (it would have been 0 which is my status address pointer and this varies with the input address being passed in).

The fact that this lined up with the hub interval of 8 longs, also really confused me as well.

Total messed up ugliness and this was quite hard to locate. I need to be more careful with what goes into this code before it gets saved off. It was probably one of those things you add to your code just before you finish up and then leave it and forget to test it after coming back from lunch/dinner/night etc. Then it comes back to haunt you later.

The bad code was this


patchvfp                    callpa  #V_FP-0, #blank         'send vertical front porch lines
...

                            alti    patchvfp, #%100_000
                            test    a wz
            if_z            mov     patchvfp, writestat     'no front porch (just equalization)

and it was just changed to the normal/simple way to do this, which fixed the issue.

                            mov     a, patchvfp
                            shr     a, #9
                            test    a, #$1ff wz
            if_z            mov     patchvfp, writestat     'no front porch (just equalization)

rjo__ · 2019-12-07 07:12

cgracey wrote: »

rogloh wrote: »

I'd like more internal hub RAM and native 10b video output so hires DVI/HDMI resolutions higher than VGA resolutions are then enabled. But others could want more COGs. P1 and to a much lesser extent P2 (with video) have seemed to me to mainly be memory limited not processor limited, though they are micro-controller's after all and we are pushing for a lot here. In my own applications which typically have been video focussed to date, I've usually run out of memory before COGs. Given the raw speed of the P2 cogs, I am thinking this would continue. Thankfully the extra pins and HyperRAM can now help increase the memory significantly.

Has anyone demonstrated video streaming from a HyperRAM, yet?

Ray did it. And Brian demonstrated that 16bit VGA is very practical. If you put the two ideas together... very nice.

forums.parallax.com/discussion/169926/hyperram-flash-as-vga-screen-buffer-now-xga-720p-1080p-rev-b#latest

At first I didn't believe it.

Speaking of multi-ported ram. You can't read and write to HyperRam simultaneously and two boards take up a lot of real estate. All of the display issuesI'm seeing could be solved by a P2x4 with a common clock. Good for the whole family.
Big profits too!

potatohead · 2019-12-07 07:33

If it were me, on P3?

Blow it out to 64 bit. Logically extend things so that Cogs are large, and the on-chip RAM also larger. Keep it operating in the Propeller way, with everything blown out to the larger scale.

Add MMU facility on one variant to take advantage of very large off chip RAM.

Then we get:

COG = Ultra Fast
On Chip RAM = Fast
Off Chip RAM = Not as Fast

Maybe have a discussion about what that MMU can / should actually do. Another one about protecting regions of external RAM, perhaps per COG style.

In short, keep with this idea of doing multiple things at once, sans an operating system. People can, and I bet will do that anyway for P2 and future chips, but they won't have to.

Something in me says computing is going to tighten down considerably in the next decade. Should we get lucky, and P2 sales permits another adventure?

Target a process node that would get us a respect

rogloh wrote: »

Hey @potatohead if you get a chance, try out the latest PAL fix and see if your PVM now recognizes it as valid PAL and shows the colour bars + olive burst colour at the bottom of screen. Binary to use is in that last posted zip file.

Will do tomorrow. I'm home, where that display is.

Wuerfel_21 · 2019-12-07 07:40

potatohead wrote: »

If it were me, on P3?

Blow it out to 64 bit.

Eh, 64 bit is too unwieldy and doesn't start getting really interesting unless you get close to that 4GB address space limit.
Making the opcodes (optionally!) 64 bit wide seems interesting though.
A P3 feature I'd like to see would be some way to do 3-operand instructions (i.e. ALTR #something without the ALTR), maybe limited to a set of special registers to fit in the opcode.

Anyways, P3 ramblings belong in that other thread. Or that other thread. Or t.. oh.

cgracey · 2019-12-07 08:42

potatohead wrote: »

cgracey wrote: »

evanh wrote: »

cgracey wrote: »

I wish foundries had more than dual-port RAMs. The more read and write ports, the more can be done in parallel. As it is now, if you want a 3-read/1-write port RAM, you need to instantiate 3 dual-port RAMs and dedicate a port on each to the common 1-write cause.

The reason why is probably because no one else, these days at least, uses SRAM for the general register set.

Are you saying they just use flops and mux's? That only works if your register set is small, like maybe 16 registers, in my experience.

And other constructs. I was on a flight recently with someone from a big chip company we all know. Her title:

Register Designer / Architect

Chip, this way of having the CPU be a big pile of registers is unique out there right now. We got to chatting and frankly, how the P2 works was rather difficult to communicate. It's very, very different from how the major players are doing things.

We all know her? I don't know anybody that fits that description.

I'm thinking it's more human-friendly to have everything be registers, but if compilers are doing all the machine-code generation, a small set of registers is probably more efficient, considering die area and power.

evanh · 2019-12-07 08:49

Spud was meaning the company, not her. Just the way he wrote her title on a new line made the fullstop seem to vanish.

cgracey · 2019-12-07 08:57

evanh wrote: »

Spud was meaning the company, not her. Just the way he wrote her title on a new line made the fullstop seem to vanish.

Oh, Duh. I was blowing through the punctuation as I read it, "...from a big chip company. We all know her. Title: Register Designer/Architect."

This is fascinating to read about 14nm technology, if anyone wants to go on a 15-minute mental odyssey:

https://hal.archives-ouvertes.fr/hal-01541171/document

They must place dummy poly and wires on all sides of actual circuit elements in the fine layers, in order to assure manufacturing consistency. This leads to regular patterns with interruptions to get functionality. Another world. And the metal layer width/thicknesses ranges from mice to elephants.

evanh · 2019-12-07 09:02

cgracey wrote: »

I'm thinking it's more human-friendly to have everything be registers, but if compilers are doing all the machine-code generation, a small set of registers is probably more efficient, considering die area and power.

Speed of multitasking context switching is a limiting factor, in traditional general architectures it curtails the number of general use registers. Which is not something that has been of concern with the propeller where the system engineer decides how each cog gets used and generally doesn't attempt to switch tasks in any major fashion.

So that's a clear line in the sand on what one should expect of even bigger propellers. At the very least it's one cog per program/process/environment. No sharing of processors between tasks.

evanh · 2019-12-07 09:09

Conveniently, that would also retain the individual program's per cog control of whether interrupts are in operation or not.

cgracey · 2019-12-07 09:20

evanh wrote: »

cgracey wrote: »

I'm thinking it's more human-friendly to have everything be registers, but if compilers are doing all the machine-code generation, a small set of registers is probably more efficient, considering die area and power.

Speed of multitasking context switching is a limiting factor, in traditional general architectures it curtails the number of general use registers. Which is not something that has been of concern with the propeller where the system engineer decides how each cog gets used and generally doesn't attempt to switch tasks in any major fashion.

So that's a clear line in the sand on what one should expect of even bigger propellers. At the very least it's one cog per program/process/environment. No sharing of processors between tasks.

That's a good way to think about it. I made everything into registers so that, finally, I wouldn't be under the tyranny of fixed-named, limited sets of working registers which always need storing and reloading to do something different. It's fatiguing to deal with, and eventually results in stultifying dread. What else could maybe be blown wide open to alleviate stress on the programmer?

evanh · 2019-12-07 09:35

cgracey wrote: »

What else could maybe be blown wide open to alleviate stress on the programmer?

Well, there is a design tension. One direction is make the cogs large, fast and hot, limiting there numbers. The other direction is smaller and relatively slower cogs that can be packed in and therefore have more of them. We saw it playing out early on with the prop2 development. I feel you'd sort of decided which side you wanted when you changed to 2-clocks per instruction execute.

Wuerfel_21 · 2019-12-07 09:39

evanh wrote: »

cgracey wrote: »

I'm thinking it's more human-friendly to have everything be registers, but if compilers are doing all the machine-code generation, a small set of registers is probably more efficient, considering die area and power.

Speed of multitasking context switching is a limiting factor, in traditional general architectures it curtails the number of general use registers.

Compared to how long most modern multitasking designs take to switch tasks, the P2, even when dumping and restoring all ~500 registers, isn't actually that ill-suited to single-cog multitasking (and it ofc gets a lot faster if you give each thread it's own registers or reduce the range that is dumped/restored).

evanh · 2019-12-07 09:45

Give each task it's own thread (hardware context). Do OSes actually try to restrict to that?

cgracey · 2019-12-07 09:46

evanh wrote: »

cgracey wrote: »

What else could maybe be blown wide open to alleviate stress on the programmer?

Well, there is a design tension. One direction is make the cogs large, fast and hot, limiting there numbers. The other direction is smaller and relatively slower cogs that can be packed in and therefore have more of them. We saw it playing out early on with the prop2 development. I feel you'd sort of decided which side you wanted when you changed to 2-clocks per instruction execute.

Yes, it takes three dual-port RAMs to achieve single-clock execution because there's no 3R/1W memory instance available. That was the main driver to dropping to two-clock execution, since it only requires a single dual-port RAM. We made the cogs big, anyway, after that. On P3, it would be good to go back to single-clock execution. If we use a much smaller process, three dual-port RAMs for cog memory won't be a problem. Imagine (fantasy) 16 cogs executing single-clock instructions at 8GHz. That would be 128 BIPS. Right now, at 250MHz, we have 1 BIPS. Two orders of magnitude speed improvement - you could really FEEL that.

evanh · 2019-12-07 09:54

Lol, too much. The prop2 is plenty fast.

rogloh · 2019-12-07 10:04

Imagine (fantasy) 16 cogs executing single-clock instructions at 8GHz.

I wonder what sort of power draw that would require...

All PASM2 gurus - help optimizing a text driver over DVI?

Comments