Propeller II update - BLOG

Bill Henning · 2014-01-20 15:02

I could not agree more!!!

It is a vicious cycle. It takes a lot of time to make a highly optimizing compiler that can use most or all instructions, which drives development costs up. Limited budgets mean compiler drivers leave out anything not essential, losing potential performance.

Processor designers then go "oh no one is using those instructions! Let's drop them! because they ignore the shrinking assembly programmer pool, which has mostly dissapeared from general purpose PC's (but is fortunately alive for microcontrollers).

Then the compiler guys (driven by low development costs) drop more instructions. Rinse and repeat...

Meanwhile the assembly programmers on resource constrained micros make great use of the "compiler useless" instructions - getting MUCH better performance than compiled code.

cgracey wrote: »

It's been my suspicion that this phenomenon of compilers driving CPU design has created a wasteland for would-be assembly languages programmers. Can anyone corroborate this?

ozpropdev · 2014-01-20 15:03

ctwardell wrote: »

Do we have any documentation yet for SETXCH?

It seems like there is a lot of hope that the D Port is going to be some kind of magical panacea, but we only have one so it's usage will need to be negotiated among various objects.

C.W.

Here's the section on Port D from the latest "Prop docs"

PORT D INTER-COG EXCHANGE
-------------------------

Port A, associated with PINA/OUTA/DIRA, connects to external pins 0..31.    *** SAME
Port B, associated with PINB/OUTB/DIRB, connects to external pins 32..63.   *** SAME
Port C, associated with PINC/OUTC/DIRC, connects to external pins 64..91.   *** SAME
Port D, associated with PIND/OUTD/DIRD, connects to internal pins 96..127.  *** DIFFERENT!!!

The internal pins of port D differ from the external pins of ports A/B/C in regard to both
outputs and inputs:

    Each cog generates its port D outputs in the same pattern it generates its port A/B/C
    outputs:

        OUTD is OR'd with SERA/SERB/CTRA/CTRB/XFR/TRACE outputs 127..96, then those 32 bits
        get AND'd with DIRD to form the port D outputs.

    The difference is that all the cogs' port D outputs are not OR'd together before going
    to a set of 32 I/O pins. Instead, each cog's port D outputs are kept separated, and
    every cog can determine which other cogs' port D outputs it wants to see in its own
    PIND input, which also feeds SERA/SERB/CTRA/CTRB/XFR inputs 127..96.


The SETXCH instruction is used to set the PIND input filter:

    SETXCH  D/#             - Set PIND input filter to %DDDDDDDD_CCCCCCCC_BBBBBBBB_AAAAAAAA

        %DDDDDDDD = filter for PIND[31..24]

            %xxxxxxx1 = cog 0's port D output [31..24] will be OR'd into PIND[31..24] input
            %xxxxxx1x = cog 1's port D output [31..24] will be OR'd into PIND[31..24] input
            %xxxxx1xx = cog 2's port D output [31..24] will be OR'd into PIND[31..24] input
            %xxxx1xxx = cog 3's port D output [31..24] will be OR'd into PIND[31..24] input
            %xxx1xxxx = cog 4's port D output [31..24] will be OR'd into PIND[31..24] input
            %xx1xxxxx = cog 5's port D output [31..24] will be OR'd into PIND[31..24] input
            %x1xxxxxx = cog 6's port D output [31..24] will be OR'd into PIND[31..24] input
            %1xxxxxxx = cog 7's port D output [31..24] will be OR'd into PIND[31..24] input

        %CCCCCCCC = filter for PIND[23..16]

            %xxxxxxx1 = cog 0's port D output [23..16] will be OR'd into PIND[23..16] input
            %xxxxxx1x = cog 1's port D output [23..16] will be OR'd into PIND[23..16] input
            %xxxxx1xx = cog 2's port D output [23..16] will be OR'd into PIND[23..16] input
            %xxxx1xxx = cog 3's port D output [23..16] will be OR'd into PIND[23..16] input
            %xxx1xxxx = cog 4's port D output [23..16] will be OR'd into PIND[23..16] input
            %xx1xxxxx = cog 5's port D output [23..16] will be OR'd into PIND[23..16] input
            %x1xxxxxx = cog 6's port D output [23..16] will be OR'd into PIND[23..16] input
            %1xxxxxxx = cog 7's port D output [23..16] will be OR'd into PIND[23..16] input

        %BBBBBBBB = filter for PIND[15..8]

            %xxxxxxx1 = cog 0's port D output [15..8] will be OR'd into PIND[15..8] input
            %xxxxxx1x = cog 1's port D output [15..8] will be OR'd into PIND[15..8] input
            %xxxxx1xx = cog 2's port D output [15..8] will be OR'd into PIND[15..8] input
            %xxxx1xxx = cog 3's port D output [15..8] will be OR'd into PIND[15..8] input
            %xxx1xxxx = cog 4's port D output [15..8] will be OR'd into PIND[15..8] input
            %xx1xxxxx = cog 5's port D output [15..8] will be OR'd into PIND[15..8] input
            %x1xxxxxx = cog 6's port D output [15..8] will be OR'd into PIND[15..8] input
            %1xxxxxxx = cog 7's port D output [15..8] will be OR'd into PIND[15..8] input

        %AAAAAAAA = filter for PIND[7..0]

            %xxxxxxx1 = cog 0's port D output [7..0] will be OR'd into PIND[7..0] input
            %xxxxxx1x = cog 1's port D output [7..0] will be OR'd into PIND[7..0] input
            %xxxxx1xx = cog 2's port D output [7..0] will be OR'd into PIND[7..0] input
            %xxxx1xxx = cog 3's port D output [7..0] will be OR'd into PIND[7..0] input
            %xxx1xxxx = cog 4's port D output [7..0] will be OR'd into PIND[7..0] input
            %xx1xxxxx = cog 5's port D output [7..0] will be OR'd into PIND[7..0] input
            %x1xxxxxx = cog 6's port D output [7..0] will be OR'd into PIND[7..0] input
            %1xxxxxxx = cog 7's port D output [7..0] will be OR'd into PIND[7..0] input


To input only cog 0's port D output into PIND, you would use the filter value $01_01_01_01.
To input the logical OR of cog 0's and cog 1's port D outputs into PIND, you would use
$03_03_03_03. In most cases, it may be desirable to just see one other cog's full port D
output in a PIND input, but many other arrangements are possible. SETBYTE and GETBYTE
instructions can be used to efficiently move bytes via OUTD/PIND windows.

After SETXCH, PIND can be read for newly-filtered data on the third clock:

        SETXCH  #$00000001      'change filter
        MOV     X,PIND          'data from old filter
        MOV     X,PIND          'data from old filter
        MOV     X,PIND          'data from new filter


Writes to an OUTD are readable from a PIND on the third clock, as well.

Heater. · 2014-01-20 15:03

Chip,

That is one way to look at it.

Certainly back in the day the hundreds of instructions added to the Z80 over the original 8080 instruction set were put there for the assembler language programmer. Never used by compilers.

Similarly the Motorola 6809 was a joy to program in assembler over the previous 6800. A lot of instructions and addressing modes and general "orthogonality" added to make the assembler language programmer happy.

In the here and now you have described all those new P2 instructions as "friends who want to help me". Which is no doubt true. I'm also guessing that a lot of them will not be used by compilers. Except as bolt on intrinsics/functions/inline assembler that will be added to support the most useful of those instructions.

Tor · 2014-01-20 15:39

On the other hand.. instructions were also added to processor architectures with the intention to 'help' compiler writers.
I have written emulators for two different minicomputer architectures (one 16-bit and one 32-bit), I used to write assembly programs for both back in the day. It was a pleasure to write assembly, whenever the job got too stressful I would relax by writing some assembly that did something useful, like adding features that compilers could not provide natively. Or just speed up stuff. I noticed however how every new iteration of both of those CPU types added more and more instructions over the years. These were described as intended to support compilers. Instructions that would set up stack frames, for example. meant to replace the sequence of instructions used until then. Complex instructions some of them, some of them useful for assembly programmers, but not very fun, in a sense. A strcpy() instruction, with extra arguments? Not attractive really. Other instructions way too complex to be interesting. But supposedly good for compilers.

Then years later I started to write my emulators.. I had no hardware anymore, limited documentation, but I still had lots of software, even compilers, including the compiler used to compile some of those compilers. So I traced literally millions of disassembly lines as I figured out how the instructions worked. And I found that:
1) Even the latest compiler versions, written years after those compiler-helping instruction had been added, didn't use them. The stack-frame instruction wasn't used by a single one of them, for example. The compilers stuck to nearly the same set of instructions used by the very first versions (so the latest and greatest Fortran-77 compiler version used nearly the same set of instructions as the oldest Fortran IV compiler for the code they generated. Very different code, but same set of instructions with only a few exceptions).
2) The complexity of those instructions really messed up the architectures, particularly of the 16-bit one which had started out as a lean, mean, risc-like architecture in some sense. Extremely easy to decode, one or two cycles of the clock. After more and more complex instructions were added the decoding complexity increased. My emulator suffered a lot, obviously, being implemented in software, but there's no way it could have been done efficiently in hardware either. So it ended up as a bigger and bigger micro-coded core. And for little gain, a lot of the work done by the hardware guys of that vendor was simply wasted. Their own compiler developers didn't use the instructions. And the compilers were good. The programs were very space-efficient, and got the most out of the CPUs.
3) On the other hand, the 32-bit architecture I worked on and which was also very nice to program for in assembly had a bewildering array of address modes. 28 different ones in total I think. And tons of interesting instructions for assembly writers. And a lot of instructions meant for compilers, many of them not used by compilers. But I realised when implementing my emulator that it would have been a horror to implement all of those nice instructions and addressing modes in hardware too. I've seen an expression used by Chip: Critical path. I believe I understand what it means.. I'm sure this architecture which was meant to support compilers, and to some extent assembly programmers too, really hit the critical path.
No wonder that when I ran a heavy-duty Fortran program through f2c, recompiled it in C on a Sun-4, in 1990 I think, it ran 50 times faster..

-Tor

Cluso99 · 2014-01-20 16:47

Many of the new P2 instructions will probably only be used in assembly language. But that is precisely where they are required - to perform something fast !

Some instructions will aid the higher level languages. They have been asked for by the compiler writers and friends.

The P2 will likely have both assembly drivers and higher level (spin, C, etc) programs. This is no different than P1.

PORTD
How we quickly forget. Thanks for reposting the usage and SETXCH. It is way better than just a 32 bit internal PortD. It is really like 8x PortD's, each one individual to each cog, but with the option to combine them into one. WTG Chip.

ctwardell · 2014-01-20 17:37

ozpropdev wrote: »

Here's the section on Port D from the latest "Prop docs"...

Thanks for posting the Port D info.

C.W.

ctwardell · 2014-01-20 18:16

Non-Hub Flags...

Port D will be useful for a lot of things, but I think it would be nice to also have a set of flags that can be accessed as a non-hub operation by all the cogs.

The use case I have in mind is a flag to indicate that something is ready to be acted on, basically "You Have Mail!".

The issue with the Port D setup is that a requesting COG can "raise the mail flag", but the responding COG cannot "lower the mail flag".

The responding COG could raise its "I got the mail flag" and then the requesting COG could "lower the mail flag".

The responding COG would have to watch for the requesting COG to "lower the mail flag" in order to prevent a false trigger from the original request.

That seems like a lot of work compared to having a commonly available location where the request COG could set the flag and the responding COG could clear it.

I realize we could just use a location in the hub, but with HUBEXEC it seems like a big performance hit to use hub cycles to poll flags.

What I would like to have is a shared long providing 32 flags that can be set, cleared, and tested by non-hub instructions.

I would also like to see the locks increased as I mentioned earlier, 32 would be nice.

Having the same number of locks and flags would be useful in the use cases I have in mind, but isn't required.

Just throwing this in the mix since you can't get it if you don't ask for it.

C.W.

cgracey · 2014-01-20 18:22

ctwardell wrote: »

Non-Hub Flags...

Port D will be useful for a lot of things, but I think it would be nice to also have a set of flags that can be accessed as a non-hub operation by all the cogs.

The use case I have in mind is a flag to indicate that something is ready to be acted on, basically "You Have Mail!".

The issue with the Port D setup is that a requesting COG can "raise the mail flag", but the responding COG cannot "lower the mail flag".

The responding COG could raise its "I got the mail flag" and then the requesting COG could "lower the mail flag".

The responding COG would have to watch for the requesting COG to "lower the mail flag" in order to prevent a false trigger from the original request.

That seems like a lot of work compared to having a commonly available location where the request COG could set the flag and the responding COG could clear it.

I realize we could just use a location in the hub, but with HUBEXEC it seems like a big performance hit to use hub cycles to poll flags.

What I would like to have is a shared long providing 32 flags that can be set, cleared, and tested by non-hub instructions.

I would also like to see the locks increased as I mentioned earlier, 32 would be nice.

Having the same number of locks and flags would be useful in the use cases I have in mind, but isn't required.

Just throwing this in the mix since you can't get it if you don't ask for it.

C.W.

I'll see what I can do. This is pretty simple, but if you don't hear anything from me, please remind me.

dr hydra · 2014-01-20 18:26

The stm32f429 discovery looks cool...but what is the point...a $25 development board that needs a $500 IDE studio...and yes I know they have free lite versions...but 32k that is way too small:(

Bill Henning · 2014-01-20 18:26

Sounds like a simple internal "PORTE" - would be handy.

Regarding locks... I also vote to increase them if feasible; I use them extensively in Morpheus to coordinate multi-cog access to the external sram. With four tasks per cog, I can see a use for more than eight locks.

ctwardell · 2014-01-20 18:29

cgracey wrote: »

I'll see what I can do. This is pretty simple, but if you don't hear anything from me, please remind me.

Thanks Chip.

C.W.

jmg · 2014-01-20 18:38

ctwardell wrote: »

That seems like a lot of work compared to having a commonly available location where the request COG could set the flag and the responding COG could clear it.

This needs a rule for what happens when both actions occur on the same clock,
- and is the originator able to remove a set flag ?

ctwardell · 2014-01-20 18:46

jmg wrote: »

This needs a rule for what happens when both actions occur on the same clock,
- and is the originator able to remove a set flag ?

Chip or someone else may have a better suggestion but here is my thought:

Any COG can set or clear.

If conflicting values are set on the same clock:

If the current value is set, then clear wins.
If the current value is cleared, then set wins.

The use cases I have in mind would occur within a lock so this wouldn't be an issue, but in the general case some rule is needed.

C.W.

evanh · 2014-01-20 19:04

Heater. wrote: »

It is this phenomena that was the motivation for the RISC idea. They analysed the output of compilers and the run time usage of different instructions and addressing modes etc and realized it might be a good idea to throw away the unused or little used instructions and dedicate the silicon resources to speeding up execution of what is used. Use those transistors for registers and pipelines rather than instruction decoders.

That's a fair call but it does go both ways. Compiler designers also take advantage of new hardware focuses. Code optimising is all about finding such improvements.

Of course as transistors became plentiful it turned out you could keep the backwards compatible CISC and have those registers and pipelines. At cost of increased power consumption of course.

Backwards compatibility was extremely important for running the same binaries that were copies in the first place. The PC came to dominate on this point and it's still an important feature of the PC. There is still new systems being delivered on modern hardware and running MSDOS.

The rules have changed a bit in some areas. With a monopoly comes rule setters that can potentially dictate how and when purchases are made and thereby force an upgrade path.

pedward · 2014-01-20 19:42

David Betz wrote: »

While it is true that fetching a single value from hub memory will be faster with a RDLONG instruction than with a DCACHEX followed by RDLONGC, that may not be true for fetching numerous contiguous values from hub if they are all in one cache line. I guess which makes the most sense depends on the context.

But if you are dealing with a volatile memory location, you'd never use RDXXXXC. My point is that you would use RDXXXX for volatile variables and RDXXXXC for non-volatile variables (in the C context of the term volatile).

David Betz · 2014-01-20 20:21

pedward wrote: »

But if you are dealing with a volatile memory location, you'd never use RDXXXXC. My point is that you would use RDXXXX for volatile variables and RDXXXXC for non-volatile variables (in the C context of the term volatile).

You can safely read volatile locations as long as you flush the cache first and you're sure the other COG won't be writing at the same time as you're reading but that is usually true of a producer/consumer relationship.

Cluso99 · 2014-01-20 21:18

Chip seems happy to discuss locks, portD and other options. There are probably simple and good ways to do intercog transfers/flags/locks. How about a new thread to discuss this???

jmg · 2014-01-20 21:50

ctwardell wrote: »

If the current value is set, then clear wins.

That could give problems with positive logic flags, and a lost message could result.

If the rule was sticky-logic case won, (set trumps clear), then a lost message would not result.

Of course, you could also claim such a boundary case was perilously close to overrun, but the handler may be FIFO based, and
set on Load and clear on unload...(or clear on partly empty)

pedward · 2014-01-20 23:15

David Betz wrote: »

You can safely read volatile locations as long as you flush the cache first and you're sure the other COG won't be writing at the same time as you're reading but that is usually true of a producer/consumer relationship.

The point is:

a) You don't need to invalidate the cache, which is sloppy and can impact performance considerably
b) Another COG will NEVER be writing at the same time as you are reading, by virtue of the hub slicing

So, because of rules a) and b), you can short circuit the whole cache and atomicity problem by simply using RDXXXX. You can never have a non-atomic read or write, but you can have race conditions because you need to lock global memory blocks; this is true for any system. You can sometimes avoid a lock by using a semaphore that is in the same block as the data you want. The RD instruction can pull 32 bytes in 1 read, so you could load a fairly large structure in 1 read.

pik33 · 2014-01-20 23:29

cgracey wrote: »

It's been my suspicion that this phenomenon of compilers driving CPU design has created a wasteland for would-be assembly languages programmers. Can anyone corroborate this?

Some time ago I wrote a program which have to synthesize the sound using sum of 512 simple software generators as an output on the PC. The processor I used was AMD Duron @ something less than 1 GHz
After all needed preparation processes I had about 10 ns to do one (simple) loop.

No compiler could go lower than 50 ns with this loop.

So I did this in asm and after some code optimization (I had to select registers and interleave instructions in the right way) the goal was achieved. The assembler made this possible.

1% code uses 99% time so why don't use asm when needed? I have a software synthesizer which can do well on PC. I tried to port it to Raspberry Pi. The Pi is too slow and the sound is not good. So, why not use asm in time-critical parts of the program, which is now something about 100 lines of code and the full code is over 6000 lines? I am sure it will work well then so I will do it as soon as I pass all exams I have to pass in January (no time to program now...)

cgracey · 2014-01-21 02:11

I've got AUGD and AUGS working (instead of a single AUGI), as well as 6 instructions which can compute table lookup addresses according to hub exec space (16-bit addressing for long instructions expands to 18-bits for RDxxxx instructions. I'm compiling what might be the next release now. If it works, I'll get the docs updated and get this posted soon. I think it all turned out really nice and clean.

ozpropdev · 2014-01-21 02:17

cgracey wrote: »

I've got AUGD and AUGS working (instead of a single AUGI), as well as 6 instructions which can compute table lookup addresses according to hub exec space (16-bit addressing for long instructions expands to 18-bits for RDxxxx instructions. I'm compiling what might be the next release now. If it works, I'll get the docs updated and get this posted soon. I think it all turned out really nice and clean.

Sounds great Chip! Nice work

Heater. · 2014-01-21 02:25

rd hydra,

The stm32f429 discovery looks cool...but what is the point...a $25 development board that needs a $500 IDE studio

Not me.

Not sure about that board exactly but I program my STM32F4 board for free using GCC.

Cluso99 · 2014-01-21 02:37

Fingers crossed Chip. Nice job

potatohead · 2014-01-21 06:03

Excellent. Thank you Chip. I am in NOLA this week... I will ne last to the party this time. Can't wait!

David Betz · 2014-01-21 06:11

cgracey wrote: »

I've got AUGD and AUGS working (instead of a single AUGI), as well as 6 instructions which can compute table lookup addresses according to hub exec space (16-bit addressing for long instructions expands to 18-bits for RDxxxx instructions. I'm compiling what might be the next release now. If it works, I'll get the docs updated and get this posted soon. I think it all turned out really nice and clean.

Cool! So now it is possible with AUGD and AUGS to write a 96 bit P2 instruction! Does this mean we can call P2 a VLIW processor? :-)

Seriously, congratulations on getting all of this working and I'm looking forward to hearing more about the 6 new instructions!

Gadgetman · 2014-01-21 06:33

One example of code that was never used in Compilers...

Take the IX and IY registers of the Z80...
LD A,(IX)+24 - Loads the Accumulator With the contents of memory pointed to by 'Value of IX + 24'

not useful?
What about Languages that understands 'Records' and SomeRecord.Field3 starts at 24Bytes past the beginning of 'someRecord'
(Yes, it was +/- 127, so a bit limited in the size of the record. It is an 8bit CPU after all. And yes, I've worked With ADA... )

One of my hobbies is to dig into old code in ROM on vintage computers.
Some of them contains a lot of 'store one register, then load the same value' code, or similar, showing that the code is compiled from a high-level Language and that the same variable was used in two consecutive operations.
They also never really uses the full set of registers.
On one, I started scribbling on the printout, and realised that I could easily remove 10 - 15% of the code.

A lot of compilers are rushed to market, and never really optimised afterwards. Then the focus is on bug fixes, and yeah, you're not likely to do something that may introduce New bugs, are you?

Heater. · 2014-01-21 07:07

I'm no compiler writer but I can imagine that:

When they started on Z80 compilers there was no Z80, they were probably targeting 8085. Think PL/M from Intel. Or BDSC. So supporting those extra instructions was not on the radar.
When the Z80 became popular you still had to support 8080 machines, so there was not much incentive.
When you compiler is running on a small machine like the old 8 biters you are struggling to fit the thing in and make it fast enough. Adding in piles of extra stuff and optimizations was not really an option.
By the time the Z80 dominated the world one could already see that 16 bitters were going to take over. No point in optimizing that old stuff.

Oh yeah and optimization is hard!

Baggers · 2014-01-21 08:31

That's great news Chip

can't wait to play with the new codes!

Bill Henning · 2014-01-21 08:47

Excellent news! Time for me to dust off my DE2-115...

Will it still fit in the DE0-Nano's? I'd love to try the UART at as high a bit rate as I can...

cgracey wrote: »

I've got AUGD and AUGS working (instead of a single AUGI), as well as 6 instructions which can compute table lookup addresses according to hub exec space (16-bit addressing for long instructions expands to 18-bits for RDxxxx instructions. I'm compiling what might be the next release now. If it works, I'll get the docs updated and get this posted soon. I think it all turned out really nice and clean.

Propeller II update - BLOG

Comments