Prop2 FPGA files!!! - Updated 2 June 2018 - Final Version 32i

David Betz · 2015-10-12 12:44

Cluso99 wrote: »

I can see how hubexec might look easier for new users familiar with other micros. But if you remember, cog (PASM) was not difficult.

What percentage of P1 users ever go beyond Spin? I was under the impression that it was relatively uncommon for most Propeller users to use PASM other than what is included in drivers that they download from OBEX. This is as it should be. I would be willing to bet that very few Arduino users or even commercial users of the AVR, PIC, or ARM processors write much assembly code. We here in this forum are probably not representative of normal MCU users.

Dave Hein · 2015-10-12 12:52

PLEASE DO NOT REMOVE HUBEXEC!!!! Yes, that was me yelling. Sorry if I hurt anybody's ears. It concerns me greatly that the chief designer (only designer) of the P2 would even consider removing hubexec. I think things could have been simplified if hubexec used the same addressing scheme that is used with cog/LUT memory. It seems like using long addressing throughout would simplify the hardware and the software. If fact, I would think that completely eliminating un-aligned accesses would simplified both instruction access and data access from hub RAM. We don't need un-aligned accesses.

Hubexec will be very useful on the P2. It will be used extensively with C programs, and I'm sure many people will write large PASM programs that will use it as well. It would be used even more if P2 Spin has a mode where it directly compiles into PASM. Just think about it. Spin programmers could actually write code that runs almost as fast as PASM programs.

The only problem that I have with hubexec is that it uses the streamer instead of an instruction cache. The streamer will have to be restarted every time a jump is done. This reduces the efficiency of code that does a lot of jumps. This really messes up the performance of tight loops running in hubexec mode. It also messes up the timing of hubexec code that calls cog code in an attempt to improve efficiency. The streamer is restarted when the cog routine routines instead of just continuing after the CALL was made. Also, data cannot be stream while running in hubexec mode (or at the very least they will have to alternate data and instruction streaming).

An instruction cache would resolve this. Maybe there can be some tweaks made to the streamer to improve looping and calls, but a cache would be the ultimate fix.

Hubexec could really use an instruction cache.

Seairth · 2015-10-12 12:57

ozpropdev wrote: »

Chip
It seems the "bug radar" is being jammed here with "chaff"
Not sure if you saw an earlier post highlighting an issue with INB operations.
See "INB issue" here

*BUMP*

I was able to reproduce the same issue. There's definitely something going on with INB when the cog first starts.

I am getting $3FFFFFFF initially. Then, after some period, it goes to $7FFFFFFF. On my 1-2-3, it's taking ~2700 iterations of the loop before it seems to start working. Actually, it doesn't seem to matter if I'm reading INB repeatedly or not. It's the delay that seems to be important.

David Betz · 2015-10-12 13:01

Dave Hein wrote: »

PLEASE DO NOT REMOVE HUBEXEC!!!!

I don't think Chip has even considered removing hub exec. I was just objecting to the idea that it should be relegated to an "advanced feature". It seems to me that COG/LUT execution is the advanced feature since it involves loading another memory with code before you can use it. Hub execution treating COG addresses as registers seems like the simplest way to think of P2 programming.

ozpropdev · 2015-10-12 13:11

@Seairth
Yep, I found later that using waitx was effective as a workaround too.

Dave Hein · 2015-10-12 13:32

cgracey wrote: »

... The trouble is mainly coming from addressing complexities. If we could establish some basic rules, maybe based on cog/lut vs. hub context, perhaps we could simplify things greatly.

cgracey wrote: »

I think trying to orient programming towards hub exec has been the downer. Living in the cog is more fun.

Based on what Chip said, maybe it's only the addressing difference between cog and hub exec that complicates things. That's why I suggest eliminating un-aligned hubexec. The programmer counter would continue to increment by 1 when going from cogexec to hubexec mode. A PC address of $400 would map to a byte address of $1000 in hub RAM. The only drawback I can see is that the first $1000 bytes of hub RAM can't be used for hubexec. I don't see that as a major problem. This memory could be used to hold a cog image that would be loaded into cog RAM. Most hubexec programs will call routines loaded in cog RAM for speed improvement.

I think some small tweaks can be done to the instruction streamer to improve performance. I think one simple tweak is for it to remember where it left off when calling routines in cog memory. This way it wouldn't have to be reloaded when returning from the cog memory routine. Another tweak would be to retain a certain number of longs in the streamer after execution. This way the streamer wouldn't have to be reloaded if just jumping back a few instructions. And one more tweak would be to not reload the streamer if jumping ahead by just a few instructions.

A cache would really be the best approach, but that may be too much to ask for at this point.

Bill Henning · 2015-10-12 15:31

I am not awake enough to be diplomatic...

Removing hubexec would be worst possible move, as hubexec is will give good high level language performance.

Personally, I'd address code in longs, with implied low order bits, in instructions. LOC etc can always add the 2 zeros, and data access for {RD|WR}{BYTE|WORD|LONG} (and streaming equivalents) in bytes.

Regarding streamer... I think it is GREAT for streaming data for video or other high-bandwidth data.

It is far better than LMM for code, but it is nowhere near as effective as a code cache.

Rule of thumb on most processors is every sixth instruction is a branch, which will flush the streamer.

It takes 8 instruction cycles (16 clocks) to refill the streamer.

Best guess: streamed hubexec code will (on average) be 1/2 the speed of cog code, which is not bad at all. (lut code will be 66% the speed of cog code)

A 16 long instruction cache, would run at ~100% cog speed for loops <= 16 longs, which are frequent for things like mem*(), str*(), etc, but not for main UI code loops.

What would likely help quite a bit is making the instruction stream a bit smarter... only flushing what is needed, so if the target of the branch is already in the streamer, go to it immediately - heck Chip may already be doing this! This would automatically "cache" loops up to 16 instructions long.

David had a good idea - perhaps educational material should first document hubexec mode, as it is closer to conventional processors, and does not limit size. lut/cog execution as an "advanced" topic for drivers, libraries etc. makes sense.

Regarding C/C++ .. David, more people will use it on a P2 with a lot of memory.

Seairth · 2015-10-12 15:48

Seairth wrote: »

What pins are PB0-PB3 connected to? The schematic PDF is only providing the FPGA pins.

Actually, is there a full listing somewhere for the P2 pins?

I should have asked this here. However, I threw together some code to search for those pin outputs. Here's what I found:

inb[27] : PB3
inb[26] : PB2
inb[25] : PB1
inb[24] : PB0

note: active low (0 = pressed, 1 = not pressed)

I'd still like a full listing of the 1-2-3 to the P2 pins, though.

cgracey · 2015-10-12 15:53

Hub exec is not going away.

potatohead · 2015-10-12 16:06

@jmg, the real question to ask is whether or not there will be that code that assembly language people write.

Having it standards compliant and all of that at this stage doesn't really mean much. Besides we can write simple filters it'll move a lot of the code if it's really needed.

The reality is a lot of people ended up on p1 and did some neat stuff precisely because of the cool programming environment. If you want that to continue then we need another cool programming environment.

And I would advise that you abandon the idea of some grand unification at the stage it's not going to happen.

In fact on P2 to one of the things I expect we will see is a few different kinds of programming environments. Having code that does cool stuff is a nice problem to have.

potatohead · 2015-10-12 16:11

As for learning, let's whittle this syntax down and resolve addressing for real.

Then we can start thinking about how people get in. Just as a brief note, I found myself wondering where code was actually running. More clarity on that should be a driver on our addressing solution.

David Betz · 2015-10-12 16:11

potatohead wrote: »

@jmg, the real question to ask is whether or not there will be that code that assembly language people write.

Having it standards compliant and all of that at this stage doesn't really mean much. Besides we can write simple filters it'll move a lot of the code if it's really needed.

The reality is a lot of people ended up on p1 and did some neat stuff precisely because of the cool programming environment. If you want that to continue then we need another cool programming environment.

And I would advise that you abandon the idea of some grand unification at the stage it's not going to happen.

In fact on P2 to one of the things I expect we will see is a few different kinds of programming environments. Having code that does cool stuff is a nice problem to have.

Sure we need a cool programming environment but does it have to be based on Spin? Can't we come up with a way to make a C or C++ based environment "cool"? It's true that having lots of cool programming environments is a nice problem to have. My question is whether we have enough resources to generate more than one. It ook many years to create all of the environments we now have on P1. Do we want to wait that long for P2 environments?

potatohead · 2015-10-12 16:15

Yes. I want to say more on this... later today.

Seairth · 2015-10-12 16:29

Please, feel free to say more. But could you move it to another thread that's specifically about that topic? I'm afraid that Chip is missing bug reports (like the one that @ozpropdev mentioned earlier). And, in my opinion, getting verilog bugs fixed is much more important right now.

jmg · 2015-10-12 16:33

potatohead wrote: »

@jmg,
And I would advise that you abandon the idea of some grand unification at the stage it's not going to happen.

? Whose post are you replying to ?
I have never used the words "grand unification"

What I have have done, is given examples of a small steps to make new users life easier, and to make code more portable.

User Name · 2015-10-12 16:47

Coley wrote:
IMHO Hub exec mode is the icing on the cake and shouldn't be the driving force.
I use all sorts of micros depending on the task at hand, the Propeller is by far the easiest and most fun to use.
I really hope P2 can continue that tradition.

+1

David Betz wrote:
I was under the impression that it was relatively uncommon for most Propeller users to use PASM other than what is included in drivers that they download from OBEX.

It is FAR FAR FAR more likely for a Propeller user to use ASM than it is for an ARM user to use ASM. I will resist any effort to make the Propeller another ARM chip. At least 75% of all Propeller code I write is PASM because of the Propeller's irreplaceable and almost unique facility to allow a user to create precise custom high-speed signalling easily.

David Betz · 2015-10-12 16:49

User Name wrote: »

Coley wrote:
IMHO Hub exec mode is the icing on the cake and shouldn't be the driving force.
I use all sorts of micros depending on the task at hand, the Propeller is by far the easiest and most fun to use.
I really hope P2 can continue that tradition.

+1

David Betz wrote:
I was under the impression that it was relatively uncommon for most Propeller users to use PASM other than what is included in drivers that they download from OBEX.

It is FAR FAR FAR more likely for a Propeller user to use ASM than it is for an ARM user to use ASM. I will resist any effort to make the Propeller another ARM chip. At least 75% of all Propeller code I write is PASM because of the Propeller's irreplaceable and almost unique facility to allow a user to create precise custom high-speed signalling easily.

I don't doubt that that is true but are you a typical Propeller user or are you one of the few who provide OBEX code that the rest use?

jmg · 2015-10-12 16:50

Dave Hein wrote: »

Based on what Chip said, maybe it's only the addressing difference between cog and hub exec that complicates things. That's why I suggest eliminating un-aligned hubexec. The programmer counter would continue to increment by 1 when going from cogexec to hubexec mode. A PC address of $400 would map to a byte address of $1000 in hub RAM.

Complicates what ?
Chip has made the code relocatable, which is work now hidden from users.
The fine granularity of HUB is needed for BYTE data, and sure CODE does not have to be other than long aligned, but long aligned Code is a subset of what is working now.
Because a BYTE address is needed for Data, I'm guessing Chip has kept that for Code to keep the hardware simpler. (Plus, it does avoid wasting space with packers, but that is a less important point)

Dave Hein wrote: »

The only drawback I can see is that the first $1000 bytes of hub RAM can't be used for hubexec. I don't see that as a major problem. This memory could be used to hold a cog image that would be loaded into cog RAM. Most hubexec programs will call routines loaded in cog RAM for speed improvement.

This was why I suggested LUT:HUB could go to the top of the memory map.

Dave Hein wrote: »

I think some small tweaks can be done to the instruction streamer to improve performance. I think one simple tweak is for it to remember where it left off when calling routines in cog memory. This way it wouldn't have to be reloaded when returning from the cog memory routine. Another tweak would be to retain a certain number of longs in the streamer after execution. This way the streamer wouldn't have to be reloaded if just jumping back a few instructions. And one more tweak would be to not reload the streamer if jumping ahead by just a few instructions.

A cache would really be the best approach, but that may be too much to ask for at this point.

The question then becomes what size cache - and any cache still needs to be reloaded.
A Cache-reload would still need to wait for the COG-slot, as that is fundamental to the high memory bandwidth.
Given the good P2 support for block moves, I expect Software will eventually create some nifty resizable-caches, where it does load a block into COG and runs that.

The streamer design seems a nice, relatively simple, way to give every COG full memory bandwidth flows.
Sure, not quite a code-cache in conventional thinking, but there are 16 of these, so they cannot be overly large.

jmg · 2015-10-12 17:00

David Betz wrote: »

I would be willing to bet that very few Arduino users or even commercial users of the AVR, PIC, or ARM processors write much assembly code. We here in this forum are probably not representative of normal MCU users.

Yes, the general larger MCU use model, is to 'use ASM only when you have to'.
( Of course, that also gives laments of the hundreds of files the frameworks create and the tens of k bytes of overheads, seen in other forums....

)

The P2 is a little different, and I expect first-usage will follow other MCUs with a large HUB compiler model, but there are 16 COGS and they are finite in size.
That means there will be more ASM code in a P2, than in other MCUs and that ASM code resource needs to be 'user harvestable'.

potatohead · 2015-10-12 17:56

Okie Dokie.

Dave Hein · 2015-10-12 18:13

jmg wrote: »

Complicates what ?
Chip has made the code relocatable, which is work now hidden from users.
The fine granularity of HUB is needed for BYTE data, and sure CODE does not have to be other than long aligned, but long aligned Code is a subset of what is working now.
Because a BYTE address is needed for Data, I'm guessing Chip has kept that for Code to keep the hardware simpler. (Plus, it does avoid wasting space with packers, but that is a less important point)

I was responding to Chip's comment about the complications of hubexec. Yes, Chip has made the code relocatable and hidden the complexities from users, but it seems like he regretted how much effort it required.

The fine addressing granularity is required for data, but it's not necessary for execution, so why complicate execution with it? And non-aligned accesses aren't required for data or execution.

David Betz · 2015-10-12 18:24

Dave Hein wrote: »

jmg wrote: »

Complicates what ?
Chip has made the code relocatable, which is work now hidden from users.
The fine granularity of HUB is needed for BYTE data, and sure CODE does not have to be other than long aligned, but long aligned Code is a subset of what is working now.
Because a BYTE address is needed for Data, I'm guessing Chip has kept that for Code to keep the hardware simpler. (Plus, it does avoid wasting space with packers, but that is a less important point)

I was responding to Chip's comment about the complications of hubexec. Yes, Chip has made the code relocatable and hidden the complexities from users, but it seems like he regretted how much effort it required.

The fine addressing granularity is required for data, but it's not necessary for execution, so why complicate execution with it? And non-aligned accesses aren't required for data or execution.

Chip has expressed a desire to intermix packed data with instructions and that makes non-aligned instructions necessary.

Roy Eltham · 2015-10-12 18:35

Cluso,
Why do you think having all 16 cogs running hubexec will be any hotter than 15 cogs running cogexec, and 1 running hubexec?

If anything, I would expect 16 cogs running hubexec to run slightly cooler than 16 cogs running cogexec. Mainly because the execution units will be stalled more often waiting on fifo refills on branches. The execution units are likely the hot points. The hub memory stuff is spread out around many separate memories, making it less likely to be a hot point because it's spread out.

I think the main thing that caused the P2-hot to be hot was the pipelining to get instructions to 1 cycle per. Since it had more of the cog execution stuff going at the same time.

potatohead · 2015-10-12 18:38

The massive busses didn't help either. This one has pins associated with COGS for some DAC ops... for that reason.

Heat sink it and ONWARD! ;p

Roy Eltham · 2015-10-12 21:01

I seriously doubt it will need a heat sink. Unless you try to overclock it significantly.

You'll want to run it off a decent lipo (or other lithium based battery) for most unplugged cases, instead of the old school small dry cell(s).

Cluso99 · 2015-10-12 21:13

Roy Eltham wrote: »

Cluso,
Why do you think having all 16 cogs running hubexec will be any hotter than 15 cogs running cogexec, and 1 running hubexec?

If anything, I would expect 16 cogs running hubexec to run slightly cooler than 16 cogs running cogexec. Mainly because the execution units will be stalled more often waiting on fifo refills on branches. The execution units are likely the hot points. The hub memory stuff is spread out around many separate memories, making it less likely to be a hot point because it's spread out.

I think the main thing that caused the P2-hot to be hot was the pipelining to get instructions to 1 cycle per. Since it had more of the cog execution stuff going at the same time.

This is because one of the P2-HOT issues was that the hub was a singular block that was being accessed regularly. Chip broke this down to 16 blocks, with each block being a staged lowest 4bit address (ie each consecutive address can be accessed by a cog in parallel - hence the egg-beater definition). So now we can have every cog accessing a separate hub block in parallel, and will be doing so on almost every clock in hubexec. So the full 16 blocks of hub will be being accessed on each clock, one clock for each cog.
Sure, there will be a bit of a hub slot delay with hubexec, but not that much.

So, the power being used will be significantly higher with hubexec. Just how much, I have absolutely no idea. Cog and Lut exec will not access hub regularly, as happens now with the P1.

BTW I do understand that there were other issues with P2-HOT.

potatohead · 2015-10-12 21:47

Roy Eltham wrote: »

I seriously doubt it will need a heat sink. Unless you try to overclock it significantly.

You'll want to run it off a decent lipo (or other lithium based battery) for most unplugged cases, instead of the old school small dry cell(s).

I don't think so either. Just being flippant

cgracey · 2015-10-12 22:17

I just posted a new file at the top of this thread.

I made some assembler changes which are going to make programming a lot easier, regarding addresses and branching.

Notice two small differences here?

dat
		orgh	0
'
' launch cogs 15..0 with blink program
' cogs that don't exist won't blink
'
		org

:loop		coginit	cognum,#`blink
		djns	cognum,:loop

cognum		long	15
'
' blink
'
		org

blink		cogid	x		'which cog am I?
		setb	dirb,x		'make that pin an output
		notb	outb,x		'flip its output state
		add	x,#16		'add to my id
		shl	x,#18		'shift up to make it big
		waitx	x		'wait that many clocks
		jmp	blink		'do it again

x		res	1		'variable at cog register 8

I also added word and long alignment directives.

Seairth · 2015-10-12 22:26

I already don't like the tick mark. It is both visually subtle and and easy to confuse with the comment mark.

As for the removal of "@", I'm somewhat intrigued. Personally, I was fine with the "@", so I don't know what the value of removing it is.

Electrodude · 2015-10-12 22:33

Is there still a way to do an indirect absolute jump?

Prop2 FPGA files!!! - Updated 2 June 2018 - Final Version 32i

Comments