New Hub Scheme For Next Chip

jazzed · 2014-05-14 08:29

tonyp12 wrote: »

>This is faster than mux'ing
Do you get the feeling that the RAM will now be much faster as going down from 128bit down to 32bit etc, that 250mhz is now possible?

Good question.

tonyp12 wrote: »

>So calculating determinism is not going to be straightforward.
Did you not see my post above, who is asking you to do the calculations in your head?, the compiler will have simulator/trace mode.

Is anyone adding this simulator/trace to the compiler?

It's much better if a person (at least moderately bright) can understand the consequences of timing in their head.

User Name · 2014-05-14 08:30

RossH wrote: »

Even with this scheme, the Propeller will still be quite a slow chip compared to its competitors...

12.8 GB/s exceeds anything available in the ARM hardware I deal with on a daily basis. For specific applications it could clobber the competition.

WRT determinism...the determinism I care most about - cog to pin timing - isn't broken at all by the new hub scheme.

WRT the new scheme...much of our I/O is multiplexed - raster scans on TV screen, LED matrix displays, keyboard scanning, etc. Why not multiplexed memory? Who cares if it isn't linear-contiguous?

mark · 2014-05-14 08:46

[QUOTE=mark

mark · 2014-05-14 09:30

I'll just leave this here...

tonyp12 · 2014-05-14 09:38

>It's much better if a person (at least moderately bright) can understand the consequences of timing in their head.
if you are referring the new %xxxx1111 hubaddress round robin or you want to implement a different one that it's easier to calculate in your head?
There are so many variables you have take in to account when you calculate the timings in a REP loop, but it's not a random lottery (unless your code don't really care anyway)

1: how far apart are the two hub addresses, is A before B or B before A?
2: is one hubpnt auto incrementing/decrementing?
3: how many regular 2cycle cog instruction's are there before/between/after the two rdlongs?

If you know all that you can calculate it, but I don't recommend doing it your head.
a JavaScript simulator could show a REP16 result.

Roy Eltham · 2014-05-14 09:44

Byte access will be best done by doing long or block reads/writes, and then using the getbyte/setbyte instructions to manipulate/isolate byte sized units. Even on P1 HUB access was best done in longs.

A loop reading contiguous addresses will be able to do so with each hub cycle giving one read, and performing (I believe) 7 instructions + 1 rdlong. I'm not sure why you guys think it won't? Your first read with be a 1-16 clock wait, then you read your value, then you perform 16 clocks of instructions, then when you read again the window will be at your previous address+1(4bytes), exactly where you want it. If you read bytes/words, then you do 14 cycles of instructions and your next read will get the same long masked to the byte/word you care about. Then every 4(or 2) reads, you'll have an extra 2 cycle wait to get the next long aligned address. You could unroll your loop to utilize those extra cycles if needed.

Reading/writing longs/words/bytes is very similar to how it worked before (assuming sequential hub access ordering). The bonus block read/writes give us the extra performance win. It's really not more complex to use, or undeterministic.

In fact, for the block read/write stuff it's more deterministic, because there is never a wait for the hub, you always start with whatever window you are on.

Jack Buffington · 2014-05-14 10:03

cgracey wrote: »

There would be special instructions RDBLOC/WRBLOC to handle the transfers of 16 longs. Regardless of the initially-available window, it would always take 16 clocks (+1 for the memory read delay).

So will it be possible to execute rdbloc/wrblock and then execute other instructions that don't use the HUB or the addresses that you are reading/writing to while the transfer happens or do these instructions stall the cog for 16 clocks?

dMajo · 2014-05-14 10:08

Roy Eltham wrote: »

Byte access will be best done by doing long or block reads/writes, and then using the getbyte/setbyte instructions to manipulate/isolate byte sized units. Even on P1 HUB access was best done in longs.

A loop reading contiguous addresses will be able to do so with each hub cycle giving one read, and performing (I believe) 7 instructions + 1 rdlong. I'm not sure why you guys think it won't? Your first read with be a 1-16 clock wait, then you read your value, then you perform 16 clocks of instructions, then when you read again the window will be at your previous address+1(4bytes), exactly where you want it. If you read bytes/words, then you do 14 cycles of instructions and your next read will get the same long masked to the byte/word you care about. Then every 4(or 2) reads, you'll have an extra 2 cycle wait to get the next long aligned address. You could unroll your loop to utilize those extra cycles if needed.

Reading/writing longs/words/bytes is very similar to how it worked before (assuming sequential hub access ordering). The bonus block read/writes give us the extra performance win. It's really not more complex to use, or undeterministic.

In fact, for the block read/write stuff it's more deterministic, because there is never a wait for the hub, you always start with whatever window you are on.

Did you intended 14? As I understood rdblock uses 2 clocks for the opcode itself and then 16 clocks to transfer the data thus a total of 18 clocks.

Or you can have eg 4 consecutive rdblock for a burst read of 64 longs in 64 clocks?

rjo__ · 2014-05-14 10:14

Roy et al,

This subject is slightly beyond my pay grade. I think I get the global picture, but I am pinching myself because of what I think it might imply for some of my aspirations. Could you comment on what you think this scheme will do for acquisition bandwidth of 8 bit image data (as from digital cameras, which typically generate 8 bit pixels from 6-48MHz). What will this do for video display capabilities?

What effects does this scheme have on external memory storage and access?

When we initiate a block transfer does this stall the PASM code until it is done?

Thanks

Rich

jazzed · 2014-05-14 10:26

Yes, the current one.

So you (obviously being more than moderately bright) feel it is really so overwhelming?

Maybe this isn't the solution we want? It should be usable without relying on a tool.

tonyp12 wrote: »

>It's much better if a person (at least moderately bright) can understand the consequences of timing in their head.
if you are referring the new %xxxx1111 hubaddress round robin or you want to implement a different one that it's easier to calculate in your head?
There are so many variables you have take in to account when you calculate the timings in a REP loop, but it's not a random lottery (unless your code don't really care anyway)

1: how far apart are the two hub addresses, is A before B or B before A?
2: is one hubpnt auto incrementing/decrementing?
3: how many regular 2cycle cog instruction's are there before/between/after the two rdlongs?

If you know all that you can calculate it, but I don't recommend doing it your head.
a JavaScript simulator could show a REP16 result.

Cluso99 · 2014-05-14 10:33

My understanding is that RDBLOCK will take 2 clocks to setup the transfer, then the very next hub slot will start the transfer of a long, and continue for 16 clocks total, by which time all 16 longs will have transferred. The cog effectively stalls while transferring, so effectively the instruction takes 18 clocks. Alignment is automatically handled.

When accessing the hub memory randomly, like storing a byte from FDX (uart), and then adjusting the head pointer, determinism is almost impossible to calculate. However, in this example, determinism is not required either, providing hub is accessed within certain time limits. Worst case canbe calculated although its a bit more complex.

tonyp12 · 2014-05-14 10:35

>Maybe this isn't the solution we want? It should be usable without relying on a tool.

It will all be OK, we will learn a few patterns of HUB address alignment and number of instructions that will produce 0-1 cycle stalls.
All printed example's will use these handful number of pattern of code arrangement, and we visually recognize it right away.
Put in the comment section: don't insert additional code here, don't change order of hub variable here.......

Cluso99 · 2014-05-14 10:37

dMajo,
4 consecutive RDBLOCKs would take 4*(2+16)=8+64=72 clocks for 4*16 longs.

Cluso99 · 2014-05-14 11:02

Block transfers between cogs

Cog N starts a transfer as follows...

  REPS #20,#1
  WRBYTE <flagptr>,#1  'set we are running
  NOP     'reps requires 2 instr before loop
   WRBLOCK <hubptr>,<cogptr>  'copy 16 longs
  ADD      <hubptr>,#(16*4)  'ptr++
  ADD      <cogptr>,#16  'ind++

and cog R would...

loop  RDBYTE <flagptr>, tmp WZ
if_z jmp #loop  'wait for ready
' need to wait a total of ~16 clocks (allows for non-alignment of WRBLOCK)
  REPS #20,#3
  nop
  nop
  RDBLOCK <hubptr>,<cogptr>
  ADD      <hubptr>,#(4*16)
  ADD      <cogptr>,#16

Mike Green · 2014-05-14 11:14

You wrote "RDBLOCK <hubptr>,<cogptr>". I don't remember what Chip proposed, but, if RDBLOCK is done the same way as other RDxxxx instructions, the destination field is the cog address and you'd have to add a 32-bit constant with the 16 shifted left into the destination address field position. Will that work with the REPS?

dMajo · 2014-05-14 11:48

Cluso99 wrote: »

dMajo,
4 consecutive RDBLOCKs would take 4*(2+16)=8+64=72 clocks for 4*16 longs.

I don't think so.
If it is 16 clocks totat than it is 4*16=64
If as opposite it takes 2+16 than it will miss the window at next transfer thus it will become 4*(2+16+ 14wait/stall)=4*32=128

or ... like you intend 72 clocks (it takes what it have in front) ... what is still worse the first transfer will eg transfer hubmem0 to hubmem15, the next hubmem 2to15to1, the next hubmem 4to15to3 ... and so on ... with not aligned transfers ... if this is a data array it will mess your indexes.

Cluso99 · 2014-05-14 11:50

Mike Green wrote: »

You wrote "RDBLOCK <hubptr>,<cogptr>". I don't remember what Chip proposed, but, if RDBLOCK is done the same way as other RDxxxx instructions, the destination field is the cog address and you'd have to add a 32-bit constant with the 16 shifted left into the destination address field position. Will that work with the REPS?

Yes, I think you are correct - I have reversed the cog and hub addresses - its 4:40am here so I'll blame being sleepy;)
But, I am not following the constant/shift parts. The <hubptr> and <cogptr> are cog registers so I am not using self modify instructions.
Yes I am tired. I would need to add to the RDBLOCK instruction and that might require a nop too.

Rayman · 2014-05-14 11:52

Roy Eltham wrote: »

Byte access will be best done by doing long or block reads/writes, and then using the getbyte/setbyte instructions to manipulate/isolate byte sized units. Even on P1 HUB access was best done in longs.

I'm not sure about this statement... For me, on P1, if I want a byte, I'd just do rdbyte. How is it more efficient to read a long and then extract a byte from it?

jmg · 2014-05-14 12:44

RossH wrote: »

However, at least with this scheme you can regain determinism should you really need it (but at the cost of some additional code complexity and of course the loss of any speed advantage this scheme might have had).

Nope, determinism is not 'lost' as you seem to believe, - it depends on what your code is doing, & determinism has always depended on user code design, and what your code is doing.

Now, the factors are just a little different.

I can see many ways to code and know exactly how many cycles will be needed, and ways to gain much better than16N, when moving Data to or from the HUB.
(and I'm fairly sure that includes HUB to/from Pins, when Chip replies to my earlier questions )

There are also some simple improvements to Pointer Handling, that can work very well with this scheme, to remove additional code complexity, and give higher bandwidths to FIFO/Buffers.

jazzed · 2014-05-14 13:17

tonyp12 wrote: »

>Maybe this isn't the solution we want? It should be usable without relying on a tool.

It will all be OK, we will learn a few patterns of HUB address alignment and number of instructions that will produce 0-1 cycle stalls.
All printed example's will use these handful number of pattern of code arrangement, and we visually recognize it right away.
Put in the comment section: don't insert additional code here, don't change order of hub variable here.......

Hate to say it, but such recipes (algorithms) and some nicely presented theory of operation should be in a manual (that very few hobbyists will read).

I'm guessing at this point, that when we have an fpga .bit file, we can empirically discover the recipes based on the perceived theory of operation.

If we had a nicely written white paper (Chip? Roy?) on what the theory is, then we could test against it. Forum snippets are great, but scattered at best.

Cluso99 · 2014-05-14 13:28

dMajo wrote: »

I don't think so.
If it is 16 clocks totat than it is 4*16=64
If as opposite it takes 2+16 than it will miss the window at next transfer thus it will become 4*(2+16+ 14wait/stall)=4*32=128

or ... like you intend 72 clocks (it takes what it have in front) ... what is still worse the first transfer will eg transfer hubmem0 to hubmem15, the next hubmem 2to15to1, the next hubmem 4to15to3 ... and so on ... with not aligned transfers ... if this is a data array it will mess your indexes.

I am fairily certain it will be 4*(2+16) because the beauty of this rdblock is that it starts at any hub window, which is an address ending in any 4 bits and it will copy each long from there, wrapping when it gets to $xxxxxF until it copies 16 longs. The fact that the next red lock starts another 2 clocks later just means the copy starts at a 2 long higher address and wraps 2 longs sooner. Hope you understand this, else reread chips post.

tonyp12 · 2014-05-14 13:43

>we can empirically discover the recipes based on the perceived theory of operation.

it's not a theory, it's science.

Once I know the round robin slot allocation, as it may not be from $xxxx0 to $xxxxF, but instead 0-8-1-9-2-A.......

I will write a JavaScript simulator were you can adjust number of cog instructions in between two RD/WLONGs
And also if one auto-inc and also change Bpnt hub location in reference to A location (±8 spots ) and it will simulate a REP16 and then graph the clock stalls for each loop runaround.

Roy Eltham · 2014-05-14 13:44

Rayman wrote: »

I'm not sure about this statement... For me, on P1, if I want a byte, I'd just do rdbyte. How is it more efficient to read a long and then extract a byte from it?

Ray, you can usually read a long and do operations on all four bytes in that long in less time that it takes to do 4 rdbytes with interleaved code. In fact often times you can get down to every other window or even every window depending on how complex your byte operations are. If you are just wanting to read a byte only, then sure rdbyte is easiest and similarly efficient.

Roy Eltham · 2014-05-14 13:49

dMajo wrote: »

I don't think so.
If it is 16 clocks totat than it is 4*16=64
If as opposite it takes 2+16 than it will miss the window at next transfer thus it will become 4*(2+16+ 14wait/stall)=4*32=128

or ... like you intend 72 clocks (it takes what it have in front) ... what is still worse the first transfer will eg transfer hubmem0 to hubmem15, the next hubmem 2to15to1, the next hubmem 4to15to3 ... and so on ... with not aligned transfers ... if this is a data array it will mess your indexes.

dMajo,
I think you are missing a key part of the RDBLOC instruction. It can start it's 16 long read at whatever hub window is currently lined up with the cog, and it reads a long from each of the next 16 windows (wrapping around from 15 to 0 if it started in the middle). At the end it will have read 16 longs in 18 clocks, and if you start another RDBLOC as the next instruction it will read 16 longs in 18 clocks total also. so 4 consecutive RDBLOC instructions would read 64 longs in a total of 72 clocks.

Roy Eltham · 2014-05-14 13:55

dMajo wrote: »

Did you intended 14? As I understood rdblock uses 2 clocks for the opcode itself and then 16 clocks to transfer the data thus a total of 18 clocks.

Or you can have eg 4 consecutive rdblock for a burst read of 64 longs in 64 clocks?

In that post I was not talking about RDBLOC, I was talking about normal RDLONG/WORD/BYTE stuff. Which can take 2 clocks if used with PTRx. So you get it plus 7 other instructions (for a total of 16 clocks) for a HUB cycle. So if you want to read consecutive longs you can do 7 instructions plus the rdlong, and stay "in sync" with reading the HUB window you care about. If you are using RDBYTE, then you need to do one less instruction, so you end up pointing at the same window again 4 times , and then the 5th byte will have a 2 cycle stall, but then the next 3 will be in sync, and so on. For RDWORD, it's 2 reads insync then a stall (again only doing 6 instructions + the RDxxxx). Or you can unroll your loop and use the wait slots to get another instruction in there every 4th one for bytes, and every other one for words.

Roy Eltham · 2014-05-14 14:03

rjo__ wrote: »

Roy et al,

This subject is slightly beyond my pay grade. I think I get the global picture, but I am pinching myself because of what I think it might imply for some of my aspirations. Could you comment on what you think this scheme will do for acquisition bandwidth of 8 bit image data (as from digital cameras, which typically generate 8 bit pixels from 6-48MHz). What will this do for video display capabilities?

What effects does this scheme have on external memory storage and access?

When we initiate a block transfer does this stall the PASM code until it is done?

Thanks

Rich

First of all, yes, when you do a RDBLOC or WRBLOC it will take 18 clocks to do the whole thing and nothing else happens on that cog. It will read/write 16 longs (64 bytes) of data per 18 clocks, so for something like capturing data from a camera module, you should be able to setup a loop that reads a "line" from the camera module into cog memory, and then quickly dump that to HUB memory during the hblank time. If you want to handle really fast camera modules, then you may need to setup multiple cogs to take turns reading from the module and writing the results out to HUB interleaved such that the end result is a normal linear buffer in HUB memory.

Moving data to/from external memory should be about 4 times faster with this scheme than it was with the previous one. (it's even 2 times faster than what we had on the old P2 design with it's RDWIDE (8 long transfers)).

jazzed · 2014-05-14 14:04

tonyp12 wrote: »

>we can empirically discover the recipes based on the perceived theory of operation.

it's not a theory, it's science.

Tony, science uses experiment to prove theory. Theory of operation is a phrase referring to how something works. Science is a voyage of discovery through hypothesis (perceived theory or just a guess) and experiment which eventually proves theory.

jmg · 2014-05-14 14:08

tonyp12 wrote: »

>we can empirically discover the recipes based on the perceived theory of operation.

it's not a theory, it's science.

Once I know the round robin slot allocation, as it may not be from $xxxx0 to $xxxxF, but instead 0-8-1-9-2-A.......

? - Why would it not be linear ?. It needs to be predictable, and easy to visualise.

tonyp12 wrote: »

I will write a JavaScript simulator were you can adjust number of cog instructions in between two RD/WLONGs

I think there is a need for advancement control, but that should be done at the Pointer, as COGs will all be running different code .
It is there that a JavaScript simulator could help confirm to users, they have it right.

potatohead · 2014-05-14 14:11

Actually, experiment confirms theory, until it doesn't, and then we find we need new theory.

Working through that, we get understanding. That is what science brings us on a basic level. Understanding.

Roy Eltham · 2014-05-14 14:13

Brian Fairchild wrote: »

Where, for 90% of real world embedded processor cases, x=1 or 2.

I look at compiler generated assembly code all the time. It's not nearly as bad as you are making it out to be. In a typical stream of 16 instructions you'll find on average 2-3 branches, and MOST of the time those branches are skipping over 1 or 2 instructions if they are taken. So you often end up executing more than half of those 16 instructions, quite often 2/3rds+. So, perf won't be bad at all on typical code. And if we end up with 2 or 4 "cache" lines for hub exec mode, then we'll probably end up being a lot closer to full speed for a lot of code. And obviously, the code will be able to be much larger than 512 instructions.

If you code is small enough to fit entirely in a cog, then great, but if it doesn't fit, then you are going to have to do something that hits HUB. Once you do that you are never going to be going 100% speed.

New Hub Scheme For Next Chip

Comments