Turbo Bytecode Blowout - Interpret hub bytes in 8+ clocks!

jmg · 2017-04-08 21:18

evanh wrote: »

Doing it in hardware without a large cache per Cog would be terrible at best.

Nope, tho I have no idea what your 'terrible' or 'large cache' means here, it is hardly an engineering number !!.

For some real-world numbers on actual HyperRAM speed, lets look around :
https://warmcat.com/embedded/hardware/lattice/hyperbus/hyperram/2016/09/19/hyperbus-implementation-tips-on-ice5.html

and some timings from the data I posted earlier
http://forums.parallax.com/discussion/comment/1398973/#Comment_1398973

Data shows access times of the order of ~ 20 edges (at 100MHz that is 100ns, at 40Mhz 250ns).
After that, data can burst, and 20 bytes gives data equal time to header/preamble.
ie at payloads as small as 20 bytes, header cost is 50% of bandwidth.
With a payload of 32 bytes, that is ~38%, and 64 bytes it is ~ 24%

Not as fast as parallel SRAM, but a whole lot less pins, and hardly 'terrible', nor is 64 bytes what most would call a "large cache" ? - quite the opposite.

The open questions around P2 are more how does the streamer play with HyperRAM ?
I think you can define a Clock count, and Data Size, but can that clock count be DDR or edges, not pulses ?
(ie if DDR cell is not implemented, a simple divide by 2 looks to be needed on CLK pin ?)

Other detail, is how does Direction in streamer change, when you flip from command.address write, to Data In ?
Is that first edge synchronous ?
Most compact code will be 2 chunks: a Header, then Data, but 3 chunks may be possible.

Header write needs to be first 6 edges (P2 drives BUS), but the next latency edges depend on clock speed, and bus is ignored, so anywhere here bus direction can change. eg One 3 chunk split could be 10H+(20-10)L+N.Data ?
The devil is in the details.

potatohead · 2017-04-08 21:54

evanh wrote: »

potatohead wrote: »

Can't we use the events to do this with a data fetch and queue COG or two? Not as fast as hardware, but a lot should be possible.

Ya, that's why I've been saying no need to ask for XIP. Doing it in hardware without a large cache per Cog would be terrible at best. And obviously, with the Prop, a large set of caches means the end of HubRAM, which is not the same objective any longer.

Seems like some hinting could be done too. Where program flow is known, indicate the need to improve on the throughput of the external storage. Will be interesting to see what people end up doing.

jmg · 2017-04-08 22:13

potatohead wrote: »

Seems like some hinting could be done too. Where program flow is known, indicate the need to improve on the throughput of the external storage. Will be interesting to see what people end up doing.

Yes, and the new SKIP instructions could reduce the number of address-changes in code, which could mean less need to write a new address (and helps in hubexec too).
However, even the cost of a change of address at ~20 edges, is not high, just 5 x 32b opcodes.

evanh · 2017-04-08 22:51

jmg wrote: »

evanh wrote: »

Doing it in hardware without a large cache per Cog would be terrible at best.

Nope, tho I have no idea what your 'terrible' or 'large cache' means here, it is hardly an engineering number !!.

Those were useless links. Nothing dealing with interleaving of multiple Cogs.

evanh · 2017-04-08 22:57

Every clock is a new address without caching.

jmg · 2017-04-09 00:31

evanh wrote: »

Every clock is a new address without caching.

Wow... Really, another sweeping, incomprehensible claim ?

evanh wrote: »

Those were useless links. Nothing dealing with interleaving of multiple Cogs.

Ok, sorry I had assumed readers would know the very basics about HyperRAM, clearly that was optimistic.

Here is a link you should read.
http://www.cypress.com/file/183506/download.

That shows that every clock is clearly not 'a new address without caching", it takes many clocks to send the command and address fields, then some more clocks to match latency, then data arrives on the pins, one byte per edge.

That is the hardware level reality, which is the topic of conversation.
ie How can the P2 streamer best interact with HyperRAM, to give the best practical bandwidth ?

potatohead · 2017-04-09 00:45

Well, how about some sample code as a thought experiment?

Seems tough to explore this without some hardware to test on. Maybe someone can make a few boards.

And there are a few cases:

One is big program, single or paired COG. That one seems fairly low controversy.

Another is multiple COGS running a code body, or code bodies from XMM.

Then come overlays and other deterministic type use cases. We know what is needed, so conflicts can be minimized and latency can be interleaved with other activity. Things like fetching a new computemail module, or UX routine at given time, or user menu pick. Considerable swap and or fetch time may well be just fine here.

Or,

We use it as data mostly, maybe in tandem with overlays. SPIN byte code is gonna be compact. We may find few people really need XMM for programs as much as they do data, or again, overlays.

jmg · 2017-04-09 01:25

potatohead wrote: »

Well, how about some sample code as a thought experiment?

Seems tough to explore this without some hardware to test on. Maybe someone can make a few boards.

A few boards are already around, but actual HyperRAM test info is rarer...

http://forums.parallax.com/discussion/164923/new-boards-in-for-my-project-p1-max10m25

http://forums.parallax.com/discussion/164540/usb-usd-hyperram-flash-camera-test-board-for-p123/p1

http://forums.parallax.com/discussion/comment/1377007/#Comment_1377007

and other FPGA boards, with HyperRAM
https://shop.trenz-electronic.de/en/search?sSearch=HyperRAM

evanh · 2017-04-09 03:23

A single Cog is no problem, and never was a problem. I've not ever commented on that.

I'm solely addressing multiple concurrent Cogs. Which would be a requirement if hardware XIP was being asked for. My only criticism is of expectations that XIP could work under this criteria. It would be rubbish without caching.

jmg · 2017-04-09 08:56

evanh wrote: »

A single Cog is no problem, and never was a problem. I've not ever commented on that.
I'm solely addressing multiple concurrent Cogs.

Hmm...., then seems we have two quite separate topics going here.
Multiple COGs are not mandatory for fast operation of HyperRAM, and it seems smartest to first get one COG working well, using the streamer (as that is the best suited P2 resource) and getting real numbers, before making leaps as to what actual value your rather vague 'rubbish' might have.

evanh · 2017-04-09 09:47

I'm only addressing the issue around hardware implementation of XIP.

Conga · 2017-04-11 10:24

I'm still thinking about the XBYTE feature's selection of bits to match in hardware, and how it determines the number of LUT RAM longs used.

Matching from MSB --- from 8 bits (all) to 1 bit --- seems better for bytecode:
emulating existing ISAs, or
any byte VM that a programmer might want to define.

Matching from LSB seems better for state machines.
Now I see what Chip was suggesting.
He did not try hard to "sell" the idea, and I'm afraid that state machines did not get a fair hearing.

*Any* selection of bits to match (up to 8 bits) can be implemented with a suitably populated table in LUT RAM, duplicating entries when two bit patterns are supposed to be equivalent.

Given the above, I think that the hardware support should allow matching at both ends
while helping minimize usage of LUT RAM per table.

If this is accepted as the driver for the design of the XBYTE feature,
then the most useful patterns to support in hardware would be:
- 8 bits (match all);
- 7, 6, 5 bits from MSB *and* from LSB;
- 4 bits from MSB or from LSB, since
there's only one value left in the "magic" address range ($1F8 to $1FF).

Conga · 2017-04-11 10:36

Actually adding a few more "magic" addresses that trigger XBYTE is possible:
PA and PB are the obvious candidates:
- they are contiguous to the other 8 "magic" addresses, and
- they cannot be used to hold code (when using XBYTE) since they are rewritten for each byte executed.

cgracey · 2017-04-11 10:48

Good thoughts, Conga. I wish we had one more bit somewhere. Then, we could just pick from MSBs or LSBs.

Conga · 2017-04-11 13:37

cgracey wrote: »

Good thoughts, Conga. I wish we had one more bit somewhere. Then, we could just pick from MSBs or LSBs.

What's wrong with extending the "magic" range to include $1F0 to $1F7 ?

Only $1F1 to $1F7 are actually needed, and even less would be still useful.

Bill Henning · 2017-04-11 16:42

Awesome!

cgracey wrote: »

jmg wrote: »

Roy Eltham wrote: »

Other bytecode setups I have seen often have the lowest bits being some form of index or offset, so they would prefer the MSBs being the code and the LSBs being the data.

CIL and Java have the small constants and groups, all adjacent coded, so that means LSBs set the index.
However, pushing the bytecode itself to higher bits, makes the tables more sparse, and larger ?

If you don't care about trying to compress the bytecode table, you can always just use the whole byte as the index.

cgracey · 2017-04-11 22:56

Conga wrote: »

I'm still thinking about the XBYTE feature's selection of bits to match in hardware, and how it determines the number of LUT RAM longs used.

Matching from MSB --- from 8 bits (all) to 1 bit --- seems better for bytecode:
emulating existing ISAs, or
any byte VM that a programmer might want to define.

Matching from LSB seems better for state machines.
Now I see what Chip was suggesting.
He did not try hard to "sell" the idea, and I'm afraid that state machines did not get a fair hearing.

*Any* selection of bits to match (up to 8 bits) can be implemented with a suitably populated table in LUT RAM, duplicating entries when two bit patterns are supposed to be equivalent.

Given the above, I think that the hardware support should allow matching at both ends
while helping minimize usage of LUT RAM per table.

If this is accepted as the driver for the design of the XBYTE feature,
then the most useful patterns to support in hardware would be:
- 8 bits (match all);
- 7, 6, 5 bits from MSB *and* from LSB;
- 4 bits from MSB or from LSB, since
there's only one value left in the "magic" address range ($1F8 to $1FF).

I found a way to handle both MSBs and LSBs.

One problem we have had is that shifting the bytecode to get the MSBs in index position and then adding it to the SETQ base value takes so long that we start forming a critical path. What is better to do is OR the index and base together, at some split point, thereby getting rid of the adder chain.

So, the LSB of the SETQ value (which never got used, actually) now selects between MSB (1) and LSB (0) index orientation. The three LSBs from the RET to $1F8..$1FF determine where the split is, between SETQ base address and the shifted/masked bytecode index.

Here are four sets of instructions that kick off XBYTE with different parameters:

	push	#$1F8		'select 8-bit lookup
_ret_	setq	#$100		'select $100 LUT base and full-byte lookup

	push	#$1FC		'select 4-bit lookup
_ret_	setq	#$1F0+1		'select $1F0 LUT base and MSBs lookup

	push	#$1FC		'select 4-bit lookup
_ret_	setq	#$1F0+0		'select $1F0 LUT base and LSBs lookup

	push	#$1FF		'select 1-bit lookup
_ret_	setq	#$1FE+1		'select $1FE LUT base and MSB lookup

	push	#$1FF		'select 1-bit lookup
_ret_	setq	#$1FE+0		'select $1FE LUT base and LSB lookup

jmg · 2017-04-11 23:06

cgracey wrote: »

I found a way to handle both MSBs and LSBs.

One problem we have had is that shifting the bytecode to get the MSBs in index position and then adding it to the SETQ base value takes so long that we start forming a critical path. What is better to do is OR the index and base together, at some split point, thereby getting rid of the adder chain.

Cool, so this is smaller and faster and more flexible too ?

cgracey · 2017-04-12 00:01

jmg wrote: »

cgracey wrote: »

I found a way to handle both MSBs and LSBs.

One problem we have had is that shifting the bytecode to get the MSBs in index position and then adding it to the SETQ base value takes so long that we start forming a critical path. What is better to do is OR the index and base together, at some split point, thereby getting rid of the adder chain.

Cool, so this is smaller and faster and more flexible too ?

That's right. It only lets you set the LUT base at integral positions (i.e. 1-bit lookup tables must be based at even addresses, while 8-bit lookup tables can only start at $000 or $100).

Roy Eltham · 2017-04-12 02:58

Chip,
This is excellent! Now we have both ways!

Conga · 2017-04-12 08:05

cgracey wrote: »
Here are four sets of instructions that kick off XBYTE with different parameters:
...
	push	#$1F4		'select 4-bit lookup
_ret_	setq	#$1F0+1		'select $1F0 LUT base and MSBs lookup

	push	#$1F4		'select 4-bit lookup
_ret_	setq	#$1F0+0		'select $1F0 LUT base and LSBs lookup
...

Thank Chip, this is great!

I guess you intended to write #$1FC instead of #$1F4 for 4-bit lookup.

By the way, is it enough to push *once* when changing the number of bits to lookup
or we need to repeat 8 times the push of a "magic" address?

In other words, what exactly is required to make the internal stack provide the desired value for *any* number of unmatched RETs and POPs?

cgracey · 2017-04-12 16:32

Conga wrote: »
cgracey wrote: »
Here are four sets of instructions that kick off XBYTE with different parameters:
...
	push	#$1F4		'select 4-bit lookup
_ret_	setq	#$1F0+1		'select $1F0 LUT base and MSBs lookup

	push	#$1F4		'select 4-bit lookup
_ret_	setq	#$1F0+0		'select $1F0 LUT base and LSBs lookup
...
Thank Chip, this is great!

I guess you intended to write #$1FC instead of #$1F4 for 4-bit lookup.

By the way, is it enough to push *once* when changing the number of bits to lookup
or we need to repeat 8 times the push of a "magic" address?

In other words, what exactly is required to make the internal stack provide the desired value for *any* number of unmatched RETs and POPs?

Yes, that should be $1FC. Thanks for noticing.

It's only necessary to push once, since the stack is not popped if the return address is $1F8.. $1FF.

cgracey · 2017-04-15 00:00

I was writing the brancher code snippet for the interpreter and I had a nice realization.

SKIPF patterns are used by XBYTE. Normally, you would probably not want to have any '1' bits (1=ignore) after a conditional RET instruction, because a fork is created if the branch takes. XBYTE solves this, actually, by resetting the EXECF bit pattern as part of its behavior. So you can have a conditional RET instruction without any worries about it executing and still having '1' bits left in the SKIPF pattern. If the RET takes and the stack address is $1F8..$1FF, then it executes the hidden XBYTE instruction, which does an RFBYTE, a RDLUT, and then an EXECF (which is a branch plus SKIPF). This resets the SKIPF pattern to the new bytecode's pattern. So, no problem putting early conditional RET's into your code snippets!

Writing this code brought up the concern before I realized there wasn't actually any problem. Note that there are conditional RET's in the middle of the snippet. There's no lingering SKIPF problem if they execute! Life is simple, afterall.

'
' Branches
'
br_byte		rfbyte	pa		'	a b c d e f			a,b,c: byte add, z, nz
br_word		rfword	pa		'	| | | | | | g h i j k l		d,e,f: byte sub, z, nz
br_long		rflong	pa		'	| | | | | | | | | | | | m n o	g,h,i: word add, z, nz
		test	x	wz	'	| b c | e f | h i | k l | n o	j,k,l: word sub, z, nz
		popa	x		'pop	| b c | e f | h i | k l | n o	m,n,o: long rel, z, nz
	if_z	ret			'	| b | | e | | h | | k | | n |
	if_nz	ret			'	| | c | | f | | i | | l | | o
		add	pb,pa		'	a b c | | | g h i | | | m n o
		sub	pb,pa		'	| | | d e f | | | j k l | | |
	_ret_	rdfast	#0,pb		'branch	a b c d e f g h i j k l m n o

You see in that 10-instruction snippet, there are actually 15 different routines that would have taken 75 instructions to code discretely. And no clock cycle gets wasted!

jmg · 2017-04-15 01:13

cgracey wrote: »

You see in that 10-instruction snippet, there are actually 15 different routines that would have taken 75 instructions to code discretely. And no clock cycle gets wasted!

Nifty, but starting to look like some tool that extracts your column-tags into masks could be useful to have ?
Maybe a 'MaskName=' in vertical above the columns, would give enough for the tool to work with ?

cgracey · 2017-04-15 02:49

jmg wrote: »

cgracey wrote: »

You see in that 10-instruction snippet, there are actually 15 different routines that would have taken 75 instructions to code discretely. And no clock cycle gets wasted!

Nifty, but starting to look like some tool that extracts your column-tags into masks could be useful to have ?
Maybe a 'MaskName=' in vertical above the columns, would give enough for the tool to work with ?

Yes, something that could produce a CON label with the 32-bit EXECF value (22-bit SKIPF pattern plus 10-bit address).

Cluso99 · 2017-04-15 09:44

cgracey wrote: »
I was writing the brancher code snippet for the interpreter and I had a nice realization.

SKIPF patterns are used by XBYTE. Normally, you would probably not want to have any '1' bits (1=ignore) after a conditional RET instruction, because a fork is created if the branch takes. XBYTE solves this, actually, by resetting the EXECF bit pattern as part of its behavior. So you can have a conditional RET instruction without any worries about it executing and still having '1' bits left in the SKIPF pattern. If the RET takes and the stack address is $1F8..$1FF, then it executes the hidden XBYTE instruction, which does an RFBYTE, a RDLUT, and then an EXECF (which is a branch plus SKIPF). This resets the SKIPF pattern to the new bytecode's pattern. So, no problem putting early conditional RET's into your code snippets!

Writing this code brought up the concern before I realized there wasn't actually any problem. Note that there are conditional RET's in the middle of the snippet. There's no lingering SKIPF problem if they execute! Life is simple, afterall.
'
' Branches
'
br_byte		rfbyte	pa		'	a b c d e f			a,b,c: byte add, z, nz
br_word		rfword	pa		'	| | | | | | g h i j k l		d,e,f: byte sub, z, nz
br_long		rflong	pa		'	| | | | | | | | | | | | m n o	g,h,i: word add, z, nz
		test	x	wz	'	| b c | e f | h i | k l | n o	j,k,l: word sub, z, nz
		popa	x		'pop	| b c | e f | h i | k l | n o	m,n,o: long rel, z, nz
	if_z	ret			'	| b | | e | | h | | k | | n |
	if_nz	ret			'	| | c | | f | | i | | l | | o
		add	pb,pa		'	a b c | | | g h i | | | m n o
		sub	pb,pa		'	| | | d e f | | | j k l | | |
	_ret_	rdfast	#0,pb		'branch	a b c d e f g h i j k l m n o
You see in that 10-instruction snippet, there are actually 15 different routines that would have taken 75 instructions to code discretely. And no clock cycle gets wasted!

Brilliant work Chip.

jmg,
We don't need the compiler to be that intelligent. There will not be that many to use this kind of code, and those that do should be smart enough to work it out. There are far more important things to add to the compiler first!!!

jmg · 2017-04-15 10:17

Cluso99 wrote: »

jmg,
We don't need the compiler to be that intelligent. There will not be that many to use this kind of code, and those that do should be smart enough to work it out. There are far more important things to add to the compiler first!!!

Did you mean to say 'the assembler' ?
It's not really 'intelligence' to read down a column vertically, if that is what you meant ?.
The user provides all the intelligence there, the Assembler just spins what they wrote 90', to create the masks.

Cluso99 · 2017-04-15 10:26

jmg wrote: »

Cluso99 wrote: »

jmg,
We don't need the compiler to be that intelligent. There will not be that many to use this kind of code, and those that do should be smart enough to work it out. There are far more important things to add to the compiler first!!!

Did you mean to say 'the assembler' ?

Technically while pnut is currently only an assembler, on P1 there are no assemblers, only compilers, even tho we are assembling. Who cares

It's not really 'intelligence' to read down a column vertically, if that is what you meant ?.
The user provides all the intelligence there, the Assembler just spins what they wrote 90', to create the masks.

It is still resources to put that into the compiler/assembler. Those resources (probably Roy) would be better spent adding other much more deserving features that will be much more useful.

Heater. · 2017-04-15 10:42

Cluso99,

...on P1 there are no assemblers, only compilers...

There is one. propasm by Cliff Biffle:
https://github.com/cbiffle/propasm

It's the tool I first used to program the Propeller. Back in the dark ages when there was no BST or other cross-platform tools for the Propeller.

jmg · 2017-04-15 11:00

Cluso99 wrote: »

It is still resources to put that into the compiler/assembler. Those resources (probably Roy) would be better spent adding other much more deserving features that will be much more useful.

Yes, but I was actually thinking in terms of Chip's time, not anyone else's.
Right now, Chip is making/changing a lot of manual masks.
It is tedious to manually create those bit patterns, and every edit needs more manual time. That's a lot of Chip's very valuable time.
The Column 90' extract I suggested, is coded only once, and thereafter automatically extracts the mask patterns, in an edit-safe manner.

Turbo Bytecode Blowout - Interpret hub bytes in 8+ clocks!

Comments