Fast Bytecode Interpreter

David Betz · 2017-05-12 00:21

Cluso99 wrote: »

LUT size has been locked in because code executing from hub starts at long address $400 (byte address $1000). ie there is a hole for cog and LUT.

Well, I suppose the LUT size could be tripled and the base of hub memory moved to $800. I imagine that much memory wouldn't fit though.

Dave Hein · 2017-05-12 00:53

jmg wrote: »

Seairth wrote: »

Yes, if the PC is between $200 and $3FF, the cog is fetching instructions (at full speed) from LUT instead of COG. Like you suggest, you could execute from the LUT and use the COG ram purely as data registers. Combine that with shared LUT mode, where the paired cog could dynamically swap out executable code, and you end up with some really interesting execution options!

I wonder how elastic the LUT size is ?

If we are wildly optimistic for a moment, and presume the routed device has spare space after the 512k RAM is included, how easy is it to increase the LUT size to the next notch ?

If there is any spare space after the 512K RAM is included I rather see it filled up with even more hub RAM. It seems like that would be much easier to do than increasing the LUT size.

cgracey · 2017-05-12 06:48

I just found a bug in XBYTE. If the next instruction in the pipeline following the _RET_/RET to $1F8..$1FF had an immediate D field, it wouldn't read the LUT byte. Someone had said earlier that they had some funny problem with XBYTE. I imagine this was it. I just discovered this in optimizing the D mux.

100MHz for the next FPGA release is going to be no problem. I'm almost wondering if we could get 120MHz.

evanh · 2017-05-12 07:01

cgracey wrote: »

100MHz for the next FPGA release is going to be no problem. I'm almost wondering if we could get 120MHz.

200MHz final silicon, here we come!

Rayman · 2017-05-12 11:08

108 MHz would be nice... Good for USB and SXGA and 1280 x 960 resolution ...

Seairth · 2017-05-12 11:34

cgracey wrote: »

I just found a bug in XBYTE. If the next instruction in the pipeline following the _RET_/RET to $1F8..$1FF had an immediate D field, it wouldn't read the LUT byte. Someone had said earlier that they had some funny problem with XBYTE. I imagine this was it. I just discovered this in optimizing the D mux.

All the more reason to truly lock the design. That could have been a subtle bug to find. Moreover, it shows that the P2 will need a lot of testing, which will be much more successful if the design is not a moving target.

Here's a suggestion: create a new Google spreadsheet or document called "P3 ideas" and make it editable. If anyone has a new idea or suggestion for the P2, we all are responsible for redirecting that person to add the idea/suggesting to that document instead. That way, the person's idea is not being shot down just because the P2 itself is locked down, and we capture the idea in a place that will be more discoverable than sifting back through forum posts will be.

potatohead · 2017-05-12 13:49

Seconded.

David Betz · 2017-05-12 14:45

The trouble is, a design freeze has been announced several times already. How do we know that a new one will really stick?

AJL · 2017-05-12 14:52

David Betz wrote: »

The trouble is, a design freeze has been announced several times already. How do we know that a new one will really stick?

That is entirely within the control of Parallax. Even if the flow of suggestions were somehow to be blocked the change can (and probably will) continue until Chip is happy to call it finished.

Seairth · 2017-05-12 14:58

David Betz wrote: »

The trouble is, a design freeze has been announced several times already. How do we know that a new one will really stick?

All it takes is Chip saying "add it to the P3 Ideas document" for every suggestion that comes up. And if Chip floats a new idea, it's up to the rest of us to say "add it to the P3 Ideas document" instead of discussing it. And if you are wondering if even this will happen, we shall see. I'm sure that jmg will be testing Chip's resolve as soon as that document is created.

AJL · 2017-05-12 23:29

Seairth wrote: »

David Betz wrote: »

The trouble is, a design freeze has been announced several times already. How do we know that a new one will really stick?

All it takes is Chip saying "add it to the P3 Ideas document" for every suggestion that comes up. And if Chip floats a new idea, it's up to the rest of us to say "add it to the P3 Ideas document" instead of discussing it. And if you are wondering if even this will happen, we shall see. I'm sure that jmg will be testing Chip's resolve as soon as that document is created.

I was not just adressing the matter of new ideas, but the general idea of a design freeze; that includes refraining from fixing things that don't work.

potatohead · 2017-05-12 23:33

They need to be fixed.

AJL · 2017-05-13 00:12

potatohead wrote: »

They need to be fixed.

While that statement still holds, the design isn't ready to be frozen.

The feature set should be frozen now, which is where the "ideas for P3" repository can be used as a parking lot.

I fully understand the desire to continue to improve the design while fixing errors. That's where I rely on my managers to say "good enough" and deliver. Without a manager to make that call, Chip needs to exercise that rigour for himself.

evanh · 2017-05-13 00:36

Nah, sorry, can't agree on that AJL. Real progress is happening. It'll be finished.

potatohead · 2017-05-13 01:03

Yes. And we really do need a round of tests. SPIN and C tools are a good first round. Code done with those, developing some library, objects should help with the rest.

Bug fixes are necessary right now. I feel there will be some. Given the emphasis on quality and performance, we need to do this.

With you on features. It's jam packed with good stuff! No need for more.

AJL · 2017-05-13 04:40

evanh wrote: »

Nah, sorry, can't agree on that AJL. Real progress is happening. It'll be finished.

I'm not saying that it is ready to ship (not my place). My comments were in response to the statement about design freeze.

As this work isn't happening in response to a contract from a particular customer Chip has the freedom to fix "everything". If this was a contracted delivery with declared shipping date he would have to meet that, even at the expense of leaving things "broken".
I see many comments that seem to miss this important distinction.
Without a declared end date, projects run the risk of never being finished. The cost of such a date is that the end product may contain known problems. You generally can't have both a declared end date and a perfect product.

evanh · 2017-05-13 06:02

It's getting finished, no worries there. Perfect is the objective ... so there isn't a fixed end date.

KeithE · 2017-05-13 15:32

> Perfect is the objective

But think about this - there is no such thing as a perfect chip. Name one that you think is perfect, and I'm sure that we can find plenty of room for improvement in multiple dimensions.

Cluso99 · 2017-05-13 16:33

Depends on the definition of perfect.

The errata I have seen on some chips is quite large, and they often take many revisions to fix them. Some of the published workarounds say don't use xxx which are not really workarounds.

I don't recall seeing any errata on the P1, nor am I aware of any bugs. There is a PLL issue if you don't connect all the power and ground pins but I would consider this as a user issue, not a bug.

That being said, the P2 is considerably more complex, and there hasn't really been any concerted testing efforts since the P2HOT. There will most likely be bugs. As long as they don't break everything, they will likely just block a specific piece working. Because the P2 is so flexible there will be other ways around any such problems.

Even in a worst case where the smart pins didn't work, we can still drive them directly. We have 16 cogs after all. Maybe there would be a few things it wouldn't do, but it wouldn't break the chip.

Many of us just wanted a P1 with more HUB RAM, faster, and more I/O. The P2 kills that hands down.

Heater. · 2017-05-13 19:17

KeithE,

...there is no such thing as a perfect chip. Name one that you think is perfect...

The 555 of course...

potatohead · 2017-05-13 19:19

P1 is damn good. 6809 is too.

Dave Hein · 2017-05-13 20:37

Cluso99 wrote: »

...
That being said, the P2 is considerably more complex, and there hasn't really been any concerted testing efforts since the P2HOT. There will most likely be bugs. As long as they don't break everything, they will likely just block a specific piece working. Because the P2 is so flexible there will be other ways around any such problems.

Even in a worst case where the smart pins didn't work, we can still drive them directly. We have 16 cogs after all. Maybe there would be a few things it wouldn't do, but it wouldn't break the chip.
...

There hasn't been any formal testing, but people have been testing a few things since P2-Hot. If everyone would re-run what they've done on previous versions it would go a long way toward testing out new FPGA images. I'm hoping that any changes after the next version will be small and incremental. So I would encourage everyone to run whatever they have on each new FPGA image from now on.

Chip did build a test chip containing Smart pins. This should have provided an effective test bed for the analog circuitry, and whatever digital circuitry he included in the test chip.

I feel that the P2 is getting very close. There is light at the end of the tunnel. Hopefully it's not the headlights of a high speed train coming straight at us.

It would be nice if people could restrain themselves from proposing their own pet ideas. There are several things that I would have liked to see in the P2, but I feel it would be reckless to propose them at this point. I think the bit ops are a good example of something that wasn't absolutely necessary for the P2. This ended up consuming a couple of weeks, and caused a bit of reshuffling of the instruction set. There are other ways to do bit operations at the expense of a few extra cycles.

I am all for having discussion on the P3, but can it wait a few months until the P2 is sent off to the foundry? I don't think this forum is capable of having a completely separate discussion on the P3 without it spilling over into the P2. It would be more productive for everyone to test the P2 than to discuss new features on the P3 right now.

evanh · 2017-05-13 23:43

I'm hanging out for a Prop2 loader that works on Linux. PNut.exe plays havoc with the DTR line, preventing the Prop2 from accepting the program download.

ersmith · 2017-05-14 00:00

evanh wrote: »

I'm hanging out for a Prop2 loader that works on Linux. PNut.exe plays havoc with the DTR line, preventing the Prop2 from accepting the program download.

Dave Hein's loadp2 program works great on Linux --that's what I've been using. He also posted a p2asm assembler, which I think works well but I haven't used as much (I mostly use spin2cpp/fastspin for my P2 development).

Eric

evanh · 2017-05-14 00:35

Cool, how long has that existed for?

Hey Dave,
Give yourself a signature with that as the link.

EDIT: Found it - http://forums.parallax.com/discussion/comment/1409237/#Comment_1409237

evanh · 2017-05-14 01:25

Top notch! I just used PNut to build the .obj and then used loadp2 to download it. Thanks heaps Dave. /me is happy.

Dave Hein · 2017-05-14 02:17

I'm glad the loader works for you. I've been doing a little more work on GCC for the P2, and I hope to post an update soon. The loader hasn't changed much, except that I added a -v option to enable a verbosity mode. If the -v option isn't specified in the new version it disables the prints so it runs silently.

I've modified p2asm to generate an object file, and I wrote a linker call p2link to produce an executable binary file. The mods to p2asm and p2link are based on the work I did on the Taz C compiler. All the tools are tied together with a bash script called p2gcc. It's actually working out pretty well. I can compile a C program and load it on the P2 board by typing "p2gcc -r -t hello.c". The -r will cause loadp2 to run, and the -t option is passed to loadp2 to run the terminal emulator. Eventually I'll update spinsim and tie that into p2gcc so that programs can be compiled and run on the simulator by typing "p2gcc -sim hello.c".

cgracey · 2017-09-02 00:19

Here is the hub move/fill routine for the new Spin interpreter.

At 120MHz, it's moving 64KB of data within the hub in 500us. That's with a read-then-write transfer buffer of 32 longs. I can almost double that speed by going to a 256-long buffer, but there won't be that much free space in the interpreter. As it is, we are a little better than half of theoretical full speed with a 32-long buffer.

This thing took me three days to write. Because there are a few levels of performance possible in the Prop2, optimizing general-purpose things like this gets complicated.

'
'
' BYTEMOVE(dst,src,cnt)		z = 0
' WORDMOVE(dst,src,cnt)		z = 0
' LONGMOVE(dst,src,cnt)		z = 0
'
' BYTEFILL(dst,val,cnt)		z = 1
' WORDFILL(dst,val,cnt)		z = 1
' LONGFILL(dst,val,cnt)		z = 1
'
mov_fil		popa	y		'	a b c d e f	pop src/val		a: BYTEMOVE
		popa	z		'	a b c d e f	pop dst			b: WORDMOVE
					'						c: LONGMOVE
		tjz	x,#.exit	'	a b c d e f	if cnt=0, exit
					'						d: BYTEFILL
		shl	x,#1		'	| b | | e |	if word, cnt*2		e: WORDFILL
		shl	x,#2		'	| | c | | f	if long, cnt*4		f: LONGFILL

		cmp	y,z	wc	'	a b c | | |	reverse move?
	if_c	add	y,x		'	a b c | | |
	if_c	add	z,x		'	a b c | | |

		movbyts	y,#%%0000	'	| | | d | |	byte fill
		movbyts	y,#%%1010	'	| | | | e |	word fill

		rep	#2,#32		'	| | | d e f	set fill pattern
		altd	pa,altd_fill	'	| | | d e f
		mov	buff,y		'	| | | d e f

		shr	x,#1	wc	'handle any stray byte
	if_c	mov	a,#1
	if_c	callpa	#%0111_0111,#.m

		shr	x,#1	wc	'handle any stray word
	if_c	mov	a,#2
	if_c	callpa	#%1011_1011,#.m

.loop		cmpsub	x,#32	wc	'handle longs in blocks of up to 32
	if_c	mov	a,#32
	if_nc	mov	a,x
	if_nc	mov	x,#0
		mov	pb,a
		sub	pb,#1
		shl	a,#2
		callpa	#%1100_1100,#.m
		jmp	#.loop


.m		cmp	y,z	wc	'move/fill routine, reverse move?

if_nz_and_c	sub	y,a		'if reverse move, pre-dec pointers
if_nz_and_c	sub	z,a

		skipf	pa		'set skip pattern for rdxxxx/wrxxxx

if_nz		setq	pb		'rdxxxx for move
if_nz		rdlong	buff,y
if_nz		rdword	buff,y
if_nz		rdbyte	buff,y

		setq	pb		'wrxxxx for move/fill
		wrlong	buff,z
		wrword	buff,z
		wrbyte	buff,z

if_z_or_nc	add	y,a		'if forward move/fill, post-inc pointers
if_z_or_nc	add	z,a

	_ret_	tjz	x,#.done	'if not done, return to caller


.done		pop	x		'done, pop call stack
.exit	_ret_	popa	x		'pop top of stack into x, return to xbyte loop

ozpropdev · 2017-09-02 01:14

Nice work Chip!

BTW I couldn't help but notice your reference to 120MHz.
That sounds very encouraging.

Dave Hein · 2017-09-02 01:44

Chip, that looks great. However, I wonder if you could get a speed improvement by separating xxxxFILL from xxxxMOVE. Also, because the P2 does unaligned reads and writes the BYTE and WORD operations could be almost as fast as the LONG operations. It just requires performing 0 to 3 BYTE accesses at the beginning, and then doing LONG accesses after that. This is the code I used for memset() in p2gcc, which is basically the same as BYTEFILL() in Spin.

' This code resides in cog RAM
__LONGFILL
        wrfast  #0, r0
        rep     #1, r2
        wflong  r1
        ret

// This code resides in hub RAM
// ptr, val and num are in r0, r1 and r2
void memset(void *ptr, int val, int num)
{
    __asm__("       mov     r3, #3");
    __asm__("       and     r3, r2 wz");
    __asm__(" if_z  jmp     #label1");
    __asm__("label2 wrbyte  r1, r0");
    __asm__("       add     r0, #1");
    __asm__("       djnz    r3, #label2");
    __asm__("label1 setnib  r1, r1, #1");
    __asm__("       setbyte r1, r1, #1");
    __asm__("       setword r1, r1, #1");
    __asm__("       shr     r2, #2 wz");
    __asm__(" if_nz call    #\\__LONGFILL");
}

I could speed it up a bit by moving all the code to cog RAM, but p2gcc C code currently calls functions that only reside in hub RAM.

Fast Bytecode Interpreter

Comments