Yes, if the PC is between $200 and $3FF, the cog is fetching instructions (at full speed) from LUT instead of COG. Like you suggest, you could execute from the LUT and use the COG ram purely as data registers. Combine that with shared LUT mode, where the paired cog could dynamically swap out executable code, and you end up with some really interesting execution options!
I wonder how elastic the LUT size is ?
If we are wildly optimistic for a moment, and presume the routed device has spare space after the 512k RAM is included, how easy is it to increase the LUT size to the next notch ?
If there is any spare space after the 512K RAM is included I rather see it filled up with even more hub RAM. It seems like that would be much easier to do than increasing the LUT size.
I just found a bug in XBYTE. If the next instruction in the pipeline following the _RET_/RET to $1F8..$1FF had an immediate D field, it wouldn't read the LUT byte. Someone had said earlier that they had some funny problem with XBYTE. I imagine this was it. I just discovered this in optimizing the D mux.
100MHz for the next FPGA release is going to be no problem. I'm almost wondering if we could get 120MHz.
I just found a bug in XBYTE. If the next instruction in the pipeline following the _RET_/RET to $1F8..$1FF had an immediate D field, it wouldn't read the LUT byte. Someone had said earlier that they had some funny problem with XBYTE. I imagine this was it. I just discovered this in optimizing the D mux.
All the more reason to truly lock the design. That could have been a subtle bug to find. Moreover, it shows that the P2 will need a lot of testing, which will be much more successful if the design is not a moving target.
Here's a suggestion: create a new Google spreadsheet or document called "P3 ideas" and make it editable. If anyone has a new idea or suggestion for the P2, we all are responsible for redirecting that person to add the idea/suggesting to that document instead. That way, the person's idea is not being shot down just because the P2 itself is locked down, and we capture the idea in a place that will be more discoverable than sifting back through forum posts will be.
The trouble is, a design freeze has been announced several times already. How do we know that a new one will really stick?
That is entirely within the control of Parallax. Even if the flow of suggestions were somehow to be blocked the change can (and probably will) continue until Chip is happy to call it finished.
The trouble is, a design freeze has been announced several times already. How do we know that a new one will really stick?
All it takes is Chip saying "add it to the P3 Ideas document" for every suggestion that comes up. And if Chip floats a new idea, it's up to the rest of us to say "add it to the P3 Ideas document" instead of discussing it. And if you are wondering if even this will happen, we shall see. I'm sure that jmg will be testing Chip's resolve as soon as that document is created.
The trouble is, a design freeze has been announced several times already. How do we know that a new one will really stick?
All it takes is Chip saying "add it to the P3 Ideas document" for every suggestion that comes up. And if Chip floats a new idea, it's up to the rest of us to say "add it to the P3 Ideas document" instead of discussing it. And if you are wondering if even this will happen, we shall see. I'm sure that jmg will be testing Chip's resolve as soon as that document is created.
I was not just adressing the matter of new ideas, but the general idea of a design freeze; that includes refraining from fixing things that don't work.
While that statement still holds, the design isn't ready to be frozen.
The feature set should be frozen now, which is where the "ideas for P3" repository can be used as a parking lot.
I fully understand the desire to continue to improve the design while fixing errors. That's where I rely on my managers to say "good enough" and deliver. Without a manager to make that call, Chip needs to exercise that rigour for himself.
Yes. And we really do need a round of tests. SPIN and C tools are a good first round. Code done with those, developing some library, objects should help with the rest.
Bug fixes are necessary right now. I feel there will be some. Given the emphasis on quality and performance, we need to do this.
With you on features. It's jam packed with good stuff! No need for more.
Nah, sorry, can't agree on that AJL. Real progress is happening. It'll be finished.
I'm not saying that it is ready to ship (not my place). My comments were in response to the statement about design freeze.
As this work isn't happening in response to a contract from a particular customer Chip has the freedom to fix "everything". If this was a contracted delivery with declared shipping date he would have to meet that, even at the expense of leaving things "broken".
I see many comments that seem to miss this important distinction.
Without a declared end date, projects run the risk of never being finished. The cost of such a date is that the end product may contain known problems. You generally can't have both a declared end date and a perfect product.
But think about this - there is no such thing as a perfect chip. Name one that you think is perfect, and I'm sure that we can find plenty of room for improvement in multiple dimensions.
The errata I have seen on some chips is quite large, and they often take many revisions to fix them. Some of the published workarounds say don't use xxx which are not really workarounds.
I don't recall seeing any errata on the P1, nor am I aware of any bugs. There is a PLL issue if you don't connect all the power and ground pins but I would consider this as a user issue, not a bug.
That being said, the P2 is considerably more complex, and there hasn't really been any concerted testing efforts since the P2HOT. There will most likely be bugs. As long as they don't break everything, they will likely just block a specific piece working. Because the P2 is so flexible there will be other ways around any such problems.
Even in a worst case where the smart pins didn't work, we can still drive them directly. We have 16 cogs after all. Maybe there would be a few things it wouldn't do, but it wouldn't break the chip.
Many of us just wanted a P1 with more HUB RAM, faster, and more I/O. The P2 kills that hands down.
...
That being said, the P2 is considerably more complex, and there hasn't really been any concerted testing efforts since the P2HOT. There will most likely be bugs. As long as they don't break everything, they will likely just block a specific piece working. Because the P2 is so flexible there will be other ways around any such problems.
Even in a worst case where the smart pins didn't work, we can still drive them directly. We have 16 cogs after all. Maybe there would be a few things it wouldn't do, but it wouldn't break the chip.
...
There hasn't been any formal testing, but people have been testing a few things since P2-Hot. If everyone would re-run what they've done on previous versions it would go a long way toward testing out new FPGA images. I'm hoping that any changes after the next version will be small and incremental. So I would encourage everyone to run whatever they have on each new FPGA image from now on.
Chip did build a test chip containing Smart pins. This should have provided an effective test bed for the analog circuitry, and whatever digital circuitry he included in the test chip.
I feel that the P2 is getting very close. There is light at the end of the tunnel. Hopefully it's not the headlights of a high speed train coming straight at us.
It would be nice if people could restrain themselves from proposing their own pet ideas. There are several things that I would have liked to see in the P2, but I feel it would be reckless to propose them at this point. I think the bit ops are a good example of something that wasn't absolutely necessary for the P2. This ended up consuming a couple of weeks, and caused a bit of reshuffling of the instruction set. There are other ways to do bit operations at the expense of a few extra cycles.
I am all for having discussion on the P3, but can it wait a few months until the P2 is sent off to the foundry? I don't think this forum is capable of having a completely separate discussion on the P3 without it spilling over into the P2. It would be more productive for everyone to test the P2 than to discuss new features on the P3 right now.
I'm hanging out for a Prop2 loader that works on Linux. PNut.exe plays havoc with the DTR line, preventing the Prop2 from accepting the program download.
I'm hanging out for a Prop2 loader that works on Linux. PNut.exe plays havoc with the DTR line, preventing the Prop2 from accepting the program download.
Dave Hein's loadp2 program works great on Linux --that's what I've been using. He also posted a p2asm assembler, which I think works well but I haven't used as much (I mostly use spin2cpp/fastspin for my P2 development).
I'm glad the loader works for you. I've been doing a little more work on GCC for the P2, and I hope to post an update soon. The loader hasn't changed much, except that I added a -v option to enable a verbosity mode. If the -v option isn't specified in the new version it disables the prints so it runs silently.
I've modified p2asm to generate an object file, and I wrote a linker call p2link to produce an executable binary file. The mods to p2asm and p2link are based on the work I did on the Taz C compiler. All the tools are tied together with a bash script called p2gcc. It's actually working out pretty well. I can compile a C program and load it on the P2 board by typing "p2gcc -r -t hello.c". The -r will cause loadp2 to run, and the -t option is passed to loadp2 to run the terminal emulator. Eventually I'll update spinsim and tie that into p2gcc so that programs can be compiled and run on the simulator by typing "p2gcc -sim hello.c".
Here is the hub move/fill routine for the new Spin interpreter.
At 120MHz, it's moving 64KB of data within the hub in 500us. That's with a read-then-write transfer buffer of 32 longs. I can almost double that speed by going to a 256-long buffer, but there won't be that much free space in the interpreter. As it is, we are a little better than half of theoretical full speed with a 32-long buffer.
This thing took me three days to write. Because there are a few levels of performance possible in the Prop2, optimizing general-purpose things like this gets complicated.
'
'
' BYTEMOVE(dst,src,cnt) z = 0
' WORDMOVE(dst,src,cnt) z = 0
' LONGMOVE(dst,src,cnt) z = 0
'
' BYTEFILL(dst,val,cnt) z = 1
' WORDFILL(dst,val,cnt) z = 1
' LONGFILL(dst,val,cnt) z = 1
'
mov_fil popa y ' a b c d e f pop src/val a: BYTEMOVE
popa z ' a b c d e f pop dst b: WORDMOVE
' c: LONGMOVE
tjz x,#.exit ' a b c d e f if cnt=0, exit
' d: BYTEFILL
shl x,#1 ' | b | | e | if word, cnt*2 e: WORDFILL
shl x,#2 ' | | c | | f if long, cnt*4 f: LONGFILL
cmp y,z wc ' a b c | | | reverse move?
if_c add y,x ' a b c | | |
if_c add z,x ' a b c | | |
movbyts y,#%%0000 ' | | | d | | byte fill
movbyts y,#%%1010 ' | | | | e | word fill
rep #2,#32 ' | | | d e f set fill pattern
altd pa,altd_fill ' | | | d e f
mov buff,y ' | | | d e f
shr x,#1 wc 'handle any stray byte
if_c mov a,#1
if_c callpa #%0111_0111,#.m
shr x,#1 wc 'handle any stray word
if_c mov a,#2
if_c callpa #%1011_1011,#.m
.loop cmpsub x,#32 wc 'handle longs in blocks of up to 32
if_c mov a,#32
if_nc mov a,x
if_nc mov x,#0
mov pb,a
sub pb,#1
shl a,#2
callpa #%1100_1100,#.m
jmp #.loop
.m cmp y,z wc 'move/fill routine, reverse move?
if_nz_and_c sub y,a 'if reverse move, pre-dec pointers
if_nz_and_c sub z,a
skipf pa 'set skip pattern for rdxxxx/wrxxxx
if_nz setq pb 'rdxxxx for move
if_nz rdlong buff,y
if_nz rdword buff,y
if_nz rdbyte buff,y
setq pb 'wrxxxx for move/fill
wrlong buff,z
wrword buff,z
wrbyte buff,z
if_z_or_nc add y,a 'if forward move/fill, post-inc pointers
if_z_or_nc add z,a
_ret_ tjz x,#.done 'if not done, return to caller
.done pop x 'done, pop call stack
.exit _ret_ popa x 'pop top of stack into x, return to xbyte loop
Chip, that looks great. However, I wonder if you could get a speed improvement by separating xxxxFILL from xxxxMOVE. Also, because the P2 does unaligned reads and writes the BYTE and WORD operations could be almost as fast as the LONG operations. It just requires performing 0 to 3 BYTE accesses at the beginning, and then doing LONG accesses after that. This is the code I used for memset() in p2gcc, which is basically the same as BYTEFILL() in Spin.
' This code resides in cog RAM
__LONGFILL
wrfast #0, r0
rep #1, r2
wflong r1
ret
// This code resides in hub RAM
// ptr, val and num are in r0, r1 and r2
void memset(void *ptr, int val, int num)
{
__asm__(" mov r3, #3");
__asm__(" and r3, r2 wz");
__asm__(" if_z jmp #label1");
__asm__("label2 wrbyte r1, r0");
__asm__(" add r0, #1");
__asm__(" djnz r3, #label2");
__asm__("label1 setnib r1, r1, #1");
__asm__(" setbyte r1, r1, #1");
__asm__(" setword r1, r1, #1");
__asm__(" shr r2, #2 wz");
__asm__(" if_nz call #\\__LONGFILL");
}
I could speed it up a bit by moving all the code to cog RAM, but p2gcc C code currently calls functions that only reside in hub RAM.
Comments
100MHz for the next FPGA release is going to be no problem. I'm almost wondering if we could get 120MHz.
200MHz final silicon, here we come!
All the more reason to truly lock the design. That could have been a subtle bug to find. Moreover, it shows that the P2 will need a lot of testing, which will be much more successful if the design is not a moving target.
Here's a suggestion: create a new Google spreadsheet or document called "P3 ideas" and make it editable. If anyone has a new idea or suggestion for the P2, we all are responsible for redirecting that person to add the idea/suggesting to that document instead. That way, the person's idea is not being shot down just because the P2 itself is locked down, and we capture the idea in a place that will be more discoverable than sifting back through forum posts will be.
That is entirely within the control of Parallax. Even if the flow of suggestions were somehow to be blocked the change can (and probably will) continue until Chip is happy to call it finished.
All it takes is Chip saying "add it to the P3 Ideas document" for every suggestion that comes up. And if Chip floats a new idea, it's up to the rest of us to say "add it to the P3 Ideas document" instead of discussing it. And if you are wondering if even this will happen, we shall see. I'm sure that jmg will be testing Chip's resolve as soon as that document is created.
I was not just adressing the matter of new ideas, but the general idea of a design freeze; that includes refraining from fixing things that don't work.
While that statement still holds, the design isn't ready to be frozen.
The feature set should be frozen now, which is where the "ideas for P3" repository can be used as a parking lot.
I fully understand the desire to continue to improve the design while fixing errors. That's where I rely on my managers to say "good enough" and deliver. Without a manager to make that call, Chip needs to exercise that rigour for himself.
Bug fixes are necessary right now. I feel there will be some. Given the emphasis on quality and performance, we need to do this.
With you on features. It's jam packed with good stuff! No need for more.
I'm not saying that it is ready to ship (not my place). My comments were in response to the statement about design freeze.
As this work isn't happening in response to a contract from a particular customer Chip has the freedom to fix "everything". If this was a contracted delivery with declared shipping date he would have to meet that, even at the expense of leaving things "broken".
I see many comments that seem to miss this important distinction.
Without a declared end date, projects run the risk of never being finished. The cost of such a date is that the end product may contain known problems. You generally can't have both a declared end date and a perfect product.
But think about this - there is no such thing as a perfect chip. Name one that you think is perfect, and I'm sure that we can find plenty of room for improvement in multiple dimensions.
The errata I have seen on some chips is quite large, and they often take many revisions to fix them. Some of the published workarounds say don't use xxx which are not really workarounds.
I don't recall seeing any errata on the P1, nor am I aware of any bugs. There is a PLL issue if you don't connect all the power and ground pins but I would consider this as a user issue, not a bug.
That being said, the P2 is considerably more complex, and there hasn't really been any concerted testing efforts since the P2HOT. There will most likely be bugs. As long as they don't break everything, they will likely just block a specific piece working. Because the P2 is so flexible there will be other ways around any such problems.
Even in a worst case where the smart pins didn't work, we can still drive them directly. We have 16 cogs after all. Maybe there would be a few things it wouldn't do, but it wouldn't break the chip.
Many of us just wanted a P1 with more HUB RAM, faster, and more I/O. The P2 kills that hands down.
P1 is damn good. 6809 is too.
Chip did build a test chip containing Smart pins. This should have provided an effective test bed for the analog circuitry, and whatever digital circuitry he included in the test chip.
I feel that the P2 is getting very close. There is light at the end of the tunnel. Hopefully it's not the headlights of a high speed train coming straight at us.
It would be nice if people could restrain themselves from proposing their own pet ideas. There are several things that I would have liked to see in the P2, but I feel it would be reckless to propose them at this point. I think the bit ops are a good example of something that wasn't absolutely necessary for the P2. This ended up consuming a couple of weeks, and caused a bit of reshuffling of the instruction set. There are other ways to do bit operations at the expense of a few extra cycles.
I am all for having discussion on the P3, but can it wait a few months until the P2 is sent off to the foundry? I don't think this forum is capable of having a completely separate discussion on the P3 without it spilling over into the P2. It would be more productive for everyone to test the P2 than to discuss new features on the P3 right now.
Dave Hein's loadp2 program works great on Linux --that's what I've been using. He also posted a p2asm assembler, which I think works well but I haven't used as much (I mostly use spin2cpp/fastspin for my P2 development).
Eric
Hey Dave,
Give yourself a signature with that as the link.
EDIT: Found it - http://forums.parallax.com/discussion/comment/1409237/#Comment_1409237
I've modified p2asm to generate an object file, and I wrote a linker call p2link to produce an executable binary file. The mods to p2asm and p2link are based on the work I did on the Taz C compiler. All the tools are tied together with a bash script called p2gcc. It's actually working out pretty well. I can compile a C program and load it on the P2 board by typing "p2gcc -r -t hello.c". The -r will cause loadp2 to run, and the -t option is passed to loadp2 to run the terminal emulator. Eventually I'll update spinsim and tie that into p2gcc so that programs can be compiled and run on the simulator by typing "p2gcc -sim hello.c".
At 120MHz, it's moving 64KB of data within the hub in 500us. That's with a read-then-write transfer buffer of 32 longs. I can almost double that speed by going to a 256-long buffer, but there won't be that much free space in the interpreter. As it is, we are a little better than half of theoretical full speed with a 32-long buffer.
This thing took me three days to write. Because there are a few levels of performance possible in the Prop2, optimizing general-purpose things like this gets complicated.
BTW I couldn't help but notice your reference to 120MHz.
That sounds very encouraging.
I could speed it up a bit by moving all the code to cog RAM, but p2gcc C code currently calls functions that only reside in hub RAM.