First, hub memory will be comprised of 16 instances of 32768 x 8 RAM for 512KB. This is going to yield a hub data path of 128 bits. Only the RAMs that are needed on a given cycle will be activated, saving power. ... Though the cog memory map is still 512x32, cog RAM will be physically organized as 128 x 128, so we can read or write four contiguous registers with RDQUAD/WRQUAD instructions. ... This also gives cogs running at 200MHz (100 MIPS) a hub memory bandwidth of 200MB/s, which is enough to do any kind of VGA that we have the internal hub memory to support, at any color depth (up to 24bpp)
##### Come let us calculate together: 200MHz / 16 cogs * 16-byte transfers (per hub slot) = 200MB/s. Have I got that right? Absolutely fantastic!!! #####
Chip, it's so cool that you're willing to work hard to keep the video features. I know some folks don't need it. But it brought many of us to the Prop. Thanks!
Guess we'll be losing the P2's color lookup table (CLUT)? Maybe we can alter colors on the fly. Anyway to keep the CLUT or what are people's thoughts?
That was based on 32 bit (1 long) hub bus, same as P1.
Now Chip expanded that to 128 bits. BIG difference.
In order to avoid arguing with half the forum here, I pm'd him a very simple hubexec that suits this that would be ~50MIPS (simple instructions) for incredibly few gates (relatively speaking). With 32 entry slot mapping, it would get close to 100MIPS. WAY better C performance.
non-FCACHE LMM on this latest design tops out at 12.5MIPS, fcache can help a lot - when the compiler can effectively use it.
non-FCACHE QLMM, if effectively used, would be roughly 25MIPS (lose one slot to add pc,#16)
I am extremely tired of some people criticizing me on a non-technical basis, so I did not post (NOT talking about you)
Bill, that's really quite contrary to how we've all been doing things here. Besides if Chip uses it, then we are all going to know it with the first FPGA release, and all you've done is delay the arguments until then.
That was based on 32 bit (1 long) hub bus, same as P1.
Now Chip expanded that to 128 bits. BIG difference.
In order to avoid arguing with half the forum here, I pm'd him a very simple hubexec that suits this that would be ~50MIPS (simple instructions) for incredibly few gates.
Does anyone have all the sources from OBEX that can be greped for all the different PASM mnemonics so that we can count what ops are used a lot and what are not?
It's not definitive of course as some rarely occurring opcode may actually be executed a lot in reality. But gives an idea at least.
But I have HAD IT with personal attacks, or criticism without understanding what I propose, and not doing a technical analysis of it.
I LOVE TO GET TECHNICAL, ANALYTIC CRITICISM/ARGUMENTS - I LEARN FROM THAT! SO DOES EVERYONE!
But I have had it with "hand waving", "it is not like the P1", or criticism without a well founded technical argument to support it.
If moderators were doing their jobs well, they should have come down hard on personal attacks, and non-technical criticism. But they are not.
You guys have NO idea how close I came to leaving the P2 forum for ever in the past week. Or at least a year like last time.
I actually wonder if some forum members would prefer if I dissapeared.
Bill, that's really quite contrary to how we've all been doing things here. Besides if Chip uses it, then we are all going to know it with the first FPGA release, and all you've done is delay the arguments until then.
Instructions that I think we could drop from the P1 set to make room for required new ones:
These can be done with the separate instructions:
CMPSUB
Probably don't need all flavors of MUX, just keep MUXC/MUXZ and drop MUXNC/MUXNZ.
I disagree strongly with removing CMPSUB and MUXNC/MUXNZ, these are some seminal instructions that are very useful for atomic operations and dealing with bit tests. I've written some code that could *not* have been written without these instructions, specifically interfacing gray-code encoders requires those instructions.
I agree that some of the arithmetic instructions are lesser used, specifically the absolute values. Perhaps we could remove all ABS instructions in asm, since they can easily be done as 2 instructions:
ABS:
NEG D, S WC WZ NR
IF_C_AND_NZ NEG D, S
ABSNEG:
NEG D, S WC WZ NR
IF_NC_AND_NZ NEG D, S
There are other instructions that can be broken down into "microcode" instead of dedicated instruction slots.
I vote yes on the mode to make the shifter work bidirectionally across 8 pins. We will use the Smile out of that.
Re: composite loss
Pal might get tough again, but we have a lot of P1 software that drives a TV sans the chroma circuit.
It worked well, but for the coarse signals. With DACS this will be fine. No tear shed here. A 200Mhz or so display COG can nail a TV signal. I'll end up doing one. I like TV. Another COG can do color lookups and integrate data from buffers.
If games are done, we have an awful lot of sprite COGS and the ability to partition the display among them still. Nuts crazy things are possible.
We don't lose much, but the nice color space and fast pixel clocks. We can still do commercial quality video.
Bill,
I for one do not want you to leave, and while I may disagree with you at times, I still respect your methods and contributions. However, bucking the system to avoid arguments from a few is unfair to the system.
pedward,
Then keep MUXNC/MUXNZ and drop MUXC/MUXZ ? Do you need both? CMPSUB can be done with a CMP and a conditional SUB, is it really required to be in one instruction? Remember the instructions are inherently twice as fast (2 clock instead of 4).
However, bucking the system to avoid arguments from a few is unfair to the system.
Why is it bucking the system?
I know that at the very least you and jazzed make suggestions in person to Chip, being close enough to physically visit. I am sure there are others.
That is exactly the same as a PM.
Your request, that I only post on the forum, is illogical and unfair.
Regarding avoiding arguments:
I LOVE GOOD TECHNICAL ARGUMENTS
Moderators are not really doing their job stamping on personal attacks.
Quite a few people do not bother to read technical proposals, and have illogical, non-supported knee-jerk reactions. Which influence others, and result in many uninformed "Mee too!" postings.
Such non-technical arguments should also be stamped on.
FYI,
Just like I do not have the right to request that those close enough to visit Chip and make suggestions, or email, phone or pm suggestions to him...
That was based on 32 bit (1 long) hub bus, same as P1.
Now Chip expanded that to 128 bits. BIG difference.
In order to avoid arguing with half the forum here, I pm'd him a very simple hubexec that suits this that would be ~50MIPS (simple instructions) for incredibly few gates (relatively speaking). With 32 entry slot mapping, it would get close to 100MIPS. WAY better C performance.
non-FCACHE LMM on this latest design tops out at 12.5MIPS, fcache can help a lot - when the compiler can effectively use it.
non-FCACHE QLMM, if effectively used, would be roughly 25MIPS (lose one slot to add pc,#16)
I am extremely tired of some people criticizing me on a non-technical basis, so I did not post (NOT talking about you)
That was based on 32 bit (1 long) hub bus, same as P1.
Now Chip expanded that to 128 bits. BIG difference.
In order to avoid arguing with half the forum here, I pm'd him a very simple hubexec that suits this that would be ~50MIPS (simple instructions) for incredibly few gates (relatively speaking). With 32 entry slot mapping, it would get close to 100MIPS. WAY better C performance.
non-FCACHE LMM on this latest design tops out at 12.5MIPS, fcache can help a lot - when the compiler can effectively use it.
non-FCACHE QLMM, if effectively used, would be roughly 25MIPS (lose one slot to add pc,#16)
I am extremely tired of some people criticizing me on a non-technical basis, so I did not post (NOT talking about you)
I'm not convinced that QLMM is going to be all that big a help. There are issues like how to handle branches and LMM macros. Do you have a complete QLMM proposal?
This thread is about the new chip we are going to build in the 180nm process.
Thanks Chip,
Your posts sound good.
Backward PASM code timing compatibility is not an issue. If instructions can be executed in one or two clock cycles, that's far more important than PASM timing compatibility.
However, P1 instruction compatibility is pretty important. If all the OBEX code and all the other code written before that people rely on for revenue can be ported, then it's a non-issue. DEC/ENC were never usable anyway, so those can easily be removed. MUL/DIV should be implemented of course.
Thanks for agreeing to not remove RDWORD/WRWORD.
If people want color NTSC or PAL they can use P1.
Any chance that 1MB HUB RAM would fit? Perry's suggestion makes sense looking forward.
Any chance that RD*C instructions can be used? If not that's fine.
Bill,
I'm actually quite far from Chip, but I do visit from time to time (once or twice a year). There have been times that Chip has contacted me about features, but they got shared here pretty quickly. Also, usually when I visit, we talk about almost everything else besides the chips. Anyway, I mostly just want to hear your idea because I am interested in how it works, since LMM/HUBEXEC type stuff is going to be key in making C/C++ perform well and I think that is very important for Parallax.
Yes it does. My posts above is the minimum expected performance, because until a given bit of code is tried, the % improvement is unknow.
As I came up with LMM, FCACHE, FLIB (renamed 'kernel extensions"), I know that you realize I am well aware of how good FCACHE can be
With something like an FFT, on a small enough array that it fits in the cog, it can closely approach native MIPS for the fcached part.
With stack based recursive code, accessing local variables, it has very little effect.
hubexec is much faster than the base LMM performance, and saves a lot of hub memory.
QLMM has the potential of roughly 2x the performance of hubexec, at the expense of having to write a VLIW style GCC code generator. Money better reserved for P1+ / P2 shuttle runs in my opinion.
At one point, Ken mentioned that improving compiled C performance was high on custumers "want" list.
I see this thread has degenerated already. Moderators, please remove perceived personal attacks and posts about perceived personal attacks (including this one).
I say break it if necessary. As much as we would love to have code compatibility sometimes it's just not possible . I have always felt that the biggest mistake that Microsoft made was try to stay compatible with DOS when Windows was developed. They should have made a clean break.
I'll post FPGA images for the DE0-Nano and DE2-115 boards as soon as we have something working.
I'm looking forward to that. :cool:
So, what can I do with 16 cog's? Now I have to think of a project to do. LOL
Can I count on you and others to stamp on personal attacks on me, and technically unsupported criticism on my posts?
If so, I will be happy to post it.
I love technical criticism, it leads to improvements!
I know, and I would strongly prefer to post all such suggestions, but they get lost in unwarranted non-technical criticisms... which I am tired of responding to, but have to, as otherwise (people not technically as versed) will fall for it, and add "me too!"... and the viscious circle continues.
Bill,
I'm actually quite far from Chip, but I do visit from time to time (once or twice a year). There have been times that Chip has contacted me about features, but they got shared here pretty quickly. Also, usually when I visit, we talk about almost everything else besides the chips. Anyway, I mostly just want to hear your idea because I am interested in how it works, since LMM/HUBEXEC type stuff is going to be key in making C/C++ perform well and I think that is very important for Parallax.
Comments
I know what Rayman's after here, and it would be quite useful for connecting to a lot of LCD panels.
##### Come let us calculate together: 200MHz / 16 cogs * 16-byte transfers (per hub slot) = 200MB/s. Have I got that right? Absolutely fantastic!!! #####
Chip, it's so cool that you're willing to work hard to keep the video features. I know some folks don't need it. But it brought many of us to the Prop. Thanks!
Guess we'll be losing the P2's color lookup table (CLUT)? Maybe we can alter colors on the fly. Anyway to keep the CLUT or what are people's thoughts?
Now Chip expanded that to 128 bits. BIG difference.
In order to avoid arguing with half the forum here, I pm'd him a very simple hubexec that suits this that would be ~50MIPS (simple instructions) for incredibly few gates (relatively speaking). With 32 entry slot mapping, it would get close to 100MIPS. WAY better C performance.
non-FCACHE LMM on this latest design tops out at 12.5MIPS, fcache can help a lot - when the compiler can effectively use it.
non-FCACHE QLMM, if effectively used, would be roughly 25MIPS (lose one slot to add pc,#16)
I am extremely tired of some people criticizing me on a non-technical basis, so I did not post (NOT talking about you)
I too think CMPSUB is important.
Is there an estimation what the power budget might be?.... this chip might just be the one I'm needing.
Cheers,
Peter (pjv)
Had no idea that color space was so expensive! Ditch it.
I'm still catching up, but I vote now on code compatability.
The idea that we take the best features from our efforts and respect the power budget is a winner!
I'm stoked about this. Chip, you are awesome. Really.
It's not definitive of course as some rarely occurring opcode may actually be executed a lot in reality. But gives an idea at least.
I WOULD FAR PREFER TO DO IT ON THE OPEN FORUM
But I have HAD IT with personal attacks, or criticism without understanding what I propose, and not doing a technical analysis of it.
I LOVE TO GET TECHNICAL, ANALYTIC CRITICISM/ARGUMENTS - I LEARN FROM THAT! SO DOES EVERYONE!
But I have had it with "hand waving", "it is not like the P1", or criticism without a well founded technical argument to support it.
If moderators were doing their jobs well, they should have come down hard on personal attacks, and non-technical criticism. But they are not.
You guys have NO idea how close I came to leaving the P2 forum for ever in the past week. Or at least a year like last time.
I actually wonder if some forum members would prefer if I dissapeared.
I disagree strongly with removing CMPSUB and MUXNC/MUXNZ, these are some seminal instructions that are very useful for atomic operations and dealing with bit tests. I've written some code that could *not* have been written without these instructions, specifically interfacing gray-code encoders requires those instructions.
I agree that some of the arithmetic instructions are lesser used, specifically the absolute values. Perhaps we could remove all ABS instructions in asm, since they can easily be done as 2 instructions:
There are other instructions that can be broken down into "microcode" instead of dedicated instruction slots.
Bean
Re: composite loss
Pal might get tough again, but we have a lot of P1 software that drives a TV sans the chroma circuit.
It worked well, but for the coarse signals. With DACS this will be fine. No tear shed here. A 200Mhz or so display COG can nail a TV signal. I'll end up doing one. I like TV. Another COG can do color lookups and integrate data from buffers.
If games are done, we have an awful lot of sprite COGS and the ability to partition the display among them still. Nuts crazy things are possible.
We don't lose much, but the nice color space and fast pixel clocks. We can still do commercial quality video.
I for one do not want you to leave, and while I may disagree with you at times, I still respect your methods and contributions. However, bucking the system to avoid arguments from a few is unfair to the system.
pedward,
Then keep MUXNC/MUXNZ and drop MUXC/MUXZ ? Do you need both? CMPSUB can be done with a CMP and a conditional SUB, is it really required to be in one instruction? Remember the instructions are inherently twice as fast (2 clock instead of 4).
Thank you.
Why is it bucking the system?
I know that at the very least you and jazzed make suggestions in person to Chip, being close enough to physically visit. I am sure there are others.
That is exactly the same as a PM.
Your request, that I only post on the forum, is illogical and unfair.
Regarding avoiding arguments:
I LOVE GOOD TECHNICAL ARGUMENTS
Moderators are not really doing their job stamping on personal attacks.
Quite a few people do not bother to read technical proposals, and have illogical, non-supported knee-jerk reactions. Which influence others, and result in many uninformed "Mee too!" postings.
Such non-technical arguments should also be stamped on.
FYI,
Just like I do not have the right to request that those close enough to visit Chip and make suggestions, or email, phone or pm suggestions to him...
The power considerations are firm in mind, as is the vision of lots of COGS to do stuff we packed into single ones.
I want fast big programs. Any suggestion toward that end makes sense to me. That is a big execption to P1. Let's be rid of it.
Thanks Chip,
Your posts sound good.
Backward PASM code timing compatibility is not an issue. If instructions can be executed in one or two clock cycles, that's far more important than PASM timing compatibility.
However, P1 instruction compatibility is pretty important. If all the OBEX code and all the other code written before that people rely on for revenue can be ported, then it's a non-issue. DEC/ENC were never usable anyway, so those can easily be removed. MUL/DIV should be implemented of course.
Thanks for agreeing to not remove RDWORD/WRWORD.
If people want color NTSC or PAL they can use P1.
Any chance that 1MB HUB RAM would fit? Perry's suggestion makes sense looking forward.
Any chance that RD*C instructions can be used? If not that's fine.
--Steve
I'm actually quite far from Chip, but I do visit from time to time (once or twice a year). There have been times that Chip has contacted me about features, but they got shared here pretty quickly. Also, usually when I visit, we talk about almost everything else besides the chips. Anyway, I mostly just want to hear your idea because I am interested in how it works, since LMM/HUBEXEC type stuff is going to be key in making C/C++ perform well and I think that is very important for Parallax.
Yep, and it works very well, the C version of the heater_FFT runs almost as fast as the hand crafted PASM version !
How that translates to more general purpose usage I have no idea but it's most encouraging.
C.W.
As I came up with LMM, FCACHE, FLIB (renamed 'kernel extensions"), I know that you realize I am well aware of how good FCACHE can be
With something like an FFT, on a small enough array that it fits in the cog, it can closely approach native MIPS for the fcached part.
With stack based recursive code, accessing local variables, it has very little effect.
hubexec is much faster than the base LMM performance, and saves a lot of hub memory.
QLMM has the potential of roughly 2x the performance of hubexec, at the expense of having to write a VLIW style GCC code generator. Money better reserved for P1+ / P2 shuttle runs in my opinion.
At one point, Ken mentioned that improving compiled C performance was high on custumers "want" list.
I say break it if necessary. As much as we would love to have code compatibility sometimes it's just not possible . I have always felt that the biggest mistake that Microsoft made was try to stay compatible with DOS when Windows was developed. They should have made a clean break.
I'm looking forward to that. :cool:
So, what can I do with 16 cog's? Now I have to think of a project to do. LOL
Can I count on you and others to stamp on personal attacks on me, and technically unsupported criticism on my posts?
If so, I will be happy to post it.
I love technical criticism, it leads to improvements!
I know, and I would strongly prefer to post all such suggestions, but they get lost in unwarranted non-technical criticisms... which I am tired of responding to, but have to, as otherwise (people not technically as versed) will fall for it, and add "me too!"... and the viscious circle continues.
Perhaps I should have vented earlier.