My thoughts were mainly that it could be simpler to implement and explain, and a possible benefit in a reduction of caveats. Currently an extra clock is being inserted in RDLUT which would not be required.
As I said originally, all I was after was to see if it were possible to execute from LUT space. I was happy with any caveats required. Chip went a step further than this since it was easy.
The S & D addresses could be modified (extended) by AUGD/AUGS/AUGDS which could set the A9 bit to the upper 512 longs, permitting the upper cog ram (ie LUT) to be used with standard instructions (ie a pair of instructions).
Sounds like such a Paging Scheme could cause a lot of fun around Interrupts ?
Also, does the P2 really need all memory available as discrete registers (the highest cost RAM) ?
Lots of SW uses more memory as Arrays, than as discrete VARs,
and code can fetch from this memory too, which frees up the more valuable register-capable RAM.
As it works out, there are no caveats regarding concurrent LUT execution and LUT r/w. The LUT instructions are fetched on "go" cycles and LUT r/w operations occur on non-"go" cycles. RDLUT takes three clocks, not the usual two.
Hi Chip
I believe that any address/data conficts could be addressed, including streamer related ones, by spliting the CLUT into two 256 longs halves, isolated by independent sets of muxes.
Then, execution and streaming could run in parallel, although in mux-segregated address spaces and subjected to specific but not so restrictive rules.
You see that the memories are listed at the top (cog_ram_/cog_lut_) and their areas total 386,471 um2. The total cog, including those memories, is 424,322 um2. That means the cog logic area is only 37,851 um2, or less that 1/25 mm2! The overall cog size will grow to ~500k um2 after we upgrade the cog_lut_ from 256 to 512 words.
In practice that 37,851 um2 logic area needs to be divided by 65% (multiplied by ~1.5) to allow initially empty spaces in the cell array for the clock tree and signal buffers to go into later. That's still really small, though. A cog is only ~13,000 cells!
The clock period was set to 10ns (100MHz) for this run, which is very low, but allowed us to check the basic flow out without getting encumbered in possibly failed-timing reports. Speed checks are next.
As it works out, there are no caveats regarding concurrent LUT execution and LUT r/w. The LUT instructions are fetched on "go" cycles and LUT r/w operations occur on non-"go" cycles. RDLUT takes three clocks, not the usual two.
Hi Chip
I believe that any address/data conficts could be addressed, including streamer related ones, by spliting the CLUT into two 256 longs halves, isolated by independent sets of muxes.
Then, execution and streaming could run in parallel, although in mux-segregated address spaces and subjected to specific but not so restrictive rules.
Henrique
That could certainly work, but we would need two memories - one for each half. When you go to read or write, you tie up the whole memory. I think what we have right now is going to be fine because there probably won't be a strong need to stream from the LUT and execute from it, too. I think those will be fairly different types of applications that would use things differently.
My thoughts were mainly that it could be simpler to implement and explain, and a possible benefit in a reduction of caveats. Currently an extra clock is being inserted in RDLUT which would not be required.
As I said originally, all I was after was to see if it were possible to execute from LUT space. I was happy with any caveats required. Chip went a step further than this since it was easy.
The S & D addresses could be modified (extended) by AUGD/AUGS/AUGDS which could set the A9 bit to the upper 512 longs, permitting the upper cog ram (ie LUT) to be used with standard instructions (ie a pair of instructions).
Sounds like such a Paging Scheme could cause a lot of fun around Interrupts ?
Also, does the P2 really need all memory available as discrete registers (the highest cost RAM) ?
Lots of SW uses more memory as Arrays, than as discrete VARs,
and code can fetch from this memory too, which frees up the more valuable register-capable RAM.
I could have initiated the read on the "go" cycle (before we know its condition), instead of the next "get" cycle, but it would be disastrous is cases where the instruction wasn't supposed to execute and it wound up interfering with streaming that was going on, causing a glitch. Also, it would have conflicted with possible LUT instruction fetching. I think what we have now is just right. The only way to get around these problems, if they are problems, is to use a dual-port RAM, just like the cog RAM. That would explode the area, though (+3mm2), for a marginal improvement in function.
On a slightly related note, I just noticed that there weren't any INDx registers in the 8/13 document. Did we lose indirect registers in the new design?
Yes, they are gone. We have an ALTDS instruction now that substitutes D and S fields in the next instruction. ALTDS also increments/decrements those fields in its D register, with S supplying the inc/dec controls. It was a really cheap way around what could be a huge hardware situation, like in Prop2-Hot.
As it works out, there are no caveats regarding concurrent LUT execution and LUT r/w. The LUT instructions are fetched on "go" cycles and LUT r/w operations occur on non-"go" cycles. RDLUT takes three clocks, not the usual two.
Aha - got it thanks Chip.
Does that mean it would then insert an extra clock (ie 4 clocks) to get back in "sync" with "go" cycles or does the "go" cycle get shifted?
The "go" cycle can be delayed by any number of clocks (ie WAITCNT). The "get" cycle reads D and S registers and the "go" cycle writes the ALU result and reads another instruction. Normal execution goes: get,go,get,go,get,go, etc. If an instruction delays, it looks like this: get,gox,gox,gox,...go.
WRLUT issues the write on "get", then does "go".
RDLUT issues the read on "get", captures the result on "gox", then does "go". It needs that "gox" cycle to capture the data before routing it through the result mux, which also takes some time.
The "go" cycle can be delayed by any number of clocks (ie WAITCNT). The "get" cycle reads D and S registers and the "go" cycle writes the ALU result and reads another instruction. Normal execution goes: get,go,get,go,get,go, etc. If an instruction delays, it looks like this: get,gox,gox,gox,...go.
I am not qualified to talk here...
just wondered if writing the result on the first gox would make a difference?
The "go" cycle can be delayed by any number of clocks (ie WAITCNT). The "get" cycle reads D and S registers and the "go" cycle writes the ALU result and reads another instruction. Normal execution goes: get,go,get,go,get,go, etc. If an instruction delays, it looks like this: get,gox,gox,gox,...go.
I am not qualified to talk here...
just wondered if writing the result on the first gox would make a difference?
Well, in WRLUT there is no "gox", only "get" then "go". So, we would have to make this a 3-clock instruction to even have a "gox". I don't see what good it would do, but I probably don't understand what you are thinking about.
In my thoughts, they would be ever kept isolated of each other.
Each one capable of acting as a peripheral storage tank, except by having two mux groups connected to their data/address/control buses: one meant to be driven by the streamer logic state machine, and the other, by the ALU Get/Result bus logic state machine.
But, in fact, I was only trying to imagine the way you designed their data, address and control buses.
As you described that they would be tied up for reading and writing, probably it's a common bus, with the streamer SM logic at one side, and the ALU Get/Result bus SM logic at the other one; simpler and faster than I supposed it would be.
And yes, I must recognize that streaming to/from LUT and executing from it are different applications; there would be no need to run them simultaneously.
That could certainly work, but we would need two memories - one for each half. When you go to read or write, you tie up the whole memory. I think what we have right now is going to be fine because there probably won't be a strong need to stream from the LUT and execute from it, too. I think those will be fairly different types of applications that would use things differently.
yes, killed my DE2-115 a while back. Have a couple of other boards... bemicro-A9(etc)...P1v working fine, but no time right now to play. The LED's on the 123 indicate that Jac's P1V is active, but Proptool can't identify Prop. Was working last weekend.... after giving it time to find itself.
When your guys test, make sure they do a serial loop back ala Ozpropdev.
I don't know about the delay between releasing the image and releasing the code will be for the P2... but I read your comment about supporting multiple boards. If I were you I would worry about the Bemicro-A9 and P123 A7&A9 at this point. Once you release the sources, it will give everybody something fun to do. If there is someone that is really inconvenienced by this... let them speak up now:)!!!!
yes, killed my DE2-115 a while back. Have a couple of other boards... bemicro-A9(etc)...P1v working fine, but no time right now to play. The LED's on the 123 indicate that Jac's P1V is active, but Proptool can't identify Prop. Was working last weekend.... after giving it time to find itself.
When your guys test, make sure they do a serial loop back ala Ozpropdev.
I don't know about the delay between releasing the image and releasing the code will be for the P2... but I read your comment about supporting multiple boards. If I were you I would worry about the Bemicro-A9 and P123 A7&A9 at this point. Once you release the sources, it will give everybody something fun to do. If there is someone that is really inconvenienced by this... let them speak up now:)!!!!
Chip just said he's still on the DE2 himself, so I think you're pretty safe for that one for a while. On the other hand, DE0 will be spare-time dependent I suspect.
yes, killed my DE2-115 a while back. Have a couple of other boards... bemicro-A9(etc)...P1v working fine, but no time right now to play. The LED's on the 123 indicate that Jac's P1V is active, but Proptool can't identify Prop. Was working last weekend.... after giving it time to find itself.
When your guys test, make sure they do a serial loop back ala Ozpropdev.
I don't know about the delay between releasing the image and releasing the code will be for the P2... but I read your comment about supporting multiple boards. If I were you I would worry about the Bemicro-A9 and P123 A7&A9 at this point. Once you release the sources, it will give everybody something fun to do. If there is someone that is really inconvenienced by this... let them speak up now:)!!!
I'm hoping there will be a DE2-115 image since I don't have either a 1-2-3 board or a BeMicro-A9.
It's an old topic that recently got revived by Seairth wondering about it's effectiveness. This then evolved into a question about how to do double indirection. I added my two cents worth by trying to make a working example of double indirection.
The end result seems to be asymmetrical methods, depending on whether one is reading or writing.
After all the PLL drama, I've got Prop2 comfortably running on the Prop 1-2-3 FPGA -A7 board.
This is a much simpler setup than the DE2-115. Everything is on one board, there's no need for a PropPlug, just one USB connection. Download is 15x faster, and there's no need to cycle power, just flip a PGM/RUN switch.
Assuming this means an image is imminent I've got a DE2-115 but I can pick up a BeMicroCVA9 if need be although they seem to have jumped up in price from $149 to $210 (What's the go there?).
BTW, do you have a link to the documents, even if they are (and will be) a work in progress?
Chip,
I noted on the ALTDS thread you mentioned the new instruction bit order...
CCCC OOOOOOO CZI DDDDDDDDD SSSSSSSSS
Is there any specific reason you have reversed the CZ bits (of CZI) ?
On the P1 they are .... ZCRI ....
If there is no reason, wouldn't it be better to keep the same sequence ZCI so that we don't have to remember to reverse the order of ZC in the P2 ???
BTW I feel your pain with the PLL problem on the A7/A9 FPGA's. What a PITA!
The C is first because it just seemed like a better way to go, plus it made the Verilog more orderly, because of the way I had arranged things. I had been thinking about changing it in the back of my mind for quite a while.
I think that the Prop2 is such a different animal than the Prop1 that nobody's going to think much of that difference, in light of everything else that has changed. Just my feeling about it. The new way seems cleaner to me.
After all the PLL drama, I've got Prop2 comfortably running on the Prop 1-2-3 FPGA -A7 board.
This is a much simpler setup than the DE2-115. Everything is on one board, there's no need for a PropPlug, just one USB connection. Download is 15x faster, and there's no need to cycle power, just flip a PGM/RUN switch.
15x speedup sounds delicious... Is the programming and pin usage exactly the same for the A9 as the A7? I wonder if I should dig into the BeMicroCV-A9 schematic and the 1-2-3 schematic and see how hard it would be to drop a Propeller on the BeMicroCV-A9 for programming... Is that worth the trouble?
After all the PLL drama, I've got Prop2 comfortably running on the Prop 1-2-3 FPGA -A7 board.
This is a much simpler setup than the DE2-115. Everything is on one board, there's no need for a PropPlug, just one USB connection. Download is 15x faster, and there's no need to cycle power, just flip a PGM/RUN switch.
That sounds encouraging. Are you likely to release an A7 image before the new A9 board is ready? Might still be worth picking up an A7 board after all.
Edit: I just realized that you're probably talking about the 1-2-3 A7 board with the whisker wire. Is there any other way to get a clock to the P2? How have people been doing it with P1v on the 1-2-3 A7 board?
Edit: I just realized that you're probably talking about the 1-2-3 A7 board with the whisker wire. Is there any other way to get a clock to the P2? How have people been doing it with P1v on the 1-2-3 A7 board?
I think the issue is around the PLL paths?
There is a 50MHz canned Osc, so there will be Non PLL clock solutions. That may be fine for most testing.
Adafruit have a small ~$7 Si5351 PCB, which would give a PLL solution up to 200MHz
After all the PLL drama, I've got Prop2 comfortably running on the Prop 1-2-3 FPGA -A7 board.
How similar is the FPGA PLL to the P2 PLL, and do you have a proven OnSemi PLL Cell you are able to use.
Seems that could be a test coverage problem area ?
Comments
As I said originally, all I was after was to see if it were possible to execute from LUT space. I was happy with any caveats required. Chip went a step further than this since it was easy.
Hi Chip
I believe that any address/data conficts could be addressed, including streamer related ones, by spliting the CLUT into two 256 longs halves, isolated by independent sets of muxes.
Then, execution and streaming could run in parallel, although in mux-segregated address spaces and subjected to specific but not so restrictive rules.
Henrique
Treehouse did some preliminary synthesis work today so we can get a reality check on size and timing.
This is a cell-count/silicon-area report on a single Prop2 cog.
You see that the memories are listed at the top (cog_ram_/cog_lut_) and their areas total 386,471 um2. The total cog, including those memories, is 424,322 um2. That means the cog logic area is only 37,851 um2, or less that 1/25 mm2! The overall cog size will grow to ~500k um2 after we upgrade the cog_lut_ from 256 to 512 words.
In practice that 37,851 um2 logic area needs to be divided by 65% (multiplied by ~1.5) to allow initially empty spaces in the cell array for the clock tree and signal buffers to go into later. That's still really small, though. A cog is only ~13,000 cells!
The clock period was set to 10ns (100MHz) for this run, which is very low, but allowed us to check the basic flow out without getting encumbered in possibly failed-timing reports. Speed checks are next.
That could certainly work, but we would need two memories - one for each half. When you go to read or write, you tie up the whole memory. I think what we have right now is going to be fine because there probably won't be a strong need to stream from the LUT and execute from it, too. I think those will be fairly different types of applications that would use things differently.
I could have initiated the read on the "go" cycle (before we know its condition), instead of the next "get" cycle, but it would be disastrous is cases where the instruction wasn't supposed to execute and it wound up interfering with streaming that was going on, causing a glitch. Also, it would have conflicted with possible LUT instruction fetching. I think what we have now is just right. The only way to get around these problems, if they are problems, is to use a dual-port RAM, just like the cog RAM. That would explode the area, though (+3mm2), for a marginal improvement in function.
Yes, they are gone. We have an ALTDS instruction now that substitutes D and S fields in the next instruction. ALTDS also increments/decrements those fields in its D register, with S supplying the inc/dec controls. It was a really cheap way around what could be a huge hardware situation, like in Prop2-Hot.
The "go" cycle can be delayed by any number of clocks (ie WAITCNT). The "get" cycle reads D and S registers and the "go" cycle writes the ALU result and reads another instruction. Normal execution goes: get,go,get,go,get,go, etc. If an instruction delays, it looks like this: get,gox,gox,gox,...go.
WRLUT issues the write on "get", then does "go".
RDLUT issues the read on "get", captures the result on "gox", then does "go". It needs that "gox" cycle to capture the data before routing it through the result mux, which also takes some time.
I am not qualified to talk here...
just wondered if writing the result on the first gox would make a difference?
Well, in WRLUT there is no "gox", only "get" then "go". So, we would have to make this a 3-clock instruction to even have a "gox". I don't see what good it would do, but I probably don't understand what you are thinking about.
Thanks for the get/gox/go info. Makes it quite simple to understand.
Thanks for the update Chip!
If (A), what's the use case? If (B) what's the value of having S be an immediate value?
Each one capable of acting as a peripheral storage tank, except by having two mux groups connected to their data/address/control buses: one meant to be driven by the streamer logic state machine, and the other, by the ALU Get/Result bus logic state machine.
But, in fact, I was only trying to imagine the way you designed their data, address and control buses.
As you described that they would be tied up for reading and writing, probably it's a common bus, with the streamer SM logic at one side, and the ALU Get/Result bus SM logic at the other one; simpler and faster than I supposed it would be.
And yes, I must recognize that streaming to/from LUT and executing from it are different applications; there would be no need to run them simultaneously.
Henrique
Disregard my earlier post... board 2 now just sits there like a zombie:)
Are you saying that you are using the new -A7 board exclusively?
When your guys test, make sure they do a serial loop back ala Ozpropdev.
I don't know about the delay between releasing the image and releasing the code will be for the P2... but I read your comment about supporting multiple boards. If I were you I would worry about the Bemicro-A9 and P123 A7&A9 at this point. Once you release the sources, it will give everybody something fun to do. If there is someone that is really inconvenienced by this... let them speak up now:)!!!!
I'm sure Chip won't do that anyway.
You might want to review the conversation on ALTDS here - http://forums.parallax.com/discussion/156242/question-about-altds-implementation-in-new-chip/p1
It's an old topic that recently got revived by Seairth wondering about it's effectiveness. This then evolved into a question about how to do double indirection. I added my two cents worth by trying to make a working example of double indirection.
The end result seems to be asymmetrical methods, depending on whether one is reading or writing.
This is a much simpler setup than the DE2-115. Everything is on one board, there's no need for a PropPlug, just one USB connection. Download is 15x faster, and there's no need to cycle power, just flip a PGM/RUN switch.
Assuming this means an image is imminent I've got a DE2-115 but I can pick up a BeMicroCVA9 if need be although they seem to have jumped up in price from $149 to $210 (What's the go there?).
BTW, do you have a link to the documents, even if they are (and will be) a work in progress?
I noted on the ALTDS thread you mentioned the new instruction bit order...
CCCC OOOOOOO CZI DDDDDDDDD SSSSSSSSS
Is there any specific reason you have reversed the CZ bits (of CZI) ?
On the P1 they are .... ZCRI ....
If there is no reason, wouldn't it be better to keep the same sequence ZCI so that we don't have to remember to reverse the order of ZC in the P2 ???
BTW I feel your pain with the PLL problem on the A7/A9 FPGA's. What a PITA!
The C is first because it just seemed like a better way to go, plus it made the Verilog more orderly, because of the way I had arranged things. I had been thinking about changing it in the back of my mind for quite a while.
I think that the Prop2 is such a different animal than the Prop1 that nobody's going to think much of that difference, in light of everything else that has changed. Just my feeling about it. The new way seems cleaner to me.
15x speedup sounds delicious... Is the programming and pin usage exactly the same for the A9 as the A7? I wonder if I should dig into the BeMicroCV-A9 schematic and the 1-2-3 schematic and see how hard it would be to drop a Propeller on the BeMicroCV-A9 for programming... Is that worth the trouble?
===Jac
Edit: I just realized that you're probably talking about the 1-2-3 A7 board with the whisker wire. Is there any other way to get a clock to the P2? How have people been doing it with P1v on the 1-2-3 A7 board?
The A9 delays have likely pushed back the time-lines.
I think the issue is around the PLL paths?
There is a 50MHz canned Osc, so there will be Non PLL clock solutions. That may be fine for most testing.
Adafruit have a small ~$7 Si5351 PCB, which would give a PLL solution up to 200MHz
How similar is the FPGA PLL to the P2 PLL, and do you have a proven OnSemi PLL Cell you are able to use.
Seems that could be a test coverage problem area ?