... The loader on it is way faster than the Altera arrangement that all other boards seem to use.
That is surprising, as Altera loaders usually uses FT240 USB FIFO (good to 1MByte/sec) and a CPLD for the JTAG state engine.
That HW means they should be able to spin things close to FS USB ceilings ?
If they are s l o w, maybe their SW is poorly written, or they have some silly default settings ?
What FT part does the Parallax PCB use ?
If Parallax have a solution that is much faster, then users could pay for that - another target for the compact P1 board mentioned elsewhere.
I'd like your feedback on your expectations about documentation. What do you think the minimum needs would be?
Ken Gracey
Same as what chip did with the DE0 and DE2 early releases - there was a ~3 page doc with some photos of where to hook things up to, and screenshots showing how to program firmware in. That kind of thing goes a long way when starting out with something new, as does advice about the minimum Altera downloads (just the programmer, etc)
Chips text docs (instructions with some demo code) are fine
Google doc might work, but it'd be good to export into the master zip at the time of release. Peter's P2 google doc got a bit unwieldy as it grew large, but lets see how this goes
There's likely to be several releases over the coming weeks. If possible could we have a master zip with the relevant contents (including Pnut, px, rbf/pof etc) with some kind of date stamp in the filename, at least on the master zip file, but perhaps on utils like Pnut too? I'm not worried about the date being out by a few days, so much as it's a consistent stamp across the various docs and utils.
If dates are tricky, perhaps walnut varieties (or something)
I'd suggest that when it comes to Google docs that new documents are created and linked from the master document, not too many, but certainly the instruction set can have its own etc.
Before getting too far, I detoured and got the execution from LUT working. We are going to switch to a 512x32 LUT so that it matches the cog RAM size. Now, $000..$7FC is cog RAM execution, $800..$FFC is LUT RAM execution, and $1000..$FFFFF is hub RAM execution.
A few notable developments:
- I changed over to Altera's special memory instances, instead of my generic Verilog inferred memories, and the compiled design shrunk by 3,000 LE's in the eggbeater memory!
- Treehouse confirmed that multi-cycle paths are quite hard to define for the ASIC flow, so I inserted 32 flops between the ALU and the final result mux, and cancelled all multi-cycle paths. It caused a ~10% speed decrease, but I'm working on improving that. Not having multi-cycle paths makes the chip synthesis way simpler - and safer!
- Treehouse has my latest files just now and they will start early synthesis efforts so that we can get a reality check on the overall design size and speed.
I've been working 24 hours and I'm going to rest now.
Note: based on our experience in working on the "unofficial" P2 docs a few years back, Google Docs would get really sluggish when the document got large. Maybe they've improved the performance, but beware.
Before getting too far, I detoured and got the execution from LUT working. We are going to switch to a 512x32 LUT so that it matches the cog RAM size. Now, $000..$7FC is cog RAM execution, $800..$FFC is LUT RAM execution, and $1000..$FFFFF is hub RAM execution.
Hah. That threw me off for a moment. Addresses are in bytes, not longs. *whew* So... how do you execute from the LUT?
June July August September
Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa
-- -- -- -- -- -- 1 2 3 4 1 1 2 3 4 5
-- -- -- -- -- -- -- 5 6 7 8 9 10 11 2 3 4 5 6 7 8 6 7 8 9 10 -- --
-- -- -- -- -- -- -- 12 13 14 15 16 17 18 9 10 11 12 13 14 15 -- -- -- -- -- -- --
21 22 23 24 25 26 27 19 20 21 22 23 24 25 16 17 18 19 20 21 22 -- -- -- -- -- -- --
28 29 30 26 27 28 29 30 31 23 24 25 26 27 28 29 -- -- -- --
30 31
Interrupts and increased cog RAM. This is stuff that we've been begging for for years, and somehow it gets added in the last few weeks of a 400-week design project. I'll have to go back and look at my wish list again to see if something else can be added.
Before getting too far, I detoured and got the execution from LUT working. We are going to switch to a 512x32 LUT so that it matches the cog RAM size. Now, $000..$7FC is cog RAM execution, $800..$FFC is LUT RAM execution, and $1000..$FFFFF is hub RAM execution.
What are the caveats on LUT execute and random access ?
(The assumption is LUT is not quite as ported/capable as COG RAM ?)
Can the larger LUT be used partially as HW LUT (eg 256) and the remaining 256 used for Data Buffers ?
What are the caveats on LUT execute and random access ?
(The assumption is LUT is not quite as ported/capable as COG RAM ?)
Can the larger LUT be used partially as HW LUT (eg 256) and the remaining 256 used for Data Buffers ?
It'll only be a single bus and LUT functionality would have had priority previously so it'll be no change there. Instruction fetches will alternate with data naturally I'd guess.
I'd imagine a LUT size can be anything that fits, with the remaining LUTRAM spare for other uses.
To execute from the LUT RAM, you just jump into its range and the cog will fetch instructions from it, instead of the cog RAM. All D and S registers are still in cog RAM, of course - just the instructions are coming from the LUT. This can be thought of as hub exec without any timing penalties.
If you are executing from the LUT, you would not want the streamer to use it, also, as there would be a conflict. However, you can read and write the LUT while executing from it by using RDLUT/WRLUT and SETQ2+RDLONG (block load).
Chip,
What's the timing/issues with using WRLUT/RDLUT to manipulate code in the LUT as it's running from LUT?
For example, with self modifying code using WRLUT, how many gap slots are needed between the WRLUT and the execution point in LUT memory? I'm guessing at least 1 or 2?
Can a RDLUT source LUT location be the same as the current execution point in LUT memory? (or the next execution point?) I'm guessing this doesn't matter.
I expect that self-modifying code will not work when executing from LUT because any modification would address the normal cog ram addresses (register space), just the same as hubexec. So basically any hubexec restrictions will also be restrictions when executing from LUT.
I had not expected to be able to use the LUT as "LUT" or RAM (via RDLUT/WRLUT or SETQ2+RDLONG) when executing from the LUT because it is single ported ram. Of course I haven't examined the pipeline stages in P2 (I don't think Chip has released an updated version yet???).
This is a "biggie" though for larger cog programs running "full speed"
Cluso99,
Notice I specifically said using WRLUT, not the normal self modifying instructions. I'm pretty darn sure you will be able to use WRLUT from code executing from LUT memory. Just curious about the timing.
I'm also sure that we will be able to self modify code in hubexec using WRLONG. It'll just be old-school style self modifying code where you have to write the whole instruction, and you have to deal with timing and other issues (fifo streamer being a kind of cache).
Cluso99,
Notice I specifically said using WRLUT, not the normal self modifying instructions. I'm pretty darn sure you will be able to use WRLUT from code executing from LUT memory. Just curious about the timing.
I'm also sure that we will be able to self modify code in hubexec using WRLONG. It'll just be old-school style self modifying code where you have to write the whole instruction, and you have to deal with timing and other issues (fifo streamer being a kind of cache).
It is like cog execution, where you have to have one instruction in-between the write instruction and the written instruction.
All I know is that we are going to see some clever code that uses the LUT for program space and the COG registers entirely for data space. Like the SPIN2 interpreter, maybe. Or I/O drivers with large cog-local buffers.
Not only that, but it suddenly makes the use of cog address space for special registers less critical. I wonder (just aloud, not necessarily seriously) if it would make sense to add a PTRC and PTRD. Since PTRA and PTRB could be used with CALLx, this would allow two additional pointers that would not be related to call stacks.
On a slightly related note, I just noticed that there weren't any INDx registers in the 8/13 document. Did we lose indirect registers in the new design?
I had been presuming that the LUT (for code execution) would not permit access to the LUT while running code from it. I was thinking just a simple execution model with more caveats. Seems Chip went further.
From what I understand, the pipeline clock cycles (without considering the overlapping instructions, are...
2 clocks per instruction, such that one clock is I+R and the second clock is S+D
where I = instruction fetch, R = write result, S = fetch Source contents, D = fetch Destination contents
Remember, the normal Cog RAM is 2 port access, where one port is always read and the other r/w.
So when the instruction address is located in LUT (extended cog ram) or HUB (via the cache/buffer), the "I" fetch will come from the LUT or HUB. If it is coming from the HUB then there may be delays, but not from the LUT.
However, if an instruction is going to write to the LUT (ie WRLUT or SETQ2+RDLONG) then the "R" cycle for writing to the LUT will clash with the "I" cycle for a subsequent instruction creating a "STALL". I presume the pipeline "I" fetch will stall ???
If an instruction is going to read from the LUT (ie RDLUT) then I presume this will take place as an "S" or "D" cycle (in the same S+D clock), so there should be no problems here.
If the instruction has had the "S" or "D" address(es) modified with an AUGxx instruction (presuming these still exist), then again, I presume this will take place as an "S" or "D" cycle (in the same S+D clock), so there should be no problems here (for the fetching part).
BUT, if the instruction writes back the result to the LUT, we have the same problem as previously, where the "R" cycle for writing to the LUT will clash with the "I" cycle for a subsequent instruction creating a "STALL". I presume the pipeline "I" fetch will stall ???
If either of these problems are still at issue, then (at least for the time being) I would rather a caveat that these situations are undefined and cannot be used.
I had been presuming that the LUT (for code execution) would not permit access to the LUT while running code from it. I was thinking just a simple execution model with more caveats. Seems Chip went further.
From what I understand, the pipeline clock cycles (without considering the overlapping instructions, are...
2 clocks per instruction, such that one clock is I+R and the second clock is S+D
where I = instruction fetch, R = write result, S = fetch Source contents, D = fetch Destination contents
Remember, the normal Cog RAM is 2 port access, where one port is always read and the other r/w.
So when the instruction address is located in LUT (extended cog ram) or HUB (via the cache/buffer), the "I" fetch will come from the LUT or HUB. If it is coming from the HUB then there may be delays, but not from the LUT.
However, if an instruction is going to write to the LUT (ie WRLUT or SETQ2+RDLONG) then the "R" cycle for writing to the LUT will clash with the "I" cycle for a subsequent instruction creating a "STALL". I presume the pipeline "I" fetch will stall ???
If an instruction is going to read from the LUT (ie RDLUT) then I presume this will take place as an "S" or "D" cycle (in the same S+D clock), so there should be no problems here.
If the instruction has had the "S" or "D" address(es) modified with an AUGxx instruction (presuming these still exist), then again, I presume this will take place as an "S" or "D" cycle (in the same S+D clock), so there should be no problems here (for the fetching part).
BUT, if the instruction writes back the result to the LUT, we have the same problem as previously, where the "R" cycle for writing to the LUT will clash with the "I" cycle for a subsequent instruction creating a "STALL". I presume the pipeline "I" fetch will stall ???
If either of these problems are still at issue, then (at least for the time being) I would rather a caveat that these situations are undefined and cannot be used.
As it works out, there are no caveats regarding concurrent LUT execution and LUT r/w. The LUT instructions are fetched on "go" cycles and LUT r/w operations occur on non-"go" cycles. RDLUT takes three clocks, not the usual two.
Chip,
How did you implement the new LUT for code use?
If it is 16 of 512x32 SP RAM then it would use 3.0mm2 (+1.5mm2) whereas if it were DP RAM then it would use 4.7mm2 (+3.2mm2).
Since we have ~5mm2 free based on 50% gates required, then there may be advantages to having all DP RAM ???
I don't follow how the LUT is being used for CLUT (no need to explain as I will see when we get to use it).
But, there maybe benefits/simplifications for the CLUT to have it's own path for reading when used as CLUT.
When used as Extended Cog RAM, by using AUGxx we can then access the LUT as a pure extension of normal Cog RAM, with the ability to use self-modifying code, and to use any standard instruction (mov/movi/and/add/ etc). This way, we would not require an extra SETQ2 instruction, nor would we require RDLUT/WRLUT as the standard RDxxxx/WRxxxx and SETQ would just work.
This would also permit CALLx instructions using the PTRx to use anywhere in the 4KB cog space.
Please, if it does not simplify the design / reduce the risks, or will take more than a couple of hours, do not consider it !!!
For reference from a previous post...
16 of 8192x32 SP RAM 16 x 1.57 mm2 = 25.1 mm2
16 of 512x32 DP RAM 16 x 0.292 mm2 = 4.7 mm2
16 of 256x32 SP RAM 16 x 0.095 mm2 = 1.5 mm2
If it is 16 of 512x32 SP RAM then it would use 3.0mm2 (+1.5mm2) whereas if it were DP RAM then it would use 4.7mm2 (+3.2mm2).
Since we have ~5mm2 free based on 50% gates required, then there may be advantages to having all DP RAM ???
I think DP only matters if you expect S,D to also be in the same space.
Above, Chip said S,D are always in the lower 512w
( ie LUT is a special, local form of HUB exec that bypasses the slot-queue)
If you expand LUT map area to DP, then the opcodes would all naturally expand to an extra bit per S & D, which is sounding like quite a large change ?
Not necessarily (and we don't want this change as its too big), and this is why I posted this.
The S & D addresses could be modified (extended) by AUGD/AUGS/AUGDS which could set the A9 bit to the upper 512 longs, permitting the upper cog ram (ie LUT) to be used with standard instructions (ie a pair of instructions).
As it works out, there are no caveats regarding concurrent LUT execution and LUT r/w. The LUT instructions are fetched on "go" cycles and LUT r/w operations occur on non-"go" cycles. RDLUT takes three clocks, not the usual two.
Aha - got it thanks Chip.
Does that mean it would then insert an extra clock (ie 4 clocks) to get back in "sync" with "go" cycles or does the "go" cycle get shifted?
The S & D addresses could be modified (extended) by AUGD/AUGS/AUGDS which could set the A9 bit to the upper 512 longs, permitting the upper cog ram (ie LUT) to be used with standard instructions (ie a pair of instructions).
Sounds like such a Paging Scheme could cause a lot of fun around Interrupts ?
Also, does the P2 really need all memory available as discrete registers (the highest cost RAM) ?
Lots of SW uses more memory as Arrays, than as discrete VARs,
and code can fetch from this memory too, which frees up the more valuable register-capable RAM.
Comments
The most important 'documentation' is a full working set of files, including Board test examples.
See other posts here, the main issues around these seem to be in the details of things like Pin-mapping, and partial product test.
The average user of these is not going to be inexperienced elecronically, but may be relatively new to Quartus flows.
That is surprising, as Altera loaders usually uses FT240 USB FIFO (good to 1MByte/sec) and a CPLD for the JTAG state engine.
That HW means they should be able to spin things close to FS USB ceilings ?
If they are s l o w, maybe their SW is poorly written, or they have some silly default settings ?
What FT part does the Parallax PCB use ?
If Parallax have a solution that is much faster, then users could pay for that - another target for the compact P1 board mentioned elsewhere.
There is an FTDI chip upstream that can talk to either the P1 or FPGA
Same as what chip did with the DE0 and DE2 early releases - there was a ~3 page doc with some photos of where to hook things up to, and screenshots showing how to program firmware in. That kind of thing goes a long way when starting out with something new, as does advice about the minimum Altera downloads (just the programmer, etc)
Chips text docs (instructions with some demo code) are fine
Google doc might work, but it'd be good to export into the master zip at the time of release. Peter's P2 google doc got a bit unwieldy as it grew large, but lets see how this goes
There's likely to be several releases over the coming weeks. If possible could we have a master zip with the relevant contents (including Pnut, px, rbf/pof etc) with some kind of date stamp in the filename, at least on the master zip file, but perhaps on utils like Pnut too? I'm not worried about the date being out by a few days, so much as it's a consistent stamp across the various docs and utils.
If dates are tricky, perhaps walnut varieties (or something)
Before getting too far, I detoured and got the execution from LUT working. We are going to switch to a 512x32 LUT so that it matches the cog RAM size. Now, $000..$7FC is cog RAM execution, $800..$FFC is LUT RAM execution, and $1000..$FFFFF is hub RAM execution.
A few notable developments:
- I changed over to Altera's special memory instances, instead of my generic Verilog inferred memories, and the compiled design shrunk by 3,000 LE's in the eggbeater memory!
- Treehouse confirmed that multi-cycle paths are quite hard to define for the ASIC flow, so I inserted 32 flops between the ALU and the final result mux, and cancelled all multi-cycle paths. It caused a ~10% speed decrease, but I'm working on improving that. Not having multi-cycle paths makes the chip synthesis way simpler - and safer!
- Treehouse has my latest files just now and they will start early synthesis efforts so that we can get a reality check on the overall design size and speed.
I've been working 24 hours and I'm going to rest now.
Note: based on our experience in working on the "unofficial" P2 docs a few years back, Google Docs would get really sluggish when the document got large. Maybe they've improved the performance, but beware.
Hah. That threw me off for a moment. Addresses are in bytes, not longs. *whew* So... how do you execute from the LUT?
81 days since the end of Spring.
12 days to the beginning of Fall.
52 days to November 1st.
Interrupts and increased cog RAM. This is stuff that we've been begging for for years, and somehow it gets added in the last few weeks of a 400-week design project. I'll have to go back and look at my wish list again to see if something else can be added.
What are the caveats on LUT execute and random access ?
(The assumption is LUT is not quite as ported/capable as COG RAM ?)
Can the larger LUT be used partially as HW LUT (eg 256) and the remaining 256 used for Data Buffers ?
I guess not quite quick enough if you saw it! :-)
It'll only be a single bus and LUT functionality would have had priority previously so it'll be no change there. Instruction fetches will alternate with data naturally I'd guess.
I'd imagine a LUT size can be anything that fits, with the remaining LUTRAM spare for other uses.
If you are executing from the LUT, you would not want the streamer to use it, also, as there would be a conflict. However, you can read and write the LUT while executing from it by using RDLUT/WRLUT and SETQ2+RDLONG (block load).
What's the timing/issues with using WRLUT/RDLUT to manipulate code in the LUT as it's running from LUT?
For example, with self modifying code using WRLUT, how many gap slots are needed between the WRLUT and the execution point in LUT memory? I'm guessing at least 1 or 2?
Can a RDLUT source LUT location be the same as the current execution point in LUT memory? (or the next execution point?) I'm guessing this doesn't matter.
I had not expected to be able to use the LUT as "LUT" or RAM (via RDLUT/WRLUT or SETQ2+RDLONG) when executing from the LUT because it is single ported ram. Of course I haven't examined the pipeline stages in P2 (I don't think Chip has released an updated version yet???).
This is a "biggie" though for larger cog programs running "full speed"
Get some rest and come back to it fresh and relaxed
Notice I specifically said using WRLUT, not the normal self modifying instructions. I'm pretty darn sure you will be able to use WRLUT from code executing from LUT memory. Just curious about the timing.
I'm also sure that we will be able to self modify code in hubexec using WRLONG. It'll just be old-school style self modifying code where you have to write the whole instruction, and you have to deal with timing and other issues (fifo streamer being a kind of cache).
It is like cog execution, where you have to have one instruction in-between the write instruction and the written instruction.
Not only that, but it suddenly makes the use of cog address space for special registers less critical. I wonder (just aloud, not necessarily seriously) if it would make sense to add a PTRC and PTRD. Since PTRA and PTRB could be used with CALLx, this would allow two additional pointers that would not be related to call stacks.
I had been presuming that the LUT (for code execution) would not permit access to the LUT while running code from it. I was thinking just a simple execution model with more caveats. Seems Chip went further.
From what I understand, the pipeline clock cycles (without considering the overlapping instructions, are...
2 clocks per instruction, such that one clock is I+R and the second clock is S+D
where I = instruction fetch, R = write result, S = fetch Source contents, D = fetch Destination contents
Remember, the normal Cog RAM is 2 port access, where one port is always read and the other r/w.
So when the instruction address is located in LUT (extended cog ram) or HUB (via the cache/buffer), the "I" fetch will come from the LUT or HUB. If it is coming from the HUB then there may be delays, but not from the LUT.
However, if an instruction is going to write to the LUT (ie WRLUT or SETQ2+RDLONG) then the "R" cycle for writing to the LUT will clash with the "I" cycle for a subsequent instruction creating a "STALL". I presume the pipeline "I" fetch will stall ???
If an instruction is going to read from the LUT (ie RDLUT) then I presume this will take place as an "S" or "D" cycle (in the same S+D clock), so there should be no problems here.
If the instruction has had the "S" or "D" address(es) modified with an AUGxx instruction (presuming these still exist), then again, I presume this will take place as an "S" or "D" cycle (in the same S+D clock), so there should be no problems here (for the fetching part).
BUT, if the instruction writes back the result to the LUT, we have the same problem as previously, where the "R" cycle for writing to the LUT will clash with the "I" cycle for a subsequent instruction creating a "STALL". I presume the pipeline "I" fetch will stall ???
If either of these problems are still at issue, then (at least for the time being) I would rather a caveat that these situations are undefined and cannot be used.
As it works out, there are no caveats regarding concurrent LUT execution and LUT r/w. The LUT instructions are fetched on "go" cycles and LUT r/w operations occur on non-"go" cycles. RDLUT takes three clocks, not the usual two.
How did you implement the new LUT for code use?
If it is 16 of 512x32 SP RAM then it would use 3.0mm2 (+1.5mm2) whereas if it were DP RAM then it would use 4.7mm2 (+3.2mm2).
Since we have ~5mm2 free based on 50% gates required, then there may be advantages to having all DP RAM ???
I don't follow how the LUT is being used for CLUT (no need to explain as I will see when we get to use it).
But, there maybe benefits/simplifications for the CLUT to have it's own path for reading when used as CLUT.
When used as Extended Cog RAM, by using AUGxx we can then access the LUT as a pure extension of normal Cog RAM, with the ability to use self-modifying code, and to use any standard instruction (mov/movi/and/add/ etc). This way, we would not require an extra SETQ2 instruction, nor would we require RDLUT/WRLUT as the standard RDxxxx/WRxxxx and SETQ would just work.
This would also permit CALLx instructions using the PTRx to use anywhere in the 4KB cog space.
Please, if it does not simplify the design / reduce the risks, or will take more than a couple of hours, do not consider it !!!
For reference from a previous post...
I think DP only matters if you expect S,D to also be in the same space.
Above, Chip said S,D are always in the lower 512w
( ie LUT is a special, local form of HUB exec that bypasses the slot-queue)
If you expand LUT map area to DP, then the opcodes would all naturally expand to an extra bit per S & D, which is sounding like quite a large change ?
The S & D addresses could be modified (extended) by AUGD/AUGS/AUGDS which could set the A9 bit to the upper 512 longs, permitting the upper cog ram (ie LUT) to be used with standard instructions (ie a pair of instructions).
Does that mean it would then insert an extra clock (ie 4 clocks) to get back in "sync" with "go" cycles or does the "go" cycle get shifted?
Sounds like such a Paging Scheme could cause a lot of fun around Interrupts ?
Also, does the P2 really need all memory available as discrete registers (the highest cost RAM) ?
Lots of SW uses more memory as Arrays, than as discrete VARs,
and code can fetch from this memory too, which frees up the more valuable register-capable RAM.