Yes, it would then look up something else, or be a table of instructions, i.e., for a recording of a replay, where it will have bits for movement directions or anything, but I guess the GETNIB would work fine too isn't there one for bytes too iirc. as that would be helpful also.
for getting a tile from a map, to then draw to screen, or other such use, not just tied to gaming.
edit: MOVF or something like that, can't remember off hand.
Yes, there are GETBYTE, GETWORD, and some others too.
The GETNIB would save having to ROR the source, but you would need to AND the dest to mask the upperbits off. Though, if it were a table, thatcouldbe unnecessary.
Yes, it would then look up something else, or be a table of instructions, i.e., for a recording of a replay, where it will have bits for movement directions or anything, but I guess the GETNIB would work fine too isn't there one for bytes too iirc. as that would be helpful also.
for getting a tile from a map, to then draw to screen, or other such use, not just tied to gaming.
edit: MOVF or something like that, can't remember off hand.
The MOVF/SETF (Byte mover) is now obsolete. It has been replaced with the following instructions
GETNIB SETNIB GETBYTE SETBYTE GETWORD SETWORD ROLNIB SWBYTES STBYTES ESWAP4 ESWAP8
ROLBYTE ROLWORD . Some very useful instructions in that group.
Yes, there are GETBYTE, GETWORD, and some others too.
The GETNIB would save having to ROR the source, but you would need to AND the dest to mask the upperbits off. Though, if it were a table, thatcouldbe unnecessary.
GETNIB GETBYTE GETWORD all zero upper bits of destinarion. No ANDing required
The MOVF/SETF (Byte mover) is now obsolete. It has been replaced with the following instructions
GETNIB SETNIB GETBYTE SETBYTE GETWORD SETWORD ROLNIB SWBYTES STBYTES ESWAP4 ESWAP8
ROLBYTE ROLWORD . Some very useful instructions in that group.
GETNIB GETBYTE GETWORD all zero upper bits of destinarion. No ANDing required
Of course. I was confusing it with SETNIB where you put it into an existing field. I am yet to understand how some of these work.
Awesome set of instructions shame I was too busy with work at the time they came out, and missed seeing that part of the thread, then again it does fill up quickly in here
Well, since there is no such thing as a stupid question around here, I think I'll take my shot at a stupid answer:)
I think the description in Prop2_Docs from 11/27 is a little thin, since it stops just when things are getting interesting.
Superficially, it looks like Frac first applies a fixed scale to the fraction but the resulting getdivq and getdivr results are not really described. Are they the components of a binary fraction?
To start a 32-bit fraction calculation, use FRAC:
FRAC D/#,S/# - Begin calculating the unsigned fraction of D/# over S/#, where
D/# and S/# are unsigned 32-bit values and D/# is less than S/#.
Use GETDIVQ to get the result.
After starting the divider, you'll have 17 clocks cycles to execute other code, if you
wish, before GETDIVQ/GETDIVR will return the quotient/remainder long(s) of the result:
GETDIVQ D - Get quotient result
GETDIVR D - Get remainder result
In single-task mode, GETDIVQ/GETDIVR will stall the pipeline until the result is ready.
In multi-task mode, GETDIVQ/GETDIVR will jump to themselves until the result is ready,
freeing clocks for other tasks.
I'm trying to get my camera up and running on the P2… right now, everything I don't know how to do with the Prop2, I am contemplating doing with a Prop1 and then just frankenstein the hardware together.
Of course, the point is to get everything on the P2 as quickly as possible. How do I set up ctrb to output NCO on pin x(portA)? I understand how to assign frqb, it is the mode bits that I can't fathom.
Ugh! I missed that part - just assumed the destination was rotated.
If he is wanting the source to be rotated, that is not possible as that would mean 2 writebacks to the cog in stage 4.
So this is what he was after...
Which means combining these instructions...
AND D,#mask
[I] where "2" uses mask=11, "4" uses mask=1111, "8" uses mask=11111111
[/I]MOV tmp,S
AND tmp,#mask
OR D,tmp
followed by
ROR S,#n
[I]where n=2/4/8[/I]
That's right. We only have one cog RAM register-write possibility per instruction, so the operation on S would have to become an operation on D in a separate instruction. As clock cycles go, this wouldn't be any worse - it would just take another instruction.
I'm trying to get my camera up and running on the P2… right now, everything I don't know how to do with the Prop2, I am contemplating doing with a Prop1 and then just frankenstein the hardware together.
Of course, the point is to get everything on the P2 as quickly as possible. How do I set up ctrb to output NCO on pin x(portA)? I understand how to assign frqb, it is the mode bits that I can't fathom.
Thanks,
Rich
Rich, sorry those docs are done yet. You can find the LSB-justified bit pattern that does NCO in the latest docs file. Just put the pin number you want the output on into the D and/or I fields (via SETD/SETI). Pin numbers are seven bits, but it will be necessary to set relative bits +7 and +8 to either %01, %10, or %11 to get output. Remember to set the related DIR bits(s). You'll need to experiment (I'm not at my work computer now).
Just a question regarding P2 power useage.
While there is some 'down time' waiting for the next shuttle run, I know you've been tweaking and adding features at a rip roaring rate.
However, I've not seen any mention of any sort of power-gating being implemented.
This may already be in place, however if not, I think it would be very, very worthwhile to minimize power requirements as much as is reasonable possible.
Just a question regarding P2 power useage.
While there is some 'down time' waiting for the next shuttle run, I know you've been tweaking and adding features at a rip roaring rate.
However, I've not seen any mention of any sort of power-gating being implemented.
This may already be in place, however if not, I think it would be very, very worthwhile to minimize power requirements as much as is reasonable possible.
The synthesis tools add clock gating automatically in cases where adequate setup times exist. In our last synthesis run, 89% of the flipflops were clock-gated, as opposed to always-clocked with enable inputs. This chip might dissipate a couple of watts at 1.8V.
The synthesis tools add clock gating automatically in cases where adequate setup times exist. In our last synthesis run, 89% of the flipflops were clock-gated, as opposed to always-clocked with enable inputs. This chip might dissipate a couple of watts at 1.8V.
@CHIP
In Automated Testing Systems we often have the need to measure isolated voltages.
This can easily be done with external Sigma/Delta encoders like http://www.analog.com/AD7401A
This part delivers a 10MHz bitstream which gives a 16bit ADC when filtered by a SINC3 filter
(verilog code shown in datasheet). But then need an FPGA to decode it, which is high cost.
So I was wondering, if the P2 ADC HW might be used to take the EXTERNAL bitstream, instead of the internal one
to implement an isolated ADC.
IIRC the internal ADC works with 1st order S/D-encoder whereas this SINC3 gives MUCH better Signal to noise ratio.
Looking at the verilog code it might even be possible to use one COG @200MHz to do it in SW.
but some HW assist, if available, would be much better.
@CHIP
In Automated Testing Systems we often have the need to measure isolated voltages.
This can easily be done with external Sigma/Delta encoders like http://www.analog.com/AD7401A
This part delivers a 10MHz bitstream which gives a 16bit ADC when filtered by a SINC3 filter
(verilog code shown in datasheet). But then need an FPGA to decode it, which is high cost.
So I was wondering, if the P2 ADC HW might be used to take the EXTERNAL bitstream, instead of the internal one
to implement an isolated ADC.
IIRC the internal ADC works with 1st order S/D-encoder whereas this SINC3 gives MUCH better Signal to noise ratio.
Looking at the verilog code it might even be possible to use one COG @200MHz to do it in SW.
but some HW assist, if available, would be much better.
The CTRs have modes to sum up 1's in order to realize a 1st-order delta-sigma conversion. I don't know if there will be adequate bandwidth to do any 2nd-order conversions, unless you can use every Nth bit, or MSBs of short accumulations.
I got the hub execution working last night (5:30am). It only uses one cache line, but it should be easily expanded to four lines, using a least-recently-used algorithm.
Since hub execution occurs whenever a task's 16-bit program counter is beyond $01FF, it turns out that there is no need to store and recall a hub-mode bit. The only rule is: if an instruction is being fetched above $01FF, it needs to be read from the icache, which might entail an 8-long hub fetch. This got rid of all kinds of state hardware (like hub mode being stored in bit 18 of stack data). When things turn out right, they are always simple. I feel really good about how this is shaping up. There's no bandwidth pinch anywhere, either, which I was concerned about.
When hub code is cached up, it runs exactly as fast as it would from the cog. I made a pin-toggling program that runs partly in the cog and partly in the hub, and on the scope you can see every 50ns cycle and when the cache fetching occurs. I feel relieved. This was a big feature add, but it's something very valuable. This will work wonders for the Spin interpreter's efficiency. Being able to bust beyond the cog's RAM is a fantastic feeling.
I got the hub execution working last night (5:30am). It only uses one cache line, but it should be easily expanded to four lines, using a least-recently-used algorithm.
Since hub execution occurs whenever a task's 16-bit program counter is beyond $01FF, it turns out that there is no need to store and recall a hub-mode bit. The only rule is: if an instruction is being fetched above $01FF, it needs to be read from the icache, which might entail an 8-long hub fetch. This got rid of all kinds of state hardware (like hub mode being stored in bit 18 of stack data). When things turn out right, they are always simple. I feel really good about how this is shaping up. There's no bandwidth pinch anywhere, either, which I was concerned about.
When hub code is cached up, it runs exactly as fast as it would from the cog. I made a pin-toggling program that runs partly in the cog and partly in the hub, and on the scope you can see every 50ns cycle and when the cache fetching occurs. I feel relieved. This was a big feature add, but it's something very valuable. This will work wonders for the Spin interpreter's efficiency. Being able to bust beyond the cog's RAM is a fantastic feeling.
I got the hub execution working last night (5:30am). It only uses one cache line, but it should be easily expanded to four lines, using a least-recently-used algorithm.
Since hub execution occurs whenever a task's 16-bit program counter is beyond $01FF, it turns out that there is no need to store and recall a hub-mode bit. The only rule is: if an instruction is being fetched above $01FF, it needs to be read from the icache, which might entail an 8-long hub fetch. This got rid of all kinds of state hardware (like hub mode being stored in bit 18 of stack data). When things turn out right, they are always simple. I feel really good about how this is shaping up. There's no bandwidth pinch anywhere, either, which I was concerned about.
When hub code is cached up, it runs exactly as fast as it would from the cog. I made a pin-toggling program that runs partly in the cog and partly in the hub, and on the scope you can see every 50ns cycle and when the cache fetching occurs. I feel relieved. This was a big feature add, but it's something very valuable. This will work wonders for the Spin interpreter's efficiency. Being able to bust beyond the cog's RAM is a fantastic feeling.
Wow! That is wonderful news! Congratulations!!
Are you ready to release the new instruction encodings yet so we can think about updating propgcc?
Hub execution will give you 200 MHz LMM execution, that coupled with the ability to bring in DRAM data means big things.
I assume there's no d-cache... so this is only for making static read-only code faster right?
That's right. If you need self-modifying code, you could load it into the cog RAM and execute it there. After I get this done, I want to make automatic transfers between hub, cog, aux, and pins.
That's right. If you need self-modifying code, you could load it into the cog RAM and execute it there. After I get this done, I want to make automatic transfers between hub, cog, aux, and pins.
You could say so. You just can't execute instructions that would try to alter those resources while the transfer is going on. For example, if you start a hub-to-AUX transfer, you better not do hub or AUX instructions while it's busy, or the transfer would get corrupted.
I got the hub execution working last night (5:30am). It only uses one cache line, but it should be easily expanded to four lines, using a least-recently-used algorithm.
Since hub execution occurs whenever a task's 16-bit program counter is beyond $01ff, it turns out that there is no need to store and recall a hub-mode bit. The only rule is: If an instruction is being fetched above $01ff, it needs to be read from the icache, which might entail an 8-long hub fetch. This got rid of all kinds of state hardware (like hub mode being stored in bit 18 of stack data). When things turn out right, they are always simple. I feel really good about how this is shaping up. There's no bandwidth pinch anywhere, either, which i was concerned about.
When hub code is cached up, it runs exactly as fast as it would from the cog. I made a pin-toggling program that runs partly in the cog and partly in the hub, and on the scope you can see every 50ns cycle and when the cache fetching occurs. I feel relieved. This was a big feature add, but it's something very valuable. This will work wonders for the spin interpreter's efficiency. Being able to bust beyond the cog's ram is a fantastic feeling.
You could say so. You just can't execute instructions that would try to alter those resources while the transfer is going on. For example, if you start a hub-to-AUX transfer, you better not do hub or AUX instructions while it's busy, or the transfer would get corrupted.
Comments
for getting a tile from a map, to then draw to screen, or other such use, not just tied to gaming.
edit: MOVF or something like that, can't remember off hand.
The GETNIB would save having to ROR the source, but you would need to AND the dest to mask the upperbits off. Though, if it were a table, thatcouldbe unnecessary.
The MOVF/SETF (Byte mover) is now obsolete. It has been replaced with the following instructions
GETNIB SETNIB GETBYTE SETBYTE GETWORD SETWORD ROLNIB SWBYTES STBYTES ESWAP4 ESWAP8
ROLBYTE ROLWORD . Some very useful instructions in that group.
GETNIB GETBYTE GETWORD all zero upper bits of destinarion. No ANDing required
BTW Do you know what FRAC D/#,S/# does?
Clusu99, what does FRAC D/#,S/# do?
I think the description in Prop2_Docs from 11/27 is a little thin, since it stops just when things are getting interesting.
Superficially, it looks like Frac first applies a fixed scale to the fraction but the resulting getdivq and getdivr results are not really described. Are they the components of a binary fraction?
Rich
Of course, the point is to get everything on the P2 as quickly as possible. How do I set up ctrb to output NCO on pin x(portA)? I understand how to assign frqb, it is the mode bits that I can't fathom.
Thanks,
Rich
That's right. We only have one cog RAM register-write possibility per instruction, so the operation on S would have to become an operation on D in a separate instruction. As clock cycles go, this wouldn't be any worse - it would just take another instruction.
Rich, sorry those docs are done yet. You can find the LSB-justified bit pattern that does NCO in the latest docs file. Just put the pin number you want the output on into the D and/or I fields (via SETD/SETI). Pin numbers are seven bits, but it will be necessary to set relative bits +7 and +8 to either %01, %10, or %11 to get output. Remember to set the related DIR bits(s). You'll need to experiment (I'm not at my work computer now).
I thought it was probably described somewhere in a thread I had missed or that maybe we only had one counter on the FPGA.
Baggers…Happy Holidays!!!
Rich
Just a question regarding P2 power useage.
While there is some 'down time' waiting for the next shuttle run, I know you've been tweaking and adding features at a rip roaring rate.
However, I've not seen any mention of any sort of power-gating being implemented.
This may already be in place, however if not, I think it would be very, very worthwhile to minimize power requirements as much as is reasonable possible.
The synthesis tools add clock gating automatically in cases where adequate setup times exist. In our last synthesis run, 89% of the flipflops were clock-gated, as opposed to always-clocked with enable inputs. This chip might dissipate a couple of watts at 1.8V.
Here's a page with a picture - see Figure 1:
http://www.synopsys.com/Tools/Implementation/RTLSynthesis/Pages/PowerCompiler.aspx
In Automated Testing Systems we often have the need to measure isolated voltages.
This can easily be done with external Sigma/Delta encoders like http://www.analog.com/AD7401A
This part delivers a 10MHz bitstream which gives a 16bit ADC when filtered by a SINC3 filter
(verilog code shown in datasheet). But then need an FPGA to decode it, which is high cost.
So I was wondering, if the P2 ADC HW might be used to take the EXTERNAL bitstream, instead of the internal one
to implement an isolated ADC.
IIRC the internal ADC works with 1st order S/D-encoder whereas this SINC3 gives MUCH better Signal to noise ratio.
Looking at the verilog code it might even be possible to use one COG @200MHz to do it in SW.
but some HW assist, if available, would be much better.
The CTRs have modes to sum up 1's in order to realize a 1st-order delta-sigma conversion. I don't know if there will be adequate bandwidth to do any 2nd-order conversions, unless you can use every Nth bit, or MSBs of short accumulations.
I got the hub execution working last night (5:30am). It only uses one cache line, but it should be easily expanded to four lines, using a least-recently-used algorithm.
Since hub execution occurs whenever a task's 16-bit program counter is beyond $01FF, it turns out that there is no need to store and recall a hub-mode bit. The only rule is: if an instruction is being fetched above $01FF, it needs to be read from the icache, which might entail an 8-long hub fetch. This got rid of all kinds of state hardware (like hub mode being stored in bit 18 of stack data). When things turn out right, they are always simple. I feel really good about how this is shaping up. There's no bandwidth pinch anywhere, either, which I was concerned about.
When hub code is cached up, it runs exactly as fast as it would from the cog. I made a pin-toggling program that runs partly in the cog and partly in the hub, and on the scope you can see every 50ns cycle and when the cache fetching occurs. I feel relieved. This was a big feature add, but it's something very valuable. This will work wonders for the Spin interpreter's efficiency. Being able to bust beyond the cog's RAM is a fantastic feeling.
Would it be possible to execute inline assembler in spin or even a kind of lmm assembler?
Wow! That is wonderful news! Congratulations!!
Are you ready to release the new instruction encodings yet so we can think about updating propgcc?
I assume there's no d-cache... so this is only for making static read-only code faster right?
That's right. If you need self-modifying code, you could load it into the cog RAM and execute it there. After I get this done, I want to make automatic transfers between hub, cog, aux, and pins.
I think so. There will be a few instruction additions coming, but I don't see any big changes.
Yes. That's going to be important to Spin - to be able to execute PASM and even background PASM in other tasks of the same cog.
You could say so. You just can't execute instructions that would try to alter those resources while the transfer is going on. For example, if you start a hub-to-AUX transfer, you better not do hub or AUX instructions while it's busy, or the transfer would get corrupted.
Also a quick question... Would the Spin interpreter be able to run from ROM? thus freeing even more COG ram?