Okay, of course, it's done at packing time. The one did was integrated to divide and not generic. And I keep forgetting, I didn't have the luxury of excess lower bits.
@evanh said:
Okay, of course, it's done at packing time. The one did was integrated to divide and not generic. And I keep forgetting, I didn't have the luxury of excess lower bits.
I think I remember this. It's purpose was to give a rounded quotient, right?
Yes. Matching the IEEE de-biasing results, but applied to 32-bit integers instead of floats. Mainly for use with enhanced muldiv65() routine. Which itself was an extension to your original muldiv64().
@evanh said:
Yes. Matching the IEEE de-biasing results, but applied to 32-bit integers instead of floats. Mainly for use with enhanced muldiv65() routine. Which itself was an extension to your original muldiv64().
Ah, yes. I keep remembering and forgetting about this. Do you think it would be good to modify the MULDIV64() to operate differently? I remember someone pointing out that the error could be quite large in some cases.
@cgracey said:
Ah, yes. I keep remembering and forgetting about this. Do you think it would be good to modify the MULDIV64() to operate differently? I remember someone pointing out that the error could be quite large in some cases.
Yes, the cost for the basic (+ divisor>>1) was minimal ... but it doesn't combat the nature of integers losing resolution at an alarming rate. I was surprised how much of an improvement fixed-point can be, even with multiplication. Ada made that clear not long ago.
So that's a feature request, btw, of the Cordic if the Prop2 ever goes for a re-spin. Providing a new instruction, GETQZ say, that grabs QY[15:0]and QX[31:16] for doing faster 16.16 fixed point.
Hmm, not too sure about performing unsigned bounding. Unsigned is naturally circular. That's how carry/borrow functions. Masking suits it better than bounding.
Chip,
Scratch the above feature request. I just realise that GETQX and GETQY are not bound to the usual hub-op granularity timing. It isn't clear in the spreadsheet that they aren't affected by hub slots alignment.
For 16.16 fixed-point, this is pretty quick in the scheme of things:
@evanh said:
For 16.16 fixed-point, this is pretty quick in the scheme of things:
getqx pa
getqy fp
rolword fp, pa, #1
Think of that in a loop though. Remember, it's not single multiply ops, it's vectors of sums of products. So add an ALTR + ADD and whoops it no longer fits into an 8-cycle loop and you have to waste half the cordic throughput.
Yes, with current P2 you also need to work around QMUL being unsigned-only, which is a much worse issue than extracting the fixed point result. Other than that, no.
Chip,
For inline Pasm code within a Spin method, is it guaranteed to be a fresh copy of the routine in cogRAM for each time the assembled code is called? I ask because I'm considering doing self-modify. And it would be shorter if the cogRAM copy gets reset for each call.
@evanh said:
Chip,
For inline Pasm code within a Spin method, is it guaranteed to be a fresh copy of the routine in cogRAM for each time the assembled code is called? I ask because I'm considering doing self-modify. And it would be shorter if the cogRAM copy gets reset for each call.
A search for "inline" in Spin2_interpreter.spin2 will provide the answer.
I should ask Eric too. If he's not happy to support it then I better not. I've kind of already flagged even the self-modifying option as too much overhead with minimal speed up. The results are good as is.
@evanh said:
Chip,
For inline Pasm code within a Spin method, is it guaranteed to be a fresh copy of the routine in cogRAM for each time the assembled code is called? I ask because I'm considering doing self-modify. And it would be shorter if the cogRAM copy gets reset for each call.
Yes, the inline PASM code gets reloaded into cog RAM each time it is executed.
@evanh said:
Chip,
For inline Pasm code within a Spin method, is it guaranteed to be a fresh copy of the routine in cogRAM for each time the assembled code is called? I ask because I'm considering doing self-modify. And it would be shorter if the cogRAM copy gets reset for each call.
Yes, the inline PASM code gets reloaded into cog RAM each time it is executed.
but what about interrupts, you had some sample staying resident?
The inline PASM code loads wherever the ORG says to. If you ORG $080, the code will be loaded into register $080, upwards. If your next ORG is $000 (default), your code loads into register $000, upwatds. It doesn't affect the old code at $080 unless it overlaps it.
The debug window has a mysterious delay when scrolling. The last line is not cleared immediately but instead keeps a copy of the line just scrolled upward. Half a second later it is eventually cleared or overwritten with new content.
The debug window is not refreshed when it is obscured by another window and then brought to the front again. Instead, it stays black.
It would be nice if text from the debug window could be copied/pasted into an editor. This would make it much easier to share or document debugging output, for example here in the forum.
The debug window has a mysterious delay when scrolling. The last line is not cleared immediately but instead keeps a copy of the line just scrolled upward. Half a second later it is eventually cleared or overwritten with new content.
That annoys me too. I think this may happen only on certain systems or OS versions.
The debug window is not refreshed when it is obscured by another window and then brought to the front again. Instead, it stays black.
I've seen that also. What OS are you experiencing this? I saw it on Win 7, but not Win 10 when I tested. The solution (I've already found) may slow the display slightly, so I was leery of implementing it.
It would be nice if text from the debug window could be copied/pasted into an editor. This would make it much easier to share or document debugging output, for example here in the forum.
Noted. In the meantime, you can use the DEBUG log feature, but should be careful about limiting the log size and it is certainly not as convenient in a situation with an impromptu need for such text.
@ManAtWork said:
I think you've already mentioned that but, unfortunatelly, I haven't managed to find out how this works. How is the log feature activated?
In your code, just define DEBUG_LOG_SIZE in your code as a value > 0... set it to the maximum number of bytes you'd like to limit the log file too.
CON
DEBUG_LOG_SIZE = 1024
When using Propeller Tool, the log file will be stored as the "DEBUG.log" file in your My Documents > Propeller Tool folder.
Yes, the DEBUG window is just a visual thing designed to run as fast as possible. It simply scrolls the bitmap each time a new line is printed. It is up to Windows to get around to repainting the window. This is much faster than redrawing all the text, and it enables things to go really fast, to not incur much of a DEBUG cache delay.
Like Jeff said, you can log the first N bytes of DEBUG data by setting that DEBUG_LOG_SIZE to a size limit, so that the DEBUG data goes into a DEBUG.log file.
Comments
Which pipeline trick are you meaning? I just added 3 instructions, I think, to do the job. It's in the "packf" routine.
Okay, of course, it's done at packing time. The one did was integrated to divide and not generic. And I keep forgetting, I didn't have the luxury of excess lower bits.
I think I remember this. It's purpose was to give a rounded quotient, right?
Yes. Matching the IEEE de-biasing results, but applied to 32-bit integers instead of floats. Mainly for use with enhanced muldiv65() routine. Which itself was an extension to your original muldiv64().
Ah, yes. I keep remembering and forgetting about this. Do you think it would be good to modify the MULDIV64() to operate differently? I remember someone pointing out that the error could be quite large in some cases.
Yes, the cost for the basic (+ divisor>>1) was minimal ... but it doesn't combat the nature of integers losing resolution at an alarming rate. I was surprised how much of an improvement fixed-point can be, even with multiplication. Ada made that clear not long ago.
So that's a feature request, btw, of the Cordic if the Prop2 ever goes for a re-spin. Providing a new instruction, GETQZ say, that grabs QY[15:0]and QX[31:16] for doing faster 16.16 fixed point.
Byte -$80 to $FF ?
Word -$8000 to $FFFF?
how?
confused
Mike
It checks to make sure values will fit into bytes/words.
It accommodates byte/word signed values that can be later sign-extended back to 32 bits.
Also, it allows unsigned values up to the byte/word size.
I will probably change it in the next release to work by placing the word FIT after BYTE/LONG.
Hmm, not too sure about performing unsigned bounding. Unsigned is naturally circular. That's how carry/borrow functions. Masking suits it better than bounding.
Chip,
Scratch the above feature request. I just realise that GETQX and GETQY are not bound to the usual hub-op granularity timing. It isn't clear in the spreadsheet that they aren't affected by hub slots alignment.
For 16.16 fixed-point, this is pretty quick in the scheme of things:
And even 24.8 fixed-point is just as quick:
And 8.24 is one more instruction:
Think of that in a loop though. Remember, it's not single multiply ops, it's vectors of sums of products. So add an ALTR + ADD and whoops it no longer fits into an 8-cycle loop and you have to waste half the cordic throughput.
Isn't there more to it though? It's never going to fit in an eight clock window.
Yes, with current P2 you also need to work around QMUL being unsigned-only, which is a much worse issue than extracting the fixed point result. Other than that, no.
If you want to pipeline these operations, then just store the GETQX and GETQY values and process them later.
I imagine something like this. The second loop needs 15 clock cycles, first loop padded to suit the second:
Chip,
For inline Pasm code within a Spin method, is it guaranteed to be a fresh copy of the routine in cogRAM for each time the assembled code is called? I ask because I'm considering doing self-modify. And it would be shorter if the cogRAM copy gets reset for each call.
A search for "inline" in Spin2_interpreter.spin2 will provide the answer.
I should ask Eric too. If he's not happy to support it then I better not. I've kind of already flagged even the self-modifying option as too much overhead with minimal speed up. The results are good as is.
Yes, the inline PASM code gets reloaded into cog RAM each time it is executed.
Thanks Chip.
but what about interrupts, you had some sample staying resident?
That's not inline code then. Differently loaded. Flexspin don't support it so I'm vague on how. It'll be a DAT section using regload() or something.
The inline PASM code loads wherever the ORG says to. If you ORG $080, the code will be loaded into register $080, upwards. If your next ORG is $000 (default), your code loads into register $000, upwatds. It doesn't affect the old code at $080 unless it overlaps it.
Huh, never expected a value after ORG to have an effect for inlined. Obviously important for regload()ed sections though.
Some improvement requests for the Propeller Tool:
That annoys me too. I think this may happen only on certain systems or OS versions.
I've seen that also. What OS are you experiencing this? I saw it on Win 7, but not Win 10 when I tested. The solution (I've already found) may slow the display slightly, so I was leery of implementing it.
Noted. In the meantime, you can use the DEBUG log feature, but should be careful about limiting the log size and it is certainly not as convenient in a situation with an impromptu need for such text.
Yes, I also use Win7
I think you've already mentioned that but, unfortunatelly, I haven't managed to find out how this works. How is the log feature activated?
In your code, just define DEBUG_LOG_SIZE in your code as a value > 0... set it to the maximum number of bytes you'd like to limit the log file too.
When using Propeller Tool, the log file will be stored as the "DEBUG.log" file in your My Documents > Propeller Tool folder.
Yes, the DEBUG window is just a visual thing designed to run as fast as possible. It simply scrolls the bitmap each time a new line is printed. It is up to Windows to get around to repainting the window. This is much faster than redrawing all the text, and it enables things to go really fast, to not incur much of a DEBUG cache delay.
Like Jeff said, you can log the first N bytes of DEBUG data by setting that DEBUG_LOG_SIZE to a size limit, so that the DEBUG data goes into a DEBUG.log file.