For cog exec code initially residing in hub RAM, all that matters is that those instructions share a common hub byte offset, so that when they are loaded into the cog RAM, everything is long-aligned. It also does not matter for hub-exec purposes whether instructions fall on absolute long boundaries, or not. It is true that it takes one more clock to begin a hub exec instruction stream if it is not absolutely long-aligned. However, this can be avoided by long-aligning your hub exec code in the assembler. This is a small price to pay for allowing data structures of mixed word lengths in hub memory. There is no reason that I see to enforce long-alignment rules in hub memory. All that would do is introduce unnecessary strictures.
Given there is a speed penalty for not being aligned, will the Assembler have an automatic/default method whereby code is long aligned ?Seems this is not the sort of thing you would want a novice to trigger accidentally - a sudden speed change in code they did not intentionally modify.
Ease up a little there JMG. The assembler isn't written yet ... It'll just be an extra directive or similar. It's something that can dealt with at any time.
Thanks for clearing up the hub instruction boundary issue.
I will definitely be enforcing any of my code to be on a long boundary in hub. IMHO that gives consistency and I would not want it any other way.
Executing hubexec from non long boundaries just seems plain silly IMHO.
However, if others have use for it otherwise then fine.
It is true that it takes one more clock to begin a hub exec instruction stream if it is not absolutely long-aligned.
Given there is a speed penalty for not being aligned, will the Assembler have an automatic/default method whereby code is long aligned ?Seems this is not the sort of thing you would want a novice to trigger accidentally - a sudden speed change in code they did not intentionally modify.
One clock at the start of the COG load, then the egg-beater streaming architecture just delivers full spead.
At least that's how I understand it ... ;-)
It is true that it takes one more clock to begin a hub exec instruction stream if it is not absolutely long-aligned.
Given there is a speed penalty for not being aligned, will the Assembler have an automatic/default method whereby code is long aligned ?Seems this is not the sort of thing you would want a novice to trigger accidentally - a sudden speed change in code they did not intentionally modify.
One clock at the start of the COG load, then the egg-beater streaming architecture just delivers full spead.
At least that's how I understand it ... ;-)
Sorry to enter this discussion in the middle and without having read the history but I don't see any real advantage to being able to execute instructions at non-aligned addresses especially if it costs an extra cycle. Nor do I see any real advantage to having rdlong or reword be able to fetch from non-aligned addresses. Is this something that falls out of the current implementation and we get for free? If there is any noticeable cost to this it can probably be removed.
One thing that's been bugging me about the non-maskable super interrupt for debugging is that it can't step through REP sections. There is too much context to save and restore within REPs to make it practical. It could be done, but it would cost a bunch of extra hardware.
I had an idea, though, and it tests out just fine: Have the debug interrupt routine look at the instruction at the LINK return address and if it's a REP instruction, just synthesize it in the single-stepper! This is pretty simple and I see that I am able to sense REP's, no problem. The debug interrupt will let you step through all INT0/INT1/INT2 code, regardless of whether interrupts are being stalled or allowed via STALLI/ALLOWI.
In a similar fashion to how REPs can be synthesized during single-stepping, so can ALTDS/AUGS/AUGD. This means the single-stepper can go anywhere!
To make sure this feature isn't routinely abused in people's applications, it will be made good for debugging, only, and will not serve any other purposes well. This way, it should always be able to be employed without disrupting people's existing code.
If I emulate any old CPU that give me big advantage.
Same if I have big tables that are Word - Byte mixed ---->
<
And so on, and so on
Sorry to enter this discussion in the middle and without having read the history but I don't see any real advantage to being able to execute instructions at non-aligned addresses especially if it costs an extra cycle. Nor do I see any real advantage to having rdlong or reword be able to fetch from non-aligned addresses. Is this something that falls out of the current implementation and we get for free? If there is any noticeable cost to this it can probably be removed.
If I emulate any old CPU that give me big advantage.
Same if I have big tables that are Word - Byte mixed ---->
<
And so on, and so on
Sorry to enter this discussion in the middle and without having read the history but I don't see any real advantage to being able to execute instructions at non-aligned addresses especially if it costs an extra cycle. Nor do I see any real advantage to having rdlong or reword be able to fetch from non-aligned addresses. Is this something that falls out of the current implementation and we get for free? If there is any noticeable cost to this it can probably be removed.
I said "any real advantage". I'm not sure emulating old CPUs is a big target market for Propeller chips although I know we hobbyists like to do it sometimes. Anyway, you're probably right that rdlong and rdword are more useful than the ability to execute instructions from non-aligned addresses. My point was that they can be left out if there is a significant cost to having them.
I had an idea, though, and it tests out just fine: Have the debug interrupt routine look at the instruction at the LINK return address and if it's a REP instruction, just synthesize it in the single-stepper! This is pretty simple and I see that I am able to sense REP's, no problem. The debug interrupt will let you step through all INT0/INT1/INT2 code, regardless of whether interrupts are being stalled or allowed via STALLI/ALLOWI.
In a similar fashion to how REPs can be synthesized during single-stepping, so can ALTDS/AUGS/AUGD. This means the single-stepper can go anywhere!
Sounds good, (I thought you had meant exactly that in an earlier comment), but nice to read it has been tested with REP and double opcodes.
This can be a SW optional feature - in many designs the granular step is ok.
Most debuggers have a Step-into and Step-over, (some also have Step-ret and Step-loop) so the Step into can drill into a REP and Step over can not bother with the extra work.
Sorry to enter this discussion in the middle and without having read the history but I don't see any real advantage to being able to execute instructions at non-aligned addresses especially if it costs an extra cycle. .
That is why an assembler directive is there, to avoid that extra cycle. I think the HW does the fetch 'for free', because of the DATA handling mentioned below.(ignoring the time hit for now)
Nor do I see any real advantage to having rdlong or rdword be able to fetch from non-aligned addresses. .
Really ? Being able to R/W DATA as Bytes/Words and Longs is built into the opcodes, and that means being able to read any record in memory. Users to not want to have to repack records, just because some reads are not byte aligned, and I think having some DATA IO opcodes not byte-granular, and some that are, is going to take more hardware.
Sorry to enter this discussion in the middle and without having read the history but I don't see any real advantage to being able to execute instructions at non-aligned addresses especially if it costs an extra cycle. .
That is why an assembler directive is there, to avoid that extra cycle. I think the HW does the fetch 'for free', because of the DATA handling mentioned below.(ignoring the time hit for now)
Nor do I see any real advantage to having rdlong or rdword be able to fetch from non-aligned addresses. .
Really ? Being able to R/W DATA as Bytes/Words and Longs is built into the opcodes, and that means being able to read any record in memory. Users to not want to have to repack records, just because some reads are not byte aligned, and I think having some DATA IO opcodes not byte-granular, and some that are, is going to take more hardware.
Yeah, I was probably wrong about the data operations. I admitted that in a later message. Even if you don't care about CPU emulation, being able to access unaligned words and longs makes it much easier to parse byte streams with unaligned fields. Now all we need are some endian swapping instructions. :-)
How about instructions for:
htonlhtonsntohlntohs
We do have an endian-swapping instruction: MOVBYTS D,S/#
S is treated as four 2-bit fields which select one of four D bytes into their S-respective positions. For example:
MOVBYTS D,#%00_01_10_11 - endian-swap bytes in D
MOVBYTS D,#%01_00_11_10 - endian-swap words in D
MOVBYTS D,#%00_00_00_00 - copy lower D byte to all bytes in D
We do have an endian-swapping instruction: MOVBYTS D,S/#
S is treated as four 2-bit fields which select one of four D bytes into their S-respective positions. For example:
MOVBYTS D,#%00_01_10_11 - endian-swap bytes in D
MOVBYTS D,#%01_00_11_10 - endian-swap words in D
MOVBYTS D,#%00_00_00_00 - copy lower D byte to all bytes in D
I have to say all these Interrupts are nice, BUT, time is dragging on! Now a few weeks have been devoted to this and still further parts are being added.
I have to say all these Interrupts are nice, BUT, time is dragging on! Now a few weeks have been devoted to this and still further parts are being added.
When some of my projects drag on my wife always reminds me that they don't have to be perfect but they do have to 'be'.
II have to say all these Interrupts are nice, BUT, time is dragging on! Now a few weeks have been devoted to this and still further parts are being added.
I suppose if I had an FPGA image out, there would be less concern about things like this.
I almost finished the debug super interrupt today, but got interrupted by a funeral trip we've been planning. Now, I'm working on the laptop with Quartus while my wife drives and kids cycle through their singing, shrieking, and quarreling behind us. I'm really frustrated that I don't have my FPGA board and scope because I left things in a broken state at the last minute.
Hey Chip, what you're proposing will be really neat, and worth the effort. Have a safe trip.
Out of interest, what's your compile time like?
I'm still compiling for the Cyclone IV on the DE2-115 board. For one cog + hub, the compile time is about 5 minutes.
I was able to find out what my problem was by looking at the cog Verilog in Quartus. I had to add a bit to the link instruction generator, but I forgot to take another bit away, so the link instruction turned out to be something else that wasn't having any branch effect. I am relieved I found that! I hate the feeling of things being broken.
Chip, have a safe trip. Hope the kids have plenty to keep them occupied we've don lots of long trips when our kids were just kids. Now they all have their own kids
True there would be less pressure if we had an fpga image to test. But there is still pressure as interrupts weren't even on the plan a month ago. And it seems never ending requests continue. It would be easier if we were off testing and the smart pins were done. While I do like the interrupt scheme that has evolved, I am still concerned at the time it has taken.
While travelling, I decided to target the Prop2 Quartus compilation to the newer Cyclone V-A9 device which we are using on the Prop-123 board.
The design was taking forever to compile, and once the fitter started running, it reported 25k registers! It turned out that 16k of them were in the dual-port cog RAM, which didn't make any sense, at first. Turns out it was NOT implying a dual-port RAM from the Verilog code, but was building a huge array of flops!
I searched the web and found an Altera example of inferable dual-port RAM in Verilog. There was some rather subtle difference in how it worked from mine.
Here is what I was doing at first, and then what I had to change it to, in order to make it infer as a dual-port RAM...
This makes a giant array of flops:
always @(posedge clkx)
begin
if (wex) ram[ax] <= dx;
qx <= ram[ax];
end
always @(posedge clky)
begin
if (wey) ram[ay] <= dy;
qy <= ram[ay];
end
This makes a dual-port RAM:
always @(posedge clkx)
if (wex)
begin
ram[ax] <= dx;
qx <= dx;
end
else
qx <= ram[ax];
always @(posedge clky)
if (wey)
begin
ram[ay] <= dy;
qy <= dy;
end
else
qy <= ram[ay];
Get the difference? it makes perfect sense, but I sure wasn't anticipating it.
Quartus just finished compiling a one-cog Prop2 for Cyclone V-A9 and the Fmax came in at 98MHz, whereas it was hitting 111MHz on the Cyclone IV. This confirms my previous experience that the Cyclone V is slower than the Cyclone IV, though some thought it should have been about 11% faster. The compile time was in-line with what my faster desktop does. I think those horrendous compile times I had initially reported were due to the un-inferable-ness of my dual-port RAMs on the Cyclone V, which had worked on the Cyclone IV.
I'll kick off a whole-chip compile now and report the device utilization and compile-time in the morning.
Get the difference? it makes perfect sense, but I sure wasn't anticipating it.
What does that fix do to A9 compile times and resource usage (and MHz reports ) ?
Well, the design is not viable, at all, with flops for cog RAM, so there's no comparison. Given time to fully compile, it would be twice as big and half the speed, probably.
P.S. Giving this further thought, it would be more like 5 times as big and 1/5 as fast. Trying to make RAM from flops is horribly inefficient.
I just started the whole-chip compile and with all 16 cogs and the hub there are 49k flops. It finished the synthesis in only 9 minutes on my older laptop. The fitter is going to take some time, now.
This is like watching a good suspense movie. I've got my popcorn at the ready and am sitting near the screen. Discovering that it wasn't inferring the dual-port RAM was like dodging a bullet. Thank goodness for the wherewithal to search the web and find a relevant Altera example (sometimes when one is clever enough to do most things on one's own, one can neglect to take advantage of the resources out there). Sometimes it's good to hit a wall, I guess. Anyway, I'm amazed that work can continue to any extent on the road in a car filled with travelers family. That's going all out!
So, for those of us that aren't familiar with FPGA design, if the synthesis process is the step that takes the Verilog code from written form to gates (logical connections), is the "fitter" process the part that lays out all the circuitry? If so, I can imagine that arranging things in a somewhat optimized pattern could be time consuming for the computer (not to mention the designer waiting on it). And if I recall correctly, slightly different but presumably functionally-equivalent results can happen from one or both of those processes. That sounds a little bit scary to the uninitiated like myself, but I guess one learns to trust the tools, at least for the most part.
Comments
Given there is a speed penalty for not being aligned, will the Assembler have an automatic/default method whereby code is long aligned ?Seems this is not the sort of thing you would want a novice to trigger accidentally - a sudden speed change in code they did not intentionally modify.
I will definitely be enforcing any of my code to be on a long boundary in hub. IMHO that gives consistency and I would not want it any other way.
Executing hubexec from non long boundaries just seems plain silly IMHO.
However, if others have use for it otherwise then fine.
If I look from hardware side, byte boundary are only one way to go.
If I look from programing side, All necessary LONG, WORD, BYTE aligning are simple assembler DIRECTIVES
That's right. There are ~5- to ~20 clock delays for reloading the instruction FIFO after branches. Once loaded, contiguous execution goes full speed.
Given there is a speed penalty for not being aligned, will the Assembler have an automatic/default method whereby code is long aligned ?Seems this is not the sort of thing you would want a novice to trigger accidentally - a sudden speed change in code they did not intentionally modify.
One clock at the start of the COG load, then the egg-beater streaming architecture just delivers full spead.
At least that's how I understand it ... ;-)
And there is time for pnut to see the right operators.
Given there is a speed penalty for not being aligned, will the Assembler have an automatic/default method whereby code is long aligned ?Seems this is not the sort of thing you would want a novice to trigger accidentally - a sudden speed change in code they did not intentionally modify.
One clock at the start of the COG load, then the egg-beater streaming architecture just delivers full spead.
At least that's how I understand it ... ;-)
Sorry to enter this discussion in the middle and without having read the history but I don't see any real advantage to being able to execute instructions at non-aligned addresses especially if it costs an extra cycle. Nor do I see any real advantage to having rdlong or reword be able to fetch from non-aligned addresses. Is this something that falls out of the current implementation and we get for free? If there is any noticeable cost to this it can probably be removed.
I had an idea, though, and it tests out just fine: Have the debug interrupt routine look at the instruction at the LINK return address and if it's a REP instruction, just synthesize it in the single-stepper! This is pretty simple and I see that I am able to sense REP's, no problem. The debug interrupt will let you step through all INT0/INT1/INT2 code, regardless of whether interrupts are being stalled or allowed via STALLI/ALLOWI.
In a similar fashion to how REPs can be synthesized during single-stepping, so can ALTDS/AUGS/AUGD. This means the single-stepper can go anywhere!
To make sure this feature isn't routinely abused in people's applications, it will be made good for debugging, only, and will not serve any other purposes well. This way, it should always be able to be employed without disrupting people's existing code.
Same if I have big tables that are Word - Byte mixed ---->
<
And so on, and so on
Sorry to enter this discussion in the middle and without having read the history but I don't see any real advantage to being able to execute instructions at non-aligned addresses especially if it costs an extra cycle. Nor do I see any real advantage to having rdlong or reword be able to fetch from non-aligned addresses. Is this something that falls out of the current implementation and we get for free? If there is any noticeable cost to this it can probably be removed.
Same if I have big tables that are Word - Byte mixed ---->
<
And so on, and so on
Sorry to enter this discussion in the middle and without having read the history but I don't see any real advantage to being able to execute instructions at non-aligned addresses especially if it costs an extra cycle. Nor do I see any real advantage to having rdlong or reword be able to fetch from non-aligned addresses. Is this something that falls out of the current implementation and we get for free? If there is any noticeable cost to this it can probably be removed.
I said "any real advantage". I'm not sure emulating old CPUs is a big target market for Propeller chips although I know we hobbyists like to do it sometimes. Anyway, you're probably right that rdlong and rdword are more useful than the ability to execute instructions from non-aligned addresses. My point was that they can be left out if there is a significant cost to having them.
In a similar fashion to how REPs can be synthesized during single-stepping, so can ALTDS/AUGS/AUGD. This means the single-stepper can go anywhere!
Sounds good, (I thought you had meant exactly that in an earlier comment), but nice to read it has been tested with REP and double opcodes.
This can be a SW optional feature - in many designs the granular step is ok.
Most debuggers have a Step-into and Step-over, (some also have Step-ret and Step-loop) so the Step into can drill into a REP and Step over can not bother with the extra work.
That is why an assembler directive is there, to avoid that extra cycle. I think the HW does the fetch 'for free', because of the DATA handling mentioned below.(ignoring the time hit for now)
Nor do I see any real advantage to having rdlong or rdword be able to fetch from non-aligned addresses. .
Really ? Being able to R/W DATA as Bytes/Words and Longs is built into the opcodes, and that means being able to read any record in memory. Users to not want to have to repack records, just because some reads are not byte aligned, and I think having some DATA IO opcodes not byte-granular, and some that are, is going to take more hardware.
That is why an assembler directive is there, to avoid that extra cycle. I think the HW does the fetch 'for free', because of the DATA handling mentioned below.(ignoring the time hit for now)
Nor do I see any real advantage to having rdlong or rdword be able to fetch from non-aligned addresses. .
Really ? Being able to R/W DATA as Bytes/Words and Longs is built into the opcodes, and that means being able to read any record in memory. Users to not want to have to repack records, just because some reads are not byte aligned, and I think having some DATA IO opcodes not byte-granular, and some that are, is going to take more hardware.
Yeah, I was probably wrong about the data operations. I admitted that in a later message. Even if you don't care about CPU emulation, being able to access unaligned words and longs makes it much easier to parse byte streams with unaligned fields. Now all we need are some endian swapping instructions. :-)
How about instructions for:
htonlhtonsntohlntohs
S is treated as four 2-bit fields which select one of four D bytes into their S-respective positions. For example:
MOVBYTS D,#%00_01_10_11 - endian-swap bytes in D
MOVBYTS D,#%01_00_11_10 - endian-swap words in D
MOVBYTS D,#%00_00_00_00 - copy lower D byte to all bytes in D
S is treated as four 2-bit fields which select one of four D bytes into their S-respective positions. For example:
MOVBYTS D,#%00_01_10_11 - endian-swap bytes in D
MOVBYTS D,#%01_00_11_10 - endian-swap words in D
MOVBYTS D,#%00_00_00_00 - copy lower D byte to all bytes in D
Very nice!
When some of my projects drag on my wife always reminds me that they don't have to be perfect but they do have to 'be'.
Sandy
I suppose if I had an FPGA image out, there would be less concern about things like this.
I almost finished the debug super interrupt today, but got interrupted by a funeral trip we've been planning. Now, I'm working on the laptop with Quartus while my wife drives and kids cycle through their singing, shrieking, and quarreling behind us. I'm really frustrated that I don't have my FPGA board and scope because I left things in a broken state at the last minute.
Out of interest, what's your compile time like?
Out of interest, what's your compile time like?
I'm still compiling for the Cyclone IV on the DE2-115 board. For one cog + hub, the compile time is about 5 minutes.
I was able to find out what my problem was by looking at the cog Verilog in Quartus. I had to add a bit to the link instruction generator, but I forgot to take another bit away, so the link instruction turned out to be something else that wasn't having any branch effect. I am relieved I found that! I hate the feeling of things being broken.
True there would be less pressure if we had an fpga image to test. But there is still pressure as interrupts weren't even on the plan a month ago. And it seems never ending requests continue. It would be easier if we were off testing and the smart pins were done. While I do like the interrupt scheme that has evolved, I am still concerned at the time it has taken.
There's not going to be a Chipmas this year
The design was taking forever to compile, and once the fitter started running, it reported 25k registers! It turned out that 16k of them were in the dual-port cog RAM, which didn't make any sense, at first. Turns out it was NOT implying a dual-port RAM from the Verilog code, but was building a huge array of flops!
I searched the web and found an Altera example of inferable dual-port RAM in Verilog. There was some rather subtle difference in how it worked from mine.
Here is what I was doing at first, and then what I had to change it to, in order to make it infer as a dual-port RAM...
This makes a giant array of flops:
always @(posedge clkx)
begin
if (wex) ram[ax] <= dx;
qx <= ram[ax];
end
always @(posedge clky)
begin
if (wey) ram[ay] <= dy;
qy <= ram[ay];
end
This makes a dual-port RAM:
always @(posedge clkx)
if (wex)
begin
ram[ax] <= dx;
qx <= dx;
end
else
qx <= ram[ax];
always @(posedge clky)
if (wey)
begin
ram[ay] <= dy;
qy <= dy;
end
else
qy <= ram[ay];
Get the difference? it makes perfect sense, but I sure wasn't anticipating it.
I'll kick off a whole-chip compile now and report the device utilization and compile-time in the morning.
What does that fix do to A9 compile times and resource usage (and MHz reports ) ?
What does that fix do to A9 compile times and resource usage (and MHz reports ) ?
Well, the design is not viable, at all, with flops for cog RAM, so there's no comparison. Given time to fully compile, it would be twice as big and half the speed, probably.
P.S. Giving this further thought, it would be more like 5 times as big and 1/5 as fast. Trying to make RAM from flops is horribly inefficient.
I just started the whole-chip compile and with all 16 cogs and the hub there are 49k flops. It finished the synthesis in only 9 minutes on my older laptop. The fitter is going to take some time, now.
So, for those of us that aren't familiar with FPGA design, if the synthesis process is the step that takes the Verilog code from written form to gates (logical connections), is the "fitter" process the part that lays out all the circuitry? If so, I can imagine that arranging things in a somewhat optimized pattern could be time consuming for the computer (not to mention the designer waiting on it). And if I recall correctly, slightly different but presumably functionally-equivalent results can happen from one or both of those processes. That sounds a little bit scary to the uninitiated like myself, but I guess one learns to trust the tools, at least for the most part.
Those look different to me. Your original code captures old_value to qx, while the new code captures new_value.