The 6502 implemented LE instead of BE as was used by its predecessor, the 6800. The main reason was that it saves cycles when doing operations which requires addition of 16-bit numbers (including address indexing) on an 8-bit bus. With LE you can fetch the lowest significant byte first, then start working with that one immediately while fetching the next byte.
The 16-bit (over 8-bit bus) addressing example still requires both bytes before an address can be actioned. I suspect a bigger adder circuit would resolve whatever is of concern here. Could do with some more detail.
A bigger adder doesn't help if you can fetch only 8 bits at the time. If you look at the addition example you add two 16-bit values (assume one of them is in a 16-bit register already) by first adding the least significant bytes, then the most significant bytes plus any carry from the previous operation. The point with LE is that the 6502 (and the 8088) could do that first addition in parallel with fetching the most significant byte. The 6800 would fetch the most significant byte, then the least significant byte, *then* start the addition operation the normal way (add LE bytes, add BE bytes plus carry).
All human readable languages and even all computing formats are written/displayed in BE format.
I take it you prefer to work with BE.. and so do I. I just wish USA would do the same with their date format, instead of the mixed-endianity in use now
LE has no real technical advantages over BE.
Unfortunately this is not always the case, at least for architectures where the number of bytes that can be fetched per cycle is less than the word sizes used. In the old days of the 6502 and the Intel and Zilog 8-bitters the advantage to LE, from a technical point of view, was not insignificant.
Edit: I had a look and It's mentioned on wikipedia: https://en.wikipedia.org/wiki/Endianness#Calculation_order
although they don't get into the specific optimization the 6502 used, its ability to do the first addition in parallel with fetching the next byte.
Elsewhere I saw it stated like this for the indexing case:
The 6502 can fetch the next byte of a complex opcode while it is currently executing. Take for example an indexed load, something like LDA $1000,x. Instruction and parameter take three bytes. This is a two-step instruction: add the index to the address, then load. The 6502 interleaves loading the second byte of the address with the indexing addition, which speeds up the load by one cycle. This is possible because the byte order is little-endian.
A bigger (16b+16b) adder would just do a single addition once both bytes are loaded. Obviously this takes more circuitry ... but also executes faster.
If you want to start picking apart old tech like that, then back then there wasn't anything like burst transactions either, DRAM random access was all equal down to the byte .. so fetching the second byte first is just as quick.
Obviously we're talking historical reasons for LE here. If you can get to all the bytes simultaneously, which is generally the case these days, LE gives no performance enhancement over BE (there's only the dubious, but not completely bogus 'cast' argument left). But Intel can't change their processors, they've based everything on being backwards compatible all the way back - even to the 8080, with the help of an assembly translator. So they'll stay LE. ARM and MIPS can be configured either way. SPARC is BE due to Sun's previous processor, the MC68k. And so on and so forth.
For new processors without baggage, and if all bytes can be fetched in the same cycle, there's not much technical reason to choose LE. But by then you may have programmers who have read hex in LE for so long that they're comfortable with it. Or they want to exchange binary data with an Intel CPU (never mind padding and alignment issues).
But yes I prefer BE if I can choose.
It may come as a surprise to you but the Propeller is both BE and LE!
When you define a long in a DAT section the bytes sit in memory one way around, when you use a literal constant in your Spin code the bytes sit in memory the other way around.
Normally we don't get to see this in any Spin/PASM programming we do as we don't look at those literal constants sitting in the Spin byte codes. But you can see it if you look at the listings produced by BST.
Object DAT Blocks
|===========================================================================|
0018(0000) 04 03 02 01 | aLong long $01020304
|===========================================================================|
|===========================================================================|
Spin Block start with 0 Parameters and 0 Extra Stack Longs. Method 1
PUB start
Local Parameter DBASE:0000 - Result
|===========================================================================|
6 x := $01020304
Addr : 001C: 3B 01 02 03 04 : Constant 4 Bytes - 01 02 03 04 - $01020304 16909060
Addr : 0021: 41 : Variable Operation Global Offset - 0 Write
Addr : 0022: 32 : Return
Did I say "byte" codes? I suspect this is done for the same reason as described above for the 6502, it's quicker and easier for Spin's stack based byte code interpreter to work with numbers if they are the "right" way around.
... But Intel can't change their processors, they've based everything on being backwards compatible all the way back ...
Hmm, we could probably not be talking about the PC but since you're insisting ... The PC industry, including Intel, have a long history of rearranging things. ISA got ditched for PCI for example. 32bit mode was a pretty big change that took a long time to drag everyone along ... 64bit mode is an even bigger shift in the programming model but that was hardly even a bump in the road, the reason? ... there is a large amount of user software insulation these days. You could say general computing demands it. The Mac switched from BE to LE on the drop of a hat and still supported legacy programs. Intel probably added new instructions for that deal.
Intel doesn't see any reason to switch is all. Who still debugs with simple debuggers on Windozes any longer?
Could we move this BE/LE discussion off to its own thread instead of having it buried in the middle of Chip's thread for releasing and discussing the P2 FPGA image releases?
I, the master of irrelevancy, don't see how this is at all related to the P2 or the FPGA unless we are going to flip-flop endianess at this point in the P2 design.
I just posted an updated link to the latest file at the top of this thread.
Cool! Now I have 11 LEDs flashing on my DE2-115 board. Need to get gas working with the new instruction set. Has the instructions.txt file been updated to match the new image?
I also got it working...
The phasing of the LEDs is kinda interesting...
I suppose LED10 should blink almost twice as fast as LED0. I guess it looks right.
Here's some oscilloscope screenshots of the cog_1k_program.spin output with latest release DE2-115.
Zooming in on the burst...
Yellow is P4 and Green is P0. (I've got bandwidth limit on so it looks smoother).
I also got it working...
The phasing of the LEDs is kinda interesting...
I suppose LED10 should blink almost twice as fast as LED0. I guess it looks right.
Huh. I think higher-numbered cogs should be blinking slower than lower-numbered cogs, as the delay is ((cogid+16)<<18). Try taking it to a single cog (comment out all of the coginits) and see if LED 0 or LED 10 lights up.
Here's the all_cogs_blink demo (for 2015-09-29 image), rewritten to use REP instead of JMP. This uses the new @ syntax and #0 to indicate infinite repeats. Because REP does not work in hub exec mode (bummer), this requires the code to be copied to cog memory before executing. Note also that each cog now COGINIT's the next cog. This was just to shorten the code I was looking at.
Here's the all_cogs_blink demo (for 2015-09-29 image), rewritten to use REP instead of JMP. This uses the new @ syntax and #0 to indicate infinite repeats. Because REP does not work in hub exec mode (bummer), this requires the code to be copied to cog memory before executing. Note also that each cog now COGINIT's the next cog. This was just to shorten the code I was looking at.
where exactly does the started COG start another one?
dat
orgh 1
' each cog:
' 1. Copies code to cog memory and jumps to cog exec mode
' 2. launches the next cog
' 3. then blinks for eternity
' any cogs missing from the FPGA won't blink
init coginit #16,#init
mov ptrb, #cog_code
setq #(x-cog_entry)>>2
rdlong cog_entry, ptrb
jmp #cog_entry
cog_code
org 8 << 2
cog_entry
blink rep @:end,#0
cogid x 'which cog am I?
setb dirb,x 'make that pin an output
notb outb,x 'flip its output state
add x,#16 'add to my id
shl x,#18 'shift it up to make it big
waitx x 'wait that many clocks
:end
x res 1
where exactly does the started COG start another one?
The very first line. You will notice that coginit starts the next available cog (#16) at instruction "#init". It then goes on to set up the cog to run in cog exec mode. In the meantime, the next cog is doing exactly the same thing (starting the next cog, setting up, etc...).
where exactly does the started COG start another one?
The very first line. You will notice that coginit starts the next available cog (#16) at instruction "#init". It then goes on to set up the cog to run in cog exec mode. In the meantime, the next cog is doing exactly the same thing (starting the next cog, setting up, etc...).
Yes. Now I see it. I was looking at the code of the COG, thinking of how you start one COG out of PASM.
You comments are misleading.
' 1. Copies code to cog memory and jumps to cog exec mode
' 2. launches the next cog
' 3. then blinks for eternity
should be
' 1. launches the next cog
' 2. Copies code to cog memory and jumps to cog exec mode
' 3. then blinks for eternity
Intriguing, since I don't have a FPGA I haven't tried to read any of sources previously. Mike, you say launch then copy ... this means that the SETQ + RDLONG pair somehow does a block copy, correct? Is that how a burst copy is done? What defines the length?
Comments
I take it you prefer to work with BE.. and so do I. I just wish USA would do the same with their date format, instead of the mixed-endianity in use now
Unfortunately this is not always the case, at least for architectures where the number of bytes that can be fetched per cycle is less than the word sizes used. In the old days of the 6502 and the Intel and Zilog 8-bitters the advantage to LE, from a technical point of view, was not insignificant.
Edit: I had a look and It's mentioned on wikipedia:
https://en.wikipedia.org/wiki/Endianness#Calculation_order
although they don't get into the specific optimization the 6502 used, its ability to do the first addition in parallel with fetching the next byte.
Elsewhere I saw it stated like this for the indexing case:
-Tor
If you want to start picking apart old tech like that, then back then there wasn't anything like burst transactions either, DRAM random access was all equal down to the byte .. so fetching the second byte first is just as quick.
For new processors without baggage, and if all bytes can be fetched in the same cycle, there's not much technical reason to choose LE. But by then you may have programmers who have read hex in LE for so long that they're comfortable with it. Or they want to exchange binary data with an Intel CPU (never mind padding and alignment issues).
But yes I prefer BE if I can choose.
-Tor
New image and Pnut working nicely!
Edit: Tested on Prop123-A7 and DE2-115 Ok
When you define a long in a DAT section the bytes sit in memory one way around, when you use a literal constant in your Spin code the bytes sit in memory the other way around.
Normally we don't get to see this in any Spin/PASM programming we do as we don't look at those literal constants sitting in the Spin byte codes. But you can see it if you look at the listings produced by BST.
Or in the binary:
Did I say "byte" codes? I suspect this is done for the same reason as described above for the 6502, it's quicker and easier for Spin's stack based byte code interpreter to work with numbers if they are the "right" way around.
Hmm, we could probably not be talking about the PC but since you're insisting ... The PC industry, including Intel, have a long history of rearranging things. ISA got ditched for PCI for example. 32bit mode was a pretty big change that took a long time to drag everyone along ... 64bit mode is an even bigger shift in the programming model but that was hardly even a bump in the road, the reason? ... there is a large amount of user software insulation these days. You could say general computing demands it. The Mac switched from BE to LE on the drop of a hat and still supported legacy programs. Intel probably added new instructions for that deal.
Intel doesn't see any reason to switch is all. Who still debugs with simple debuggers on Windozes any longer?
I, the master of irrelevancy, don't see how this is at all related to the P2 or the FPGA unless we are going to flip-flop endianess at this point in the P2 design.
Thank you!!
I'll try it on my DE2 when I get home from work (shhh, don't tell anybody, there be P2s at work!!!)
The phasing of the LEDs is kinda interesting...
I suppose LED10 should blink almost twice as fast as LED0. I guess it looks right.
Wanted to show photo of 11 LEDs flashing...
Zooming in on the burst...
Yellow is P4 and Green is P0. (I've got bandwidth limit on so it looks smoother).
Edit: I posted without reading the whole thread.
This is great news !
I believe it's an index for one of the 4 registers: PTRA, PTRB, ADRA, ADRB.
Huh. I think higher-numbered cogs should be blinking slower than lower-numbered cogs, as the delay is ((cogid+16)<<18). Try taking it to a single cog (comment out all of the coginits) and see if LED 0 or LED 10 lights up.
That's correct.
That's for the constant-address version. There's also CALLD D,S which gives 512 possibilities for D.
I need to make the assembler handle the D,S/@ version such that it allows nearby @ addresses with all 512 D's.
where exactly does the started COG start another one?
Mike
The very first line. You will notice that coginit starts the next available cog (#16) at instruction "#init". It then goes on to set up the cog to run in cog exec mode. In the meantime, the next cog is doing exactly the same thing (starting the next cog, setting up, etc...).
Yes. Now I see it. I was looking at the code of the COG, thinking of how you start one COG out of PASM.
You comments are misleading.
' 1. Copies code to cog memory and jumps to cog exec mode
' 2. launches the next cog
' 3. then blinks for eternity
should be
' 1. launches the next cog
' 2. Copies code to cog memory and jumps to cog exec mode
' 3. then blinks for eternity
Thanks!
Mike