Mangling expectations to free hobby time. Love it. Hey, doing a little of that is good for the soul and happy healthy souls just work better and more consistently, but for the odd late night, gotta finish it time.
I found the whole single instruction thing fascinating. It is taking a lot to not go down that road.
Yes, Heater zeroed in on what I thought was interesting as did jmg.
Phil has it right, of course. Having waitcnt would improve a lot of cases, particularly on multi core devices.
There is a discussion on Y Combinatorial citing how ugly and wasteful x86 is, and how big code was for RISC. Those guys would likely be surprised at heater benchmarking RISC V and how x86 dropped well into the sweet spot on code size.
I'm eager to learn more about how RISC V maps into superscalar spaces. Does the instruction set actually matter in terms of peak speed and overall efficiency?
I think it might, or if it does not, maybe having a smaller, more regular problem set may make incremental gains happen more consistently and or at a lower cost compared to Intel.
On that Y combinatorial thread, the multi core difficulty came up, as did lamenting over the fact that 4Ghz seems a hard limit on single thread, sequential execute.
BTW, I just researched this to get the very fastest single thread performance CPU I could. It clocks reliably at 4.3Ghz, and is a dual physical core CPU with hyperthreading to present 4 cores to the OS.
More cores limits peak CPU performance on single thread sequential, and it often limits it a lot.
My interest is mechanical CAD and CAM, both of which work on large data sets, easily in the GB range, and which need to be resolved sequentially, due to the nature of solid models and the geometry kernels and dependencies.
When I get toy desktop, I'll cite the CPU.
Lots of people buy expensive i7 chips, or Xeons with big caches and high clocks only to see the real gains be tepid, unless they run a lot of stuff hard and concurrently and or their task is one where parallel processing is understood well enough for common applications to take advantage of it.
For a whole lot of stuff, neither of those things are true.
So, this idea of multi core is a necessary path, due to hard limits on sequential execute being in place for over 10 years now. I've been watching for that 5ghz CPU and we just never seem to get it, but for the odd over clock here and there.
(And I am gonna attempt that for this use case which just needs all the sequential execute it can get)
The Propeller idea of making multi core, concurrent multiprocessing easy still makes a whole lot of sense. Nothing touches PASM for this.
Sure, at a larger scale, we have good tools and the assembly doesn matter so much, but at the embedded scale, having something g be both easy and not high level, in the sense the author meant here, is still very attractive, particularly if no OS is needed to make it happen.
I know, kind of a ramble, but that was the intent of this topic anyway.
Rambling appreciated. I have to mull it over a bit.
In one RISC V presentation the case is made that the instruction set does not matter. Both RISC and CISC can have similar code sizes and can have similar performance and so on.
Their point is that compilers, operating systems, algorithm selection, quality of programmer, process technology, hardware implementation architecture, and a bunch of other parameters have so much affect on performance that the details of the actual instruction set are hardly worth worrying about. The whole system is much more to worry about than that particular detail.
RISC V of course does not help at all with the issues the original blog poster was talking about w.r.t. crypto problems.
Yes. The desire to limit information leakage is a very interesting problem.
I thought about P2, and getting it to skip a branch, or ignore a mode change based on fuse state is a plausible attack. Very regular and consistent timing, along with a published sequence might make doing such a thing trivial.
Now I'm tempted to go flogging on a P1 to see what it does with the kinds of irregular inputs, clocks and such described by the chip whisperer...
Years ago, I would get my old Atari machines to to some pretty interesting things doing that.
To me the solution to the varying-time problem was always to make it more varying by a random factor, not by trying to change it to always use the same amount of time, which is nearly impossible, not to mention the even more impossible goal of using the exact same amount of current.. instead, use a hardware random source (even pseudo-random should do) to deliberately insert variation into the processing time and current consumption.
-Tor
AFIK adding a random delay is only a partial solution. It'll reduce the signal to noise ratio of the side-channel, but won't eliminate the leaking signal altogether.
I want to build a one instruction computer.
One op code "SUBLEQ", subtract and branch if less than or equal to zero.
Three operands subtrahend, minuend, jmp target.
Always Interesting to see minimal work, but here the 'one' claim is a large stretch - more an exercise in semantics and self delusion.
As soon as they say In the standard implementation, negative numbers are used for Input/Output Ports or to Halt the machine in the case of a branch destination.
or The key difference is that negative numbers now refer to in-direct addressing.
Oops, you have used a contained BIT or BITs to change what the opcode does, so you actually have more than one opcode - and the revealing aspect is in the truly massive CYCLES and CODE sizes to do anything...
Call and Return
Case Instructions Executed Memory Accesses Code Size in Words
Original SUBLEQ 18 90 55
Improved SUBLEQ 7 39 22
Push and Pop
Case Instructions Executed Memory Accesses Code Size in Words
Original SUBLEQ 20 100 60
Improved SUBLEQ 10 54 30
Block Memory Move (50 Words Moved)
Case Instructions Executed Memory Accesses Code Size in Words
Original SUBLEQ 468 2340 82
Improved SUBLEQ 364 1974 66
That perceived silicon saving in the core, shifts to be consumed in the Icc and CODE memory needed.
On the other hand one could argue that the "one instruction" rule is being adhered to.
If we look at some code like:
SUBLEQ A B C
We see we have the opcode "SUBLEQ". Which need never be written because it's always the same. And we have three operands or parameters. So we end up writing just:
A B C
Well, the "one instruction rule" does not dictate what those operands are or where they come from. SUBLEQ only requires that the operands are some kind of values for which the subtraction operation is valid, integers, floats, whatever. And that the results of such subtraction can be compared against zero.
Could be literal values which would not be very useful. Could be addresses of values which is more interesting. Could be interpreted as some I/O. Or perhaps this indirect addressing idea.
Compare and contrast to other instruction sets that have say a MOV instruction and then a pile of literal, immediate, direct, indirect, indexed addressing modes to get at the operands.
In the extreme a minimal machine would have only one bit wide variables and a single XOR instruction. Or whatever it takes to make Turing Complete computer. We don't want to go there!
Yes, I'm sure you are right. Making the instruction set ultra simple leads to bigger code size, more instruction fetching and slower execution.
In that we see that idea like SUBLEQ are pushing the old RISC vs CISC debate to the extreme. On the other extreme we might speculate that the CPU be able to execute my C++ source code directly. Why not?
Anyway, this is all more of an intellectual puzzle than a suggestion for real hardware solutions.
..and this page from Nature hints that somebody is looking into making a carbon nanotube subneg computer: http://www.nature.com/nature/journal/v501/n7468/fig_tab/nature12502_F1.html Makes sense I guess, as they can concentrate on making the most minimal computing unit and then suddenly they would actually have a Turing-complete, working nanotube computer.
-Tor
(Surfing the net instead of working on my subleq virtual machine because I can't find a mickey mouse type power cable for my notebook computer)
That is impressive, but even here the 'one opcode' is very quickly morphed, as look at the ASM example in that link
sub i, imm(-10)
loop:
dec i
bltz i, loop
Of course, I have always liked the idea of assemblers that are smarter than the underlying opcodes, (but not by too much, that's what compilers are for ).
Plenty of scope for Macros and the tedious label handling, or, as done above, improved clarity, but still giving the programmer low level control.
I would imagine if that if one had a subleq machine one would want to get away from actually writing subleq codes as soon as possible, like say heavy use of macros in an assembler as you showed above.
It's that natural pattern:
If you have to program in hex you soon find yourself writing an assembler.
If you have to program in assembler you soon create a compiler for an HLL like C.
If you have to program in C you soon want a higher level language like JavaScript.
SUBLEQ is a far too complicated language. We need a computer that runs RSSB (Reverse Subtract and Skip if Borrow): http://esolangs.org/wiki/RSSB. It's like SUBLEQ, but it only takes one operand instead of three. I just wrote a VM for it, but haven't tested it yet because I still need to write an assembler for it.
Don't browse http://esolangs.org too much - you'll never have any free time again.
RSSB looks like it might be more efficient than SUBLEQ. Only one operand means each instruction can be just a 32 bit value (Let's talk a grown up 32 bit machine). The hello world in SUBLEQ is 13 instructions if I'm counting properly. Hello world in RSSB is 14 instructions. Excluding the actual text "Hello world" in each case. BUT the SUBLEQ instructions would be three 32 bit words each instead of just one! That's three times less memory used, three times less memory access and hence three times more speed.
Am I analyzing this correctly?
SUBLEQ requires two operands for the subtract and a target jump address for every instruction, even if a lot of the time you don't want to jump anywhere but just proceed to the next instruction.
After reading the article and the comments in this thread, I still don't see any way in which x86 is a high level language. I see how x86 is non deterministic when executing; but nothing about high level. Maybe I'm getting stuck at the initial statement, as "x86-is-high-level-language" simply doesn't have a reasonable interpretation.
Clearly the guy is not so stupid as to literally mean that x86 assembler is a high level language like C or Haskel or whatever.
His "high level" comment is to be taken in context of the problem he is addressing. As I said in post #11 here. You are right, execution time determinism is what it it is all about. Making crypto secure timing is impossible in a high level language, turns out it's also impossible in assembler. Ergo x86 is equivalent to a high level language in that respect.
However, whilst we are here:
Think of the AAA instruction in x86. That is many instructions in a RISC architecture but only one in x86. Sounds like a higher level thing to me.
What about the string move, compare and other string operations in x86? Also something you have to write with many instructions for in other machines.
What about the x86 RdRand instruction? A whole pseudo random number generator in one instruction. Sounds like a high level thing to me.
It comes with a Makefile, but any C 99 compiler should work. It doesn't have any dependencies other than C standard libraries.
Pass it an RSSB assembly file as its first and only operand. It ships with helloworld.rssb .
I literally just finished it - it probably still has lots of fun bugs. Compile-time expressions can have labels, integer constants, $, add, subtract, negate, multiply, integer divide, and parentheses.
It wasn't a complete waste of time - I can now say I've written an assembler!
EDIT: Unfortunately, this assembler probably won't work on a prop, although the VM should work fine. It uses malloc'd copies of labels instead of just using pointers into the existing string, and the symbol table is very wasteful with memory (an array of 256 pointers to more instances of itself, mostly filled with NULLs).
Here's my RSSB assembler and VM: https://github.com/electrodude/rssb
Pass it an RSSB assembly file as its first and only operand. It ships with helloworld.rssb .
That's nifty, but I think RSSB imposes the caveat that data-width has to equal opcode-operand width, and that is set by address reach.
Almost any size is possible, with 8b being on the small side, 12b may be more useful, & then 16/20/24/28/32 are all possible with a QuadSPI memory - and the SKIP opcode nature works well with SPI parts.
A question becomes how much silicon area has the low memory efficiency cost, and could that silicon area have better supported a core with more opcodes ?
Focusing on connecting QuadSPI, the 'next step' would be a 16 opcode part, with 4 bits of the read for Opcode.
Still tiny, would this save 12% of code space on a 32b system for a net gain ?
Maybe something as teensy as a 1 (or 16) opcode part, could make sense in a P1V to allow more COGs ?
I see ZPU uses 8 bit opcodes, and claims to be one of the smallest with a C-Compiler, but that may be large.
PicoBlaze says 6b opcoe, but that does not fit QuadSPI quanta so well.
LatticeMico8 says 18b opcodes, well suited to FPGA memory, but not so well matched to XIP-SPI.
The Zilog Z80 is an 8bit CPU, but with all the variations and modes of the instructions, you end up with around 750 distinct opcodes.
Wich is kind of difficult to fit in 8bit, so they used 'Prefix' Bytes for some instructions.
In fact, depending on which instruction you were trying to use, you could end up with a 32bit opcode.
(Bit operations on memory acessed using the IX or IY register, typically) http://wikiti.brandonw.net/index.php?title=Z80_Instruction_Set
Because it has efficient block-commands(copy, search, read from/write to -IO) it was also quite popular as a storage controller.
(The first HDD controllers in IBM PC XTs all had a Z80/Z180 chip. The same with the storage controller on the ND-series 32bit mini computers)
ZPU is a stack based instruction set architecture designed with as few instructions as possible for use as a processor core in an FPGA. There is a GCC targetted at it so it's actually useful.
Zog is a ZPU VM for the Propeller I made a while back. That was the first time we could run 32 bit code compiled from C on the Propeller.
ZiCog is the Z80 emulator for the Propeller. It was my first ever Prop program and allowed C code to be compiled on the Prop itself. Using BDSC under CP/M.
There is also Pullmoll's excellent Z80 emulator for the Prop qz80.
Oh no. Now I'll have to learn x86 assembly... (and then steal all of FASM's features for my propeller compiler)
Actually you can use FASM as your propeller compiler. I just play around with it, but other people made macros to create various dialects of assemblers. 6502, Z80, ARM ...
Using them macros the binary output can be defined by data bytes. (aka db, dw, dd and such).
Comments
I found the whole single instruction thing fascinating. It is taking a lot to not go down that road.
Yes, Heater zeroed in on what I thought was interesting as did jmg.
Phil has it right, of course. Having waitcnt would improve a lot of cases, particularly on multi core devices.
There is a discussion on Y Combinatorial citing how ugly and wasteful x86 is, and how big code was for RISC. Those guys would likely be surprised at heater benchmarking RISC V and how x86 dropped well into the sweet spot on code size.
I'm eager to learn more about how RISC V maps into superscalar spaces. Does the instruction set actually matter in terms of peak speed and overall efficiency?
I think it might, or if it does not, maybe having a smaller, more regular problem set may make incremental gains happen more consistently and or at a lower cost compared to Intel.
On that Y combinatorial thread, the multi core difficulty came up, as did lamenting over the fact that 4Ghz seems a hard limit on single thread, sequential execute.
BTW, I just researched this to get the very fastest single thread performance CPU I could. It clocks reliably at 4.3Ghz, and is a dual physical core CPU with hyperthreading to present 4 cores to the OS.
More cores limits peak CPU performance on single thread sequential, and it often limits it a lot.
My interest is mechanical CAD and CAM, both of which work on large data sets, easily in the GB range, and which need to be resolved sequentially, due to the nature of solid models and the geometry kernels and dependencies.
When I get toy desktop, I'll cite the CPU.
Lots of people buy expensive i7 chips, or Xeons with big caches and high clocks only to see the real gains be tepid, unless they run a lot of stuff hard and concurrently and or their task is one where parallel processing is understood well enough for common applications to take advantage of it.
For a whole lot of stuff, neither of those things are true.
So, this idea of multi core is a necessary path, due to hard limits on sequential execute being in place for over 10 years now. I've been watching for that 5ghz CPU and we just never seem to get it, but for the odd over clock here and there.
(And I am gonna attempt that for this use case which just needs all the sequential execute it can get)
The Propeller idea of making multi core, concurrent multiprocessing easy still makes a whole lot of sense. Nothing touches PASM for this.
Sure, at a larger scale, we have good tools and the assembly doesn matter so much, but at the embedded scale, having something g be both easy and not high level, in the sense the author meant here, is still very attractive, particularly if no OS is needed to make it happen.
I know, kind of a ramble, but that was the intent of this topic anyway.
In one RISC V presentation the case is made that the instruction set does not matter. Both RISC and CISC can have similar code sizes and can have similar performance and so on.
Their point is that compilers, operating systems, algorithm selection, quality of programmer, process technology, hardware implementation architecture, and a bunch of other parameters have so much affect on performance that the details of the actual instruction set are hardly worth worrying about. The whole system is much more to worry about than that particular detail.
RISC V of course does not help at all with the issues the original blog poster was talking about w.r.t. crypto problems.
I thought about P2, and getting it to skip a branch, or ignore a mode change based on fuse state is a plausible attack. Very regular and consistent timing, along with a published sequence might make doing such a thing trivial.
Now I'm tempted to go flogging on a P1 to see what it does with the kinds of irregular inputs, clocks and such described by the chip whisperer...
Years ago, I would get my old Atari machines to to some pretty interesting things doing that.
The what ?
Only there is nothing to crack on the P1. No challenge.
AFIK adding a random delay is only a partial solution. It'll reduce the signal to noise ratio of the side-channel, but won't eliminate the leaking signal altogether.
Marty
As soon as they say
In the standard implementation, negative numbers are used for Input/Output Ports or to Halt the machine in the case of a branch destination.
or
The key difference is that negative numbers now refer to in-direct addressing.
Oops, you have used a contained BIT or BITs to change what the opcode does, so you actually have more than one opcode - and the revealing aspect is in the truly massive CYCLES and CODE sizes to do anything...
That perceived silicon saving in the core, shifts to be consumed in the Icc and CODE memory needed.
I kind of, sort of agree, with you.
On the other hand one could argue that the "one instruction" rule is being adhered to.
If we look at some code like: We see we have the opcode "SUBLEQ". Which need never be written because it's always the same. And we have three operands or parameters. So we end up writing just: Well, the "one instruction rule" does not dictate what those operands are or where they come from. SUBLEQ only requires that the operands are some kind of values for which the subtraction operation is valid, integers, floats, whatever. And that the results of such subtraction can be compared against zero.
Could be literal values which would not be very useful. Could be addresses of values which is more interesting. Could be interpreted as some I/O. Or perhaps this indirect addressing idea.
Compare and contrast to other instruction sets that have say a MOV instruction and then a pile of literal, immediate, direct, indirect, indexed addressing modes to get at the operands.
In the extreme a minimal machine would have only one bit wide variables and a single XOR instruction. Or whatever it takes to make Turing Complete computer. We don't want to go there!
Yes, I'm sure you are right. Making the instruction set ultra simple leads to bigger code size, more instruction fetching and slower execution.
In that we see that idea like SUBLEQ are pushing the old RISC vs CISC debate to the extreme. On the other extreme we might speculate that the CPU be able to execute my C++ source code directly. Why not?
Anyway, this is all more of an intellectual puzzle than a suggestion for real hardware solutions.
-Tor
(Surfing the net instead of working on my subleq virtual machine because I can't find a mickey mouse type power cable for my notebook computer)
That is impressive, but even here the 'one opcode' is very quickly morphed, as look at the ASM example in that link
Of course, I have always liked the idea of assemblers that are smarter than the underlying opcodes, (but not by too much, that's what compilers are for ).
Plenty of scope for Macros and the tedious label handling, or, as done above, improved clarity, but still giving the programmer low level control.
I would imagine if that if one had a subleq machine one would want to get away from actually writing subleq codes as soon as possible, like say heavy use of macros in an assembler as you showed above.
It's that natural pattern:
If you have to program in hex you soon find yourself writing an assembler.
If you have to program in assembler you soon create a compiler for an HLL like C.
If you have to program in C you soon want a higher level language like JavaScript.
Don't browse http://esolangs.org too much - you'll never have any free time again.
RSSB looks like it might be more efficient than SUBLEQ. Only one operand means each instruction can be just a 32 bit value (Let's talk a grown up 32 bit machine). The hello world in SUBLEQ is 13 instructions if I'm counting properly. Hello world in RSSB is 14 instructions. Excluding the actual text "Hello world" in each case. BUT the SUBLEQ instructions would be three 32 bit words each instead of just one! That's three times less memory used, three times less memory access and hence three times more speed.
Am I analyzing this correctly?
SUBLEQ requires two operands for the subtract and a target jump address for every instruction, even if a lot of the time you don't want to jump anywhere but just proceed to the next instruction.
The subtraction is backwards, accumulator from memory rather than memory from accumulator.
The jump is taken if the specified condition is not met.
I see a Forth engine in RSSB. Except that is of course impossible.
NEVER use the word 'impossible' and 'Forth' in the same sentence on this forum. Someone is bound to take it as a challenge...
After reading the article and the comments in this thread, I still don't see any way in which x86 is a high level language. I see how x86 is non deterministic when executing; but nothing about high level. Maybe I'm getting stuck at the initial statement, as "x86-is-high-level-language" simply doesn't have a reasonable interpretation.
Clearly the guy is not so stupid as to literally mean that x86 assembler is a high level language like C or Haskel or whatever.
His "high level" comment is to be taken in context of the problem he is addressing. As I said in post #11 here. You are right, execution time determinism is what it it is all about. Making crypto secure timing is impossible in a high level language, turns out it's also impossible in assembler. Ergo x86 is equivalent to a high level language in that respect.
However, whilst we are here:
Think of the AAA instruction in x86. That is many instructions in a RISC architecture but only one in x86. Sounds like a higher level thing to me.
What about the string move, compare and other string operations in x86? Also something you have to write with many instructions for in other machines.
What about the x86 RdRand instruction? A whole pseudo random number generator in one instruction. Sounds like a high level thing to me.
Or the AES instructions?
It's crazy stuff.
I stumbled over the Flat Assembler a while ago: http://flatassembler.net/
Be careful. Might be addictive to you too.
FASM is the most powerful assembler I ever used.
Enjoy!
Mike
It comes with a Makefile, but any C 99 compiler should work. It doesn't have any dependencies other than C standard libraries.
Pass it an RSSB assembly file as its first and only operand. It ships with helloworld.rssb .
I literally just finished it - it probably still has lots of fun bugs. Compile-time expressions can have labels, integer constants, $, add, subtract, negate, multiply, integer divide, and parentheses.
It wasn't a complete waste of time - I can now say I've written an assembler!
EDIT: Unfortunately, this assembler probably won't work on a prop, although the VM should work fine. It uses malloc'd copies of labels instead of just using pointers into the existing string, and the symbol table is very wasteful with memory (an array of 256 pointers to more instances of itself, mostly filled with NULLs).
Oh no. Now I'll have to learn x86 assembly... (and then steal all of FASM's features for my propeller compiler)
That's nifty, but I think RSSB imposes the caveat that data-width has to equal opcode-operand width, and that is set by address reach.
Almost any size is possible, with 8b being on the small side, 12b may be more useful, & then 16/20/24/28/32 are all possible with a QuadSPI memory - and the SKIP opcode nature works well with SPI parts.
A question becomes how much silicon area has the low memory efficiency cost, and could that silicon area have better supported a core with more opcodes ?
Focusing on connecting QuadSPI, the 'next step' would be a 16 opcode part, with 4 bits of the read for Opcode.
Still tiny, would this save 12% of code space on a 32b system for a net gain ?
Maybe something as teensy as a 1 (or 16) opcode part, could make sense in a P1V to allow more COGs ?
I see ZPU uses 8 bit opcodes, and claims to be one of the smallest with a C-Compiler, but that may be large.
PicoBlaze says 6b opcoe, but that does not fit QuadSPI quanta so well.
LatticeMico8 says 18b opcodes, well suited to FPGA memory, but not so well matched to XIP-SPI.
The Zilog Z80 is an 8bit CPU, but with all the variations and modes of the instructions, you end up with around 750 distinct opcodes.
Wich is kind of difficult to fit in 8bit, so they used 'Prefix' Bytes for some instructions.
In fact, depending on which instruction you were trying to use, you could end up with a 32bit opcode.
(Bit operations on memory acessed using the IX or IY register, typically)
http://wikiti.brandonw.net/index.php?title=Z80_Instruction_Set
Because it has efficient block-commands(copy, search, read from/write to -IO) it was also quite popular as a storage controller.
(The first HDD controllers in IBM PC XTs all had a Z80/Z180 chip. The same with the storage controller on the ND-series 32bit mini computers)
Zog is a ZPU VM for the Propeller I made a while back. That was the first time we could run 32 bit code compiled from C on the Propeller.
ZiCog is the Z80 emulator for the Propeller. It was my first ever Prop program and allowed C code to be compiled on the Prop itself. Using BDSC under CP/M.
There is also Pullmoll's excellent Z80 emulator for the Prop qz80.
Actually you can use FASM as your propeller compiler. I just play around with it, but other people made macros to create various dialects of assemblers. 6502, Z80, ARM ...
Using them macros the binary output can be defined by data bytes. (aka db, dw, dd and such).
Enjoy!
Mike