64 bit floats on the P2 ?
Heater.
Posts: 21,230
A while ago Chip expressed interest in getting a tiny Javascript engine running on the P2.
I love this idea because there is an ocean of programmers out there who would never dream of getting into C or assembler or Spin or whatever, but they are comfortable with JS. The Espruino and other tiny JS engines show what can be done with JS on a micro-controller.
There is one fly in the ointment. All numbers in JS are IEEE 754 64 bit floats.
Turns out that even if MCU's like the STM32F4 have floating point units they are not used by those tiny JS engines. 32 bits is not enough. (JS also uses those floats to hold 53 bit integers)
Now, the P2 has some funky cordic hardware support for floating point. I was just wondering if it might be possible to get 64 bit float support with that cordic hardware. Perhaps with some tweaking?
Aside: Shame the P2 is not a 64 bit machine. Having eight 32 bit cores was pretty radical for an MCU when the P1 came out. Sixteen 64 bit cores would be the radical route today. Plus it would allow for huge COG memory which would make everybody happy.
Ah well...
I love this idea because there is an ocean of programmers out there who would never dream of getting into C or assembler or Spin or whatever, but they are comfortable with JS. The Espruino and other tiny JS engines show what can be done with JS on a micro-controller.
There is one fly in the ointment. All numbers in JS are IEEE 754 64 bit floats.
Turns out that even if MCU's like the STM32F4 have floating point units they are not used by those tiny JS engines. 32 bits is not enough. (JS also uses those floats to hold 53 bit integers)
Now, the P2 has some funky cordic hardware support for floating point. I was just wondering if it might be possible to get 64 bit float support with that cordic hardware. Perhaps with some tweaking?
Aside: Shame the P2 is not a 64 bit machine. Having eight 32 bit cores was pretty radical for an MCU when the P1 came out. Sixteen 64 bit cores would be the radical route today. Plus it would allow for huge COG memory which would make everybody happy.
Ah well...
Comments
BTW don't you mean huge hub memory??? We are constrained by die space - only going to be 512KB of Hub and 16x 4KB Cogs.
64 bit on the desktop has absolutely zippo to do with FPUs! AMD's nice x86_64 implementation does bring a refreshingly fuller general register set that x86_32 had always been sorely missing ... but, other than that, the real impetus was flat memory addressing range beyond 4GB and nothing more. No one wanted to start dealing with segmentation again.
Something that a library might be able to effectively package up.
I'm not going to worry about annoying practical details like transistor count. We can have billions of transistors now a days. Transistors are almost free.
The limit on COG space is not so much transistors as the 9 bit address fields embedded into every instruction.
64 bit instructions allows for 25 bit wide address fields. That's enough for 32 million COG registers/instructions!
Of course a COG would not be that big but it would allow for bigger COG space. Bigger native, independent, high speed code.
A lot of that address space could be mapped to HUB RAM of course.
Anyway, that's just me fantasizing. What I was really wondering is how to get some accelerated 64 bit float action from the cordic hardware support for those JS engines.
Or is that me fantasizing as well....
General registers would suddenly take a back seat to the huge address range of the 64 bit instructions. May as well remove CogExec altogether. Reducing CogRAM down to, say, 128 registers. Extending LUTRAM instead ... and build LUTRAM out of MRAM, same as HubRAM. Maybe even ditch HubRAM altogether.
Why is 32b not enough ? Any proof ?
Sure, that tweaking would be to resolve to the Cordic Resolution, which may not even be to all possible 32b Real digits anyway.
There is a trade off of speed with precision, in polynomial floating operations & hat can also vary with values.
1) All numbers in JS are IEEE 754 64 bit floats. That's what the language spec. calls for.
2) If your JS engine only supports 32 bit floats it is not standards compliant.
3) You cannot use a 32 bit FPU to help with 64 bit float operations.
4) Ergo, JS engines on MCUs with 32 bit FPUs don't use them. They do it in an emulation library.
Now, one could build a JS engine that used 32 bit floats and gains the advantage of those FPUs. Not language spec. compliant but what the hell. Turns out this breaks too many things. Significantly it means a program cannot use integers up to 2 to the power 52. Which is expected quite a lot.
Meanwhile, ARM Holdings was bought by a Japanese company:
http://www.theguardian.com/politics/2016/jul/18/tech-giant-arm-holdings-sold-to-japanese-firm-for-24bn
“ARM is the greatest achievement of my life. This is a sad day for ARM and a sad day for technology in the UK,” he said. “It is the last technology company that is relevant in the UK. There will now be strategic decisions taken in Japan that may or may not help ARM in the UK.
Let's forget the 64 bit Propeller thing. It was only an aside to my post, not anything I would expect to happen.
But what about the 64 bit float math in the cordic engine?
Probably also very optimistic of me. And not worth thinking about unless it is IEEE 754 compliant.
In the CORDIC we have:
* 64-to-32-bit square root
* 32-by-32-bit multiply with 64-bit product
* 64-over-32-bit divide with 32-bit quotient and remainder
We don't have 64x64 multiply or 64/64 divide.
I have a hard time imagining using floating-point types for maintaining bit masks. I think you'd be more likely to use arrays. Admittedly, it's not as efficient, but I think we all agree that this isn't necessarily what's important to someone who wants to use JavaScript. Also, while bit manipulation is always an important aspect of microcontroller development, I wonder if P2's smart pins will make this a bit less critical in many circumstances (at least for the performance-critical parts).
I'm not quite following. I don't see how you can effectively use the CORDIC engine with a 64-bit float.
Ergo, notice how you now painted yourself into a corner, by discarding silicon you have paid for, that can give much higher speed, simply because of constraints self-imposed by poor language choice.
Cool, got any links to code that uses/demands 2 to the power 52 integers ?
I've never needed 2^52 integers myself.
I have been very interested in properly using the silicon that exists, like Chip mentions above,
* 64-to-32-bit square root
* 32-by-32-bit multiply with 64-bit product
* 64-over-32-bit divide with 32-bit quotient and remainder
and cannot fathom why anyone would choose to fence that off ?
Still, as you admitted before, you can always keep adding more languages to the mix, to Fix the Compromises you first imposed. Of course, the original purity is long gone, & the system is then a long way from beginner-friendly.
I found some JS discussions, where they admitted it was much slower then Micro Python.
What are the performance factors? There is the obvious longer pipeline to program around for a starters. Is there manufacturing thermal/critical path concerns?
The CORDIC is the critical path, already, since two 40-bit adders must resolve, in series, in the CORDIC stages which compensate for K gain.
I hope you realize you made me drool with that.
With that process, and 64 bit / 64k cogs, how big a hub could we have???
P.S. Sad about ARM, everything is a sell-out these days, such short-term vs long-term gains. I'm glad Parallax is a "family" company.
Thanks for info. I think I had seen you say that in other topics too.
Javascript is a rubbish language anyway IMHO. It's the main problem destroying the web. It needs banned from web browsers for sure.
Remove the CCCC bits and extend the S & D by 2 bits each.
Now, we would require a "fast single clock" instruction to precede instruction(s) that require CCCC testing (ie if_xxx).
COND #%cccc_nnnn Postedit missed out extra nn bits
cccc = condition to test for in the following instructions
nnnn = the number of following instructions to test for this condition (max 15/16)
xilinx application note, double precision uses about 2.5x logic of single precision functions
http://www.xilinx.com/support/documentation/application_notes/xapp552-cordic-floating-point-operations.pdf
and this Cordic implementation of IEEE754 Double Precision for e^-x function only (but refers to other implementations at back)
http://rssi.ncsa.illinois.edu/proceedings/papers/posters/08_Pottathuparambil.pdf
If you aren't doing this already, would carry-save adders throughout the CORDIC help at all? If you can't use them everywhere, I'm not sure they would be worth it for only the final two adders, unless you could sneak the carry propogation logic in somewhere else, outside of the critical path.
For anyone who doesn't know, each full adder in a carry-save adder has its carry output connected to the carry input of the next bit of the next adder instead of the next bit of the same adder. At the end, you just add the carry to the value. This helps with propogation time when you're adding lots of numbers up.
The FPGA or ASIC compiler will actually pick an optimal topology, based on speed requirements.
Edit: Well, maybe not. Couldn't the same effect be achieved with a conditional relative branch that would just branch around the non-conditional instruction?