64 bit floats on the P2 ?

Heater. · 2016-07-25 07:13

A while ago Chip expressed interest in getting a tiny Javascript engine running on the P2.

I love this idea because there is an ocean of programmers out there who would never dream of getting into C or assembler or Spin or whatever, but they are comfortable with JS. The Espruino and other tiny JS engines show what can be done with JS on a micro-controller.

There is one fly in the ointment. All numbers in JS are IEEE 754 64 bit floats.

Turns out that even if MCU's like the STM32F4 have floating point units they are not used by those tiny JS engines. 32 bits is not enough. (JS also uses those floats to hold 53 bit integers)

Now, the P2 has some funky cordic hardware support for floating point. I was just wondering if it might be possible to get 64 bit float support with that cordic hardware. Perhaps with some tweaking?

Aside: Shame the P2 is not a 64 bit machine. Having eight 32 bit cores was pretty radical for an MCU when the P1 came out. Sixteen 64 bit cores would be the radical route today. Plus it would allow for huge COG memory which would make everybody happy.

Ah well...

Cluso99 · 2016-07-25 07:48

Heater. wrote: »

A while ago Chip expressed interest in getting a tiny Javascript engine running on the P2.

I love this idea because there is an ocean of programmers out there who would never dream of getting into C or assembler or Spin or whatever, but they are comfortable with JS. The Espruino and other tiny JS engines show what can be done with JS on a micro-controller.

There is one fly in the ointment. All numbers in JS are IEEE 754 64 bit floats.

Turns out that even if MCU's like the STM32F4 have floating point units they are not used by those tiny JS engines. 32 bits is not enough. (JS also uses those floats to hold 53 bit integers)

Now, the P2 has some funky cordic hardware support for floating point. I was just wondering if it might be possible to get 64 bit float support with that cordic hardware. Perhaps with some tweaking?

Aside: Shame the P2 is not a 64 bit machine. Having eight 32 bit cores was pretty radical for an MCU when the P1 came out. Sixteen 64 bit cores would be the radical route today. Plus it would allow for huge COG memory which would make everybody happy.

Ah well...

Are you suggesting 64-bit instructions??? This would double the ram usage

BTW don't you mean huge hub memory??? We are constrained by die space - only going to be 512KB of Hub and 16x 4KB Cogs.

evanh · 2016-07-25 08:34

Heater! Are you intentionally stirring? The forums too dead for ya or something?

64 bit on the desktop has absolutely zippo to do with FPUs! AMD's nice x86_64 implementation does bring a refreshingly fuller general register set that x86_32 had always been sorely missing ... but, other than that, the real impetus was flat memory addressing range beyond 4GB and nothing more. No one wanted to start dealing with segmentation again.

evanh · 2016-07-25 08:46

As for answering the question of Can the CORDIC do 64 bit? Obviously it needs wrapper routines to extend the 32 bit operations. The nice thing here is the Cog can be concurrently massaging the next parts to feed the CORDIC while the CORDIC is working.

Something that a library might be able to effectively package up.

Heater. · 2016-07-25 09:02

Oh yeah. 64 bits! Like I said, 32 bits was radical for a dinky MCU with only 32K RAM. 64 bits would be radical for a bigger MCU with 512K RAM.

I'm not going to worry about annoying practical details like transistor count. We can have billions of transistors now a days. Transistors are almost free.

The limit on COG space is not so much transistors as the 9 bit address fields embedded into every instruction.

64 bit instructions allows for 25 bit wide address fields. That's enough for 32 million COG registers/instructions!

Of course a COG would not be that big but it would allow for bigger COG space. Bigger native, independent, high speed code.

A lot of that address space could be mapped to HUB RAM of course.

Anyway, that's just me fantasizing. What I was really wondering is how to get some accelerated 64 bit float action from the cordic hardware support for those JS engines.

Or is that me fantasizing as well....

evanh · 2016-07-25 09:07

Oops, apologies Heater. For some reason I though you were Potatohead. I'll go correct that detail ...

evanh · 2016-07-25 09:11

Cluso's push for more Cog space could evolve, for the Prop3, into something like a 64 bit instruction fetch of LUT RAM (ie: both ports) for a whole other extended instruction set.

evanh · 2016-07-25 09:13

No, not even that. Only one port is needed but have to prefetch one further ahead is all.

evanh · 2016-07-25 09:20

That would change the memory model again.

General registers would suddenly take a back seat to the huge address range of the 64 bit instructions. May as well remove CogExec altogether. Reducing CogRAM down to, say, 128 registers. Extending LUTRAM instead ... and build LUTRAM out of MRAM, same as HubRAM. Maybe even ditch HubRAM altogether.

jmg · 2016-07-25 09:52

Heater. wrote: »

32 bits is not enough. (JS also uses those floats to hold 53 bit integers)

Why is 32b not enough ? Any proof ?

Heater. wrote: »

Now, the P2 has some funky cordic hardware support for floating point. I was just wondering if it might be possible to get 64 bit float support with that cordic hardware. Perhaps with some tweaking?

Sure, that tweaking would be to resolve to the Cordic Resolution, which may not even be to all possible 32b Real digits anyway.
There is a trade off of speed with precision, in polynomial floating operations & hat can also vary with values.

Heater. · 2016-07-25 11:09

jmg,

Why is 32b not enough ? Any proof ?

The argument goes like this:

1) All numbers in JS are IEEE 754 64 bit floats. That's what the language spec. calls for.

2) If your JS engine only supports 32 bit floats it is not standards compliant.

3) You cannot use a 32 bit FPU to help with 64 bit float operations.

4) Ergo, JS engines on MCUs with 32 bit FPUs don't use them. They do it in an emulation library.

Now, one could build a JS engine that used 32 bit floats and gains the advantage of those FPUs. Not language spec. compliant but what the hell. Turns out this breaks too many things. Significantly it means a program cannot use integers up to 2 to the power 52. Which is expected quite a lot.

David Betz · 2016-07-25 11:31

Heater. wrote: »

Significantly it means a program cannot use integers up to 2 to the power 52. Which is expected quite a lot.

More importantly, it couldn't even use integers up to 2^32 so there would be no way to express a 32 bit mask for example.

kwinn · 2016-07-25 14:25

Why not consider a 64/32 bit machine for P3? Memory and registers are 64 bits so they can hold two 32 bit instructions or one 64 bit extended instruction. Cordic to help with 32/64 bit floats, and a flat memory model with cog ram & lut starting at $0 and hub after that.

cgracey · 2016-07-25 16:04

64 bits would simplify a lot of things, but would really need a smaller process, like 28nm, in order to give each cog 64k x 64-bit registers. Also, it would be pipelined like P2-Hot for 1-clock instructions.

Meanwhile, ARM Holdings was bought by a Japanese company:

http://www.theguardian.com/politics/2016/jul/18/tech-giant-arm-holdings-sold-to-japanese-firm-for-24bn

However, Hermann Hauser, who helped to found ARM in 1990, criticised the deal.

“ARM is the greatest achievement of my life. This is a sad day for ARM and a sad day for technology in the UK,” he said. “It is the last technology company that is relevant in the UK. There will now be strategic decisions taken in Japan that may or may not help ARM in the UK.

Heater. · 2016-07-25 16:09

Hi Chip,

Let's forget the 64 bit Propeller thing. It was only an aside to my post, not anything I would expect to happen.

But what about the 64 bit float math in the cordic engine?

Probably also very optimistic of me. And not worth thinking about unless it is IEEE 754 compliant.

cgracey · 2016-07-25 16:12

Heater. wrote: »

Hi Chip,

Let's forget the 64 bit Propeller thing. It was only an aside to my post, not anything I would expect to happen.

But what about the 64 bit float math in the cordic engine?

Probably also very optimistic of me. And not worth thinking about unless it is IEEE 754 compliant.

In the CORDIC we have:

* 64-to-32-bit square root
* 32-by-32-bit multiply with 64-bit product
* 64-over-32-bit divide with 32-bit quotient and remainder

We don't have 64x64 multiply or 64/64 divide.

Heater. · 2016-07-25 16:43

OK, cool. So I guess the 53 bit mantissa of a 64 bit float can be multiplied in a number of 32 bit steps. Which, we hope, is faster than doing it with software shifts and adds.

Seairth · 2016-07-25 17:07

David Betz wrote: »

Heater. wrote: »

Significantly it means a program cannot use integers up to 2 to the power 52. Which is expected quite a lot.

More importantly, it couldn't even use integers up to 2^32 so there would be no way to express a 32 bit mask for example.

I have a hard time imagining using floating-point types for maintaining bit masks. I think you'd be more likely to use arrays. Admittedly, it's not as efficient, but I think we all agree that this isn't necessarily what's important to someone who wants to use JavaScript. Also, while bit manipulation is always an important aspect of microcontroller development, I wonder if P2's smart pins will make this a bit less critical in many circumstances (at least for the performance-critical parts).

Seairth · 2016-07-25 17:28

Heater. wrote: »

OK, cool. So I guess the 53 bit mantissa of a 64 bit float can be multiplied in a number of 32 bit steps. Which, we hope, is faster than doing it with software shifts and adds.

I'm not quite following. I don't see how you can effectively use the CORDIC engine with a 64-bit float.

jmg · 2016-07-25 21:09

Heater. wrote: »

4) Ergo, JS engines on MCUs with 32 bit FPUs don't use them. They do it in an emulation library.

Ergo, notice how you now painted yourself into a corner, by discarding silicon you have paid for, that can give much higher speed, simply because of constraints self-imposed by poor language choice.

Heater. wrote: »

Significantly it means a program cannot use integers up to 2 to the power 52. Which is expected quite a lot.

Cool, got any links to code that uses/demands 2 to the power 52 integers ?
I've never needed 2^52 integers myself.
I have been very interested in properly using the silicon that exists, like Chip mentions above,

* 64-to-32-bit square root
* 32-by-32-bit multiply with 64-bit product
* 64-over-32-bit divide with 32-bit quotient and remainder

and cannot fathom why anyone would choose to fence that off ?
Still, as you admitted before, you can always keep adding more languages to the mix, to Fix the Compromises you first imposed. Of course, the original purity is long gone, & the system is then a long way from beginner-friendly.

I found some JS discussions, where they admitted it was much slower then Micro Python.

evanh · 2016-07-25 21:26

So, Heater is actually wanting just the CORDIC to be bigger. And not even 64 bit! That's probably viable even now. There is still silicon space if I'm not mistaken and it's not like adding more bits would be hard to do.

What are the performance factors? There is the obvious longer pipeline to program around for a starters. Is there manufacturing thermal/critical path concerns?

cgracey · 2016-07-25 21:36

evanh wrote: »

So, Heater is actually wanting just the CORDIC to be bigger. And not even 64 bit! That's probably viable even now. There is still silicon space if I'm not mistaken and it's not like adding more bits would be hard to do.

What are the performance factors? There is the obvious longer pipeline to program around for a starters. Is there manufacturing thermal/critical path concerns?

The CORDIC is the critical path, already, since two 40-bit adders must resolve, in series, in the CORDIC stages which compensate for K gain.

ke4pjw · 2016-07-25 21:39

If there were only some way for javascript to run on 32-bit processors [DUCKS]

Bill Henning · 2016-07-25 22:28

Chip,

I hope you realize you made me drool with that.

With that process, and 64 bit / 64k cogs, how big a hub could we have???

cgracey wrote: »

64 bits would simplify a lot of things, but would really need a smaller process, like 28nm, in order to give each cog 64k x 64-bit registers. Also, it would be pipelined like P2-Hot for 1-clock instructions.

Meanwhile, ARM Holdings was bought by a Japanese company:

http://www.theguardian.com/politics/2016/jul/18/tech-giant-arm-holdings-sold-to-japanese-firm-for-24bn

However, Hermann Hauser, who helped to found ARM in 1990, criticised the deal.

“ARM is the greatest achievement of my life. This is a sad day for ARM and a sad day for technology in the UK,” he said. “It is the last technology company that is relevant in the UK. There will now be strategic decisions taken in Japan that may or may not help ARM in the UK.

Peter Jakacki · 2016-07-25 22:46

I said a long time ago that nothing prevents us from going to 40 bits, we don't always have to go up in powers of two. That way we have 13-bit cog addressing to address 8k longs and we could even have 32-bits for floating-point fractions too. Now "longs" are 40-bits instead (+/- 549,755,813,887) and hub memory has 5 bytes for every long!!! Then again, bytes could be 10-bits long too. The main limitation is any external memory, even serial, which is organized in 8-bit bytes (bytes aren't/weren't always 8-bits). Too radical? It's food for thought.

P.S. Sad about ARM, everything is a sell-out these days, such short-term vs long-term gains. I'm glad Parallax is a "family" company.

evanh · 2016-07-26 00:00

cgracey wrote: »

evanh wrote: »

So, Heater is actually wanting just the CORDIC to be bigger. And not even 64 bit! That's probably viable even now. There is still silicon space if I'm not mistaken and it's not like adding more bits would be hard to do.

What are the performance factors? There is the obvious longer pipeline to program around for a starters. Is there manufacturing thermal/critical path concerns?

The CORDIC is the critical path, already, since two 40-bit adders must resolve, in series, in the CORDIC stages which compensate for K gain.

Thanks for info. I think I had seen you say that in other topics too.

Javascript is a rubbish language anyway IMHO. It's the main problem destroying the web. It needs banned from web browsers for sure.

Cluso99 · 2016-07-26 00:27

Here is a different way of achieving 11-bit addresses in the PX...

Remove the CCCC bits and extend the S & D by 2 bits each.

Now, we would require a "fast single clock" instruction to precede instruction(s) that require CCCC testing (ie if_xxx).

COND #%cccc_nnnn Postedit missed out extra nn bits

cccc = condition to test for in the following instructions
nnnn = the number of following instructions to test for this condition (max 15/16)

Tubular · 2016-07-26 01:55

No reason not to look at these cordic mods for Prop3. Here are a couple of links with some figures on what might be involved :-

xilinx application note, double precision uses about 2.5x logic of single precision functions
http://www.xilinx.com/support/documentation/application_notes/xapp552-cordic-floating-point-operations.pdf

and this Cordic implementation of IEEE754 Double Precision for e^-x function only (but refers to other implementations at back)
http://rssi.ncsa.illinois.edu/proceedings/papers/posters/08_Pottathuparambil.pdf

Electrodude · 2016-07-26 01:56

cgracey wrote: »

evanh wrote: »

So, Heater is actually wanting just the CORDIC to be bigger. And not even 64 bit! That's probably viable even now. There is still silicon space if I'm not mistaken and it's not like adding more bits would be hard to do.

What are the performance factors? There is the obvious longer pipeline to program around for a starters. Is there manufacturing thermal/critical path concerns?

The CORDIC is the critical path, already, since two 40-bit adders must resolve, in series, in the CORDIC stages which compensate for K gain.

If you aren't doing this already, would carry-save adders throughout the CORDIC help at all? If you can't use them everywhere, I'm not sure they would be worth it for only the final two adders, unless you could sneak the carry propogation logic in somewhere else, outside of the critical path.

For anyone who doesn't know, each full adder in a carry-save adder has its carry output connected to the carry input of the next bit of the next adder instead of the next bit of the same adder. At the end, you just add the carry to the value. This helps with propogation time when you're adding lots of numbers up.

f4.56_key.gif

cgracey · 2016-07-26 02:33

Electrodude wrote: »

cgracey wrote: »

evanh wrote: »

So, Heater is actually wanting just the CORDIC to be bigger. And not even 64 bit! That's probably viable even now. There is still silicon space if I'm not mistaken and it's not like adding more bits would be hard to do.

What are the performance factors? There is the obvious longer pipeline to program around for a starters. Is there manufacturing thermal/critical path concerns?

The CORDIC is the critical path, already, since two 40-bit adders must resolve, in series, in the CORDIC stages which compensate for K gain.

If you aren't doing this already, would carry-save adders throughout the CORDIC help at all? If you can't use them everywhere, I'm not sure they would be worth it for only the final two adders, unless you could sneak the carry propogation logic in somewhere else, outside of the critical path.

For anyone who doesn't know, each full adder in a carry-save adder has its carry output connected to the carry input of the next bit of the next adder instead of the next bit of the same adder. At the end, you just add the carry to the value. This helps with propogation time when you're adding lots of numbers up.

The FPGA or ASIC compiler will actually pick an optimal topology, based on speed requirements.

David Betz · 2016-07-26 02:34

Cluso99 wrote: »

Here is a different way of achieving 11-bit addresses in the PX...

Remove the CCCC bits and extend the S & D by 2 bits each.

Now, we would require a "fast single clock" instruction to precede instruction(s) that require CCCC testing (ie if_xxx).

COND #%cccc_nn

cccc = condition to test for in the following instructions
nn = the number of following instructions to test for this condition (max 15/16)

This sounds like an excellent idea! Since most instructions are not conditional, requiring a prefix for the few that are doesn't seem like a big price to pay. Nice idea!

Edit: Well, maybe not. Couldn't the same effect be achieved with a conditional relative branch that would just branch around the non-conditional instruction?

64 bit floats on the P2 ?

Comments