RISC V ?

Heater. · 2018-10-11 05:09

But that looks like it's about the size of a RISC V core anyway. Judging by the size estimates at the bottom of this page: https://github.com/cliffordwolf/picorv32

KeithE · 2018-10-11 17:20

Heater. wrote: »

But that looks like it's about the size of a RISC V core anyway. Judging by the size estimates at the bottom of this page: https://github.com/cliffordwolf/picorv32

Duh - you are correct. Hmm - maybe J1a can be radically shrunk. I would have to synthesize to be sure, but I'm guessing this might be related the stacks being implemented as registers.

stack2 #(.DEPTH(15)) dstack(.clk(clk), .rd(st1), .we(dstkW), .wd(st0), .delta(dspI));
stack2 #(.DEPTH(17)) rstack(.clk(clk), .rd(rst0), .we(rstkW), .wd(rstkD), .delta(rspI));

That's (15+1)*16 + (17+1)*16 = 544 registers if I count correctly. (The +1 is for the head register.)

Ale · 2018-10-14 10:58

So ignoring the RAM you could fit 4.5 of those on the UP5K part.

No need to do that, I think, the Ultra Plus has 4 blocks of 16 kBytesx16 single port RAM, the J1 is a 16 bit implementation, one could have 4 with their own 16 kWords of memory.

There is Ataradov's RISCV implementation, it is pipelined too and not much bigger than PicoRISC V.
In terms of LUTS it seems smaller than the 8051 we use at work.

Cluso99 · 2018-10-14 13:13

There is a home brew micro built from 12 FLASH/EEPROMS and an SRAM. Quite an interesting design. It's in the home brew computer ring (not sure of the precise name but I'm sure google will find it). Some of the EPROMs are just emulating latches.

I am sure this could be modified to make it RISC V.

Cluso99 · 2018-10-14 13:21

FWIW I have just been blown away looking for a video card. The Nvidia RTX 2080 Ti has >4000 CUDA cores.
Didn't realise just how many cores these graphics cards are packing!!!

Heater. · 2018-10-15 03:55

Mind bending isn't it.

And now Nvidea is working on a RISC V core to put in there and manage all of those little guys.

Roy Eltham · 2018-10-15 06:53

Cluso,
The "cuda core" numbers are a bit misleading. While it is true that they have that many scalar floating point operation units, it's a bit of a stretch to call them cores in the same sense that you are thinking of cores. For RTX, these "cuda cores" are grouped together into Streaming Multiprocessors (SMs). One SM has 4 "Warp Scheduler" groups, each group has a a warp scheduler and dispatch unit, a register file, 16 int32 units, 16 fp32 units, 4 load/store units, 2 tensor cores (again not really a true core), and a Special Function Unit (this is for things like trig, and what not). The SM also has a block of shared memory (shared by all 4 groups, and also is the L1 cache to main memory (it's split between those two things), 4 texture sampling units, and an RT Core (again not a true core). Then, 12 SMs are grouped together into a GPC (Graphics Processing Unit), the GPC also has some raster engine stuff and is hooked to the P2 cache and the memory controllers (that connect to the external GDDR). Any given GPU will have at least 1 GPC, the high end ones have 6 of them.

nvidia equates the fp32 operations as cuda cores, so there are 64 of those per SM and 72 SMs, you get 4608 "cuda cores" in the top end cards.

So, it's probably more correct to call an SM a core, or perhaps one of the warp groups inside the SM a core. If you go with SM, then the top card has 72 of them, if you go with the warp group, then it has 288 of them.

To relate this to Prop2, a warp group would be a cog with it's register file being cog/lut memory. The shared memory would be like hub. So, an SM is like a 4 cog Prop2, but then a GPC has 12 of those and they all connect to an external shared memory (really 12 banks of external memory that can all be accessed simultaneously by different SMs/GPCs). They don't really have an equivalent to the smart pins, nor any real I/O besides to the external memory.

Dolu1990 · 2018-10-15 13:06

About the riscv softcore contest :
I played a bit, got zephyr running and the required RV32I compliance test passing on a 4 stages VexRiscv, and it require "many" CSR :
mstatus + mie + mie + catch missaligned branch/load/store + the following :
misaExtensionsInit = 0,
misaAccess = CsrAccess.READ_ONLY,
mtvecAccess = CsrAccess.WRITE_ONLY,
mepcAccess = CsrAccess.READ_WRITE,
mscratchGen = true,
mcauseAccess = CsrAccess.READ_ONLY,
mbadaddrAccess = CsrAccess.READ_ONLY,
mcycleAccess = CsrAccess.NONE,
minstretAccess = CsrAccess.NONE,
ecallGen = true,
ebreakGen = true,
wfiGenAsNop = true,

It nearly double the size of the CPU on ice40, as in those FPGA if you just want a FF you lose the LUT.
minimal
iCE40 -> 1423 LC
Artix 7 -> 728 LUT 533 FF
Cyclone V ->544 ALMs
Cyclone IV -> 997 LUT 461 FF

A RV32I configuration with reduced CSR to emulate all the missing feature would require :
iCE40 ->1114 LC
Artix 7 -> 586 LUT 403 FF
Cyclone V -> 427 ALMs
Cyclone IV -> 818 LUT 329 FF

Heater. · 2018-10-15 13:57

I'm surprised that the competition does not allow the RV32E for the small category. That is "E" as in "embedded", intended for small systems. It only has 16 registers and does not require all those counters.

Dolu1990 · 2018-10-15 14:03

RV32E isn't ready yet as there is no compiler support. Zephyr only require one timer (64 bits

), the compliance tests doesn't require any counter

Dolu1990 · 2018-10-15 14:39

I mean, the current Zehyr port require one 64 bit counter / comparator ^^
Could be change without much pain.

Heater. · 2018-10-15 15:23

I was wondering about the status of GCC support for RV32E. I found a commit log in GCC that says: "RISC-V: Add RV32E support.". I guess none of that is upstreamed yet.

https://github.com/riscv/riscv-gcc

Heater. · 2018-10-28 05:09

I found some time to get back to my my picorv32 RISC V on DE0 Nano. https://github.com/ZiCog/xoro

Biggest changes are:

1) Replaced the tri-state data bus with a bus interface module as has been suggested here a few times.
2) Got the LEDs output, timer, xoroshiro128+ PRNG modules working again.
3) Firmware now exercises LEDs, timer and PRNG. Also now includes the fftbench benchmark.
4) Updated the picorv32 core to Clifford Wolf's latest version.

About that benchmark, fftbench is running in 23688us using hardware multiply and a 100MHz clock.

I was wondering how that compares to the P2 running fftbench on the DE0 Nano. The closest I have come to finding P2 results is ersmith's comparison of P2 interpreters https://forums.parallax.com/discussion/166671/p2-interpreter-comparison which shows an fftbench result of 138096us for p1gcc LMM and 60964us for riscvjit_p2.

On the face of it that makes RISC V about twice as fast as the P2 but I'm sure I'm not comparing like for like here as we don't have a real P2 GCC yet.

The P2 would no doubt pull off an FFT much faster with it's CORDIC engine and such. But as I always say, fftbench is not really about performing an FFT, consider it as just an example of typical code full of loops, logic and integer maths.

I'm probably done with this for a while now, got to get back to my RISC V in SpinalHDL effort https://github.com/ZiCog/sodor-spinal

P.S. If all this is clutter in the P2 thread it could be moved to general discussion.

potatohead · 2018-10-28 06:33

(I just erased a wall of text, thinking I can get this said simply)

P2 won't compare easily to things like RISC V. It's got purpose built silicon, much like say, SGI added to their workstations, for example. Or, like the Amiga had early on.

With SGI, general purpose compute was never a top priority. Literally everything else was! Multi-CPU? How about 1024 CPU single OS image computing? Video, sound, and other data streaming, like digital TV? Nailed it.

The Amiga was similar, for it's time. The robust hardware brought things to people roughly a decade ahead of time. True for both SGI and Commodore.

One can compare an SGI to a SUN workstation, and on general purpose compute, the SUN was superior. On just about every other metric? No contest, generally it was SGI.

The Amiga vs Atari ST played out almost the same way, but with a much smaller general compute difference.

The P2 is going to be one of these well differentiated bits of technology. The purpose built silicon is aimed right at a lot of tasks that may not always make sense, or that could add value, usability to a solution, and or ways of taking reality in, processing it, and streaming that right back out to the person wanting to understand or control better too.

And, like other efforts of this nature, general purpose compute really isn't a priority, literally everything else is!

Should be a whole lot of fun!

Back in the day, when I would work with people trying to sort this out, "do I get a Sun, or SGI, or HP?"

What I would do is look at what it is they need to do. Then evaluate how they are doing it, understand that well, and then see whether it could be done better, more quickly, in a more robust way, whatever, and where that was true, they would consistently buy the purpose built hardware. SGI. Where that was not true, they rarely did, not unless they ended up somehow valuing those other capabilities in some way or other.

The cost difference was dramatic then too. One could sometimes get two Sun machines for the cost of one SGI machine. But, a very well equipped SUN, comparable to the purpose built silicon in the SGI (if that could even be done at the time, and it couldn't always), ended up a lot more expensive and or complex.

I think it's going to be this way with the P2. It's not the cheapest one out there, nor will it be the peak performer on general purpose compute, but it's price / feature / performance ratio will be excellent. It's gonna come down to what people value. For people doing fairly standard engineering on fairly typical problems, meh.

However, if there is an opening or potential for more, something distinctive, interactive, etc... the P2 is going to shine pretty bright.

TL;DR: RISC V won't be directly comparable to the P2. Simple benchmarks just are not going to tell the tale, nor will spec sheet numbers.

Aaaaand what did I get?

Another freaking wall of text. Oh well. Good night all. Spud is going to check out and get some sleep. Granddaughter Pooky is here, and I'll be up early exploring the world with her tomorrow, after making some eggs, and watching a good bit of "Paw Patrol."

jmg · 2018-10-28 06:53

Heater. wrote: »

P.S. If all this is clutter in the P2 thread it could be moved to general discussion.

With P2 maturing into Silicon, maybe this, and P1V topics, can go into a separate category, called something like
P1V (and other cores) in FPGA ?

Heater. · 2018-10-28 07:58

jmg,

That sounds like a good idea.

Just now we have P1 and P2 available for FPGA. People have done all kind of mods to the P1 HDL and experiments with both. Not to mention people just experimenting with snippets of P2 HDL that have been presented and so on. Parallax has even created FPGA boards. Then there is me barging in with my RISC V ramblings.

So perhaps, maybe, it's time for the forum to have an FPGA category. A play ground for such thing. Especially as now, as you say, the P2 is actually a real chip.

You never know, if it got a lot of attention it might inspire Parallax to create a cheapish FPGA board using Lattice chips and the Open Source tool chains. Which would be nice.

jmg · 2018-10-28 08:15

Heater. wrote: »

So perhaps, maybe, it's time for the forum to have an FPGA category. A play ground for such thing. Especially as now, as you say, the P2 is actually a real chip.

Thinking about this more, maybe the P2 side could also split into P2 Silicon area covering Engineering samples / Boards / Languages / Loaders etc threads (ie real stuff, relating to real boards/chips), and
a P2-R&D area, that has the FPGA and discussions of features. ie More theory/gestation, or relating to FPGA boards

Heater. · 2018-10-28 08:42

jmg,

Ahggrrr no.... I don't think we should be encouraging more P2 development. I like to think of the P2 as done. I'd like to actually get a P2 in my hands before they put me up the chimney.

I'd go for an FPGA area that could accommodate P2/3 R&D, P1 mods, and general FPGA/logic design ramblings.

jmg · 2018-10-28 08:53

Heater. wrote: »

jmg,

Ahggrrr no.... I don't think we should be encouraging more P2 development. I like to think of the P2 as done. I'd like to actually get a P2 in my hands before they put me up the chimney.

I'd go for an FPGA area that could accommodate P2/3 R&D, P1 mods, and general FPGA/logic design ramblings.

I think you missed the point.
Yes, P2 is largely done, and the split into an archive of P2 R&D is more historic record, and actually done to avoid confusing new users, who arrive into P2 Thread today, and see a sea of FPGA discussions and talk around features.
A P2 R&D area may well include early P3 discussions, but that is also rather separate from P1V.

Heater. · 2018-10-28 08:58

Ah, I see what you mean. Makes sense.

It kind of implies another forum category for P1V and other FPGA discussion. Generally I don't like to see hundreds of categories on forums. Is there enough such traffic to justify more than one?

Heater. · 2018-10-28 09:07

potatohead,

(I just erased a wall of text, thinking I can get this said simply)

I do love to find your walls of text in the morning and read them over my first coffee of the day. Let me reply in kind...

P2 won't compare easily to things like RISC V.

It was never my intention with this thread to compare RISC V and P2 or suggest they are in competition. I always imagined them as being very different beasts designed with totally different goals in mind. In short, general purpose computing vs micro-controller, low latency/deterministic timing vs maximum throughput, and so on.

What I have imagined is a synergy between the two. (Did I really use that word?) I did open with "Q: What's the quickest way to get GCC working for the P2? A: Drop a RISC V processor core in there.)

Not a serious suggestion but one can dream of a full up Linux running RISC V with Propeller cogs around it for the IO hard work. Not in this life time though I think.

Having said that, and having looked at the RISC V a bit more closely. I might be coming closer to making a comparison. Why?...

Because the RISC V is not really a thing. It's only a specification. An instruction set architecture spec. That spec is intentionally open ended. The base specification gets you enough to build a CPU that has compilers and tools available, but there is a lot of scope for extensions. One can mold it how one likes. It need not have high latency pipelines and caches. It could have Propeller like direct connectivity between CPU and IO. It could be surrounded with peripheral hardware, like the CORDIC engine, streamers, etc, together with custom instructions to talk to them directly. The picorv32, for example, has a co-processor interface exactly for that purpose, currently used for the optional mul and div instructions.

So in a way I'm now curious as to how one would add such Propeller like features to a RISC V core and make it more, well, Propeller like. An easy place to start would be custom RISC V instructions for direct bit banging IO, another would be instructions for waiting on events. That moves the RISC V into P1 territory already. Of course the first step is getting 8 RISC V cores working together like Propeller COGs.

Any way, this is all vague hand waving. Certainly I don't have the chops to push these ideas very far. Reality is that the P1 exists, the P2 almost does, meanwhile there is only one RISC V micro-controller on the market and it is not at all Propeller like. They are chalk and cheese.

I did not quite get your point about SGI and Amiga. Certainly they made use of custom silicon to get an advantage over the competition. However both were built on general purpose processors. SGI started out with Motorola 68000 and later moved to MIPS. Amiga was Motorola.

TL;DR: RISC V won't be directly comparable to the P2. Simple benchmarks just are not going to tell the tale, nor will spec sheet numbers.

Quite so. Benchmarks should be taken with a large pound of salt. Taken in isolation they are not worth the bench they are marked on.

They can be a general guide but one should be mindful of the actual problems one needs to solve in ones application.
The fftbench won't tell you anything about how well a P2 generates VGA or HDMI, for example. The application is the best benchmark!

jmg · 2018-10-28 09:13

Heater. wrote: »

Ah, I see what you mean. Makes sense.

It kind of implies another forum category for P1V and other FPGA discussion. Generally I don't like to see hundreds of categories on forums. Is there enough such traffic to justify more than one?

My screen shows all forum areas, with posts sorted by most recent.
That means having 10 or 11 or 12 categories makes no difference to my landing screen, but it does make search easier within any forum area.
Search on this forum is not great...

P1V and RISC V seem to be ticking along, tho P1V will maybe fade as P2 Silicon gets deployed.

As you mention above :
"You never know, if it got a lot of attention it might inspire Parallax to create a cheapish FPGA board using Lattice chips and the Open Source tool chains. Which would be nice."
Yes, it could be interesting to track FPGA trends, and when it might make sense to do a P1V board.
In the short term, I'd expect Parallax to be very busy ramping support for P2 - too busy to do any FPGA boards.
It's clear that over the (long) time P2 has been in gestation, FPGAs have gotten larger and cheaper, and they have proven their worth in P2 development.
P1V & RISC V efforts may even spawn some hybrid, eg using a P1V like the TPU in the freescale cores, or maybe someone will add an eggbeater to 8 RISC V cores... ?

Ale · 2018-10-28 09:23

Paw Patrol

I'm done with the 3 first seasons, many times, too many times.
The picorv needs some 5 clocks per instruction, the P1 LMM gets what maybe 8 ?. What about a RISCV with external memory interface ? That is a decisive point. A pipelined RISCV executing from internal block RAM gets close to 1 cycle per instruction. The P2 will be 2 clocks, with 512 kBytes of opcode space. Even if you get the RISCV to do one op every 2 cycles, the memory architecture will play a definite role, it cannot be left outside the equation.

Heater. · 2018-10-28 10:14

If you are done with Paw Patrol perhaps check out Shaun The Sheep: https://www.shaunthesheep.com/watch

I generally agree but couple of points:

To be fair the picorv takes from 3 to 15 clocks depending on instruction and how one configures the core. 3 to 6 might be more typical. I have no idea how that translates to average CPI for actual executing code.

Pipelining can no doubt reduce CPI but at the cost of LE's/silicon. As would multiple dispatch and speculative execution. But that buggers up timing determinism even more.

The RISC V spec says nothing about the memory architecture. I was thinking my Sodor effort would only ever be a Harvard architecture. Nice and simple. Low logic overhead. Good speed.

ersmith · 2018-10-28 11:15

Heater. wrote: »

About that benchmark, fftbench is running in 23688us using hardware multiply and a 100MHz clock.

I was wondering how that compares to the P2 running fftbench on the DE0 Nano. The closest I have come to finding P2 results is ersmith's comparison of P2 interpreters https://forums.parallax.com/discussion/166671/p2-interpreter-comparison which shows an fftbench result of 138096us for p1gcc LMM and 60964us for riscvjit_p2.

We have a Spin version of the fftbench, and we have fastspin for P2, so:

fastspin 3.9.7 -O2 on P1: 62507 us
fastspin 3.9.7 -O2 on P2: 41515 us (80 MHz FPGA)

The P1 fastspin results are pretty comparable to P1 GCC results (I have 65246 us in my table for gcc LMM -O2) so fastspin seems to be in roughly the same ballpark performance wise.

It's curious that P2 isn't much further ahead of P1 on this benchmark. I think some of the core loops fit in fcache on the P1, which really helps performance. fastspin doesn't implement fcache for P2, relying solely on hubexec.

OTOH the real P2 silicon will be clocked 3x faster, so that'll make a huge difference.

I've attached the zip file of the source for fftbench.spin. I've left the P1 version of alone as much as possible, just using an #ifdef to substitute SimpleSerial.spin for FullDuplexSerialPlus.spin if compiling for P2.

Heater. · 2018-10-28 14:46

Bloody hell, I just noticed I have a bug in the fft_bench in my github repo https://github.com/ZiCog/fftbench. It does not include the bit-reversal stage in the elapsed time!

Perhaps it's not so bad, including the bit-reversal only adds about 13% to the execution time. Looks like all the other language versions in the repo do it correctly so I will have to fix the C version to match. (Looks like a hack I might have made when messing around with OMP and then forgotten about)

@eric

In order to keep things straight I lowered the clock on my picorv32 to 80MHz, fixed the above bug and re-ran the fft_bench. The result is:

1024 point bit-reversal and butterfly run time = 16924us

Which makes the picorv32 about 2.4 times faster than fastspin on P2 the P2. Which I find a bit surprising. Would the GCC compiler do better? I guess that fcache would help.

Indeed the P2 will be 3 times faster. But it's interesting to get a clock for clock comparison. 3 times is huge, but only enough to get it's nose ahead of picorv32 at 80MHz.

Heater. · 2018-10-28 15:09

Wow, when I enable the picorv32 barrel shifter I get down to this:

1024 point bit-reversal and butterfly run time = 16013us

At the cost of 240 LE's.

potatohead · 2018-10-28 16:56

That right there is purpose built silicon in action Heater.

Re: earlier SGI, Amiga point

As systems, that purpose built hardware made a difference.

Consider microcontrollers as systems, and what is in the chip matters beyond general compute.

On a single thread benchmark, those machines rarely were at the top. But, on a task, say processing a football game video stream to overlay the first goal line, the difference was a lot more clear.

ersmith · 2018-10-28 18:44

Heater. wrote: »

Bloody hell, I just noticed I have a bug in the fft_bench in my github repo https://github.com/ZiCog/fftbench. It does not include the bit-reversal stage in the elapsed time!

Interesting, I hadn't noticed that either. That means fastspin is now comfortably ahead of PropGCC on P1, woo-hoo! Of course PropGCC correctly handles a *lot* more of the C language than fastspin does

.

In order to keep things straight I lowered the clock on my picorv32 to 80MHz, fixed the above bug and re-ran the fft_bench. The result is:

1024 point bit-reversal and butterfly run time = 16924us

Which makes the picorv32 about 2.4 times faster than fastspin on P2 the P2. Which I find a bit surprising. Would the GCC compiler do better? I guess that fcache would help.

If the PropGCC authors add fcache to P2 it might help, but right now AFAIK no compilers use fcache on P2. It shouldn't be hard for me to add it to fastspin -2, but it's not nearly as big a win on P2 as it is on P1 so it hasn't been a priority.

Does your picorv32 have hardware multiply? If so that's probably a big reason for the difference -- even on P2 multiplying 32 bit numbers is relatively slow, as it goes through the CORDIC unit. (You can build 32 bit multiplies out of the native 16 bit multiplies, and that would make things a bit faster, but it's still a bottleneck.)

Also, as you know, the RISC-V instruction set is very well designed. It's easy to write compilers for it and to implement it in silicon in a performant way, so RISC-V chips should generally perform well on benchmarks.

Heater. · 2018-10-28 20:12

potatohead,

That right there is purpose built silicon in action Heater.

I know what you are getting at but it's not quite what I had in mind when thinking of "purpose built silicon".

Speeding up the typical instructions a compiler may generate for a CPU to run, like ADD, MUL, DIV, SHIFT, is far away from the extensive support for graphics and sound etc of the SGI or AMIGA that was provided hardware external to the processor. The Motorola and MIPS cores they used knew nothing of that. There is a reason we still have graphics cards from Nvidia and such.

The P2 is a case in point, it has a ton of features to do all kind of things super quickly. There will never be a C, or other language, compiler that can generate code to make use of all that.

It might be a long road to make all those P2 features accessible to users with libraries, objects, whatever.

RISC V ?

Comments