Later on you might want to try enabling the compressed instructions and see how it affects the build size. It doesn't seem to take too many gates, and the memory in these devices is pretty limited. It seemed to work effectively for some small examples, so it's worth a look.
I'm trying to port a bootloader (audioboot loader, yes audio) to the riscv32, I added an input port, modified the output port, added resettable timers... let's see.
(Original from :https://github.com/ChrisMicro/AttinySound
Along those lines, can't the Amazon Dash Buttons can be configured with audio? (Electric Imp uses light.)
I had been thinking that it might be fun to drive I2S from the RISC V to get audio output. Any idea what the easiest way would be to get decent audio input? What are you planning to do?
KeithE,
That last firmware I posted works OK without MUL.
I just thought there was still some mystery as to why enabling sieve caused a problem? It still doesn't make sense to me, but this was the only thing that stuck out to me.
My guess is that if the processor does not have a MUL it should generate an illegal instruction TRAP if it sees one. The TRAP handler would then do the MUL in software. As it stands this firmware does not handle TRAPs or interrupts and there is no code in there to perform a software MUL.
Or, with the right options the compiler should not generate a MUL instruction anyway, but rather call a multiply routine that it builds in.
Anyway, my plan is to always have MUL and DIV instructions in future.
Heater, a 6809 Color Computer worked very reliably about 1500 baud. That system even had filenames, and the ability to seek. Used it with C15 tapes at the time.
It's not as forgiving, but one would not notice much in normal use. A bit more touchy about signal level, that is about it. CPU clocked at 1.024 Mhz
Other system choices were way more conservative, in the 300 baud range, Atari, C64. The Atari had fancy hardware. I don't know if a faster loader was ever done, but multiple ones existed for C64, and 2400 baud was possible, but questionable on garden variety tape devices, which the official peripheral was.
A simple trigger, timed loops, etc... should yield 1500 baud pretty easy, higher with the precision audio we have these days.
At one point, just as a curio, I had a library of audio for a couple machines stored on hi fi VHS, with index marks. Power on, ask for the index, press or type load, and off it would go! At that quality, fast loaders worked well. Seek times were in the half minute range. Spiffy.
Other system choices were way more conservative, in the 300 baud range, Atari, C64. The Atari had fancy hardware. I don't know if a faster loader was ever done, but multiple ones existed for C64, and 2400 baud was possible, but questionable on garden variety tape devices, which the official peripheral was.
I had an Atari 800, and the cassette tape (Atari 410 program recorder?) was very unreliable. I was so happy once we got an Atari 810 disk drive. Maybe my 410 had problems - http://www.atarimagazines.com/v2n1/tapetopics.html
No, they were crappy. I fitted mine with a better head and did alignment on it. It performed much better after that.
Super slow though. IMHO, the very worst cassette system. The Apple and CoCo were actually useful. Just get or make short length tapes and go. I had few problems.
The Atari one was a Smile shoot. Would drop a C60 in there and save, save, save...
I was amazed my colleague got it working with such a simple circuit.
Reliable but slow. Did not get used much, after code was working nicely it ended up in EPROM. Besides, we had paper tape!
Surely with an FPGA at one end and a PC or whatever at the other we can get the data rate up to 10 or 20KHz. Perhaps some kind of quadrature encoding on two channels.
When I made my memory system for picorv32 I made it as 4 banks of byte wide memory making 32 bit wide memory. Why, because the picorv32 has 32 bit data buses and individual write strobes for selecting bytes in a word to be written. In my naivety I imagined 4 byte wide banks was the way to do this. One strobe per bank. Besides, I had a devil of a job getting Quartus to to use it's memory blocks and not synthesize the RAM out of millions of logic cells.
But that meant I needed to split the firmware binary into 4 separate files, one for each memory bank. I tweaked the python bin2hex program do that. It's a pain.
So just now, breaking all the rules of logic design, I was sucking on a beer and tweaking the memory Verilog in Quarus. No test bench or such simulation first.
I changed the memory definition to a single 32 bit wide RAM.
I added a bunch of logic to get the right bytes written to the right place according to the strobe signals.
Poof! It works fine!
Like so:
if (mem_valid & enable) begin
case (mem_wstrb)
4'b0001 : begin
mem[mem_addr >> 2] <= ((mem[mem_addr >> 2] & 32'hffffff00)) | mem_wdata[ 7: 0];
end
4'b0010 : begin
mem[mem_addr >> 2] <= ((mem[mem_addr >> 2] & 32'hffff00ff)) | mem_wdata[15: 8];
end
4'b0100 : begin
mem[mem_addr >> 2] <= ((mem[mem_addr >> 2] & 32'hff00ffff)) | mem_wdata[23:16];
end
4'b1000 : begin
mem[mem_addr >> 2] <= ((mem[mem_addr >> 2] & 32'h00ffffff)) | mem_wdata[31:24];
end
4'b0011 : begin
mem[mem_addr >> 2] <= ((mem[mem_addr >> 2] & 32'hffff0000)) | mem_wdata[15: 0];
end
4'b1100 : begin
mem[mem_addr >> 2] <= ((mem[mem_addr >> 2] & 32'h0000ffff)) | mem_wdata[31:16];
end
4'b1111 : begin
mem[mem_addr >> 2] <= mem_wdata[31: 0];
end
default : ;
endcase
rdy <= 1;
end else begin
I guess I was just not ready for the fact that this memory can be read and written at the same time. Unlike any actual RAM chips I have used in the past.
Now this is great as it will make the firmware easier to build in future.
Code is pushed to the repo. Hope it does not break anyone's builds out there.
I guess I was just not ready for the fact that this memory can be read and written at the same time. Unlike any actual RAM chips I have used in the past.
Even those Lattice iCE40 RAMs have one read port, and one write port. Higher end parts often have true dual-port RAM. So you can easily create shared memories for two clients. Some picorv32 sample code uses a similar coding style as yours above, but even simpler. I was wondering why you split those up at first. (scripts/icestorm/example.v line 51 on)
I didn't look in detail at what yosys does to create the register file when you have ENABLE_REGS_DUALPORT enabled. It seems to create parallel memories to get the two read ports. I was thinking that I could save some RAM by disabling that, but it cost quite a bit of logic. Didn't explore further.
Of course, while I am busy doing all that and'ing and or'ing to get stuff into the right byte in memory, Clifford just writes to the relevant wires, like so:
Next time you are wondering why I'm do something stupid, please do say, it probably means I'm doing something stupid
This is all a new adventure to me.
It's always amazing when the synthesizer figures it out so well isn't it?
Edited to add: for your original style it makes sense that masking and ORing in the data from the read port would add logic. Versus just gating a write enable to the appropriate memories.
A coworker once needed to emulate a 125-bit memory, with 5-bit data write lanes. I think that he ended up using a generate statement and instantiating 2-bit wide block RAMs. He had to toss out 1-bit out of every 6. Not a common case, so the synthesizer didn't know what to make of it. (As far as I can remember)
Only, I did not think of it. It's not what a softie would do in C
A hardware designer would not think "Oh, I'll put a shift register or MUXes in there to move those bits up or down". No, he just connects the wires to the right place!
Hmm...now that I think about it, it would often be useful if we could do that in C.
Now I'm wondering if a string of "if"s would be more efficient than my case statement. Another day perhaps.
There's only one way to know for sure. Also would "if"s be as efficient as a "case(1)"? (Not sure if you're is case(1) now)
Edited to add: I know that this whole thread is about C and sort of dates back to ZoG, but the experience makes me wonder how much less painful it would be to use an Oberon compiler? (And maybe the Oberon RISC) Or maybe something else. If you want a simple embedded platform targetable by something higher-level than assembly, but can't use micro python/javascript/... what interesting choices do you have that are somewhat simple? And could be understood and maintained by one person?
Yes, but that it is a lot of editing and building and waiting....
As it happens now that I have jacked the RAM up to 64KBytes the LE's used has gone back up to 1,767. What I gained I lost!
In the absence of knowing what is going on here I might not think about optimizing code much in the future. This compiler is obviously smarter than me!
Sitting here in the car and trying to look at your latest. I would try changing how you handle mem_wstrb. Why bother looking for patterns like 1100 in the case? Maybe try that "if" after all. If (mem_wstrb[0] ) etcetera. No need to worry if something weird like 1110 pops up - just write the three bytes. Sorry it's hard to type on my phone.
.... but the experience makes me wonder how much less painful it would be to use an Oberon compiler? (And maybe the Oberon RISC) Or maybe something else. If you want a simple embedded platform targetable by something higher-level than assembly, but can't use micro python/javascript/... what interesting choices do you have that are somewhat simple? And could be understood and maintained by one person?
Also finds this forum comment "... As the other commenter pointed out, the question is whether ARM will be as popular 10+ years down the road. The chip Oberon was on was very popular back then. Not any more. So, they had to re-write significant portions of it to support new hardware.
Wirth decided it would be better to make a simple, custom processor that follows the spirit of the system. That by itself would be a fun project. He could put it on any current and future FPGA, something very unlikely to disappear. He then ported Oberon to it. Now, it's future-proof."
and this topical note "Wirth explains in one of his articles that designing the compiler and the CPU together simplifies both."
It is interesting, but I'll hold off for now as it looks like a whole new world to explore involving a qemu image to get that compiler. Plus I really did want to learn a little about embedded RISC V cores. I worked on chips which used embedded 8051 and then BA22 (related to OpenRISC) cores, so I'm always interested in seeing what's out there.
Edited to add: the verilog sources are certainly short. He's got a simple SPI and UART in there too.
And another edit: I couldn't resist taking a quick look at the RISC5 RTL. Here are some Lattice iCE40 numbers - not super small probably because of the mult/div/ and float stuff. If that were taken out then the compiler would need some touch up.
...
Edited to add: the verilog sources are certainly short. He's got a simple SPI and UART in there too.
And another edit: I couldn't resist taking a quick look at the RISC5 RTL. Here are some Lattice iCE40 numbers - not super small probably because of the mult/div/and float stuff. If that were taken out then the compiler would need some touch up.
Can you post your edited RISC5Top.OStation.v to get this working on Lattice ? - I get errors I'm unsure how to equivalence.
Yes, the float stuff is unexpected on a 'simple core', guess it shows how far we have come....
I presume that risc5.v processor is not the same as the riscv ISA.
Correct, both are 32b RISC, but not binary compatible.
From the numbers above RISC5 is larger than RISCV, but that larger RISC5 includes FPDivider.v, FPMultiplier.v, Divider.v, Multiplier.v
Project Oberon is a very interesting read. A great example of minimalistic design. Something we should be reminded of in this modern world.
Yes, and has a good place in education, as you can see the source of all of it, in one place.
Might yet fit on P2, or P2 + HyperFLASH, or P2 + (HyperFLASH+HyperRAM) ?
Can you post your edited RISC5Top.OStation.v to get this working on Lattice ? - I get errors I'm unsure how to equivalence.
Yes, the float stuff is unexpected on a 'simple core', guess it shows how far we have come....
Sorry - I only did the RISC5 core and below and I didn't have to edit anything in there as far as I can remember. Just wanted to know how much the CPU consumed without worrying about everything else. The video module has a DCM (digital clock multiplier) in it which would need to be changed so a Lattice equivalent. And the IOBUFs in the top level either need to be changed to the Lattice equivalent, or high-level code for a tristate bus depending on the sophistication of the tools. And this is depending on an off-chip SRAM.
...
I changed the memory definition to a single 32 bit wide RAM.
I added a bunch of logic to get the right bytes written to the right place according to the strobe signals.
Poof! It works fine!
Like so:
if (mem_valid & enable) begin
case (mem_wstrb)
4'b0001 : begin
mem[mem_addr >> 2] <= ((mem[mem_addr >> 2] & 32'hffffff00)) | mem_wdata[ 7: 0];
end
4'b0010 : begin
mem[mem_addr >> 2] <= ((mem[mem_addr >> 2] & 32'hffff00ff)) | mem_wdata[15: 8];
end
4'b0100 : begin
mem[mem_addr >> 2] <= ((mem[mem_addr >> 2] & 32'hff00ffff)) | mem_wdata[23:16];
end
4'b1000 : begin
mem[mem_addr >> 2] <= ((mem[mem_addr >> 2] & 32'h00ffffff)) | mem_wdata[31:24];
end
4'b0011 : begin
mem[mem_addr >> 2] <= ((mem[mem_addr >> 2] & 32'hffff0000)) | mem_wdata[15: 0];
end
4'b1100 : begin
mem[mem_addr >> 2] <= ((mem[mem_addr >> 2] & 32'h0000ffff)) | mem_wdata[31:16];
end
4'b1111 : begin
mem[mem_addr >> 2] <= mem_wdata[31: 0];
end
default : ;
endcase
rdy <= 1;
end else begin
I guess I was just not ready for the fact that this memory can be read and written at the same time. Unlike any actual RAM chips I have used in the past.
...
I don't think that this "works fine". Maybe it synthesizes, but working is something different.
For byte 1 for example you need to shift the data to the left by 8:
((mem[mem_addr >> 2] & 32'hffff00ff)) | mem_wdata[15: 8] << 8;
But I think the whole idea is just wrong. If you read and write a block-Ram at the same time
you don't get a read-modify-write of the ram content. These rams are synchronous, the data at the output comes one clock later.
I don't know what the synthesizer does with the "mem[mem_addr >> 2]" on the right side, I expect that the 3 bytes of the word, that should be preserved just get overwritten, when you do a byte write.
I think you have a point there. I woke up this morning thinking that my current firmware never actually writes bytes anywhere so that is not being tested. Well, except the UART which gets a char written in the low byte of a 32 bit write.
Anyway the code has changed a bit since that post. For example writing byte 1 now looks like this:
mem[mem_addr >> 2][15: 8] <= mem_wdata[15: 8];
Which I think you will agree is putting the right 8 bits in the right place.
Not sure about the read-modify-write timing. I does sound a bit off.
Comments
I had been thinking that it might be fun to drive I2S from the RISC V to get audio output. Any idea what the easiest way would be to get decent audio input? What are you planning to do?
That last firmware I posted works OK without MUL.
I will get to compressed ISA and such in good time. And many other things.
My guess is that if the processor does not have a MUL it should generate an illegal instruction TRAP if it sees one. The TRAP handler would then do the MUL in software. As it stands this firmware does not handle TRAPs or interrupts and there is no code in there to perform a software MUL.
Or, with the right options the compiler should not generate a MUL instruction anyway, but rather call a multiply routine that it builds in.
Anyway, my plan is to always have MUL and DIV instructions in future.
https://retrocomputing.stackexchange.com/questions/150/what-format-is-used-for-coco-cassette-tapes
That machine employed a coarse, 6 bit I think, DAC, ADC to sample incoming audio and do system sound. The CPU is stock clocked under 1mhz.
That tape system would track speed and level errors pretty nicely.
The Apple 2 used one bit audio to get about the same speed.
ftp://ftp.apple.asimov.net/pub/apple_II/documentation/hardware/io/The Apple II Cassette Interface.txt
It's not as forgiving, but one would not notice much in normal use. A bit more touchy about signal level, that is about it. CPU clocked at 1.024 Mhz
Other system choices were way more conservative, in the 300 baud range, Atari, C64. The Atari had fancy hardware. I don't know if a faster loader was ever done, but multiple ones existed for C64, and 2400 baud was possible, but questionable on garden variety tape devices, which the official peripheral was.
A simple trigger, timed loops, etc... should yield 1500 baud pretty easy, higher with the precision audio we have these days.
At one point, just as a curio, I had a library of audio for a couple machines stored on hi fi VHS, with index marks. Power on, ask for the index, press or type load, and off it would go! At that quality, fast loaders worked well. Seek times were in the half minute range. Spiffy.
Super slow though. IMHO, the very worst cassette system. The Apple and CoCo were actually useful. Just get or make short length tapes and go. I had few problems.
The Atari one was a Smile shoot. Would drop a C60 in there and save, save, save...
I was amazed my colleague got it working with such a simple circuit.
Reliable but slow. Did not get used much, after code was working nicely it ended up in EPROM. Besides, we had paper tape!
Surely with an FPGA at one end and a PC or whatever at the other we can get the data rate up to 10 or 20KHz. Perhaps some kind of quadrature encoding on two channels.
All that older gear assumed a 3 to 5khz roll off. Valid assumption at the time.
When I made my memory system for picorv32 I made it as 4 banks of byte wide memory making 32 bit wide memory. Why, because the picorv32 has 32 bit data buses and individual write strobes for selecting bytes in a word to be written. In my naivety I imagined 4 byte wide banks was the way to do this. One strobe per bank. Besides, I had a devil of a job getting Quartus to to use it's memory blocks and not synthesize the RAM out of millions of logic cells.
But that meant I needed to split the firmware binary into 4 separate files, one for each memory bank. I tweaked the python bin2hex program do that. It's a pain.
So just now, breaking all the rules of logic design, I was sucking on a beer and tweaking the memory Verilog in Quarus. No test bench or such simulation first.
I changed the memory definition to a single 32 bit wide RAM.
I added a bunch of logic to get the right bytes written to the right place according to the strobe signals.
Poof! It works fine!
Like so: I guess I was just not ready for the fact that this memory can be read and written at the same time. Unlike any actual RAM chips I have used in the past.
Now this is great as it will make the firmware easier to build in future.
Code is pushed to the repo. Hope it does not break anyone's builds out there.
I didn't look in detail at what yosys does to create the register file when you have ENABLE_REGS_DUALPORT enabled. It seems to create parallel memories to get the two read ports. I was thinking that I could save some RAM by disabling that, but it cost quite a bit of logic. Didn't explore further.
Of course, while I am busy doing all that and'ing and or'ing to get stuff into the right byte in memory, Clifford just writes to the relevant wires, like so: I bet that synthesizes to the same configuration anyway.
KeithE, Next time you are wondering why I'm doing something stupid, please do say, it probably means I'm doing something stupid
This is all a new adventure to me.
Edited to add: for your original style it makes sense that masking and ORing in the data from the read port would add logic. Versus just gating a write enable to the appropriate memories.
A coworker once needed to emulate a 125-bit memory, with 5-bit data write lanes. I think that he ended up using a generate statement and instantiating 2-bit wide block RAMs. He had to toss out 1-bit out of every 6. Not a common case, so the synthesizer didn't know what to make of it. (As far as I can remember)
A hardware designer would not think "Oh, I'll put a shift register or MUXes in there to move those bits up or down". No, he just connects the wires to the right place!
Hmm...now that I think about it, it would often be useful if we could do that in C.
I don't understand that.
Changes now in the repo.
New bigger firmware attached.
Edited to add: I know that this whole thread is about C and sort of dates back to ZoG, but the experience makes me wonder how much less painful it would be to use an Oberon compiler? (And maybe the Oberon RISC) Or maybe something else. If you want a simple embedded platform targetable by something higher-level than assembly, but can't use micro python/javascript/... what interesting choices do you have that are somewhat simple? And could be understood and maintained by one person?
As it happens now that I have jacked the RAM up to 64KBytes the LE's used has gone back up to 1,767. What I gained I lost!
In the absence of knowing what is going on here I might not think about optimizing code much in the future. This compiler is obviously smarter than me!
Things is, as far as I can tell there are only seven valid cases.
You can write one byte to any byte position in a word.
You can write a 16 bit word to the low or high half of in the word.
You can write the whole word.
Anything else is a screw up. It just feels natural to me to filter them out.
Probably I should not worry about that. Perhaps it saves an LE or two.
Just now I have to move on to firmware fixing....
Project Oberon certainly is worth tracking, and also might fit onto P2, with modest external memory.
Over at http://www.projectoberon.com/
I see they have verilog for the other RISC5 (may have been smarter to call that RISC-W, or RISC-O ?)
https://www.inf.ethz.ch/personal/wirth/ProjectOberon/SourcesVerilog/RISC5.v
but it may not take much to port the code generator to RISC V ?
Also finds this forum comment
"... As the other commenter pointed out, the question is whether ARM will be as popular 10+ years down the road. The chip Oberon was on was very popular back then. Not any more. So, they had to re-write significant portions of it to support new hardware.
Wirth decided it would be better to make a simple, custom processor that follows the spirit of the system. That by itself would be a fun project. He could put it on any current and future FPGA, something very unlikely to disappear. He then ported Oberon to it. Now, it's future-proof."
and this topical note
"Wirth explains in one of his articles that designing the compiler and the CPU together simplifies both."
and this about Oberon compiler work for RISC V ?
http://lists.inf.ethz.ch/pipermail/oberon/2016/009980.html
http://oberon.wikidot.com/rop2compiler
Edited to add: the verilog sources are certainly short. He's got a simple SPI and UART in there too.
And another edit: I couldn't resist taking a quick look at the RISC5 RTL. Here are some Lattice iCE40 numbers - not super small probably because of the mult/div/ and float stuff. If that were taken out then the compiler would need some touch up.
Can you post your edited RISC5Top.OStation.v to get this working on Lattice ? - I get errors I'm unsure how to equivalence.
Yes, the float stuff is unexpected on a 'simple core', guess it shows how far we have come....
However I'm not up for learning yet another language this year and will continue digging the RISC V furrow.
I presume that risc5.v processor is not the same as the riscv ISA.
From the numbers above RISC5 is larger than RISCV, but that larger RISC5 includes FPDivider.v, FPMultiplier.v, Divider.v, Multiplier.v
I forgot there was also this, which is a compiler that supports Wirth's RISC5
http://www.astrobe.com/forum/viewtopic.php?f=3&t=518&sid=0bfdbee8aa4646b5e09cd760eb513438
Yes, and has a good place in education, as you can see the source of all of it, in one place.
Might yet fit on P2, or P2 + HyperFLASH, or P2 + (HyperFLASH+HyperRAM) ?
I don't think that this "works fine". Maybe it synthesizes, but working is something different.
For byte 1 for example you need to shift the data to the left by 8:
((mem[mem_addr >> 2] & 32'hffff00ff)) | mem_wdata[15: 8] << 8;
But I think the whole idea is just wrong. If you read and write a block-Ram at the same time
you don't get a read-modify-write of the ram content. These rams are synchronous, the data at the output comes one clock later.
I don't know what the synthesizer does with the "mem[mem_addr >> 2]" on the right side, I expect that the 3 bytes of the word, that should be preserved just get overwritten, when you do a byte write.
Andy
I think you have a point there. I woke up this morning thinking that my current firmware never actually writes bytes anywhere so that is not being tested. Well, except the UART which gets a char written in the low byte of a 32 bit write.
Anyway the code has changed a bit since that post. For example writing byte 1 now looks like this: Which I think you will agree is putting the right 8 bits in the right place.
Not sure about the read-modify-write timing. I does sound a bit off.
This calls for some testing...