Well, you are right. Those byte writes do not work.
Looks like I'll be going back to separate byte banks.
I could use the Quartus mega wizard thing to conjure up a 32 bit wide RAM with byte enables. I notice it has options for having the q output reflect new or old data when doing a write while reading. But that is getting a bit Quartus specific.
This does actually work, despite worries about read-modify-write cycles.
Here we see individual bytes of an integer being incremented by 1, 2, 3, 4 respectively, independently. I checked the generated code, it is actually doing loads and stores on bytes.
However, the strange part. Quartus is using twice as many memory bits to do this. I had to reduce my RAM size from 16K words to 9K words to make it fit. Which uses the same 87% of memory bits.
I think you have a point there. I woke up this morning thinking that my current firmware never actually writes bytes anywhere so that is not being tested. Well, except the UART which gets a char written in the low byte of a 32 bit write.
Anyway the code has changed a bit since that post. For example writing byte 1 now looks like this:
mem[mem_addr >> 2][15: 8] <= mem_wdata[15: 8];
Which I think you will agree is putting the right 8 bits in the right place.
Not sure about the read-modify-write timing. I does sound a bit off.
This calls for some testing...
I know next to nothing about Verilog but why didn't you write it like this:
mem[mem_addr[31: 2]][15: 8] <= mem_wdata[15: 8];
Will that not work? Will it synthesize differently than with the shift operator?
It's kind of crufty and poorly documented, but it's able to run C code like the fibo and xxtea benchmarks. Performance is actually not too bad, considering that I haven't really tried to optimize yet:
As you can see code size is not a strong point of the RISC-V architecture. The C extension for compressed instructions would help with that, but I don't think there's room in the COG to implement it. It does seem that register based virtual machines (like the Prop1 CMM or this one) have a considerable performance advantage over stack based ones like Spin and the ZPU. The gcc LMM performance results are vastly improved by FCACHE, of course; without it performance would probably be about 4x slower.
RAM can be inferred from register-array assignments in Verilog, but Quartus is very sensitive to how much is asked. If the Verilog exceeds a rather low threshold of complexity, Quartus reverts to using registers in the logic fabric, which always results in disaster.
It's kind of crufty and poorly documented, but it's able to run C code like the fibo and xxtea benchmarks. Performance is actually not too bad, considering that I haven't really tried to optimize yet:
As you can see code size is not a strong point of the RISC-V architecture. The C extension for compressed instructions would help with that, but I don't think there's room in the COG to implement it. It does seem that register based virtual machines (like the Prop1 CMM or this one) have a considerable performance advantage over stack based ones like Spin and the ZPU. The gcc LMM performance results are vastly improved by FCACHE, of course; without it performance would probably be about 4x slower.
Maybe you could implement this for the P2 using XBYTE and maybe (gasp) SKIP? That way we could have a C compiler right away!
Maybe you could implement this for the P2 using XBYTE and maybe (gasp) SKIP? That way we could have a C compiler right away!
RISC-V instructions are longs, so I don't think XBYTE will help much. Manual SKIP instructions might help. I haven't really looked at P2 much lately, kind of waiting for things to settle down.
Maybe you could implement this for the P2 using XBYTE and maybe (gasp) SKIP? That way we could have a C compiler right away!
RISC-V instructions are longs, so I don't think XBYTE will help much. Manual SKIP instructions might help. I haven't really looked at P2 much lately, kind of waiting for things to settle down.
Eric
Ah yes, XBYTE will be useless for this. It would be interesting to have a RISC-V emulator for P2. I wonder how it's performance would compare with a CMM implementation for P2. One thing I asked for a while back is some hardware assist in expanding compressed CMM instructions into full 32 bit P2 instructions. That would create something like the ARM Thumb mode that should produce better code density. I never really worked out the details though. Do you think something like that would be feasible?
Yep, I had quite a struggle getting Quartus to use memory blocks for my byte wide RAM banks.
But now it seems it uses twice as many memory blocks as I actually need. I'll have to play with this more.
I guess the sensible thing to do is to use the mega wizard to generate the RAM I want.
You can use the wizard. It will generate a file that you instantiate and connect to. I resorted to that, eventually, because Quartus' ability to infer dual-port RAMs from Verilog is pretty limited. Better to just have the wizard generate the file and use IFDEFs to pick from among different vendors' memory interfaces.
[...One thing I asked for a while back is some hardware assist in expanding compressed CMM instructions into full 32 bit P2 instructions. That would create something like the ARM Thumb mode that should produce better code density. I never really worked out the details though. Do you think something like that would be feasible?
I thought about this for a few minutes and the only way this would work nicely is to have some kind of translation scheme. I wish it was as simple as limiting the number of operand bits to three per D and S and getting rid of flag-write control and conditional execution, but then there are the special registers and lots of other little issues.
[...One thing I asked for a while back is some hardware assist in expanding compressed CMM instructions into full 32 bit P2 instructions. That would create something like the ARM Thumb mode that should produce better code density. I never really worked out the details though. Do you think something like that would be feasible?
I thought about this for a few minutes and the only way this would work nicely is to have some kind of translation scheme. I wish it was as simple as limiting the number of operand bits to three per D and S and getting rid of flag-write control and conditional execution, but then there are the special registers and lots of other little issues.
What about if there was a way to say that the next 32 bit value was to be executed as a full 32 bit P2 instruction. Then you wouldn't have to cram everything into the smaller instruction format. I'll have to look at how CMM on the P1 works to see if something like that would be possible.
I'm curious about Spin, it's faster than riscvemu and zog for fibo() but slower than both for xxtea.
I think spin has very fast function calls, which is basically all that fibo tests, but for computations it's not quite so speedy. I ran fftbench through riscvemu and it looks pretty similar to xxtea in terms of how things stack up:
gcc CMM -Os: 567318 us
riscvemu: 637602 us
zog 1.6: 1376254 us
spin interpreter: 1465321 us
I think for computation heavy workloads register based vms have better performance than stack based. Of course, the stack based ones (generally) come in with smaller code.
[...One thing I asked for a while back is some hardware assist in expanding compressed CMM instructions into full 32 bit P2 instructions. That would create something like the ARM Thumb mode that should produce better code density. I never really worked out the details though. Do you think something like that would be feasible?
I thought about this for a few minutes and the only way this would work nicely is to have some kind of translation scheme. I wish it was as simple as limiting the number of operand bits to three per D and S and getting rid of flag-write control and conditional execution, but then there are the special registers and lots of other little issues.
I think I'd do something like 4 bits opcode, 5 bits D, 5 bits S for the common math cases, which would allow 16 user registers and access to the 16 special registers (if the user registers went at the end of memory this would be dead easy). For branches we'd want something like 4 bits condition, 10 bit offset. We'd also need, as David suggested, an escape to allow full 32 bit instructions to be executed.
RISC-V handles this nicely by having 2 bits in the opcode that are always 11 for 32 bit instructions and are anything else for compressed 16 bit opcodes (and those 2 bits are in the first byte/word (little endian) so they're easy to pick up). I'm not sure how we could handle this in P2, there really aren't any free bits in the instruction encoding. One way would be David's suggestion that some special prefix could indicate that the next 32 bits are a normal P2 instruction, so the compressed instructions would be 16 bits and "normal" ones would become 48 bits in compressed mode (or 40 bits if we could reduce the prefix to just 1 byte, although things are then only byte aligned).
I guess we could encode compressed instructions like:
00 = compressed math instruction (4 bit operation, 5 bit each src and dest; the 4 bit operation is used in a lookup for the most common 16 combinations of opcode/immediate/condition code bits)
01 = compressed branch (4 bit condition, 10 bit signed offset)
10 = partially compressed instruction (next 14 bits are combined with following 16 bits to form 30 bits of a full PASM instruction; the other 2 bits are defaulted somehow, e.g. we have a default set of condition codes)
11 = uncompressed prefix; next 6 bits are ignored, and following 32 bits are a full regular instruction
All this would only take place in hubexec mode and only when some "mode" bit has been set in a register.
But this is probably way too complicated for this late stage though . Let's get the P2 out the door...
So, ignoring fibo(), which as Eric says is only really testing function call overhead, we see that Zog is about as fast as Spin and about the same size code for doing the things we might normally do on an MCU:
Yes, you might say the FFT is an extreme mathematical thing, but it is not really, just loops and logical operations, like normal code.
From some perspective this makes Spin on the P1 redundant. Which I find a bit surprising.
Of course Spin is not C. Which I'm sure is easier for beginners to get along with. And Spin has great ease of use advantages when mixing PASM into a project and running multiple COGS.
After some hours of fighting with the Mega Wizrad, nothing works.
Seems so simple. Perhaps the byte enable signals I ordered are not the right endian or some such. I don't know. It's tedious fighting with Quartus.
Might help if I read the docs more closely.
Gone back to my old scheme where Quartus infers the memory and it all works under simulation as well. At a cost of losing half the available memory.
Chip,
After some hours of fighting with the Mega Wizrad, nothing works.
When you're in the mood, perhaps you can get the simulation model for the generated RAM. Not sure if it's going to work with Icarus, but perhaps it's a path forward. If Icarus won't deal with it, there's ModelSim. Or if you post more details here Chip might see something since he's used the tool before.
Incidentally, for those wishing to play with RISC-V on a Prop1, I've written an emulator:
Did you start writing that after Heater started this thread? If so you're pretty fast. It's always amazing what can fit into a P1 cog's code space.
Would you call this rv32im without the counter instructions (rdcycle[h],rdtime[h], rdinstret[h])?
I'm curious about how you build the programs with /opt/riscv/bin/riscv32-unknown-elf-gcc and use -march=rv32im to get the right configuration. Some of the RISC V builds create four separate sets of tool directories (i, im, ic, imc) which seems non-optimal. Do you know why they do that?
Also do you manually extract the code sizes that you quoted from the .lst files? I wasn't getting exact matches.
Nah, I'm going to move on. Plenty more to do here. I'll revisit the memory thing at some point. I'm disinclined to fight with closed source solutions. When I do I think I have to start with a minimalist exercise of the MegaWizard generated memory, no CPU etc, so that I see how it works.
Eric is a turbo nerd you know. It's only last week of so he made a JIT version of Zog!
Nah, I'm going to move on. Plenty more to do here. I'll revisit the memory thing at some point. I'm disinclined to fight with closed source solutions. When I do I think I have to start with a minimalist exercise of the MegaWizard generated memory, no CPU etc, so that I see how it works.
Eric is a turbo nerd you know. It's only last week of so he made a JIT version of Zog!
He's definitely a very smart and productive (and nice) guy!
Incidentally, for those wishing to play with RISC-V on a Prop1, I've written an emulator:
Did you start writing that after Heater started this thread? If so you're pretty fast. It's always amazing what can fit into a P1 cog's code space.
Yes. I'd already been working in the internals of an interpreter (a JIT ZPU interpreter, based on Heater's very nicely documented ZOG) so I had a fresh understanding of interpreting on the COG. It is sometimes amazing what will fit, it's true!
Would you call this rv32im without the counter instructions (rdcycle[h],rdtime[h], rdinstret[h])?
Pretty much. mulh and mulsuh aren't implemented yet but I haven't missed them so far (I will if I try to use long long, I'm sure!)
I'm curious about how you build the programs with /opt/riscv/bin/riscv32-unknown-elf-gcc and use -march=rv32im to get the right configuration. Some of the RISC V builds create four separate sets of tool directories (i, im, ic, imc) which seems non-optimal. Do you know why they do that?
I suspect they do that to make it easier to find the right libraries, which isn't an issue for my test programs (they use no libraries). I'm using the SiFive toolchain, which defaults to using rvc instructions, so I had to use the -march option to avoid this. I'm not quite sure how SiFive's tools differ from the standard ones, but there may be some other issues.
Also do you manually extract the code sizes that you quoted from the .lst files? I wasn't getting exact matches.
I used "nm -n" on the ELF executables and then subtracted the next global from the one I was interested in. It's certainly possible that I messed up somewhere in that, and/or that our tools produce slightly different results. For that matter different versions of PropGCC will also give slightly different results too.
Eric is a turbo nerd you know. It's only last week of so he made a JIT version of Zog!
He's definitely a very smart and productive (and nice) guy!
Aw, now I'm blushing I think the community here really inspires all of us to do our best work. We certainly have a great forum and some interesting discussions! This site may be one of Parallax's most valuable resources (next to Chip's brain, of course!)
Take the glory whilst you can Eric. You deserve it.
It certainly is amazing what goes on around here. From C compilers, to BASIC, to Forth, to weird interpreters/emulators, not to mention all the hardware things.
I think we all take inspiration from Chip of course.
Let's not forget the genius of Ken, who keeps the lights on around here !
Now, if you could see your way to getting the riscv emulator working on the P2 that would answer my opening post here. We don't need a riscv core, we can do it in a COG!
Ah yes, XBYTE will be useless for this. It would be interesting to have a RISC-V emulator for P2. I wonder how it's performance would compare with a CMM implementation for P2. One thing I asked for a while back is some hardware assist in expanding compressed CMM instructions into full 32 bit P2 instructions. That would create something like the ARM Thumb mode that should produce better code density. I never really worked out the details though. Do you think something like that would be feasible?
Even without XBYTE the extra code space LUT exec gives has to be worth some gains on P2.
XBYTE may allow smaller pieces of the instruction set to be packed a little better, giving more room.
If you want a 16b compressed variant, that could be done as 8b or smaller operator, and 8b of param.
Split like that, I think XBYTE can help ?
Expansion tag for Full 32b, or 24b, could be useful.
Ah yes, XBYTE will be useless for this. It would be interesting to have a RISC-V emulator for P2. I wonder how it's performance would compare with a CMM implementation for P2. One thing I asked for a while back is some hardware assist in expanding compressed CMM instructions into full 32 bit P2 instructions. That would create something like the ARM Thumb mode that should produce better code density. I never really worked out the details though. Do you think something like that would be feasible?
Even without XBYTE the extra code space LUT exec gives has to be worth some gains on P2.
XBYTE may allow smaller pieces of the instruction set to be packed a little better, giving more room.
If you want a 16b compressed variant, that could be done as 8b or smaller operator, and 8b of param.
Split like that, I think XBYTE can help ?
Expansion tag for Full 32b, or 24b, could be useful.
Yes but remember that this is RISC-V. We aren't inventing our own instruction set here. Certainly XBYTE will have big advantages for custom byte code instruction sets.
Yes but remember that this is RISC-V. We aren't inventing our own instruction set here. Certainly XBYTE will have big advantages for custom byte code instruction sets.
Yes, but all Opcode sets I've looked at do cluster their opcodes - they are not random in their encoding.
Yes but remember that this is RISC-V. We aren't inventing our own instruction set here. Certainly XBYTE will have big advantages for custom byte code instruction sets.
Yes, but all Opcode sets I've looked at do cluster their opcodes - they are not random in their encoding.
I guess so but then you'd be fetching the instruction one byte at a time. It looks like RISC-V instructions are multiples of 16 bits so it might still be faster to fetch 16 bits at a time. Maybe there could be an XWORD instruction that picks up 16 bits and dispatches off of the high or low bits.
Trying to do an audio bootloader I decided to do an uart receiver.
I may do a small board to plug on the DE10-Lite 40-pin connector, without flying cables, because my setup didn't work. I got the input to oscillate.... too unstable .
I never wrote an uart receiver before, so I decided to do it. This code works, at least in the simulator, the real code seems not to work on the fpga. Signal Tap (Heater: if you have some free RAM block, I'd recommend you explore the embedded logic analyzer, really easy to use, and an eye opener!).
The bad thing about the UART is that the USB-blaster in this board stops being recognized when another FTDI-chip enabled dongle is also plugged. My propplug and the intronix logic port must be unplugged when I want to program the FPGA
I'll see what I can do for the RAM.
Anecdote: I decided to restructure the project, split the sources into several directories and quartus would crash saying something about some suppressed messages. Re-did the project, removed the files one after the other one, at some point I thought that maybe it didn't find the block ram files (firmwarex.hex)... it didn't find them, it didn't complain either it just crashed...
Here the code:
Note: the bit clock is only started when the starting edge is detected (bad for signal tap). I'll change it to a normal always on clock , I thought it was a neat idea... It is based on Heater's transmitter.
I removed the tri-stated bus when looking for more than 50 MHz clock on the Max10. I had to settle for a 50 MHz clock anyways.
/**
* Uart receiver
*
*
* Memory Map
*
* XXXX_XX40 : Uart Transmitter
* XXXX_XX50 : Uart receiver
*
* XXXX_XX50 : recieved byte
*
* Write
* 31 16
* +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
* | | | | | | | | | | | | | | | | |
* +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
*
*
* 15 0
* +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
* | | | | | | | | | | | | | | | | |
* +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
*
*
* Read
* 31 16
* +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
* | | | | | | | | | | | | | | | | |
* +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
*
*
* 15 8 7 0
* +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
* | | | | | | | | E | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
* +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
*
*/
//`include "timescale.vh"
module uart_rx #(
parameter [31:0] BAUD_DIVIDER = 434 //868 // 100MHz / 115200 baud
) (
// Bus interface
input wire clk,
input wire resetn,
input wire enable,
input wire mem_valid,
output wire mem_ready,
input wire mem_instr,
input wire [3:0] mem_wstrb,
input wire [31:0] mem_wdata,
input wire [31:0] mem_addr,
output wire [31:0] mem_rdata,
output wire bit_clock,
// Serial interface
input wire serial_in // The serial input.
);
// Internal Variables
reg [7:0] shifter;
reg [7:0] buffer;
reg [7:0] state;
reg [3:0] bitCount;
reg [19:0] bitTimer;
reg bufferEmpty; // TRUE when ready to accept next character.
reg rdy;
reg old_serial_in;
reg old2_serial_in;
reg started;
assign bit_clock = bitTimer < (BAUD_DIVIDER / 2) ? 1'b1:1'b0;
`define BAUD_DIVIDER_34 ((BAUD_DIVIDER * 3) / 4)
// UART RX Logic
always @ (posedge clk or negedge resetn) begin
if (!resetn) begin
state <= 0;
buffer <= 0;
bufferEmpty <= 1'b1; // empty
shifter <= 0;
bitCount <= 0;
bitTimer <= 0;
rdy <= 0;
started <= 0;
end else begin
if (mem_valid & enable) begin
if ((|mem_wstrb == 1'b0) && (bufferEmpty == 1'b0)) begin
bufferEmpty <= 1'b1;
end
rdy <= 1;
end else begin
rdy <= 0;
end
old_serial_in <= serial_in;
old2_serial_in <= old_serial_in;
if ((old2_serial_in == 1'b1) && (old_serial_in == 1'b0)) // start condition
begin
if (started == 0)
begin
bitTimer <= 0; // start timer
state <= 2'h0;
started <= 1'b1;
end
end
// Generate bit clock timer for 115200 baud from 50MHz clock
if (bitTimer == BAUD_DIVIDER) begin
bitTimer <= 0;
end
if (started)
begin
bitTimer <= bitTimer + 1;
case (bitTimer)
20'd0:
case (state)
// Idle
0 : begin
//if (bufferEmpty == 1) begin
shifter <= buffer;
bitCount <= 8;
// Start bit
state <= 1;
//end
end
// Transmitting
1 : begin
if (bitCount > 0) begin
// Data bits
bitCount <= bitCount - 1;
end else begin
// Stop bit
state <= 2;
end
end
// Second stop bit
2 : begin
state <= 0;
buffer <= shifter;
bufferEmpty <= 1'b0; // full
started <= 1'b0;
end
default : ;
endcase
`BAUD_DIVIDER_34:
if (state == 2'd1)
begin
shifter <= { old_serial_in, shifter[7:1] };
end
BAUD_DIVIDER:
bitTimer <= 0;
endcase
end
end
end
// Tri-state the bus outputs.
assign mem_rdata = enable ? { bufferEmpty, buffer } : 'b0;
assign mem_ready = rdy;
endmodule
Comments
Well, you are right. Those byte writes do not work.
Looks like I'll be going back to separate byte banks.
I could use the Quartus mega wizard thing to conjure up a 32 bit wide RAM with byte enables. I notice it has options for having the q output reflect new or old data when doing a write while reading. But that is getting a bit Quartus specific.
There is something weird going on here.
I changed the byte writes as per you suggestion to add the shifts. Like so: This does actually work, despite worries about read-modify-write cycles.
Here we see individual bytes of an integer being incremented by 1, 2, 3, 4 respectively, independently. I checked the generated code, it is actually doing loads and stores on bytes.
However, the strange part. Quartus is using twice as many memory bits to do this. I had to reduce my RAM size from 16K words to 9K words to make it fit. Which uses the same 87% of memory bits.
Grrr...
https://github.com/totalspectrum/riscvemu.git
It's kind of crufty and poorly documented, but it's able to run C code like the fibo and xxtea benchmarks. Performance is actually not too bad, considering that I haven't really tried to optimize yet: As you can see code size is not a strong point of the RISC-V architecture. The C extension for compressed instructions would help with that, but I don't think there's room in the COG to implement it. It does seem that register based virtual machines (like the Prop1 CMM or this one) have a considerable performance advantage over stack based ones like Spin and the ZPU. The gcc LMM performance results are vastly improved by FCACHE, of course; without it performance would probably be about 4x slower.
RISC-V instructions are longs, so I don't think XBYTE will help much. Manual SKIP instructions might help. I haven't really looked at P2 much lately, kind of waiting for things to settle down.
Eric
Edited to add: https://www.altera.com/en_US/pdfs/literature/hb/qts/qts-qps-handbook.pdf see chapter 12
Wow, you beat me to it. A riscv emulator for the Prop 1/2 was on the end of my TODO list!
riscv has Zog on the run, for speed anyway.
I'm curious about Spin, it's faster than riscvemu and zog for fibo() but slower than both for xxtea.
Yep, I had quite a struggle getting Quartus to use memory blocks for my byte wide RAM banks.
But now it seems it uses twice as many memory blocks as I actually need. I'll have to play with this more.
I guess the sensible thing to do is to use the mega wizard to generate the RAM I want.
You can use the wizard. It will generate a file that you instantiate and connect to. I resorted to that, eventually, because Quartus' ability to infer dual-port RAMs from Verilog is pretty limited. Better to just have the wizard generate the file and use IFDEFs to pick from among different vendors' memory interfaces.
I thought about this for a few minutes and the only way this would work nicely is to have some kind of translation scheme. I wish it was as simple as limiting the number of operand bits to three per D and S and getting rid of flag-write control and conditional execution, but then there are the special registers and lots of other little issues.
I think spin has very fast function calls, which is basically all that fibo tests, but for computations it's not quite so speedy. I ran fftbench through riscvemu and it looks pretty similar to xxtea in terms of how things stack up:
I think for computation heavy workloads register based vms have better performance than stack based. Of course, the stack based ones (generally) come in with smaller code.
I think I'd do something like 4 bits opcode, 5 bits D, 5 bits S for the common math cases, which would allow 16 user registers and access to the 16 special registers (if the user registers went at the end of memory this would be dead easy). For branches we'd want something like 4 bits condition, 10 bit offset. We'd also need, as David suggested, an escape to allow full 32 bit instructions to be executed.
RISC-V handles this nicely by having 2 bits in the opcode that are always 11 for 32 bit instructions and are anything else for compressed 16 bit opcodes (and those 2 bits are in the first byte/word (little endian) so they're easy to pick up). I'm not sure how we could handle this in P2, there really aren't any free bits in the instruction encoding. One way would be David's suggestion that some special prefix could indicate that the next 32 bits are a normal P2 instruction, so the compressed instructions would be 16 bits and "normal" ones would become 48 bits in compressed mode (or 40 bits if we could reduce the prefix to just 1 byte, although things are then only byte aligned).
I guess we could encode compressed instructions like:
00 = compressed math instruction (4 bit operation, 5 bit each src and dest; the 4 bit operation is used in a lookup for the most common 16 combinations of opcode/immediate/condition code bits)
01 = compressed branch (4 bit condition, 10 bit signed offset)
10 = partially compressed instruction (next 14 bits are combined with following 16 bits to form 30 bits of a full PASM instruction; the other 2 bits are defaulted somehow, e.g. we have a default set of condition codes)
11 = uncompressed prefix; next 6 bits are ignored, and following 32 bits are a full regular instruction
All this would only take place in hubexec mode and only when some "mode" bit has been set in a register.
But this is probably way too complicated for this late stage though . Let's get the P2 out the door...
From some perspective this makes Spin on the P1 redundant. Which I find a bit surprising.
Of course Spin is not C. Which I'm sure is easier for beginners to get along with. And Spin has great ease of use advantages when mixing PASM into a project and running multiple COGS.
After some hours of fighting with the Mega Wizrad, nothing works.
Seems so simple. Perhaps the byte enable signals I ordered are not the right endian or some such. I don't know. It's tedious fighting with Quartus.
Might help if I read the docs more closely.
Gone back to my old scheme where Quartus infers the memory and it all works under simulation as well. At a cost of losing half the available memory.
Will revisit the issue later.
Edited to add: did you look in Chapter 12 of https://www.altera.com/en_US/pdfs/literature/hb/qts/qts-qps-handbook.pdf?
Would you call this rv32im without the counter instructions (rdcycle[h],rdtime[h], rdinstret[h])?
I'm curious about how you build the programs with /opt/riscv/bin/riscv32-unknown-elf-gcc and use -march=rv32im to get the right configuration. Some of the RISC V builds create four separate sets of tool directories (i, im, ic, imc) which seems non-optimal. Do you know why they do that?
Also do you manually extract the code sizes that you quoted from the .lst files? I wasn't getting exact matches.
Nah, I'm going to move on. Plenty more to do here. I'll revisit the memory thing at some point. I'm disinclined to fight with closed source solutions. When I do I think I have to start with a minimalist exercise of the MegaWizard generated memory, no CPU etc, so that I see how it works.
Eric is a turbo nerd you know. It's only last week of so he made a JIT version of Zog!
I suspect they do that to make it easier to find the right libraries, which isn't an issue for my test programs (they use no libraries). I'm using the SiFive toolchain, which defaults to using rvc instructions, so I had to use the -march option to avoid this. I'm not quite sure how SiFive's tools differ from the standard ones, but there may be some other issues.
I used "nm -n" on the ELF executables and then subtracted the next global from the one I was interested in. It's certainly possible that I messed up somewhere in that, and/or that our tools produce slightly different results. For that matter different versions of PropGCC will also give slightly different results too.
Eric
It certainly is amazing what goes on around here. From C compilers, to BASIC, to Forth, to weird interpreters/emulators, not to mention all the hardware things.
I think we all take inspiration from Chip of course.
Let's not forget the genius of Ken, who keeps the lights on around here !
Now, if you could see your way to getting the riscv emulator working on the P2 that would answer my opening post here. We don't need a riscv core, we can do it in a COG!
XBYTE may allow smaller pieces of the instruction set to be packed a little better, giving more room.
If you want a 16b compressed variant, that could be done as 8b or smaller operator, and 8b of param.
Split like that, I think XBYTE can help ?
Expansion tag for Full 32b, or 24b, could be useful.
I may do a small board to plug on the DE10-Lite 40-pin connector, without flying cables, because my setup didn't work. I got the input to oscillate.... too unstable .
I never wrote an uart receiver before, so I decided to do it. This code works, at least in the simulator, the real code seems not to work on the fpga. Signal Tap (Heater: if you have some free RAM block, I'd recommend you explore the embedded logic analyzer, really easy to use, and an eye opener!).
The bad thing about the UART is that the USB-blaster in this board stops being recognized when another FTDI-chip enabled dongle is also plugged. My propplug and the intronix logic port must be unplugged when I want to program the FPGA
I'll see what I can do for the RAM.
Anecdote: I decided to restructure the project, split the sources into several directories and quartus would crash saying something about some suppressed messages. Re-did the project, removed the files one after the other one, at some point I thought that maybe it didn't find the block ram files (firmwarex.hex)... it didn't find them, it didn't complain either it just crashed...
Here the code:
Note: the bit clock is only started when the starting edge is detected (bad for signal tap). I'll change it to a normal always on clock , I thought it was a neat idea... It is based on Heater's transmitter.
I removed the tri-stated bus when looking for more than 50 MHz clock on the Max10. I had to settle for a 50 MHz clock anyways.