Seriously, is this the speed? — Parallax Forums

# Seriously, is this the speed?

Posts: 16

while(1)
{
high(23);
low(23);
}

This gives me a 31.6kHz output. How can this be so? Even with a rubbish compiler, I'd expect a bit better, and the default clock speed is supposed to be 80MHz. I'm missing the point of something here!

• Posts: 2,936
edited 2021-07-06 13:52

@AmbientPower, It's not the compiler, it's the functions being used. Both the "high()" and "low()" functions from the Simple library are very slow - they do a lot of stuff that isn't strictly necessary, but make them extra foolproof.

```const uint32_t PIN_23 = 1 << (23 - 1);

DIRA |= PIN_23;
while (1) {
OUTA |= PIN_23;
OUTA &= !PIN_23;
}
```

or even faster:

```const uint32_t PIN_23 = 1 << (23 - 1);

DIRA |= PIN_23;
while (1) {
OUTA ^= PIN_23;
}
```

I don't remember if these get compiled into a single PASM instruction or not. If not, you can force it like so:

```const uint32_t PIN_23 = 1 << (23 - 1);

DIRA |= PIN_23;
while (1) {
__asm__ volatile ("xor outa, %0" : : "r" (PIN_23));
}
```

All of the above was taken from https://github.com/parallaxinc/PropWare/blob/develop/PropWare/gpio/port.h

• Posts: 16

@DavidZemon said:
@AmbientPower, It's not the compiler, it's the functions being used. Both the "high()" and "low()" functions from the Simple library are very slow - they do a lot of stuff that isn't strictly necessary, but make them extra foolproof.

Thanks. That makes sense, although it's still not stellar.

The first example was 104kHz, the second 83.3kHz (slower because of the loop, I'm sure), and the last 167kHz.

The -1 is incorrect in the bit position algorithm by the way!

Thanks for that. I'm still testing, but I think I'm going to do better with an ARM micro or ESP32. The parallel cores was what attracted me, but I think even with the overhead of timebase processing, I can still run faster in a traditional device. With any peripherals having to be bit bashed, it means that I2C or SPI is going to be much slower than a device with on-chip peripherals.

• Posts: 2,936

That is sometimes true, for sure. With an 80 MHz clock, and needing to bit bash all of the serial protocols, you do have a limited transfer speed.

But, it's surprising what you can get with some clever inline assembly. We can reach up to 4 MHz SPI clocks and... I don't remember how high for I2C, but I'm pretty sure 1 MHz was easily within reach.

I ran some simple benchmarks for SD card read-write, focused entirely on comparing PropWare against the Simple libraries. I didn't record the raw read and write speeds, but in 3.978 seconds PropWare was able to mount the FAT filesystem, open two files (one for read, one for write) and then copy a 25.9 KB file over character-by-character. That amounts to an average copy speed of 52 Kbps, or an average communication speed of 104 Kbps. And all of that runs in just one cog, leaving another 7 cogs to do whatever else is needed, uninterrupted by the SPI/SD card comms.

UART performance testing showed that PropWare can handle up to 2.680 Mbps sustained throughput. Again, this runs entirely in one cog, allowing the other 7 to do whatever else you want.

You might want to look at PropWare's "serial" folder, which holds efficient bit bashing routines for UART, SPI, and I2C: https://github.com/parallaxinc/PropWare/tree/develop/PropWare/serial

SPI routine for sending an arbitrary block of 8-bit words at a 4 MHz clock with a sustained write of 3.33 Mbps

UART routine for sending an arbitrary block of 8-bit words at 4.444 Mbaud with sustained throughput of 2.680 Mbps

• Posts: 1,786

But, it's surprising what you can get with some clever inline assembly. We can reach up to 4 MHz SPI clocks and... I don't remember how high for I2C, but I'm pretty sure 1 MHz was easily within reach.

20 MHz (minus small pause every 32 bits) is the max that's theoretically possible. Not sure if viable in inline ASM though (does GCC do fcached inline ASM?).

• Posts: 16

@DavidZemon said:
That is sometimes true, for sure. With an 80 MHz clock, and needing to bit bash all of the serial protocols, you do have a limited transfer speed.

But, it's surprising what you can get with some clever inline assembly. We can reach up to 4 MHz SPI clocks and... I don't remember how high for I2C, but I'm pretty sure 1 MHz was easily within reach.

Which is why I'm a bit confused about my toggling the I/O bit using inline assembler only giving 167kHz. Doing SPI, you have the clock to do and stuff in-between!

• Posts: 2,936

@Wuerfel_21 said:

But, it's surprising what you can get with some clever inline assembly. We can reach up to 4 MHz SPI clocks and... I don't remember how high for I2C, but I'm pretty sure 1 MHz was easily within reach.

20 MHz (minus small pause every 32 bits) is the max that's theoretically possible. Not sure if viable in inline ASM though (does GCC do fcached inline ASM?).

Yea I'd swear I had an example of 20 MHz SPI somewhere, but couldn't find it for the life of me when I was searching through PropWare's code lol. GCC can do fcache + inline ASM, and that's exactly how the majority of PropWare's serial routines are written.

@AmbientPower said:
Which is why I'm a bit confused about my toggling the I/O bit using inline assembler only giving 167kHz. Doing SPI, you have the clock to do and stuff in-between!

Looks like you'll need fcache + inline assembly to get any faster. Are you running CMM or LMM? Switching to LMM would probably give you a huge speed boost, but for something like this, CMM + fcache + inline assembly is a fantastic combination (majority of the code is not speed critical, but small tight functions need to be highly optimized).

• Posts: 235
edited 2021-07-06 18:44

In Tachyon forth, I tested a 1,000,000 loop that raises and lowers pin 10:-
`: TEST LAP 1000000 0 DO 10 HIGH 10 LOW LOOP LAP .LAP ;`
This executed at 80MHz clock in 5.4s, so the period of the pin 10 signal was about 5.4uS or 185.2kHz

• Posts: 3,419
edited 2021-07-06 20:13

Are you really running 80Mhz or are you maybe running in RCfast/slow without Crystal?

your results are in Khz and should be Mhz, something on your setup is wrong.

Did you set the clock frequency to 80Mhz?

Can you maybe create a listing file of the generated assembly?

Mike

• Posts: 14,756
edited 2021-07-06 21:31

@DavidZemon said:

@Wuerfel_21 said:

But, it's surprising what you can get with some clever inline assembly. We can reach up to 4 MHz SPI clocks and... I don't remember how high for I2C, but I'm pretty sure 1 MHz was easily within reach.

20 MHz (minus small pause every 32 bits) is the max that's theoretically possible. Not sure if viable in inline ASM though (does GCC do fcached inline ASM?).

Yea I'd swear I had an example of 20 MHz SPI somewhere, but couldn't find it for the life of me when I was searching through PropWare's code lol.

There are some notes in https://forums.parallax.com/discussion/173495/putty-prop-plug-what-is-the-highest-baud-rate-feasible
Peak speeds of 20MHz is possible with peripherals HW assist, as said in #6.
Next steps down are a added NOP for SysCLK/8, or SysCLK/10 allows WAIT to be used for granular control, and that can also support fractional baud.

For more sustained burst slave UART use, COG local, I think something like 12M.8.M.2 allows 1 byte per microsecond, and gives more stop bit time to store the byte, and re-sync on the next start edge. Both ends will need to agree on how many bytes per burst.

I'm not sure what that drops to if the buffer is moved into HUB, but it will slow down from COG buffer speeds.

Code snippet for P1 20MHz SPI Master read of SD card is here, which uses a spare pin for CTR adder gate.
https://forums.parallax.com/discussion/comment/1466234/#Comment_1466234 and also post #7 describes how that works.
https://forums.parallax.com/discussion/comment/1482752/#Comment_1482752 has both read and write code, 20MHz SPI master

• Posts: 16

@msrobots said:
Are you really running 80Mhz or are you maybe running in RCfast/slow without Crystal?

your results are in Khz and should be Mhz, something on your setup is wrong.

Did you set the clock frequency to 80Mhz?

Can you maybe create a listing file of the generated assembly?

Mike

The documentation says that the default is 80MHz, and I'm using the Quickstart profile to upload.

• Posts: 16

@DavidZemon said:

Looks like you'll need fcache + inline assembly to get any faster. Are you running CMM or LMM? Switching to LMM would probably give you a huge speed boost, but for something like this, CMM + fcache + inline assembly is a fantastic combination (majority of the code is not speed critical, but small tight functions need to be highly optimized).

It is in CMM at the moment. I'll re-test in LMM, thanks.

• Posts: 2,424

Hi @AmbientPower

Welcome to the forums!

Sorry if you already covered this.... but would you like to post the entire code file you are testing with? Just in case there's a config / clock option (or such like) out of place, one of us could quickly sanity check the code to make sure you're not missing out on something important.

• Posts: 16
edited 2021-07-07 10:05

@VonSzarvas said:
Hi @AmbientPower

Welcome to the forums!

Sorry if you already covered this.... but would you like to post the entire code file you are testing with? Just in case there's a config / clock option (or such like) out of place, one of us could quickly sanity check the code to make sure you're not missing out on something important.

```//const unsigned long int PIN_23 = 1 << 23;
#define PIN_20 1<<20
#define PIN_21 1<<21
#define PIN_22 1<<22
#define PIN_23 1<<23

const unsigned char IO[]={16,17,18,19,20,21,22,23};

#include "simpletools.h"                      // Include simple tools

void blink(unsigned char IOLine,unsigned int del)
{
while(1)
{
high(IO[IOLine]);                           // P26 LED on
pause(del);
low(IO[IOLine]);
pause(del);
}
}

{
}

{
}

{
}
{
}

{
while(1)
{
high(20);   //31.6kHz
low(20);
}
}

{
DIRA |= PIN_21;
while(1)
{
OUTA |= PIN_21;   //104kHz
OUTA &= !PIN_21;
}
}

{
DIRA |= PIN_22;
while(1)
OUTA ^= PIN_22;   //83.3kHz
}

{
DIRA |= PIN_23;
while(1)
__asm__ volatile ("xor outa, %0" : : "r" (PIN_23)); //167kHz
}

int main()                                    // Main function
{
//print("Hello!");                            // Display test message

}
```
• Posts: 2,424

Looks good. That's the .c file. Could you post the .side file too, as that includes the compiler options.

ps. I've added the code display tags in your above post. Three backticks on the lines above and below the code.

• Posts: 16
edited 2021-07-07 11:58

@VonSzarvas said:
Looks good. That's the .c file. Could you post the .side file too, as that includes the compiler options.

ps. I've added the code display tags in your above post. Three backticks on the lines above and below the code.

IOtest.c

compiler=C
memtype=cmm main ram compact
optimize=-Os
-m32bit-doubles
-fno-exceptions
defs::-std=c99
-lm
BOARD::QUICKSTART

• Posts: 6,322

There are a couple of problems with the line "OUTA &= !PIN_21;". In C, the "!" operator is a logical compliment. The operator that you really should use is "~", which is the bitwise compliment. Also, the symbol PIN_21 is defined as "#define PIN_21 1<<21". The pre-processor will convert "OUTA &= ~PIN_21;" to "OUTA &= ~1<<21;". The "~" operator has precedence over the "<<" operator, so this will result in an unintended value. When doing a #define like this it is best to put parentheses around the value, such as "#define PIN_21 (1<<21)". The pre-processor will then generate "OUTA &= ~(1<<21);".

• Posts: 2,936

@"Dave Hein" said:
There are a couple of problems with the line "OUTA &= !PIN_21;". In C, the "!" operator is a logical compliment. The operator that you really should use is "~", which is the bitwise compliment.

Doh! Good catch. Been a while since I've written any embedded

• Posts: 1,235
edited 2021-07-07 15:20

@bob_g4bby said:
In Tachyon forth, I tested a 1,000,000 loop that raises and lowers pin 10:-
`: TEST LAP 1000000 0 DO 10 HIGH 10 LOW LOOP LAP .LAP ;`
This executed at 80MHz clock in 5.4s, so the period of the pin 10 signal was about 5.4uS or 185.2kHz

1 MASK 1000000 0 LAP DO T LOOP LAP .LAP --> 80,000,272 cycles = 1.000sec ok

= 500kHz

and using the single character
H - high
L - low
T - toggle
F - float
P - pulse (HL)
words

1 million pulses HL:

1 MASK 1000000 0 LAP DO P LOOP LAP .LAP --> 96,000,272 cycles = 1.200sec ok

= 600 kHz but with different pulse / pause widths

don't have the scope attached

with special SPI words sending 32 bits at 3.7 - almost 4MHz
SPIWR32 ( long -- long ) send 32-bits 8.6us

• Posts: 2,936

If you want to know how to write something in Forth, all you gotta do is write it in C and complain that it isn't ___ enough

• Posts: 235
edited 2021-07-07 19:34

Well it's just a bit of fun to compare execution rates, programming in forth isn't compulsory - yet

• Posts: 235
edited 2021-07-07 19:32

deleted

• Posts: 6,322

The Forth numbers are interesting, but they don't really apply to the original post since it refers to the P1. Forth numbers for the P1 might be useful if they were for the P1. It might be useful to know the toggle speed in PASM, which I think would be 5 MHz. You could even toggle the pin using a counter. What's the highest frequency using a counter on the P1?

• Posts: 1,786

What's the highest frequency using a counter on the P1?

40 MHz in NCO mode, 128 MHz in PLL mode

• Posts: 1,235

@"Dave Hein" said:
The Forth numbers are interesting, but they don't really apply to the original post since it refers to the P1. Forth numbers for the P1 might be useful if they were for the P1. It might be useful to know the toggle speed in PASM, which I think would be 5 MHz. You could even toggle the pin using a counter. What's the highest frequency using a counter on the P1?

my numbers are for Tachyon 5.7 on the P1

• Posts: 6,322

One correction about PASM. The minimal toggle loop would take 12 cycles per loop. With an 80 MHz system clock the pin frequency would be 6.667 MHz.

• Posts: 14,756

@"Dave Hein" said:
One correction about PASM. The minimal toggle loop would take 12 cycles per loop. With an 80 MHz system clock the pin frequency would be 6.667 MHz.

I think you meant toggles at 6.667MHz, ( for 3,333Mhz frequency )
It also depends on what sort of loop you create.
The manual shows a XOR+JMP which is 8 cycles per toggle, but has no exit means.
If you add a XOR + DJNZ for counted toggles this applies, still 8 cycles.
DJNZ requires a different amount of clock cycles depending on whether or not it has to jump. If it must jump it takes 4 clock cycles, if no jump occurs it takes 8 clock cycles. Since loops utilizing DJNZ need to be fast, it is optimized in this way for speed.
If you need to exit on a pin state from another COG, a 3rd test line is needed, for 12 sysclks per toggle.

Those are compile-time locked speeds, if you add WAITCNT for run-time control of the loop delay, that is then 4+(6+)+4, and pin frequency can be any value 40MHz/N, where N > = 14

• Posts: 6,322

No, I meant what I said. The pin frequency is 6.667 MHz. The toggle rate would be 12.333 MHz. This is based on the 3-instruction loop:

loop
jmp #loop

Of course with this loop the duty cycle is not 50%. A 50% duty cycle would require an instruction between the XORs, which would make it a 16-cycle loop. In that case the pin frequency would be 5 MHz, and the toggle rate would be 10 MHz.

• Posts: 4,968

This program:

```#include <propeller.h>

#define PIN 23

void main() {
for(;;) {
}
}
```

compiles to the following loop in FlexC:

```_main
or  dira, imm_8388608_