I recommend that you simply rely on the fuses for authentication, don't change behavior based on the fuse values. For chips that aren't fused, you simply sign code with a 128bit string of zeros, then the boot methods can remain static and not change depending on the fuse values.
This allows ISP by loading a bootloader into the chip via serial and reprogramming the flash, without having an OTA/download type bootloader (no external peripherals or network available).
The P2 must fail safe when it comes to code protection, the poly fuses are only so secure because they are on-die, but between the fuse location randomization and the buried poly layer, it *should* be reasonably safe.
The next step would be to add booby traps to the silicon that destroy the chip when it's decapped.
A developer can further protect their product from tampering by issuing a unique key to every chip, therefore every device will have a unique firmware hash. In that case it would be impossible to guess the key because every decapped chip would be different; they'd have to determine the random number generator parameters used to generate the key, which is pretty darn hard.
I recommend that you simply rely on the fuses for authentication, don't change behavior based on the fuse values. For chips that aren't fused, you simply sign code with a 128bit string of zeros, then the boot methods can remain static and not change depending on the fuse values.
This allows ISP by loading a bootloader into the chip via serial and reprogramming the flash, without having an OTA/download type bootloader (no external peripherals or network available).
The P2 must fail safe when it comes to code protection, the poly fuses are only so secure because they are on-die, but between the fuse location randomization and the buried poly layer, it *should* be reasonably safe.
The next step would be to add booby traps to the silicon that destroy the chip when it's decapped.
A developer can further protect their product from tampering by issuing a unique key to every chip, therefore every device will have a unique firmware hash. In that case it would be impossible to guess the key because every decapped chip would be different; they'd have to determine the random number generator parameters used to generate the key, which is pretty darn hard.
Hi pedward,
Why adding features that can backfire and blow one's face? It would be a liability for a OSS programmer that doesn't want to have security features. Plus, the P2 design is supposed to be open. What is the point of adding destructive traps?
Why adding features that can backfire and blow one's face? It would be a liability for a OSS programmer that doesn't want to have security features. Plus, the P2 design is supposed to be open. What is the point of adding destructive traps?
Kind regards, Samuel Lourenço
I think there is a misunderstanding. The P2 may be developed in front of the world, but the intent is to produce a chip that will make money for Parallax. Code protection is necessary to appease the customers who want to lock people out of their designs. Having code protection with a zero key is a good compromise between security in the design and allowing people to tinker with the P2 without fear of forgetting a secret key.
Smartphone makers have proven time and again that no system, how secure it is designed, will remain secure. However, what they have shown is that a bootloader that is signed in the way that the P2 is intended to work, is reasonably secure.
In summary this is how one could implement a secure bootloader:
Stage 1 bootloader is signed with an HMAC derived SHA256 hash. The bootloader checks a user region of fuses to see if the value read matches what the bootloader was written for. If the fuses don't match, it looks for a chain loader in flash that is also signed to run on this chip, authenticates it, and executes it. The chain loader checks the user region of fuses to see if they match, if they do it checks to see if it is a chain loader, if it is, it replaces the primary bootloader with itself and resets.
Now that the bootloader upgrade has been executed (with recovery fallback support), it loads the program from flash, checks the signature, decrypts it, then runs it.
In theory you could skip the user program authentication and just start decryption, however it is better to authenticate it to detect a corrupted flash chip or program, instead of just assuming it is good and letting the chip run off into an unknown state when it executes corrupted data. By checking it also prevents an attack by a 3rd party that just injects random data into the stream, hoping for a desired outcome, to break out of the user program into their own code.
To upgrade bootloaders (and user program) and prevent downgrade attacks, the upgrader needs to set a fuse bit and write a new bootloader and program to the flash. These can be in a different region than the currently running program (bank 1 vs bank 0, etc). This allows for recoverable field upgrades of programs. The bootloader is something that is written by the end user and can implement a retry mechanism, or just be very simple.
The last time Chip spoke about the fuses, there were 171 available, giving 128 available for a key, and 43 for the developer (potentially 32 for user revision bits, and 11 for configuration bits). I don't know if the new process at OnSemi will allow for more or less fuses, or if they have a standard IP block to implement that.
I've got the Prop2 ROM code all optimized now.
..
I may add a text-only loader, in case serial and flash fail to load and/or authenticate. For now, this program takes $12E longs out of a possible $1F8. That's 60% full.
Looks good, any ideas what size a SD or even USB loader needs ?
No. SD is possibly simple if we look for a starting string.
Why starting string? If you adopt some rules is still more simple:
1) The SD needs MBR
2) The boot partition is always number 1, without any file system
3) In the first bytes of partition 1 is indicated the number of dwords to load and execution type (cog/hub)
With this rules you simply jump into partition table of MBR at sector 0. Retrive partition 1 absolute starting sector (3 bytes @ fixed offset in sector 0: 447,448,449) and start sequentially read the data. No file system, just plain image.
Optionally, for increased security, the partition type of partition 1 (byte @ offset 450) can be read before and check if it is of right type (eg A2h, 12h, 27h, ...) and/or boot signature (2 bytes @ offset 510 & 511).
If needed by application the storage will be in partition 2 under file system support, thus readable also on any pc. With the right partition 1 type it will be safe from possible corruption when the SD is inserted into PCs.
I'm thinking about this serial hex loading mechanism. This would be really useful for distributed systems which all receive a common data stream. Some fuse bits could be used to set an ID and mask, so that each Prop2 could pick out its own program in a broadcasted data stream. Another fuse sets indefinite timeout for the serial text loader. No SPI Flash chip needed!
The matter of loader signing is important. Even if the key (fuse 0..127) is 0 (default), signing the loader with a zero key gives assurance that whatever data was loaded is actually a program. This is important for Prop2 systems which may have critical hardware connected (CNC, for example).
Combining hex text loading with loader signing means no SPI Flash chip AND code protection!
I'm thinking about this serial hex loading mechanism. This would be really useful for distributed systems which all receive a common data stream. Some fuse bits could be used to set an ID and mask, so that each Prop2 could pick out its own program in a broadcasted data stream. Another fuse sets indefinite timeout for the serial text loader. No SPI Flash chip needed!
The matter of loader signing is important. Even if the key (fuse 0..127) is 0 (default), signing the loader with a zero key gives assurance that whatever data was loaded is actually a program. This is important for Prop2 systems which may have critical hardware connected (CNC, for example).
Combining hex text loading with loader signing means no SPI Flash chip AND code protection!
This all needs to be made very simple.
By the way Chip, will it be possible to use an SPI flash for open designs?
Wonder how long it would take to do brute force attack...
Guess you'd just read the contents of SPI flash chip you want to crack.
The "initialization area" of P1 rom (assuming same for P2) is not really obfuscated, so seems you'd only need to check the first block to see if you got it right...
Since this is an "offline" attack, you'd only be limited by processor speed, I think.
BTW: I don't really know anything about this, just read Wikipedia a bit...
Code isn't stored encrypted unless you have a chain loader that does that. The code that the P2 ROM loads is unencrypted, but signed. If you encrypt the code on the flash with the 128bit key in the P2 fuses, you are not going to brute force anything.
Furthermore, you can't make the P2 execute unsigned code because it hashes the code, then hashes the 128bit key, then hashes a sum of the two to get the signed value stored in flash. If the signed value stored in flash doesn't match the calculated hash, it won't run. This hashed authentication scheme is called HMAC and it uses the SHA256 hashing algorithm to generate the hashes.
The weak link becomes sidechannel attack by trying to do power analysis as the 128bit key is read from the fuses. This may be very possible, since a fused link will have a lower resistance than a blown link, so the current drawn may fluctuate.
In this case, having a lot of clock jitter in the RC clock may be beneficial to help obfuscate the timing of the fuse bit read. Alternatively, if the bits could be read in parallel, the cumulative current draw of each fused link would mask the true arrangement of the fuses. Reading 32bits at a time into a register would give sufficient masking I think.
When the fuses are read, they are all turned on, so there is no current difference. If you could tell from the power that either a zero or one bit was being mixed and rotated into a register, you may be able to tell, but there is no DC component to those operations.
I've been working on the serial/text passive loader. It's coming along well.
I got rid of the clock-setting (switch to fast crystal), since I realized it presented some difficult pitfalls for the user. I'm just sticking with pure auto-bauding. It seems to work to 250k baud. That means 115.2k baud will be certain.
In order to make it faster at 115.2k baud, I'm giving it a base-64 load option, where normal text characters convey 6 bits, each. That makes for 75% data efficiency, as opposed to 33% for hex (FF FF FF FF FF FF...). HMAC signatures are optional, in case they fuse key is 0. I don't think the non-HMAC hex mode can be any simpler:
The amask/adata/bmask/bdata hex values let you address a Prop2 that has certain I/O pin states (0 0 0 0 = any). The bytes are hex values that load into hub starting at 0. A "~" terminates the load and does a 'COGINIT #0,#0' to start up the downloaded program. White space (including cr/lf) is allowed between all hex data. Every space ($20) refreshes the auto-baud, so as long as there is good decoupling capacitance on the power supply pins, it will continuously adapt to changing temperature and voltage, while running off just the internal RC oscillator.
I've been working on the serial/text passive loader. It's coming along well.
I got rid of the clock-setting (switch to fast crystal), since I realized it presented some difficult pitfalls for the user. I'm just sticking with pure auto-bauding. It seems to work to 250k baud. That means 115.2k baud will be certain.
In order to make it faster at 115.2k baud, I'm giving it a base-64 load option, where normal text characters convey 6 bits, each.
Sounds good to me.
Does this use Smart pins cells, or is it entirely SW based ?
Is there a means to report the autobaud divider chosen ?
In the Autobaud schemes I've used, echo of that gives info on the actual SysCLK, which can allow higher baud choices.
For those who need faster loaders, this can also be a two stage process, right ?
The ROM should be lowest-common-denominator simple, as external clock sources will vary widely.
If the loader is published, users can adapt that to an External-Clock loader, and they need load only that faster loader in slower ROM_Stage boot.
User-Stage can use Pin-cells, fractional baud and go into megabaud of download speeds.
With 2MBytes of SPI memory very cheap, it is reasonable to expect downloads of that size.
I've not seen mention yet, if this is QuadSPI device tolerant ?
ie defines the extra Quad pins during load, and issues a reset command to force single-SPI, so a connected Quad device will behave as expected...
For those who need faster loaders, this can also be a two stage process, right ?
The as yet undefined SPI loading will handle such tricks, no problem.
... With 2MBytes of SPI memory very cheap, it is reasonable to expect downloads of that size. ...
I've not seen mention yet, if this is QuadSPI device tolerant ?
ie defines the extra Quad pins during load, and issues a reset command to force single-SPI, so a connected Quad device will behave as expected...
The small first stage simple SPI loader will be able to switch up to QuadSPI for second stage. The developer can chose any extra pins needed.
EDIT: Ah, you needed to word that differently JMG. The reset revert is the key question - which should have been right at the top of you post. ... Obviously, none of that has been done yet. SPI boot is still to be defined.
..... Obviously, none of that has been done yet. SPI boot is still to be defined.
It's all in the Boot ROM, and I find 87 hits in a search for SPI on the first page, but it still looks a little sparse on that
important QuadSPI tolerant question.
The small first stage simple SPI loader will be able to switch up to QuadSPI for second stage. The developer can chose any extra pins needed.
Actually, no.
The RESET to the Prop may arrive at any mode, and the pins connected for Quad also have use in Single SPI, so they do need to be defined during earliest BOOT phase.
This means a 4 pin allocate is not enough.
Is there even a singular, simple and clear QuadSPI soft reset process? I see there is no reset pin.
Kye always said soft resetting an SD card was a bad idea. If QuadSPI has similar soft reset problem as SD cards then it's probably a use at your own risk situation.
SD cards at least have the advantage of being removable devices. QuadSPI is probably a lost cause if there is no clear path here.
The one sure way is to power-cycle the flash chip before reading it to force a hard reset. This, of course, needs extra components and another Prop pin for power control.
Is there even a singular, simple and clear QuadSPI soft reset process? I see there is no reset pin.
Yes, a quick look at the data sheets, reveals a Continuous Read Reset (0xFF), or 0xFFFF, that exits Quad/Dual modes.
Same command for Winbond, and Fremont. Winbond says 16 Clks will exit any mode to Default SPI.
They show IO0 only, but I would tend to define all IO0..IO3 as Hi.
So, no matter what state the slave device was in, a CS toggle then 16 CLKs with all data pins high will always result in any QuadSPI slave being soft reset?
SD requires a number of clocks to get out of some modes (just using SPI mode). I have modded each of the SD fat drivers to do this. They have been clearly documented and all work fine.
Cool, sounds promising. So, Cluso, all you do is a one simple sequence and the SD card reverts back to single bit SPI mode as if freshly inserted?
Are these mod'd drivers in OBEX or somewhere else? The sequence sounds pretty close to what I just described for the QuadSPI devices. I'm wondering now if one sequence will deal to both interfaces in one hit ...
So, no matter what state the slave device was in, a CS toggle then 16 CLKs with all data pins high will always result in any QuadSPI slave being soft reset?
Yes, that resets both DualSPI and QuadSPI out of any sticky modes, and back to 1-bit commands.
Seems to be universal on the devices I checked, as it is a common problem they want to solve.
The ROM code sends four $FFFFFFFF commands before reading the SPI flash. Kye had said to do this, in order to undo any strange modes before sending an $03000000 command to begin the read.
The ROM code sends four $FFFFFFFF commands before reading the SPI flash. Kye had said to do this, in order to undo any strange modes before sending an $03000000 command to begin the read.
That should more than sort mode-exit, which leaves handling IO0..IO3 pins, some of which dual-map as HOLD#.IO3 and WP#.IO2, so should be defined high, (ie not floating) in SingleSPI modes.
That means 6 pins need to be controlled for QuadSPI tolerant connection.
Or, maybe we could have P2 start with all pins in weak pullup mode?
Either would work, with the all pins in weak pullup being most predictable.
It is probably not a good idea to start/reset a MCU to floating pins anyway.
Weak pullups seems common, and if weak enough, they can be pulled down with a resistor for those pins that need to reset low.
eg 6.6k external pulldown, is 500uA drive for Hi, and yet will pull down a 30uA weak-pullup to < 200mV
Comments
This allows ISP by loading a bootloader into the chip via serial and reprogramming the flash, without having an OTA/download type bootloader (no external peripherals or network available).
The P2 must fail safe when it comes to code protection, the poly fuses are only so secure because they are on-die, but between the fuse location randomization and the buried poly layer, it *should* be reasonably safe.
The next step would be to add booby traps to the silicon that destroy the chip when it's decapped.
A developer can further protect their product from tampering by issuing a unique key to every chip, therefore every device will have a unique firmware hash. In that case it would be impossible to guess the key because every decapped chip would be different; they'd have to determine the random number generator parameters used to generate the key, which is pretty darn hard.
Why adding features that can backfire and blow one's face? It would be a liability for a OSS programmer that doesn't want to have security features. Plus, the P2 design is supposed to be open. What is the point of adding destructive traps?
Kind regards, Samuel Lourenço
I think there is a misunderstanding. The P2 may be developed in front of the world, but the intent is to produce a chip that will make money for Parallax. Code protection is necessary to appease the customers who want to lock people out of their designs. Having code protection with a zero key is a good compromise between security in the design and allowing people to tinker with the P2 without fear of forgetting a secret key.
Smartphone makers have proven time and again that no system, how secure it is designed, will remain secure. However, what they have shown is that a bootloader that is signed in the way that the P2 is intended to work, is reasonably secure.
In summary this is how one could implement a secure bootloader:
Stage 1 bootloader is signed with an HMAC derived SHA256 hash. The bootloader checks a user region of fuses to see if the value read matches what the bootloader was written for. If the fuses don't match, it looks for a chain loader in flash that is also signed to run on this chip, authenticates it, and executes it. The chain loader checks the user region of fuses to see if they match, if they do it checks to see if it is a chain loader, if it is, it replaces the primary bootloader with itself and resets.
Now that the bootloader upgrade has been executed (with recovery fallback support), it loads the program from flash, checks the signature, decrypts it, then runs it.
In theory you could skip the user program authentication and just start decryption, however it is better to authenticate it to detect a corrupted flash chip or program, instead of just assuming it is good and letting the chip run off into an unknown state when it executes corrupted data. By checking it also prevents an attack by a 3rd party that just injects random data into the stream, hoping for a desired outcome, to break out of the user program into their own code.
To upgrade bootloaders (and user program) and prevent downgrade attacks, the upgrader needs to set a fuse bit and write a new bootloader and program to the flash. These can be in a different region than the currently running program (bank 1 vs bank 0, etc). This allows for recoverable field upgrades of programs. The bootloader is something that is written by the end user and can implement a retry mechanism, or just be very simple.
The last time Chip spoke about the fuses, there were 171 available, giving 128 available for a key, and 43 for the developer (potentially 32 for user revision bits, and 11 for configuration bits). I don't know if the new process at OnSemi will allow for more or less fuses, or if they have a standard IP block to implement that.
Why starting string? If you adopt some rules is still more simple:
1) The SD needs MBR
2) The boot partition is always number 1, without any file system
3) In the first bytes of partition 1 is indicated the number of dwords to load and execution type (cog/hub)
With this rules you simply jump into partition table of MBR at sector 0. Retrive partition 1 absolute starting sector (3 bytes @ fixed offset in sector 0: 447,448,449) and start sequentially read the data. No file system, just plain image.
Optionally, for increased security, the partition type of partition 1 (byte @ offset 450) can be read before and check if it is of right type (eg A2h, 12h, 27h, ...) and/or boot signature (2 bytes @ offset 510 & 511).
If needed by application the storage will be in partition 2 under file system support, thus readable also on any pc. With the right partition 1 type it will be safe from possible corruption when the SD is inserted into PCs.
I'm thinking about this serial hex loading mechanism. This would be really useful for distributed systems which all receive a common data stream. Some fuse bits could be used to set an ID and mask, so that each Prop2 could pick out its own program in a broadcasted data stream. Another fuse sets indefinite timeout for the serial text loader. No SPI Flash chip needed!
The matter of loader signing is important. Even if the key (fuse 0..127) is 0 (default), signing the loader with a zero key gives assurance that whatever data was loaded is actually a program. This is important for Prop2 systems which may have critical hardware connected (CNC, for example).
Combining hex text loading with loader signing means no SPI Flash chip AND code protection!
This all needs to be made very simple.
Kind regards, Samuel Lourenço
am I missing something?
Mike
That's right.
Guess you'd just read the contents of SPI flash chip you want to crack.
The "initialization area" of P1 rom (assuming same for P2) is not really obfuscated, so seems you'd only need to check the first block to see if you got it right...
Since this is an "offline" attack, you'd only be limited by processor speed, I think.
BTW: I don't really know anything about this, just read Wikipedia a bit...
Furthermore, you can't make the P2 execute unsigned code because it hashes the code, then hashes the 128bit key, then hashes a sum of the two to get the signed value stored in flash. If the signed value stored in flash doesn't match the calculated hash, it won't run. This hashed authentication scheme is called HMAC and it uses the SHA256 hashing algorithm to generate the hashes.
The weak link becomes sidechannel attack by trying to do power analysis as the 128bit key is read from the fuses. This may be very possible, since a fused link will have a lower resistance than a blown link, so the current drawn may fluctuate.
In this case, having a lot of clock jitter in the RC clock may be beneficial to help obfuscate the timing of the fuse bit read. Alternatively, if the bits could be read in parallel, the cumulative current draw of each fused link would mask the true arrangement of the fuses. Reading 32bits at a time into a register would give sufficient masking I think.
We'd have to look and see. It's very hard to anticipate, I think.
I got rid of the clock-setting (switch to fast crystal), since I realized it presented some difficult pitfalls for the user. I'm just sticking with pure auto-bauding. It seems to work to 250k baud. That means 115.2k baud will be certain.
In order to make it faster at 115.2k baud, I'm giving it a base-64 load option, where normal text characters convey 6 bits, each. That makes for 75% data efficiency, as opposed to 33% for hex (FF FF FF FF FF FF...). HMAC signatures are optional, in case they fuse key is 0. I don't think the non-HMAC hex mode can be any simpler:
" PropHex <amask> <adata> <bmask> <bdata> <bytes> ~"
The amask/adata/bmask/bdata hex values let you address a Prop2 that has certain I/O pin states (0 0 0 0 = any). The bytes are hex values that load into hub starting at 0. A "~" terminates the load and does a 'COGINIT #0,#0' to start up the downloaded program. White space (including cr/lf) is allowed between all hex data. Every space ($20) refreshes the auto-baud, so as long as there is good decoupling capacitance on the power supply pins, it will continuously adapt to changing temperature and voltage, while running off just the internal RC oscillator.
Sounds good to me.
Does this use Smart pins cells, or is it entirely SW based ?
Is there a means to report the autobaud divider chosen ?
In the Autobaud schemes I've used, echo of that gives info on the actual SysCLK, which can allow higher baud choices.
For those who need faster loaders, this can also be a two stage process, right ?
The ROM should be lowest-common-denominator simple, as external clock sources will vary widely.
If the loader is published, users can adapt that to an External-Clock loader, and they need load only that faster loader in slower ROM_Stage boot.
User-Stage can use Pin-cells, fractional baud and go into megabaud of download speeds.
With 2MBytes of SPI memory very cheap, it is reasonable to expect downloads of that size.
I've not seen mention yet, if this is QuadSPI device tolerant ?
ie defines the extra Quad pins during load, and issues a reset command to force single-SPI, so a connected Quad device will behave as expected...
The small first stage simple SPI loader will be able to switch up to QuadSPI for second stage. The developer can chose any extra pins needed.
EDIT: Ah, you needed to word that differently JMG. The reset revert is the key question - which should have been right at the top of you post. ... Obviously, none of that has been done yet. SPI boot is still to be defined.
important QuadSPI tolerant question.
Actually, no.
The RESET to the Prop may arrive at any mode, and the pins connected for Quad also have use in Single SPI, so they do need to be defined during earliest BOOT phase.
This means a 4 pin allocate is not enough.
Kye always said soft resetting an SD card was a bad idea. If QuadSPI has similar soft reset problem as SD cards then it's probably a use at your own risk situation.
SD cards at least have the advantage of being removable devices. QuadSPI is probably a lost cause if there is no clear path here.
Same command for Winbond, and Fremont. Winbond says 16 Clks will exit any mode to Default SPI.
They show IO0 only, but I would tend to define all IO0..IO3 as Hi.
Are these mod'd drivers in OBEX or somewhere else? The sequence sounds pretty close to what I just described for the QuadSPI devices. I'm wondering now if one sequence will deal to both interfaces in one hit ...
Seems to be universal on the devices I checked, as it is a common problem they want to solve.
That means 6 pins need to be controlled for QuadSPI tolerant connection.
Then, clocking the $FFFFFFFF in would undo SQI or SDI modes...
It is probably not a good idea to start/reset a MCU to floating pins anyway.
Weak pullups seems common, and if weak enough, they can be pulled down with a resistor for those pins that need to reset low.
eg 6.6k external pulldown, is 500uA drive for Hi, and yet will pull down a 30uA weak-pullup to < 200mV