Thoughts about a Text Storage Format?
Hi, perhaps someone knows something like this or has thoughts about it?
The question is about a way to store a text in an editor in a simple (and fast) way, that you can handle a text of about 3000 Lines and 100 columns and do neither need much Ram nor have many movements (writes) on a SD card, if you insert or delete lines or even chars. (At the moment my editor holds all the text in RAM and would need 3000*100 bytes for this text.)
Traditionally this is solved in a Forth editor with screens (of 16 lines of 64 columns or bigger), only one screen is visible at a time. The writer must manually handle the screens if he must insert text, that does not fit onto a screen. But here the text shall appear to the user as a continuous text.
SD card space can be handled quite wastefully, so perhaps only a part of each sector (4096 bytes) could be used and only some of these would be held in Ram?
I have seen a text editor in Forth which uses linked lists for the lines. But I think this helps only for speed.
The other direction of thought is to hold the text in a compacted form in Ram. The text has about 75% white spaces and when it is compiled it is boiled down to 1/20th of space, so there is plenty of potential...
Ideas, links? Thanks in advance!
Christof
Comments
Christof Eb.,
When you say text, do you mean ASCII, Extended ASCII (Escape Sequences), UTF, or something else?
Yes, source code is 7bit ASCII.
have a look here for some discussion: https://cdacamar.github.io/data structures/algorithms/benchmarking/text editors/c++/editor-data-structures/
(in the past I used gap buffer to implement a texteditor with an additional 'disk' buffer x blocks before and after the current cursor using a temp file to store the data and reading/writing data as needed. None too fast in those days but I could handle text files larger then memory available)
Thank You!
Interesting read!
At the moment I think, I could use the simple approach to still hold all the text in Ram but instead of fixed length lines have crlf delimited lines. In parallel to the text buffer have a table, which holds all beginnings of lines for fast access.
If you insert a character near the beginning then the whole file needs to be rewritten. I assume you are trying to avoid that. So what if you use one sector per line. Then you would need to rewrite only one sector. But I think this scheme will result in more sectors written on average than a standard text file. The SD card may have to erase much more than one sector to write one sector. This is called write amplification.
I guess only for lines that are shown on the screen? Otherwise 3000 lines * 4 bytes = 12kB just for pointers. Maybe Just keep some pointers around as starting points? A block of 256 pointers would give lines 1-256, or 2-512, 4-1024, 8-2048, 16-4096. When the table is filled every other pointer gets thrown away and the others packed into the first half of the array.
This becomes a bigger problem the more I think about it.
You really don't need quick access to individual lines in a text editor. You just need to remember the beginning position of the first line on screen and of the first line below the screen. Mind that with line-wrapping, the top of the visible screen may not line up with the beginning of a line.
Scanning the entire file (for any reasonable size of text file) to jump to a particular line should not be terribly slow.
One approach that seems like it'd work for non-resident text is to use a binary tree where leaf nodes contain (pointers to) snippets of text of up to a certain size.
Inserting or deleting bytes in such a structure requires at most 3 or 4 writes (depending on implementation - I think anyways), but seeking to a particular byte position is quite slow. It's possible to minimize the seeking needed by storing the full node path for cursor and screen positions (etc). But then those need to all be kept in sync when the nodes are updated. Another method has each branch node store the total size of all children to speed up seeking, though in this scheme the writes needed increase with tree depth due to having to update all the way up to the root node (clever people will realize the same thing works for counting lines, at expense of adding another field to each node).
Also, please be aware that SD card sectors are 512 bytes each
Thanks again for the inputs! Gives me something to think about. And it becomes more clear that it will be better to find an efficient compact way to hold the whole text in RAM, than to have the major part of the text only on SD card.
Well, not entirely.... I think, I should try to write an include mechanism for Taqoz....