Skip to content

perf: reduce chapter parsing time by ~10% via CSS caching and loop optimizations#1528

Closed
jpirnay wants to merge 2 commits into
crosspoint-reader:masterfrom
jpirnay:perf-htmlslim-parser
Closed

perf: reduce chapter parsing time by ~10% via CSS caching and loop optimizations#1528
jpirnay wants to merge 2 commits into
crosspoint-reader:masterfrom
jpirnay:perf-htmlslim-parser

Conversation

@jpirnay
Copy link
Copy Markdown
Contributor

@jpirnay jpirnay commented Mar 28, 2026

Summary

  • What is the goal of this PR? Reduce the time spent parsing and laying out a chapter. Profiling identified index_split_007.html (~60 KB, 72 pages) from a representative epub as a worst-case target, with a baseline of ~2164 ms.
  • What changes are included?
    • CSS style resolution cache per chapter (ChapterHtmlSlimParser)
    • Inter-word gap pre-computation in both line-breaking paths (ParsedText)
    • ASCII fast path in the characterData XML callback

Additional Context

Profiling methodology

Chapter-level and per-paragraph timing was added temporarily (and removed before this PR) to measure each phase:

Phase Baseline Final Δ
Total 2164 ms ~1950 ms −214 ms (−10%)
Layout (line-break + SD writes) ~1501 ms ~1501 ms
XML + misc ~663 ms ~452 ms −211 ms

The layout figure is itself a compound cost: ~720 ms is SD card writes (~10 ms/page × 72 pages), ~780 ms is actual line-break computation. Neither was addressed in this PR.

The xml+misc figure was further broken down by callback:

Callback Time
startElement pure logic ~95 ms
characterData ~65 ms
endElement ~17 ms
expat internals ~282 ms

What worked

CSS style resolution cache (−171 ms) — the largest gain. resolveStyle and parseInlineStyle were being called for every HTML element (~1000 per chapter), each doing string normalization, splitting, and multiple hash lookups. Per-chapter unordered_map caches keyed on tag|class and styleAttr eliminate repeat work. The maps are members of ChapterHtmlSlimParser and are discarded with the parser after each chapter.

Inter-word gap pre-computation (−23 ms)getSpaceAdvance/getKerning were called inside the inner loop of both computeLineBreaks (DP) and computeHyphenatedLineBreaks (greedy). Moving them to a single pre-pass vector avoids redundant codepoint scanning and renderer calls. Applied to both paths; the hyphenated path also keeps the vector in sync when hyphenateWordAtIndex inserts a new word.

ASCII fast path in characterData (−11 ms) — for bytes > 0x20 && < 0x80, skip all multi-byte sequence checks (NBSP, NNBSP, BOM) and go directly to buffer append. Safe because all multi-byte UTF-8 sequences start with a byte ≥ 0x80.

What was investigated but did not help

  • Async SD writes — the ~720 ms of SD write time embedded in page completion is the single largest remaining cost. Addressing it would require making completePageFn asynchronous, which is a significant architectural change outside the scope of this PR. (Edit: simple buffering the write calls does not help as SdFat likely already buffers internally to 512-byte sector boundaries, so our many small writePod calls were accumulating in SdFat's sector buffer and only flushing when a sector boundary was crossed.)
  • Expat parser internals (~282 ms) — not addressable without replacing the XML parser.
  • Additional callback optimizations — profiling showed endElement (~17 ms) and the remaining startElement logic (~95 ms) have no obvious low-risk wins beyond what the CSS cache already provides.

AI Usage

Did you use AI tools to help write this code? YES

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 28, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: da8fc4e3-d65f-447c-a961-f313a19be002

📥 Commits

Reviewing files that changed from the base of the PR and between 9f397e7 and 21da536.

📒 Files selected for processing (1)
  • lib/Epub/Epub/parsers/ChapterHtmlSlimParser.cpp
🚧 Files skipped from review as they are similar to previous changes (1)
  • lib/Epub/Epub/parsers/ChapterHtmlSlimParser.cpp
📜 Recent review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: cppcheck
  • GitHub Check: build

📝 Walkthrough

Walkthrough

Precomputes inter-word gaps in ParsedText line-breaking/hyphenation to remove repeated renderer kerning/space calls, adds per-chapter caches for stylesheet and inline-style resolution in ChapterHtmlSlimParser, and introduces an ASCII fast path in character tokenization.

Changes

Cohort / File(s) Summary
Line-breaking optimization
lib/Epub/Epub/ParsedText.cpp
Added a precomputed interWordGaps vector used by computeLineBreaks and computeHyphenatedLineBreaks to avoid repeated renderer.getSpaceAdvance/renderer.getKerning calls inside O(n²) loops. When hyphenation splits a word, inserts a 0 gap placeholder to keep indices aligned with modified words/wordWidths.
HTML parsing & style caching
lib/Epub/Epub/parsers/ChapterHtmlSlimParser.h, lib/Epub/Epub/parsers/ChapterHtmlSlimParser.cpp
Added cssStyleCache_ and inlineStyleCache_ unordered_maps to cache resolveStyle(name, class) and parsed inline styles; cache only populated for fully-resolved styles. Applied cached styles when evaluating image display and sizing.
Tokenization fast path & logging
lib/Epub/Epub/parsers/ChapterHtmlSlimParser.cpp
Introduced ASCII fast path for characterData bytes 0x21..0x7F to append directly to partWordBuffer, reducing UTF-8/whitespace handling on the hot path. Changed parse-time logging to compute and store elapsed ms in a uint32_t totalTimeMs.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Possibly related PRs

Suggested reviewers

  • znelson
  • daveallie
🚥 Pre-merge checks | ✅ 2
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and concisely summarizes the main changes: performance improvements via CSS caching and loop optimizations, with the specific ~10% improvement metric.
Description check ✅ Passed The description is comprehensive and directly related to the changeset, detailing the goal, specific changes, profiling methodology, measured results, and investigations.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
lib/Epub/Epub/ParsedText.cpp (2)

168-181: Extract shared inter-word-gap precompute into a helper to avoid drift.

The gap-building logic is now duplicated in two paths. A shared helper will keep behavior aligned and reduce maintenance risk.

♻️ Suggested refactor
+namespace {
+std::vector<int> buildInterWordGaps(const GfxRenderer& renderer, int fontId,
+                                    const std::vector<std::string>& words,
+                                    const std::vector<EpdFontFamily::Style>& wordStyles,
+                                    const std::vector<bool>& continuesVec) {
+  std::vector<int> gaps(words.size(), 0);
+  for (size_t j = 1; j < words.size(); ++j) {
+    gaps[j] = !continuesVec[j]
+                  ? renderer.getSpaceAdvance(fontId, lastCodepoint(words[j - 1]), firstCodepoint(words[j]),
+                                             wordStyles[j - 1])
+                  : renderer.getKerning(fontId, lastCodepoint(words[j - 1]), firstCodepoint(words[j]),
+                                        wordStyles[j - 1]);
+  }
+  return gaps;
+}
+}  // namespace

Then replace both local precompute blocks with:

-std::vector<int> interWordGaps(...);
-...
+std::vector<int> interWordGaps = buildInterWordGaps(renderer, fontId, words, wordStyles, continuesVec);

Also applies to: 293-307

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@lib/Epub/Epub/ParsedText.cpp` around lines 168 - 181, The duplicated
inter-word-gap precompute should be extracted into a helper function to keep
behavior consistent; implement a function (e.g., computeInterWordGaps) that
takes fontId, totalWordCount, words, continuesVec, wordStyles and renderer and
returns the std::vector<int> interWordGaps by using
lastCodepoint(words[j-1])/firstCodepoint(words[j]) and calling
renderer.getSpaceAdvance or renderer.getKerning depending on continuesVec[j],
then replace both local blocks that build interWordGaps (the one using
continuesVec and renderer.getSpaceAdvance/getKerning) with calls to this helper
to avoid drift.

338-340: Guard the gap-vector sync invariant with assertions.

This insertion strategy is valid but depends on strict vector alignment invariants. Add explicit assert() checks after split+insert so future changes fail fast if alignment breaks.

🛡️ Suggested hardening
+#include <cassert>
...
       if (availableWidth > 0 &&
           hyphenateWordAtIndex(currentIndex, availableWidth, renderer, fontId, wordWidths, allowFallbackBreaks)) {
         // Keep interWordGaps in sync: insert placeholder for the new remainder word.
         // The remainder is always the first word on the next line so this slot is never read.
         interWordGaps.insert(interWordGaps.begin() + currentIndex + 1, 0);
+        assert(interWordGaps.size() == wordWidths.size());
+        assert(wordWidths.size() == words.size());
+        assert(wordStyles.size() == words.size());
+        assert(continuesVec.size() == words.size());

Based on learnings: In this codebase, assertions are always enabled and should be used to enforce invariants (side-effect free, inexpensive).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@lib/Epub/Epub/ParsedText.cpp` around lines 338 - 340, After inserting the
placeholder gap, add lightweight assert checks to enforce the inter-word gap
alignment: immediately after interWordGaps.insert(...) assert that currentIndex
+ 1 is a valid index (currentIndex + 1 < interWordGaps.size()) and that the
newly inserted slot equals 0 (interWordGaps[currentIndex + 1] == 0); also add an
assertion that the interWordGaps vector remains aligned with the corresponding
word/token container used in this routine (e.g., the words/wordSlots vector used
by ParsedText) so sizes match after the insert.
lib/Epub/Epub/parsers/ChapterHtmlSlimParser.cpp (1)

297-307: Reuse the already-resolved cssStyle in the <img> path.

For img, cssStyle already contains the same stylesheet + inline-style merge, so these extra cache probes cannot change behavior. Removing them trims per-image hot-path work and gives you one place to apply the cache policy consistently.

♻️ Suggested simplification
-      if (self->cssParser) {
-        std::string imgCacheKey("img|");
-        imgCacheKey += classAttr;
-        auto imgIt = self->cssStyleCache_.find(imgCacheKey);
-        if (imgIt == self->cssStyleCache_.end())
-          imgIt = self->cssStyleCache_.emplace(imgCacheKey, self->cssParser->resolveStyle("img", classAttr)).first;
-        CssStyle imgDisplayStyle = imgIt->second;
-        if (!styleAttr.empty()) {
-          auto it = self->inlineStyleCache_.find(styleAttr);
-          if (it == self->inlineStyleCache_.end())
-            it = self->inlineStyleCache_.emplace(styleAttr, CssParser::parseInlineStyle(styleAttr)).first;
-          imgDisplayStyle.applyOver(it->second);
-        }
+      {
+        const CssStyle& imgDisplayStyle = cssStyle;
         if (imgDisplayStyle.hasDisplay() && imgDisplayStyle.display == CssDisplay::None) {
           self->skipUntilDepth = self->depth;
           self->depth += 1;
           return;
         }
       }
@@
-                std::string imgCacheKey("img|");
-                imgCacheKey += classAttr;
-                auto imgStyleIt = self->cssParser ? self->cssStyleCache_.find(imgCacheKey) : self->cssStyleCache_.end();
-                if (self->cssParser && imgStyleIt == self->cssStyleCache_.end())
-                  imgStyleIt =
-                      self->cssStyleCache_.emplace(imgCacheKey, self->cssParser->resolveStyle("img", classAttr)).first;
-                CssStyle imgStyle = self->cssParser ? imgStyleIt->second : CssStyle{};
-                // Merge inline style (e.g. style="height: 2em") so it overrides stylesheet rules
-                if (!styleAttr.empty()) {
-                  auto it = self->inlineStyleCache_.find(styleAttr);
-                  if (it == self->inlineStyleCache_.end())
-                    it = self->inlineStyleCache_.emplace(styleAttr, CssParser::parseInlineStyle(styleAttr)).first;
-                  imgStyle.applyOver(it->second);
-                }
+                CssStyle imgStyle = cssStyle;

Also applies to: 352-364

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@lib/Epub/Epub/parsers/ChapterHtmlSlimParser.cpp` around lines 297 - 307, The
img handling duplicates work by re-resolving and re-caching styles even though
cssStyle already holds the resolved stylesheet+inline merge; update the <img>
path in ChapterHtmlSlimParser (where imgCacheKey, cssStyleCache_,
cssParser->resolveStyle("img", classAttr), inlineStyleCache_,
CssParser::parseInlineStyle and imgDisplayStyle.applyOver are used) to reuse the
existing cssStyle variable instead of creating imgCacheKey and probing
cssStyleCache_/inlineStyleCache_ again—remove the extra cache lookups and
resolve calls, and apply the same single cache/application policy already used
for other elements so style resolution for images is done in one place
consistently (also apply the same change in the similar block referenced at the
later img path).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@lib/Epub/Epub/parsers/ChapterHtmlSlimParser.cpp`:
- Around line 189-197: The current code memoizes whatever
cssParser->resolveStyle(name, classAttr) returns, but that can be a transient
low-heap fallback (empty/default) and should not be cached; change the miss path
so you call cssParser->resolveStyle(...) into a temporary, then only insert into
cssStyleCache_ if the returned CssStyle is authoritative (e.g., non-default/has
rules) — otherwise skip emplacing and just use the temporary result so future
calls can re-resolve after memory recovers; reference cssStyleCache_,
cssParser->resolveStyle(name, classAttr) and the local cssStyle variable when
making this change.

---

Nitpick comments:
In `@lib/Epub/Epub/ParsedText.cpp`:
- Around line 168-181: The duplicated inter-word-gap precompute should be
extracted into a helper function to keep behavior consistent; implement a
function (e.g., computeInterWordGaps) that takes fontId, totalWordCount, words,
continuesVec, wordStyles and renderer and returns the std::vector<int>
interWordGaps by using lastCodepoint(words[j-1])/firstCodepoint(words[j]) and
calling renderer.getSpaceAdvance or renderer.getKerning depending on
continuesVec[j], then replace both local blocks that build interWordGaps (the
one using continuesVec and renderer.getSpaceAdvance/getKerning) with calls to
this helper to avoid drift.
- Around line 338-340: After inserting the placeholder gap, add lightweight
assert checks to enforce the inter-word gap alignment: immediately after
interWordGaps.insert(...) assert that currentIndex + 1 is a valid index
(currentIndex + 1 < interWordGaps.size()) and that the newly inserted slot
equals 0 (interWordGaps[currentIndex + 1] == 0); also add an assertion that the
interWordGaps vector remains aligned with the corresponding word/token container
used in this routine (e.g., the words/wordSlots vector used by ParsedText) so
sizes match after the insert.

In `@lib/Epub/Epub/parsers/ChapterHtmlSlimParser.cpp`:
- Around line 297-307: The img handling duplicates work by re-resolving and
re-caching styles even though cssStyle already holds the resolved
stylesheet+inline merge; update the <img> path in ChapterHtmlSlimParser (where
imgCacheKey, cssStyleCache_, cssParser->resolveStyle("img", classAttr),
inlineStyleCache_, CssParser::parseInlineStyle and imgDisplayStyle.applyOver are
used) to reuse the existing cssStyle variable instead of creating imgCacheKey
and probing cssStyleCache_/inlineStyleCache_ again—remove the extra cache
lookups and resolve calls, and apply the same single cache/application policy
already used for other elements so style resolution for images is done in one
place consistently (also apply the same change in the similar block referenced
at the later img path).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 574ebba3-75c4-49c3-aeac-f75678807ea2

📥 Commits

Reviewing files that changed from the base of the PR and between 3448430 and 9f397e7.

📒 Files selected for processing (3)
  • lib/Epub/Epub/ParsedText.cpp
  • lib/Epub/Epub/parsers/ChapterHtmlSlimParser.cpp
  • lib/Epub/Epub/parsers/ChapterHtmlSlimParser.h
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: cppcheck
  • GitHub Check: build
🧰 Additional context used
🧠 Learnings (9)
📓 Common learnings
Learnt from: Uri-Tauber
Repo: crosspoint-reader/crosspoint-reader PR: 988
File: lib/Epub/Epub/parsers/ChapterHtmlSlimParser.cpp:649-661
Timestamp: 2026-02-19T12:17:05.421Z
Learning: In ChapterHtmlSlimParser.cpp, when computing footnote word indices in endElement() for footnote links, the wordIndex must be cumulative across the 750-word mid-paragraph flush boundary. The correct calculation is: `wordIndex = wordsExtractedInBlock + currentTextBlock->size()`, not just `currentTextBlock->size()`. This ensures footnotes attach to the page containing their actual anchor word, even after layoutAndExtractLines(false) has extracted and removed earlier words from the block.
📚 Learning: 2026-02-19T12:17:05.421Z
Learnt from: Uri-Tauber
Repo: crosspoint-reader/crosspoint-reader PR: 988
File: lib/Epub/Epub/parsers/ChapterHtmlSlimParser.cpp:649-661
Timestamp: 2026-02-19T12:17:05.421Z
Learning: In ChapterHtmlSlimParser.cpp, when computing footnote word indices in endElement() for footnote links, the wordIndex must be cumulative across the 750-word mid-paragraph flush boundary. The correct calculation is: `wordIndex = wordsExtractedInBlock + currentTextBlock->size()`, not just `currentTextBlock->size()`. This ensures footnotes attach to the page containing their actual anchor word, even after layoutAndExtractLines(false) has extracted and removed earlier words from the block.

Applied to files:

  • lib/Epub/Epub/ParsedText.cpp
  • lib/Epub/Epub/parsers/ChapterHtmlSlimParser.h
  • lib/Epub/Epub/parsers/ChapterHtmlSlimParser.cpp
📚 Learning: 2026-03-17T15:27:17.468Z
Learnt from: znelson
Repo: crosspoint-reader/crosspoint-reader PR: 1413
File: lib/EpdFont/EpdFont.cpp:34-36
Timestamp: 2026-03-17T15:27:17.468Z
Learning: In crosspoint-reader/crosspoint-reader, `EpdFont::getGlyph()` falls back to `getGlyph(REPLACEMENT_GLYPH)` (U+FFFD) before returning `nullptr`. All fonts in this project are guaranteed to include a U+FFFD replacement glyph, so the `!glyph` null branch in `EpdFont::getTextBounds()` (and similar rendering paths) is unreachable in practice. Do not flag stale-state issues in that branch (e.g., leftover `lastBaseAdvanceFP`/`lastBaseTop` after a null glyph) as bugs in future reviews.

Applied to files:

  • lib/Epub/Epub/ParsedText.cpp
📚 Learning: 2026-03-20T19:51:12.696Z
Learnt from: znelson
Repo: crosspoint-reader/crosspoint-reader PR: 1447
File: src/activities/reader/EpubReaderActivity.cpp:642-660
Timestamp: 2026-03-20T19:51:12.696Z
Learning: In crosspoint-reader/crosspoint-reader (EpubReaderActivity.cpp, prepareSection), when `epub->getTocIndexForSpineIndex(currentSpineIndex)` returns a negative value (pre-TOC spine with no TOC entry), `prepareSection` skips the entire chapter walk and leaves `chapterPageInfo` unset. The status bar correctly falls back to `section->pageCount` in this scenario. This means the `pageTocIndex == -1` path inside the `render()` per-page TOC update block is handled correctly by simply leaving `chapterPageInfo` unchanged — no explicit clear is needed.

Applied to files:

  • lib/Epub/Epub/ParsedText.cpp
  • lib/Epub/Epub/parsers/ChapterHtmlSlimParser.h
  • lib/Epub/Epub/parsers/ChapterHtmlSlimParser.cpp
📚 Learning: 2026-02-27T22:49:59.600Z
Learnt from: ngxson
Repo: crosspoint-reader/crosspoint-reader PR: 1218
File: src/activities/ActivityManager.cpp:254-265
Timestamp: 2026-02-27T22:49:59.600Z
Learning: In this codebase, assertions are always enabled (no NDEBUG). Use assert() to crash on programmer errors and surface logic bugs during development and in production builds. Do not rely on asserts for runtime error handling; they should enforce invariants that must always hold. Keep asserts side-effect free and inexpensive, and avoid relying on them for user-visible failures. Include <cassert> where appropriate and document the invariant being tested.

Applied to files:

  • lib/Epub/Epub/ParsedText.cpp
  • lib/Epub/Epub/parsers/ChapterHtmlSlimParser.cpp
📚 Learning: 2026-03-02T10:14:16.036Z
Learnt from: Uri-Tauber
Repo: crosspoint-reader/crosspoint-reader PR: 1245
File: lib/Epub/Epub/Section.cpp:277-308
Timestamp: 2026-03-02T10:14:16.036Z
Learning: Guideline: Strengthen serialization::readString to defend against unbounded growth when reading from disk data. Implement and enforce a maximum allowed length (e.g., a configured or reasonable constant) and validate the incoming length before resizing or allocating. Audit all call sites (e.g., BookMetadataCache, TextBlock, KOReaderCredentialStore, Section cache readers) to ensure they do not rely on unbounded len-based resizing. If the readString API must remain, add internal safeguards (bounds checks, length validation, and error handling) so per-call-site validations are not required. Ensure Section cache files remain versioned (SECTION_FILE_VERSION) and parameter mismatches invalidate caches, but do not rely on unsafe allocations; prefer safe, bounded reads with explicit errors on invalid data.

Applied to files:

  • lib/Epub/Epub/ParsedText.cpp
  • lib/Epub/Epub/parsers/ChapterHtmlSlimParser.cpp
📚 Learning: 2026-03-28T11:06:29.611Z
Learnt from: pablohc
Repo: crosspoint-reader/crosspoint-reader PR: 1488
File: src/activities/home/HomeActivity.cpp:92-95
Timestamp: 2026-03-28T11:06:29.611Z
Learning: When reviewing crosspoint-reader code, avoid flagging a missing `renderer.displayBuffer()` call immediately after `GUI.drawPopup()` / `BaseTheme::drawPopup()`: `BaseTheme::drawPopup()` already calls `renderer.displayBuffer()` before returning, so the popup is guaranteed to be flushed to the e-ink panel before subsequent blocking work begins. Conversely, do not require a `renderer.displayBuffer()` call after `fillPopupProgress()`; it intentionally does not flush, so intermediate progress-bar updates may not appear unless the update granularity warrants an explicit flush elsewhere.

Applied to files:

  • lib/Epub/Epub/ParsedText.cpp
  • lib/Epub/Epub/parsers/ChapterHtmlSlimParser.cpp
📚 Learning: 2026-03-21T19:30:33.664Z
Learnt from: znelson
Repo: crosspoint-reader/crosspoint-reader PR: 1454
File: lib/Epub/Epub/parsers/ChapterHtmlSlimParser.cpp:101-105
Timestamp: 2026-03-21T19:30:33.664Z
Learning: In `lib/Epub/Epub/parsers/ChapterHtmlSlimParser.cpp`, the `blockStyleStack` is a fixed-capacity array (`MAX_BLOCK_STYLE_DEPTH`, currently 8). On overflow, `pushBlockStyle()` only logs an error and increments `blockStyleStackSize` without writing a new entry; `getAccumulatedBlockStyle()` returns the deepest stored accumulated entry as a fallback. This is intentional — real EPUBs are expected to have ≤ 8 block-style nesting levels. Inline styles use a `vector`-based stack and can nest more deeply. The overflow threshold or container type can be revisited if 8 proves insufficient in practice. Do not flag the overflow fallback behavior as a bug in future reviews.

Applied to files:

  • lib/Epub/Epub/parsers/ChapterHtmlSlimParser.h
  • lib/Epub/Epub/parsers/ChapterHtmlSlimParser.cpp
📚 Learning: 2026-03-02T21:18:25.387Z
Learnt from: Uri-Tauber
Repo: crosspoint-reader/crosspoint-reader PR: 1245
File: lib/Epub/Epub/parsers/ChapterHtmlSlimParser.cpp:137-150
Timestamp: 2026-03-02T21:18:25.387Z
Learning: In ChapterHtmlSlimParser.cpp, pendingAnchorId is a single string (not a vector) because EPUB footnote targets are virtually always block-level elements (<aside>, <p>, <div>) that trigger startNewTextBlock() and flush the anchor before overwrite can occur. Multiple consecutive inline IDs would resolve to the same completedPageCount anyway, and vector allocation overhead is unnecessary for ESP32-C3. Only inline IDs in practice are footnote back-references (<a id="ref1">), which are never navigation targets.

Applied to files:

  • lib/Epub/Epub/parsers/ChapterHtmlSlimParser.h
  • lib/Epub/Epub/parsers/ChapterHtmlSlimParser.cpp
🔇 Additional comments (1)
lib/Epub/Epub/ParsedText.cpp (1)

168-181: Nice optimization with preserved spacing semantics.

Good change: precomputing interWordGaps removes repeated renderer/codepoint work from hot loops while keeping the same space-vs-kerning decision model.

Also applies to: 199-200, 293-307, 322-323

Comment thread lib/Epub/Epub/parsers/ChapterHtmlSlimParser.cpp
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR targets faster chapter parsing/layout by reducing repeated CSS resolution work, minimizing renderer calls in line-breaking loops, and adding a faster ASCII path for XML character data handling.

Changes:

  • Add per-chapter caches for resolved CSS styles and parsed inline styles in ChapterHtmlSlimParser.
  • Pre-compute inter-word gaps in both DP and hyphenation line-breaking paths to reduce repeated renderer/codepoint work.
  • Add an ASCII fast path in the XML characterData callback.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
lib/Epub/Epub/parsers/ChapterHtmlSlimParser.h Adds per-chapter unordered_map caches for stylesheet and inline-style resolution.
lib/Epub/Epub/parsers/ChapterHtmlSlimParser.cpp Uses the caches in startElement, adds an ASCII fast path in characterData, and minor logging refactor.
lib/Epub/Epub/ParsedText.cpp Pre-computes inter-word spacing/kerning vectors to reduce inner-loop cost during line breaking.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +306 to +308
if (imgIt == self->cssStyleCache_.end())
imgIt = self->cssStyleCache_.emplace(imgCacheKey, self->cssParser->resolveStyle("img", classAttr)).first;
CssStyle imgDisplayStyle = imgIt->second;
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The image-path style caching unconditionally stores the result of CssParser::resolveStyle("img", classAttr). But resolveStyle can intentionally return an empty CssStyle when free heap is below MIN_FREE_HEAP_FOR_CSS (see CssParser.cpp), and caching that empty result would make the style permanently empty for the rest of the chapter even if heap later recovers. Consider mirroring the earlier caching logic here (only cache when resolved.defined.anySet(), otherwise use the transient resolved value without inserting) so low-heap fallbacks don’t get memoized.

Suggested change
if (imgIt == self->cssStyleCache_.end())
imgIt = self->cssStyleCache_.emplace(imgCacheKey, self->cssParser->resolveStyle("img", classAttr)).first;
CssStyle imgDisplayStyle = imgIt->second;
CssStyle imgDisplayStyle;
if (imgIt == self->cssStyleCache_.end()) {
CssStyle resolved = self->cssParser->resolveStyle("img", classAttr);
if (resolved.defined.anySet())
imgIt = self->cssStyleCache_.emplace(imgCacheKey, resolved).first;
imgDisplayStyle = resolved;
} else {
imgDisplayStyle = imgIt->second;
}

Copilot uses AI. Check for mistakes.
Comment on lines +360 to +364
auto imgStyleIt = self->cssParser ? self->cssStyleCache_.find(imgCacheKey) : self->cssStyleCache_.end();
if (self->cssParser && imgStyleIt == self->cssStyleCache_.end())
imgStyleIt =
self->cssStyleCache_.emplace(imgCacheKey, self->cssParser->resolveStyle("img", classAttr)).first;
CssStyle imgStyle = self->cssParser ? imgStyleIt->second : CssStyle{};
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same concern as above: this imgStyle cache path always emplaces the result of resolveStyle into cssStyleCache_. If resolveStyle returned an empty style due to low heap, that fallback becomes permanently cached and will prevent later correct CSS height/width/display resolution for images of this class. Suggest applying the same ‘only cache when resolved.defined.anySet()’ policy (or otherwise avoid caching low-heap fallback results) here too.

Suggested change
auto imgStyleIt = self->cssParser ? self->cssStyleCache_.find(imgCacheKey) : self->cssStyleCache_.end();
if (self->cssParser && imgStyleIt == self->cssStyleCache_.end())
imgStyleIt =
self->cssStyleCache_.emplace(imgCacheKey, self->cssParser->resolveStyle("img", classAttr)).first;
CssStyle imgStyle = self->cssParser ? imgStyleIt->second : CssStyle{};
CssStyle imgStyle{};
if (self->cssParser) {
auto imgStyleIt = self->cssStyleCache_.find(imgCacheKey);
if (imgStyleIt != self->cssStyleCache_.end()) {
imgStyle = imgStyleIt->second;
} else {
CssStyle resolved = self->cssParser->resolveStyle("img", classAttr);
if (resolved.defined.anySet()) {
imgStyle = self->cssStyleCache_.emplace(imgCacheKey, resolved).first->second;
} else {
imgStyle = resolved;
}
}
}

Copilot uses AI. Check for mistakes.
@znelson
Copy link
Copy Markdown
Member

znelson commented Apr 21, 2026

After a careful look, I'm uneasy with this. Yes, some performance improvement, but unordered_map requires nodes to be heap allocated, and I worry about heap fragmentation as a result. If the only improvement is performance, I'm going to hold off until we have a clear owner for these improvements.

@znelson znelson closed this Apr 21, 2026
@jpirnay jpirnay deleted the perf-htmlslim-parser branch May 7, 2026 17:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants