perf: add literal-pattern fast path to split()#19708
perf: add literal-pattern fast path to split()#19708mattn wants to merge 4 commits intovim:masterfrom
Conversation
When the pattern passed to split() is a single plain byte (not a regexp metacharacter), bypass vim_regcomp/vim_regexec entirely and scan with vim_strchr() instead. This avoids regex compilation and matching overhead for the very common case of splitting on a literal character such as "," or ":".
Generalize the fast path from single-byte literals to any pattern that contains no regexp metacharacters. Use mb_ptr2len() to safely skip multi-byte characters when scanning for metacharacters, and strstr() for the actual splitting.
|
How about adding a condition that the previous char is not for (p = pat; *p != NUL; p += mb_ptr2len(p))
if (*p < 0x80
&& vim_strchr((char_u *)".^$~[]\\*?+|{}()", *p) != NULL)
return FALSE;that will make EDIT: I guess that will require |
| while (*str != NUL || keepempty) | ||
| { | ||
| p = (char_u *)strstr((char *)str, (char *)pat); | ||
| end = p == NULL ? str + STRLEN(str) : p; |
There was a problem hiding this comment.
can we avoid the strlen() inside the loop?
| patlen = (int)STRLEN(pat); | ||
| while (*str != NUL || keepempty) | ||
| { | ||
| p = (char_u *)strstr((char *)str, (char *)pat); |
There was a problem hiding this comment.
Hm, does strstr() handle non utf-8 multibyte chars correctly?
There was a problem hiding this comment.
Pull request overview
This PR optimizes the Vimscript split() builtin by adding a fast path for purely-literal separator patterns, avoiding regex compilation/execution for common cases while leaving regexp and default-whitespace behavior on the existing code path.
Changes:
- Add
is_literal_pat()helper to detect patterns with no regexp metacharacters (with multibyte-safe scanning). - Implement a literal-separator split loop using
strstr()and byte-length advancement instead ofvim_regcomp()/vim_regexec().
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
| && *str != NUL && p != NULL | ||
| && end < p + patlen)) |
| static int | ||
| is_literal_pat(char_u *pat) | ||
| { | ||
| char_u *p; | ||
|
|
||
| if (pat == NULL || *pat == NUL) | ||
| return FALSE; | ||
|
|
||
| // Check that no character in the pattern has regexp meaning. | ||
| // Use mb_ptr2len() to skip over multi-byte characters safely so that | ||
| // trail bytes are never mistaken for ASCII metacharacters. | ||
| for (p = pat; *p != NUL; p += mb_ptr2len(p)) | ||
| if (*p < 0x80 | ||
| && vim_strchr((char_u *)".^$~[]\\*?+|{}()", *p) != NULL) | ||
| return FALSE; | ||
|
|
||
| return TRUE; | ||
| } | ||
|
|
split()with a literal separator (e.g.",",":","abc") is an extremely common pattern in Vim script, yet it currently goes through the full regexp compile-and-match path every time. This patch adds a fast path that detects patterns containing no regexp metacharacters and usesstrstr()to scan instead, skippingvim_regcomp()/vim_regexec()entirely. Multi-byte characters are handled safely viamb_ptr2len().Regexp patterns and the default whitespace pattern are unaffected and still take the existing code path.
Benchmark: 200,000 iterations per case
','(literal 1-char)'abc'(literal multi-char)',\+'(regexp)