std::string的find挺慢的

搞了个比较极端的测试: 64KB长的字符串里找不到:

#include <stdio.h>
#include <time.h>
// MinGW does not build against glibc, it builds against msvcrt. As such, it uses libmsvcrtXX.a instead.
// gcc and glibc are two separate products. So still no memmem().
#define _GNU_SOURCE
#include <string.h>
#include <string>
using namespace std;
int main() {
    enum { N = 1 << 16 };
    char*    p = new char[N + 1];
    for (int i = 0; i < N; i++) p[i] = (i & 15) + 'a';
    p[N] = 0;
    string    s = p;
    enum { M = 1 << 20 };
    int    n = 0;
#if 0
    const char    NEEDLE[] = "abcde";
#else
    const char    NEEDLE[] = "0123456789";
#endif
    time_t    start = time(0);
    for (int i = 0; i < M; i++)
#if 0
        //n += (int)memmem(p, N, NEEDLE, 10);
        n += (int)strstr(p, NEEDLE);
#else
        n += s.find(NEEDLE);
#endif
    printf("%dn", n); printf("%dn", time(0) - start);
}

tdm-gcc-10.3.0.exe (mingw32), -O3, 比strstr慢太多了。

Big O describes the upper bound of an algorithm. Big Omega describes the lower bound of an algorithm. Big Theta describes the tight bound of an algorithm. What’s the Difference Between Big O, Big Omega, and Big Theta? | jarednielsen.com

草堆里找针。n是草堆的长度，m是针的长度，k是字母表里字符的个数。自动机如果每个状态都有个State next[256]，太占地方了，完整的UNICODE 32位，16位不能表示所有字符。好像是把state + char和起来做key. f is a constant introduced by SIMD (单指令多数据) operations. C++又走上邪路了。:-) 我的意思是：要么别优化，因为不常用；要么换算法。还是检查mn个字符，只是每次利用SIMD指令检查多个，不能说不好，但不够标准库的水准啊。

python有个ply包，里面的lex是完全python写的，不是去调用C的lex，但用了数次re.compile，核心代码几百行的样子:

self.lexre = None             # Master regular expression. This is a list of
                                      # tuples (re, findex) where re is a compiled
                                      # regular expression and findex is a list
                                      # mapping regex group numbers to rules
        self.lexretext = None         # Current regular expression strings
        self.lexstatere = {}          # Dictionary mapping lexer states to master regexs
        self.lexstateretext = {}      # Dictionary mapping lexer states to regex strings
        self.lexstaterenames = {}     # Dictionary mapping lexer states to symbol names
        self.lexstate = 'INITIAL'     # Current lexer state
        self.lexstatestack = []       # Stack of lexer states
        self.lexstateinfo = None      # State information
        self.lexstateignore = {}      # Dictionary of ignored characters for each state
        self.lexstateerrorf = {}      # Dictionary of error functions for each state
        self.lexstateeoff = {}        # Dictionary of eof functions for each state
        self.lexreflags = 0           # Optional re compile flags
        self.lexdata = None           # Actual input data (as a string)
        self.lexpos = 0               # Current position in input text
        self.lexlen = 0               # Length of the input text
        self.lexerrorf = None         # Error rule (if any)
        self.lexeoff = None           # EOF rule (if any)
        self.lextokens = None         # List of valid tokens
        self.lexignore = ''           # Ignored characters
        self.lexliterals = ''         # Literal characters that can be passed through
        self.lexmodule = None         # Module
        self.lineno = 1               # Current line number
        self.lexoptimize = False      # Optimized mode

原文链接: https://www.cnblogs.com/funwithwords/p/15763808.html

欢迎关注

微信关注下方公众号，第一时间获取干货硬货；公众号内回复【pdf】免费获取数百本计算机经典书籍

原创文章受到原创版权保护。转载请注明出处：https://www.ccppcoding.com/archives/183405

非原创文章文中已经注明原地址，如有侵权，联系删除

关注公众号【高性能架构探索】，第一时间获取最新文章

转载文章受原作者版权保护。转载请注明原作者出处！

std::string的find挺慢的

相关推荐