【go原始碼分析】strings.go 裡的那些騷操作

原始碼分析 Go語言 · 發表 2018-09-28 20:44:49

摘要： go version go1.11 darwin/amd64 strings.go 檔案中定義了近40個常用的字串操作函式（公共函式）。以下是主要的幾個函式。函式簡介 ...

go version go1.11 darwin/amd64

strings.go 檔案中定義了近40個常用的字串操作函式（公共函式）。以下是主要的幾個函式。

函式	簡介
`Index(s, substr string) int`	返回`substr` 在`s` 中第一次出現的位置，不存在返回`-1` ；採用`RabinKarp` 演算法
`Split(s, sep string) []string`	根據`sep` 把字串`s` 進行切分，返回切分後的陣列
`Join(a []string, sep string) string`	跟`Split` 功能剛好相反
`Repeat(s string, count int) string`	返回字串`s` 重複`count` 次得到的字串
`Trim(s string, cutset string) string`	返回去除首尾存在於`cutset` 的字元的切片
`Replace(s, old, new string, n int) string`	字串替換
`EqualFold(s, t string) bool`	判斷兩個字串代表的資料夾是否相等（忽略大小寫）

以及ToUpper ToLower Title （Title 函式把單詞轉換成標題形式，不是ToTitle ）。

還有一些以上函式派生出的其他函式。比如：Contains 基本是通過Index 函式實現的；與Index 原理一致的LastIndex 函式；與Trim 有關的TrimLeft TrimRight 等。

接下來，本文會對Index Trim Join Repeat Replace 函式進行分析。

ps:len 返回的是字串的位元組數，不是字元數。字元數請使用utf8.RuneCountInString

`Index` : RabinKarp 演算法實現

Index(s, substr string) int ,返回substr 在s 中第一次出現的位置，不存在返回-1 ；採用RabinKarp 演算法

Index 函式會先對substr 的長度n 進行判斷，對特殊情況做快速處理。

其次，如果長度n 以及len(s) 足夠小，則使用BruteForce 演算法：即暴力匹配，拿substr 與s[i:i+n] 進行比較，如果相等，返回i ，其中i = (from 0 to len(s) - n) ...

最後，會先嚐試暴力匹配，如果匹配失敗次數超過臨界點，則換成RabinKarp 演算法。

（為了方便閱讀，文中不放全部程式碼，只展示核心程式碼與部分結構程式碼）

Index

func Index(s, substr string) int {
n := len(substr)// len 返回的是位元組數
switch {
case n == 0:
return 0
case n == 1:
// substr 是單位元組字串，則直接單位元組進行比較
return IndexByte(s, substr[0])
case n == len(s):
if substr == s {
return 0
}
return -1
case n > len(s):
return -1
case n <= bytealg.MaxLen:
// Use brute force when s and substr both are small
if len(s) <= bytealg.MaxBruteForce {
return bytealg.IndexString(s, substr)
}

// 這裡有一大段省略的程式碼
// 迴圈嘗試 substr == s[i:i+n]
// 如果失敗次數過多，則使用 bytealg.IndexString(s, substr)
}

// 這裡有一大段省略的程式碼
// 迴圈嘗試 substr == s[i:i+n]
// 如果失敗次數過多，則使用 indexRabinKarp(s[i:], substr)

t := s[:len(s)-n+1]
for i < len(t) {

// ... 省略程式碼

// 如果失敗次數過多，則使用 RabinKarp 演算法
if fails >= 4+i>>4 && i < len(t) {
// 這裡使用 s[i:] 作為引數
// 是因為前面的 s[:i] 都已經嘗試過了
j := indexRabinKarp(s[i:], substr)
if j < 0 {
return -1
}
return i + j
}
}

return -1
}

在看indexRabinKarp 函式之前，我們先了解一下RabinKarp 演算法。

RobinKarp 演算法是由 Robin 和 Karp 提出的字串匹配演算法。該演算法在實際應用中有較好的表現。

演算法的核心步驟：

const primeRK = 16777619 // 大素數
對substr 構造hash 值。n = len(substr) ，hash = (substr[0]*pow(primeRK, n-1) + substr[1]*pow(primeRK, n-2) + ... + substr[n-1]*pow(primeRK, 0)) % anotherBiggerPrime
對s 的每n 個子串按照相同邏輯構造hash 值，判斷與substr 的hash 是否相等；如果hash 相等，則比較子串是否真的與substr 相等
重複第三步，直到找到，或者未找到。

ps:

該演算法之所以快，是因為s[i+1, i+n+1] 的hash 值可以由s[i, i+n] 的hash 值計算出。即h(i+1) = ((h(i) - s[i] * pow(primeRK, n-1)) * primeRK + s[i+n+1]) % anotherBiggerPrime
另外，go 計算hash 時並沒有% anotherBiggerPrime ，而是定義了hash 為uint32 型別，利用整型溢位實現了對2**32 取模的效果。（一般來說是對另一個大素數取模，顯然這裡不是，不過畢竟這麼大的數也沒啥大影響）

該演算法預處理時間為O(n) ，n 為len(substr) ，執行最壞時間為O((n-m+1)m) ，m 為len(s) 。最壞情況為每個子串的hash 都與substr 的一樣。在平均情況下，執行時間還是很好的。

除了RabinKarp 演算法外，還要一些其他的字串匹配演算法。《演算法》導論中介紹了另外兩種優秀的演算法，分別是有限自動機 與Knuth-Morris-Pratt 演算法（即KMP 演算法），這兩個演算法的執行時間都為O(m) 。

下面是indexRabinKarp 函式

indexRabinKarp

func indexRabinKarp(s, substr string) int {
// Rabin-Karp search
// hashss 是 substr 根據上述方法計算出的 hash 值
// pow 是 primeRK 的 len(substr) 次冪
hashss, pow := hashStr(substr)
n := len(substr)

// 計算 s[:n] 的 hash 值
var h uint32
for i := 0; i < n; i++ {
h = h*primeRK + uint32(s[i])
}
if h == hashss && s[:n] == substr {
return 0
}

for i := n; i < len(s); {
// 計算下一個子串的 hash 值
h *= primeRK
h += uint32(s[i])
h -= pow * uint32(s[i-n])
i++

// 如果 hash 相等 且子串相等，則返回對應下標
if h == hashss && s[i-n:i] == substr {
return i - n
}
}
return -1
}

hashStr 函式跟計算s[:n] 的邏輯一致。不過不得不提一下pow 的計算方法。

hashStr

func hashStr(sep string) (uint32, uint32) {
hash := uint32(0)
for i := 0; i < len(sep); i++ {
hash = hash*primeRK + uint32(sep[i])
}

// 下面我用 python 的乘方元素符 ** 代表乘方
// 我們已 len(sep) 為 6 為例來看此函式
// 6 的二進位制是 110
// 每次迴圈，pow 和 sq 分別為
// i: 110pow: 1sq: rk
// i: 11pow: 1sq: rk ** 2
// i: 1pow: 1 * (rk ** 2)sq: rk ** 4
// i: 0pow: 1* (rk ** 2) * (rk ** 4)sq: rk ** 8
// pow: 1* (rk ** 2) * (rk ** 4) = 1 * (rk ** 6) 即是 pow(rk, 6)
var pow, sq uint32 = 1, primeRK
for i := len(sep); i > 0; i >>= 1 {
if i&1 != 0 {
pow *= sq
}
sq *= sq
}
return hash, pow
}

以上是Index 函式的實現邏輯。

`Trim` : 出神入化的位操作

Trim(s string, cutset string) string 返回去除首尾存在於cutset 的字元的切片。

執行 fmt.Println(strings.Trim("hello world", "hld"))
輸出 ello wor

Trim 的本質邏輯也比較簡單：

根據cutset 構造一個函式，該函式簽名為func(rune) bool ，接受一個rune 型別的值，返回該值是否在cutset 中
然後呼叫TrimLeft TrimRight ；這兩個函式呼叫了indexFunc ，其邏輯也比較簡單，不再贅述

函式makeCutsetFunc(cutset string) func(rune) bool 就是剛才提到的構造判斷rune 值是否在cutset 中的函式的函式。

makeCutsetFunc

func makeCutsetFunc(cutset string) func(rune) bool {
// 如果 cutset 是單個字元，則直接返回一個簡單函式，
// 該函式判斷入參 r 是否與 cutset[0] 相等
if len(cutset) == 1 && cutset[0] < utf8.RuneSelf {
return func(r rune) bool {
return r == rune(cutset[0])
}
}

// 如果 cutset 全是 ASCII 碼
// 則使用構造的 as (asciiSet型別)判斷 rune 是否在 cutset 中
if as, isASCII := makeASCIISet(cutset); isASCII {
return func(r rune) bool {
return r < utf8.RuneSelf && as.contains(byte(r))
}
}

// 呼叫 IndexRune 方法判斷 r 是否在 cutset 中
// IndexRune 其實就是 Index 的變種
return func(r rune) bool { return IndexRune(cutset, r) >= 0 }
}

其中，最有意思的要數makeASCIISet 函式，該函式用了一個[8]uint32 陣列實現了 128 個ASCII 碼的hash 表。

asciiSet

// asciiSet 一共 32 個位元組，一共 256 位，
// 其中低 128 位分別代表了 128 個 ascii 碼
type asciiSet [8]uint32

// makeASCIISet creates a set of ASCII characters and reports whether all
// characters in chars are ASCII.
func makeASCIISet(chars string) (as asciiSet, ok bool) {
for i := 0; i < len(chars); i++ {
c := chars[i]
// const utf8.RuneSelf = 0x80
// 小於 utf8.RuneSelf 的值是 ASCII 碼
// 大於 utf8.RuneSelf 的值是其他 utf8 字元的部分
if c >= utf8.RuneSelf {
return as, false
}

// ASCII 的範圍是 0000 0000 - 0111 1111
// c >> 5 的範圍是 000 - 011，即最大為 3
// 31 的二進位制是 0001 1111
// 1 << uint(c&31) 的結果剛好也在 uint 範圍內

as[c>>5] |= 1 << uint(c&31)
}
return as, true
}

// contains reports whether c is inside the set.
// 為了相容入參 c 為 byte 型別， c >> 5 < 8
// 所以 asciiSet 型別為 [8]uint32，陣列長度為 8
// 否則如果只考慮 128 個 ascii 碼的話，[4]uint32 就夠了
func (as *asciiSet) contains(c byte) bool {
return (as[c>>5] & (1 << uint(c&31))) != 0
}

以上是Trim 函式及其位操作。

從`Join` `Repeat` `Replace` 看字串如何生成

這三個函式的邏輯都很簡單，不再贅述。

頻繁申請記憶體是很耗費時間的，所以在生成某個字串時，如果能夠預知字串的長度，就能直接申請對應長度的記憶體，然後呼叫copy(dst, src []Type) int 函式把字元複製到對應位置，最後把[]byte 強轉成字串型別即可。

Join

func Join(a []string, sep string) string {
// 省略了部分特殊情況處理的程式碼

// 計算目標字串總長度
n := len(sep) * (len(a) - 1)
for i := 0; i < len(a); i++ {
n += len(a[i])
}

// 申請記憶體
b := make([]byte, n)
bp := copy(b, a[0])
// 複製內容
for _, s := range a[1:] {
bp += copy(b[bp:], sep)
bp += copy(b[bp:], s)
}
// 返回資料
return string(b)
}

Repeat

func Repeat(s string, count int) string {
// 特殊情況處理
if count < 0 {
panic("strings: negative Repeat count")
} else if count > 0 && len(s)*count/count != len(s) {
panic("strings: Repeat count causes overflow")
}

b := make([]byte, len(s)*count)
bp := copy(b, s)

// 2倍，4倍，8倍的擴大，直到 bp 不小於目標長度
for bp < len(b) {
copy(b[bp:], b[:bp])
bp *= 2
}
return string(b)
}

Replace

func Replace(s, old, new string, n int) string {
// 省略一下特殊情況程式碼

// 計算新字串的長度
t := make([]byte, len(s)+n*(len(new)-len(old)))
w := 0
start := 0
for i := 0; i < n; i++ {
j := start
if len(old) == 0 {
if i > 0 {
_, wid := utf8.DecodeRuneInString(s[start:])
j += wid
}
} else {
j += Index(s[start:], old)
}
// 查詢舊字串的位置，複製
w += copy(t[w:], s[start:j])
w += copy(t[w:], new)
start = j + len(old)
}
w += copy(t[w:], s[start:])
return string(t[0:w])
}

以上是關於生成字串時避免多次分配記憶體的高效做法。

Y_xx

【go原始碼分析】strings.go 裡的那些騷操作

Index : RabinKarp 演算法實現

Trim : 出神入化的位操作

從Join Repeat Replace 看字串如何生成

您可能也會喜歡…

`Index` : RabinKarp 演算法實現

`Trim` : 出神入化的位操作

從`Join` `Repeat` `Replace` 看字串如何生成