12. マルチバイト文字の扱い

日本語などのマルチバイト文字の扱いについて見ていきます。

文字列(string)はread-onlyなbyteのスライスです。UnicodeやUTF-8などの定義されたフォーマットの値を保持しているとは限りません。
Goのソースコードは常にUTF-8です。
文字リテラルはbyteレベルの有効なエスケープがない場合、常にUTF-8です。
runeはint32のエイリアスで、Unicodeコードポイントを表します。

Goの標準ライブラリの中で、UTF-8は特別な扱いとなっています。
通常のforループでは1バイト毎のインデックスとなりますが、rangeループではrune毎のインデックスとなります。

package main

import (
    "fmt"
)

const nihongo = "日本語"

func main() {

    // 通常のforループの場合、バイト毎の処理
    for i := 0; i < len(nihongo); i++ {
        // %x 16進数表示
        fmt.Printf("%x starts at byte position %d\n", nihongo[i], i)
    }

    // e6 starts at byte position 0
    // 97 starts at byte position 1
    // a5 starts at byte position 2
    // e6 starts at byte position 3
    // 9c starts at byte position 4
    // ac starts at byte position 5
    // e8 starts at byte position 6
    // aa starts at byte position 7
    // 9e starts at byte position 8

    // rangeでのループの場合、rune毎のインデックス
    for index, runeValue := range nihongo {
        // %#U Unicodeコードポイントおよび
        fmt.Printf("%#U starts at byte position %d\n", runeValue, index)
    }

    // U+65E5 '日' starts at byte position 0
    // U+672C '本' starts at byte position 3
    // U+8A9E '語' starts at byte position 6

}

文字数をカウントする場合、len() ではバイト数を返しますが、
unicode/utf8 パッケージの RuneCountInString を使うことでruneでのカウントを返します。

package main

import (
    "fmt"
    "unicode/utf8"
)

const nihongo = "日本語"

func main() {

    // len() はbyte数なので len == 9
    fmt.Printf("[]bytes %d", len(nihongo))

    // RuneCountInString() はrune数なので len == 3
    fmt.Printf("rune %d", utf8.RuneCountInString(nihongo))

}

マルチバイト文字列を扱うときには、runeを適切に使うようにしましょう。

↑このページの先頭へ