この記事は😺TECHSCORE Advent Calendar 2019😺の9日目の記事です。

🐈 Clojure + AWS Textract で OCR ! 🐈

紙書籍で英語の文書を読む時に面倒に思うのは、辞書をひくことです。
紙書籍では、その度にスマホなどを取り自分で入力せねばならず、極めて面倒です。

幸いドキュメントスキャナは所持しているので、電子化してみることにしました。
スキャナで画像として取りこんだものを読み取り、プレインテキストにして、普段使っているPC上のアプリでサクっと調べられるようにするのです。

プレインテキスト化したものを一箇所にまとめておけば、検索も容易になるので便利そうです。

🐈 AWS Textract 🐈

なんでもある感のある AWS ですが、Textract という OCR 的なサービスもあります。
Textract には単純な文字の読み取りだけでなく、フォームや表の読み取りの機能までありますが、今回は単純な文字の読み取りのみを行います。

AWS CLI を使えばシェルスクリプトでも十分可能ですが、今回は段落ごとにまとめる処理を入れたかったので、簡単に Clojure のコードを書いています。
ちなみに Java という言語でも書けるようですが、一般的な言語のほうが良いと思い Clojure にしました。

API は簡単で、S3 というストレージ上の場所を与えるだけです。
そうすると JSON が返ってくるので、適当に解釈するだけです。

JSON は以下のような形式になっています。
ページと行と単語といったブロック並んでいて、それぞれの Relationships プロパティでその関連が把握できるようになっています。
この内の行ブロックを抽出して、プレインテキストを生成します。
ついでに、全ての行が連続していると読みにくいので、段落毎にまとめて間に改行を入れます。




		
		
			(ns textract.core
  (:gen-class)
  (:require [clojure.data.json :as json]
            [clojure.string :as string]
            [clojure.java.io :as io]
            [cognitect.aws.client.api :as aws]
            [clj-time.core :as tm]
            [clj-time.format :as tf]))


(def ^:dynamic *bucket* "YOUR-BUCKET-NAME")

(def key-format (tf/formatter "yyyyMMdd-HHmmss" (tm/default-time-zone)))

(def uniq-num (atom 1))


(defn base-line
  "ブロックの基準位置を返す"
  [block]
  (let [{top :Top height :Height} (-> block :Geometry :BoundingBox)]
    (+ top
       (/ height 2))))

(defn extract-line-blocks
  "JSON から行を表すオブジェクトのみを抽出する"
  [content]
  (->>
    (:Blocks content)
    (filter (comp #(= "LINE" %) :BlockType))
    (sort-by base-line)))

(defn get-min-dist
  "ブロック間の距離(y軸)で最小のものを返す"
  [blocks]
  (let [base-lines (map base-line blocks)]
    (if (= 1 (count base-lines))
      (first base-lines)
      (apply min (map - (rest base-lines) base-lines)))))

(defn make-near?
  "充分に接近しているかを判断する関数を返す"
  [blocks]
  (let [dist (* 1.5 (get-min-dist blocks))]
    (fn [a b]
      (if (and a b)
        (< (Math/abs (- (base-line a) (base-line b)))
           dist)
        true))))

(defn paragraphize
  "行のブロックをパラグラフ単位にまとめる"
  [blocks]
  (if (empty? blocks)
    []
    (let [near? (make-near? blocks)]
      (loop [result []
             last-part []
             coll blocks]
        (let [[cur follow & tail] coll
              new-part #(conj result
                              (conj last-part cur))]
          (if follow
            (if (near? cur follow)
              (recur result
                     (conj last-part cur)
                     (cons follow tail))
              (recur
                (new-part)
                []
                (cons follow tail)))
            (new-part)))))))

(defn textize
  "パラグラフをテキストにする"
  [paragraphs]
  (->>
    paragraphs
    (map #(map :Text %))
    (map #(string/join "\n" %))
    (string/join "\n\n")))

(defn textract
  "S3 上にある文書画像を AWS API で OCR する"
  [client s3-key]
  (aws/invoke
    client
    {:op :DetectDocumentText
     :request {:Document
               {:S3Object {:Bucket *bucket*
                           :Name s3-key}}}}))

(defn s3-put
  "S3 にファイルをアップロードする"
  [client file-path s3-key]
  (aws/invoke
    client
    {:op :PutObject
     :request {:Bucket *bucket*
               :Key s3-key
               :Body (java.io.FileInputStream. file-path)}}))

(defn s3-rm
  "S3 からファイルを削除する"
  [client s3-key]
  (aws/invoke
    client
    {:op :DeleteObject
     :request {:Bucket *bucket*
               :Key s3-key}}))

(defn -main
  ([]
   (println "Usage: textract <INput-png> <OUTput-json> <OUTput-text>")
   (println "                <INput-json> <OUTput-text>"))
  ([input-png output-json output-text]
   (let [s3-key (str (tf/unparse key-format (tm/now))
                     "-"
                     (swap! uniq-num inc)
                     ".png")
         s3-client (aws/client {:api :s3 :region "ap-southeast-1"})
         tx-client (aws/client {:api :textract :region "ap-southeast-1"})]
     (s3-put s3-client input-png s3-key)
     (let [result (textract tx-client s3-key)
           text (-> result extract-line-blocks paragraphize textize)]
       (doall(map io/make-parents [output-text output-json]))
       (spit output-json (json/write-str result))
       (spit output-text text)
       (s3-rm s3-client s3-key)
       (println text))))
  ([input-json output-text]
   (let [text (-> input-json slurp (json/read-str :key-fn keyword) extract-line-blocks paragraphize textize)]
     (io/make-parents output-text)
     (spit output-text text)
     (println text))))

; dependencies [[org.clojure/clojure "1.10.1"]
;               [org.clojure/data.json "0.2.7"]
;               [com.cognitect.aws/api       "0.8.391"]
;               [com.cognitect.aws/endpoints "1.1.11.670"]
;               [com.cognitect.aws/s3        "770.2.568.0"]
;               [com.cognitect.aws/textract  "747.2.533.0"]
;               [clj-time "0.15.2"]]
			
				
					
				
					1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
				
						(ns textract.core
  (:gen-class)
  (:require [clojure.data.json :as json]
            [clojure.string :as string]
            [clojure.java.io :as io]
            [cognitect.aws.client.api :as aws]
            [clj-time.core :as tm]
            [clj-time.format :as tf]))
 
 
(def ^:dynamic *bucket* "YOUR-BUCKET-NAME")
 
(def key-format (tf/formatter "yyyyMMdd-HHmmss" (tm/default-time-zone)))
 
(def uniq-num (atom 1))
 
 
(defn base-line
  "ブロックの基準位置を返す"
  [block]
  (let [{top :Top height :Height} (-> block :Geometry :BoundingBox)]
    (+ top
       (/ height 2))))
 
(defn extract-line-blocks
  "JSON から行を表すオブジェクトのみを抽出する"
  [content]
  (->>
    (:Blocks content)
    (filter (comp #(= "LINE" %) :BlockType))
    (sort-by base-line)))
 
(defn get-min-dist
  "ブロック間の距離(y軸)で最小のものを返す"
  [blocks]
  (let [base-lines (map base-line blocks)]
    (if (= 1 (count base-lines))
      (first base-lines)
      (apply min (map - (rest base-lines) base-lines)))))
 
(defn make-near?
  "充分に接近しているかを判断する関数を返す"
  [blocks]
  (let [dist (* 1.5 (get-min-dist blocks))]
    (fn [a b]
      (if (and a b)
        (< (Math/abs (- (base-line a) (base-line b)))
           dist)
        true))))
 
(defn paragraphize
  "行のブロックをパラグラフ単位にまとめる"
  [blocks]
  (if (empty? blocks)
    []
    (let [near? (make-near? blocks)]
      (loop [result []
             last-part []
             coll blocks]
        (let [[cur follow & tail] coll
              new-part #(conj result
                              (conj last-part cur))]
          (if follow
            (if (near? cur follow)
              (recur result
                     (conj last-part cur)
                     (cons follow tail))
              (recur
                (new-part)
                []
                (cons follow tail)))
            (new-part)))))))
 
(defn textize
  "パラグラフをテキストにする"
  [paragraphs]
  (->>
    paragraphs
    (map #(map :Text %))
    (map #(string/join "\n" %))
    (string/join "\n\n")))
 
(defn textract
  "S3 上にある文書画像を AWS API で OCR する"
  [client s3-key]
  (aws/invoke
    client
    {:op :DetectDocumentText
     :request {:Document
               {:S3Object {:Bucket *bucket*
                           :Name s3-key}}}}))
 
(defn s3-put
  "S3 にファイルをアップロードする"
  [client file-path s3-key]
  (aws/invoke
    client
    {:op :PutObject
     :request {:Bucket *bucket*
               :Key s3-key
               :Body (java.io.FileInputStream. file-path)}}))
 
(defn s3-rm
  "S3 からファイルを削除する"
  [client s3-key]
  (aws/invoke
    client
    {:op :DeleteObject
     :request {:Bucket *bucket*
               :Key s3-key}}))
 
(defn -main
  ([]
   (println "Usage: textract <INput-png> <OUTput-json> <OUTput-text>")
   (println "                <INput-json> <OUTput-text>"))
  ([input-png output-json output-text]
   (let [s3-key (str (tf/unparse key-format (tm/now))
                     "-"
                     (swap! uniq-num inc)
                     ".png")
         s3-client (aws/client {:api :s3 :region "ap-southeast-1"})
         tx-client (aws/client {:api :textract :region "ap-southeast-1"})]
     (s3-put s3-client input-png s3-key)
     (let [result (textract tx-client s3-key)
           text (-> result extract-line-blocks paragraphize textize)]
       (doall(map io/make-parents [output-text output-json]))
       (spit output-json (json/write-str result))
       (spit output-text text)
       (s3-rm s3-client s3-key)
       (println text))))
  ([input-json output-text]
   (let [text (-> input-json slurp (json/read-str :key-fn keyword) extract-line-blocks paragraphize textize)]
     (io/make-parents output-text)
     (spit output-text text)
     (println text))))
 
; dependencies [[org.clojure/clojure "1.10.1"]
;               [org.clojure/data.json "0.2.7"]
;               [com.cognitect.aws/api       "0.8.391"]
;               [com.cognitect.aws/endpoints "1.1.11.670"]
;               [com.cognitect.aws/s3        "770.2.568.0"]
;               [com.cognitect.aws/textract  "747.2.533.0"]
;               [clj-time "0.15.2"]]

🐈 まとめ 🐈

実際に数ページ読んでみましたが、誤読取も特に見つかりませんし、単語にカーソルを合わせるだけで辞書をひけるのは便利ですね。

また、改めて感じたことですが Clojure (というか Lisp)は、REPL (対話環境) が便利です。
関数単位でコードを書いていくので、コードを編集しながら試行錯誤することが、気軽にできます。
REPL の対話環は関数型言語と特に相性が良いですね。

🐈 括弧いいねこ 🐈

画像です。

↑の括弧いい猫の画像を OCR したら肉球にメッセージが発見されました。「ad 4」はたしてどういう意味が…？