使用opencv實現例項分割，一學就會|附原始碼

OpenCV · 發表 2018-12-03 02:18:47

摘要：無論是從酒店房間接聽電話、在辦公里樓工作，還是根本不想在家庭辦公室等情況，電話會議模糊功能都可以讓會議與會者專注於自己，這樣的功能對於在家工作並希望保護其家庭成員隱私的人特別有用。為了實現這樣的功能，微軟利用計算機視覺、深度學習以及例項分割技術實現。在之前的博文中，介紹瞭如何利用...

無論是從酒店房間接聽電話、在辦公里樓工作，還是根本不想在家庭辦公室等情況，電話會議模糊功能都可以讓會議與會者專注於自己，這樣的功能對於在家工作並希望保護其家庭成員隱私的人特別有用。

為了實現這樣的功能，微軟利用計算機視覺、深度學習以及例項分割技術實現。

ofollow,noindex" target="_blank">在之前的博文中，介紹瞭如何利用YOLO以及OpenCV實現目標檢測的功能，今天將採用Mask R-CNN來構建視訊模糊功能。

使用OpenCV進行例項分割

Line"/> https://youtu.be/puSN8Dg-bdI

在本教程的第一部分中，將簡要介紹例項分割；之後將使用例項分割和OpenCV來實現：

從視訊流中檢測出使用者並分割；
模糊背景；
將使用者添加回流本身；

什麼是例項分割？

圖1：物件檢測和例項分割之間的區別

如上圖所示，對於 物件檢測（左圖，Object Detection） 而言，在各個物件周圍繪製出一個框。 例項分割（右圖，Instance Segmentation） 而言，是需要嘗試確定哪些畫素屬於對應的物件。通過上圖，可以清楚地看到兩者之間的差異。

執行物件檢測時，是需要：

計算每個物件的邊界框(x,y的)-座標；
然後將類標籤與每個邊界框相關聯；

從上可以看出，物件檢測並沒有告訴我們關於物件本身的形狀，而只獲得了一組邊界框座標。而另一方面，例項分割需要計算出一個逐畫素掩模用於影象中的每個物件。

即使物件具有相同的類標籤，例如上圖中的兩隻狗，我們的例項分割演算法仍然報告總共三個獨特的物件：兩隻狗和一隻貓。

使用例項分割，可以更加細緻地理解影象中的物件——比如知道物件存在於哪個（x，y）座標中。此外，通過使用例項分割，可以輕鬆地從背景中分割前景物件。

本文使用Mask R-CNN進行例項分割。

專案結構

專案樹：

$ tree --dirsfirst
.
├── mask-rcnn-coco
│├── frozen_inference_graph.pb
│├── mask_rcnn_inception_v2_coco_2018_01_28.pbtxt
│└── object_detection_classes_coco.txt
└── instance_segmentation.py

1 directory, 4 files

專案包括一個目錄（由三個檔案組成）和一個Python指令碼：

mask-rcnn-coco/ ：Mask R-CNN模型目錄包含三個檔案：
- frozen_inference_graph .pb ：Mask R-CNN模型的權重，這些權重是在COCO資料集上預先訓練所得到的；
- mask_rcnn_inception_v2_coco_2018_01_28 .pbtxt ：Mask R-CNN模型的配置檔案，如果你想在自己的資料集上構建及訓練自己的模型，可以參閱網上的一些資源更改該配置檔案。
- object_detection_classes_coco.txt ：此文字檔案中列出了資料集中包含的90個類，每行表示一個類別。
instance_segmentation .py ：背景模糊指令碼，本文的核心內容，將詳細介紹該程式碼並評估其演算法效能。

使用OpenCV實現例項分割

下面開始使用OpenCV實現例項分割。首先開啟 instance_segmentation .py 檔案並插入以下程式碼：

# import the necessary packages
from imutils.video import VideoStream
import numpy as np
import argparse
import imutils
import time
import cv2
import os

在開始編寫指令碼時，首先需要匯入必要的包，並且需要配置好編譯環境。本文使用的 OpenCV版本為3.4.3。如果個人的計算機配置檔案不同，需要對其進行更新。強烈建議將此軟體放在隔離的虛擬環境中，推薦使用conda安裝。

下面解析命令列引數：

# construct the argument parse and parse the arguments
ap = argparse.ArgumentParser()
ap.add_argument("-m", "--mask-rcnn", required=True,
help="base path to mask-rcnn directory")
ap.add_argument("-c", "--confidence", type=float, default=0.5,
help="minimum probability to filter weak detections")
ap.add_argument("-t", "--threshold", type=float, default=0.3,
help="minimum threshold for pixel-wise mask segmentation")
ap.add_argument("-k", "--kernel", type=int, default=41,
help="size of gaussian blur kernel")
args = vars(ap.parse_args())

每個命令列引數的描述可以在下面找到：

mask-rcnn
confidence
threshold
kernel

下面載入資料集的標籤和OpenCV例項分割模型：

# load the COCO class labels our Mask R-CNN was trained on
labelsPath = os.path.sep.join([args["mask_rcnn"],
"object_detection_classes_coco.txt"])
LABELS = open(labelsPath).read().strip().split("\n")
 
# derive the paths to the Mask R-CNN weights and model configuration
weightsPath = os.path.sep.join([args["mask_rcnn"],
"frozen_inference_graph.pb"])
configPath = os.path.sep.join([args["mask_rcnn"],
"mask_rcnn_inception_v2_coco_2018_01_28.pbtxt"])
 
# load our Mask R-CNN trained on the COCO dataset (90 classes)
# from disk
print("[INFO] loading Mask R-CNN from disk...")
net = cv2.dnn.readNetFromTensorflow(weightsPath, configPath)

標籤檔案位於 mask-rcnn - coco 目錄，指定好路徑後就可以載入標籤檔案了。同樣地， weightsPath 和 configPath 也執行型別的操作。

基於這兩個路徑，利用 dnn 模組初始化神經網路。在開始處理視訊幀之前，需要將Mask R-CNN載入到記憶體中（只需要載入一次）。

下面構建模糊核心並啟動網路攝像頭視訊流：

# construct the kernel for the Gaussian blur and initialize whether
# or not we are in "privacy mode"
K = (args["kernel"], args["kernel"])
privacy = False
 
# initialize the video stream, then allow the camera sensor to warm up
print("[INFO] starting video stream...")
vs = VideoStream(src=0).start()
time.sleep(2.0)

模糊核心元組也通過行命令設定。此外，專案有兩種模式：“正常模式”和“隱私模式”。因此，布林值 privacy 用於模式邏輯，上述程式碼將其初始化為 False 。

網路攝像頭視訊流用 VideoStream(src=0).start() ，首先暫停兩秒鐘以讓感測器預熱。

初始化了所有變數和物件後，就可以從網路攝像頭開始處理幀了：

# loop over frames from the video file stream
while True:
# grab the frame from the threaded video stream
frame = vs.read()
 
# resize the frame to have a width of 600 pixels (while
# maintaining the aspect ratio), and then grab the image
# dimensions
frame = imutils.resize(frame, width=600)
(H, W) = frame.shape[:2]
 
# construct a blob from the input image and then perform a
# forward pass of the Mask R-CNN, giving us (1) the bounding
# box coordinates of the objects in the image along with (2)
# the pixel-wise segmentation for each specific object
blob = cv2.dnn.blobFromImage(frame, swapRB=True, crop=False)
net.setInput(blob)
(boxes, masks) = net.forward(["detection_out_final",
"detection_masks"])

在每次迭代中，將抓取一幀並將其調整為設定的寬度，同時保持縱橫比。此外，為了之後的縮放操作，繼續並提取幀的尺寸。然後，構建一個 blob 並完成前向傳播網路。

結果輸出是 boxes 和 masks ，雖然需要用到掩碼（mask），但還需要使用邊界框（boxes）中包含的資料。

下面對索引進行排序並初始化變數：

# sort the indexes of the bounding boxes in by their corresponding
# prediction probability (in descending order)
idxs = np.argsort(boxes[0, 0, :, 2])[::-1]
 
# initialize the mask, ROI, and coordinates of the person for the
# current frame
mask = None
roi = None
coords = None

通過其對應的預測概率對邊界框的索引進行排序，假設具有最大相應檢測概率的人是我們的使用者。然後初始化 mask 、 roi 以及邊界框的座標。

遍歷索引並過濾結果：

# loop over the indexes
for i in idxs:
# extract the class ID of the detection along with the
# confidence (i.e., probability) associated with the
# prediction
classID = int(boxes[0, 0, i, 1])
confidence = boxes[0, 0, i, 2]
 
# if the detection is not the 'person' class, ignore it
if LABELS[classID] != "person":
continue
 
# filter out weak predictions by ensuring the detected
# probability is greater than the minimum probability
if confidence > args["confidence"]:
# scale the bounding box coordinates back relative to the
# size of the image and then compute the width and the
# height of the bounding box
box = boxes[0, 0, i, 3:7] * np.array([W, H, W, H])
(startX, startY, endX, endY) = box.astype("int")
coords = (startX, startY, endX, endY)
boxW = endX - startX
boxH = endY - startY

從idxs開始迴圈，然後，使用框和當前索引提取classID和置信度。隨後，執行第一個過濾器—— “人”。如果遇到任何其他物件類，繼續下一個索引。下一個過濾器確保預測的置信度超過通過命令列引數設定的閾值。

如果通過了該測試，那麼將邊界框座標縮放回影象的相對尺寸，然後提取座標和物件的寬度/高度。

計算掩膜並提取ROI：

# extract the pixel-wise segmentation for the object,
# resize the mask such that it's the same dimensions of
# the bounding box, and then finally threshold to create
# a *binary* mask
mask = masks[i, classID]
mask = cv2.resize(mask, (boxW, boxH),
interpolation=cv2.INTER_NEAREST)
mask = (mask > args["threshold"])
 
# extract the ROI and break from the loop (since we make
# the assumption there is only *one* person in the frame
# who is also the person with the highest prediction
# confidence)
roi = frame[startY:endY, startX:endX][mask]
break

上述程式碼首先提取掩碼，並調整其大小，之後應用閾值來建立二進位制掩碼本身。示例如下圖所示：

圖2：使用OpenCV和例項分割在網路攝像頭前通過例項分割計算的二進位制掩碼

從上圖中可以看到，假設所有白色畫素都是人（即前景），而所有黑色畫素都是背景。使用掩碼後，通過NumPy陣列切片計算roi。之後迴圈斷開，這是因為你找到最大概率的人了。

如果處於“隱私模式”，需要進行初始化輸出幀並計算模糊：

# initialize our output frame
output = frame.copy()
 
# if the mask is not None *and* we are in privacy mode, then we
# know we can apply the mask and ROI to the output image
if mask is not None and privacy:
# blur the output frame
output = cv2.GaussianBlur(output, K, 0)
 
# add the ROI to the output frame for only the masked region
(startX, startY, endX, endY) = coords
output[startY:endY, startX:endX][mask] = roi

其輸出幀只是原始幀的副本。

如果我們倆都：

有一個非空的掩膜；
處於“ 隱私模式”；
... ...

然後將使用模糊背景並將掩碼應用於輸出幀。

下面顯示輸出以及影象處理按鍵：

# show the output frame
cv2.imshow("Video Call", output)
key = cv2.waitKey(1) & 0xFF
 
# if the `p` key was pressed, toggle privacy mode
if key == ord("p"):
privacy = not privacy
 
# if the `q` key was pressed, break from the loop
elif key == ord("q"):
break
 
# do a bit of cleanup
cv2.destroyAllWindows()
vs.stop()

keypresses 被獲取其值，有兩個值可供選擇，但會導致不同的行為：

“p”：按下此鍵時，開啟或關閉“ 隱私模式”；
“q”：如果按下此鍵，將跳出迴圈並“退出”指令碼；

每當退出時，上述程式碼就會關閉開啟的視窗並停止視訊流。

例項分割結果

現在已經實現了OpenCV例項分割演算法，下面看看其實際應用！

開啟一個終端並執行以下命令：

$ python instance_segmentation.py --mask-rcnn mask-rcnn-coco --kernel 41
[INFO] loading Mask R-CNN from disk...
[INFO] starting video stream...

圖3：演示了一個用於網路聊天的“隱私過濾器”

通過啟用“隱私模式”，可以：

使用OpenCV例項分割查詢具有最大相應概率的人物檢測（最可能是最接近相機的人）；
模糊視訊流的背景；
將分割的、非模糊的人重疊到視訊流上；

下面列出一個視訊演示（需外網）：

https://youtu.be/puSN8Dg-bdI

看完視訊會立即注意到，並沒有獲得真正的實時效能——每秒只處理幾幀。為什麼是這樣？

要回答這些問題，請務必參考以下部分。

限制、缺點和潛在的改進

第一個限制是最明顯的——OpenCV例項分割的實現太慢而無法實時執行。在CPU上執行，每秒只能處理幾幀。為了獲得真正的實時例項分割效能，需要利用到GPU。

但其中存在的問題是：

dnn

一旦OpenCV正式支援 dnn 模組的NVIDIA GPU版本，就能夠更輕鬆地構建實時（甚至超實時）的深度學習應用程式。但就目前而言，本文的例項分割教程只作為演示：

此外，也可以做出的另一項改進與分割的人重疊在模糊的背景上有關。當將本文的實現與Microsoft的Office 365視訊模糊功能進行比較時，就會發現Microsoft會更加“流暢”。但也可以通過利用一些alpha混合來模仿這個功能。

對例項分割管道進行簡單而有效的更新可能是：

使用形態學操作來增加蒙版的大小；
在掩膜本身塗抹少量高斯模糊，幫助平滑掩碼；
將掩碼值縮放到範圍[0,1]；
使用縮放蒙版建立alpha圖層；
在模糊的背景上疊加平滑的掩膜+人；

或者，也可以計算掩膜本身的輪廓，然後應用掩膜近似來幫助建立“更平滑”的掩碼。

總結

看完本篇文章，你應該學習瞭如何使用OpenCV、Deep Learning和Python實現例項分割了吧。例項分割大體過程如下：

檢測影象中的每個物件；
計算每個物件的逐畫素掩碼；

注意，即使物件屬於同一類，例項分割也應為每個物件返回唯一的掩碼；

作者資訊

Adrian Rosebrock ，機器學習，人工智慧，影象處理

本文由阿里云云棲社群組織翻譯。

文章原標題《Instance segmentation with OpenCV》，譯者：海棠，審校：Uncle_LLD。

文章為簡譯，更為詳細的內容，請檢視原文。