检测实时视频源中的人类行为
检测实时视频源中的人类行为
通过发送一系列人物的姿势数据来识别身体动作 的视频帧到动作分类模型。
概述
此示例应用识别一个人的身体移动,称为动作, 通过使用 Vision 分析一系列视频帧并通过应用动作分类器预测运动的名称。 此示例中的操作分类器可识别三个练习:
- 跳千斤顶
- 弓步
- 波比跳
- 注: 请参阅创建动作分类器模型 有关创建自己的操作分类器的信息。
该应用程序在 来自设备摄像头的实时全屏视频源。 当应用识别框架中的一个或多个人时,它会覆盖线框主体 在每个人身上摆姿势。 同时,该应用程序可以预测知名人士当前的行动; 通常,这是离摄像机最近的人。
在启动时,该应用程序将设备的相机配置为生成视频帧,然后引导 通过一系列方法与 Combine 链接在一起。 这些方法协同工作以分析帧并通过以下方式进行动作预测 执行以下步骤序列:
- Locate all human body poses in each frame.
- Isolate the prominent pose.
- Aggregate the prominent pose’s position data over time.
- Make action predictions by sending the aggregate data to the action classifier.
Configure the Sample Code Project
This sample app uses a camera, so you can’t run it in Simulator — you need to run it on an iOS or iPadOS device.
Start a Video Capture Session
The app’s
class configures the device’s camera to generate video frames by creating an
AVCaptureSession
.VideoCapture
When the app first launches, or when the user rotates the device or switches between cameras,
video capture configures a camera input, a frame output, and the connection between
them in its method.configureCaptureSession()
// Set the video camera to run at the action classifier's frame rate.
let modelFrameRate = ExerciseClassifier.frameRate
let input = AVCaptureDeviceInput.createCameraInput(position: cameraPosition,
frameRate: modelFrameRate)
let output = AVCaptureVideoDataOutput.withPixelFormatType(kCVPixelFormatType_32BGRA)
let success = configureCaptureConnection(input, output)
return success ? output : nil
The
method selects the front- or rear-facing camera and configures its frame rate so it matches
that of the action classifier.createCameraInput(position:frameRate:)
- Important: If you replace the file with your own
action classifier model, set the property to match
the Frame Rate training parameter you used in the Create ML developer tool.
ExerciseClassifier.mlmodel
frameRate
The
method creates an AVCaptureVideoDataOutput
that produces frames with a specific pixel format.AVCaptureVideoDataOutput.withPixelFormatType(_:)
The
method configures the relationship between the capture session’s camera input and video output by:configureCaptureConnection(_:_:)
- Selecting a video orientation
- Deciding whether to horizontally flip the video
- Enabling image stabilization when applicable
if connection.isVideoOrientationSupported {
// Set the video capture's orientation to match that of the device.
connection.videoOrientation = orientation
}
if connection.isVideoMirroringSupported {
connection.isVideoMirrored = horizontalFlip
}
if connection.isVideoStabilizationSupported {
if videoStabilizationEnabled {
connection.preferredVideoStabilizationMode = .standard
} else {
connection.preferredVideoStabilizationMode = .off
}
}
The method keeps the app operating in real time
— and avoids building up a frame backlog —
by setting the video output’s
alwaysDiscardsLateVideoFrames
property to .true
// Discard newer frames if the app is busy with an earlier frame.
output.alwaysDiscardsLateVideoFrames = true
See Setting Up a Capture Session for more information on how to configure capture sessions and connect their inputs and outputs.
Create a Frame Publisher
The video capture publishes frames from its capture session by creating a
PassthroughSubject
in its
method.createVideoFramePublisher()
// Create a new passthrough subject that publishes frames to subscribers.
let passthroughSubject = PassthroughSubject<Frame, Never>()
// Keep a reference to the publisher.
framePublisher = passthroughSubject
A passthrough subject is a concrete implementation of Subject
that adapts imperative code to work with Combine. It immediately publishes the
instance you pass to its send(_:)
method, if it has a subscriber at that
time.
Next, the video capture registers itself as the video output’s delegate so it receives the
video frames from the capture session by calling the output’s
setSampleBufferDelegate(_:queue:)
method.
// Set the video capture as the video output's delegate.
videoDataOutput.setSampleBufferDelegate(self, queue: videoCaptureQueue)
The video capture forwards each frame it receives to its
by passing the frame to its
send(_:)
method.framePublisher
extension VideoCapture: AVCaptureVideoDataOutputSampleBufferDelegate {
func captureOutput(_ output: AVCaptureOutput,
didOutput frame: Frame,
from connection: AVCaptureConnection) {
// Forward the frame through the publisher.
framePublisher?.send(frame)
}
}
Build a Publisher Chain
The sample processes each video frame, and its derivative data, with a series of methods
that it connects together into a chain of Combine publishers in the
class.VideoProcessingChain
Each time the video capture creates a new frame publisher it notifies the main view controller,
which then assigns the publisher to the video-processing chain’s
property:upstreamFramePublisher
func videoCapture(_ videoCapture: VideoCapture,
didCreate framePublisher: FramePublisher) {
updateUILabelsWithPrediction(.startingPrediction)
// Build a new video-processing chain by assigning the new frame publisher.
videoProcessingChain.upstreamFramePublisher = framePublisher
}
Each time the property’s value changes, the video-processing chain creates a new daisy
chain of publishers by calling its
method.buildProcessingChain()
The method creates each new publisher by calling one of the following
Publisher
methods:
For example, the publisher that subscribes to the initial frame publisher is a
Publishers.CompactMap
that converts each
(a type alias of CMSampleBuffer
)
it receives into a CGImage
by calling the video-processing chain’s
method.Frame
imageFromFrame(_:)
// Create the chain of publisher-subscribers that transform the raw video
// frames from upstreamFramePublisher.
frameProcessingChain = upstreamFramePublisher
// ---- Frame (aka CMSampleBuffer) -- Frame ----
// Convert each frame to a CGImage, skipping any that don't convert.
.compactMap(imageFromFrame)
// ---- CGImage -- CGImage ----
// Detect any human body poses (or lack of them) in the frame.
.map(findPosesInFrame)
// ---- [Pose]? -- [Pose]? ----
The next sections explain the remaining publishers in the chain and the methods they use to transform their inputs.
Analyze Each Frame for Body Poses
The next publisher in the chain is a Publishers.Map
that receives each
CGImage
from the previous publisher (the compact map) by subscribing to it.
The map publisher locates any human body poses in the frame by using the video-processing chain’s
method.
The method invokes a
VNDetectHumanBodyPoseRequest
by creating a
VNImageRequestHandler
with the image and submitting the
video-processing chain’s
property to the handler’s perform(_:)
method.findPosesInFrame(_:)
humanBodyPoseRequest
- Important: Improve your app’s efficiency by creating and reusing a single
VNDetectHumanBodyPoseRequest
instance.
// Create a request handler for the image.
let visionRequestHandler = VNImageRequestHandler(cgImage: frame)
// Use Vision to find human body poses in the frame.
do { try visionRequestHandler.perform([humanBodyPoseRequest]) } catch {
assertionFailure("Human Pose Request failed: \(error)")
}
When the request completes, the method creates and returns a
array that contains one pose for every
VNHumanBodyPoseObservation
instance in the request’s
results
property.Pose
let poses = Pose.fromObservations(humanBodyPoseRequest.results)
The structure in this sample serves three main purposes:Pose
- Calculating the observation’s area within a frame (see “Isolate A Body Pose”)
- Storing the the observation’s multiarray (see “Retrieve the Multiarray”)
- Drawing an observation as a wireframe of points and lines (see “Present the Poses to the User”)
For more information about using a
VNDetectHumanBodyPoseRequest
,
see Detecting Human Body Poses in Images.
Isolate a Body Pose
The next publisher in the chain is a map that chooses a single pose from the array of
poses by using the video-processing chain’s
method.
This method selects the the most prominent pose by passing a closure to the pose array’s
max(by:)
method.isolateLargestPose(_:)
private func isolateLargestPose(_ poses: [Pose]?) -> Pose? {
return poses?.max(by:) { pose1, pose2 in pose1.area < pose2.area }
}
The closure compares the poses’ area estimates, with the goal of consistently selecting the same person’s pose over time, when multiple people are in frame.
- Important: Get the most accurate predictions from an action classifier by using whatever
technique you think best tracks a person from frame to frame, and use the multiarray from
that person’s
VNHumanBodyPoseObservation
result.
Retrieve the Multiarray
The next publisher in the chain is a map that publishes the MLMultiArray
from the pose’s property by using the video
processing chain’s method.multiArray
multiArrayFromPose(_:)
private func multiArrayFromPose(_ item: Pose?) -> MLMultiArray? {
return item?.multiArray
}
The
initializer copies the multiarray from its
VNHumanBodyPoseObservation
parameter by calling the observation’s
keypointsMultiArray()
method.Pose
// Save the multiarray from the observation.
multiArray = try? observation.keypointsMultiArray()
Gather a Window of Multiarrays
The next publisher in the chain is a
Publishers.Scan
that receives each multiarray from its upstream publisher and gathers them into an array
by providing two arguments:
- An empty multiarray-optional array (
MLMultiArray
) as the scan publisher’s initial value[
?]
- The video-processing chain’s
method as the scan publisher’s transform.
gatherWindow(previousWindow:multiArray:)
// ---- MLMultiArray? -- MLMultiArray? ----
// Gather a window of multiarrays, starting with an empty window.
.scan([MLMultiArray?](), gatherWindow)
// ---- [MLMultiArray?] -- [MLMultiArray?] ----
A scan publisher behaves similarly to a map, but it also maintains a state. The following scan publisher’s state is an array of multiarray optionals that’s initially empty. As the scan publisher receives multiarray optionals from its upstream publisher, the scan publisher passes its previous state and the incoming multiarray optional as arguments to its transform.
private func gatherWindow(previousWindow: [MLMultiArray?],
multiArray: MLMultiArray?) -> [MLMultiArray?] {
var currentWindow = previousWindow
// If the previous window size is the target size, it
// means sendWindowWhenReady() just published an array window.
if previousWindow.count == predictionWindowSize {
// Advance the sliding array window by stride elements.
currentWindow.removeFirst(windowStride)
}
// Add the newest multiarray to the window.
currentWindow.append(multiArray)
// Publish the array window to the next subscriber.
// The currentWindow becomes this method's next previousWindow when
// it receives the next multiarray from the upstream publisher.
return currentWindow
}
The method:
- Copies the parameter to
previousWindow
currentWindow
- Removes
elements from the front of , if it’s full
windowStride
currentWindow
- Appends the parameter to the end of
multiArray
currentWindow
- Returns , which becomes the new state of the scan publisher and
the next value for when the scan publisher receives the next value from
its upstream publisher and invokes the method
currentWindow
previousWindow
The video-processing chain considers a window to be full if it contains
elements.
When the window is full, this method removes (in step 2) the oldest elements to make
room for newer elements, effectively sliding the window forward in time.predictionWindowSize
The Exercise Classifier’s
method determines the value of the prediction window size at runtime by inspecting the
model’s modelDescription
property.calculatePredictionWindowSize()
Monitor the Window Size
The next publisher in the chain is a Publishers.Filter
, which only publishes
an array window when the method returns .gateWindow(_:)
true
// Only publish a window when it grows to the correct size.
.filter(gateWindow)
// ---- [MLMultiArray?] -- [MLMultiArray?] ----
The method returns if the window array contains exactly the number of elements
defined in .
Otherwise, the method returns , which instructs the filter publisher to discard the
current window and not publish it.true
predictionWindowSize
false
private func gateWindow(_ currentWindow: [MLMultiArray?]) -> Bool {
return currentWindow.count == predictionWindowSize
}
This filter publisher, in combination with its upstream scan publisher, publishes an array of
multiarray optionals (MLMultiArray
) once per each number of frames
defined in .[
?]
windowStride
Predict the Person’s Action
The next publisher in the chain makes an
from the multiarray window by using the
method as
its transform.ActionPrediction
predictActionWithWindow(_:)
// Make an activity prediction from the window.
.map(predictActionWithWindow)
// ---- ActionPrediction -- ActionPrediction ----
The method’s input array contains multiarray optionals where each element represents
a frame in which Vision wasn’t able to find any human body poses.
An action classifier requires a valid, non- multiarray for every frame.
To remove the elements in the array, the method creates a new multiarray,
, by:nil
nil
nil
filledWindow
- Copying each each valid element in
currentWindow
- Replacing each element in with an
nil
currentWindow
emptyPoseMultiArray
var poseCount = 0
// Fill the nil elements with an empty pose array.
let filledWindow: [MLMultiArray] = currentWindow.map { multiArray in
if let multiArray = multiArray {
poseCount += 1
return multiArray
} else {
return Pose.emptyPoseMultiArray
}
}
The empty pose multiarray has:
- Every element set to zero
- The same value for its
shape
property as a multiarray from a human body-pose observation
As the method iterates through each element in , it tallies the number
of non- elements with .currentWindow
nil
poseCount
If the value of is too low, the method directly creates a
action prediction.poseCount
noPersonPrediction
// Only use windows with at least 60% real data to make a prediction
// with the action classifier.
let minimum = predictionWindowSize * 60 / 100
guard poseCount >= minimum else {
return ActionPrediction.noPersonPrediction
}
Otherwise, the method merges the array of multiarrays into a single, combined multiarray by
calling the
MLMultiArray(concatenating:axis:dataType:)
initializer.
// Merge the array window of multiarrays into one multiarray.
let mergedWindow = MLMultiArray(concatenating: filledWindow,
axis: 0,
dataType: .float)
The method generates an action prediction by passing the combined multiarray to the action
classifier’s
helper method.predictActionFromWindow(_:)
// Make a genuine prediction with the action classifier.
let prediction = actionClassifier.predictActionFromWindow(mergedWindow)
// Return the model's prediction if the confidence is high enough.
// Otherwise, return a "Low Confidence" prediction.
return checkConfidence(prediction)
The method checks the prediction’s confidence by passing the prediction to the
helper method, which returns the same prediction if its confidence is high enough;
otherwise .checkConfidence(_:)
lowConfidencePrediction
Present the Prediction to the User
The final component in the chain is a subscriber that notifies the video-processing chain’s
delegate with the prediction using the
method.sendPrediction(_:)
// Send the action prediction to the delegate.
.sink(receiveValue: sendPrediction)
The method sends the action prediction and the number of frames the prediction represents
() to the video-processing chain’s
, the main view controller.windowStride
delegate
// Send the prediction to the delegate on the main queue.
DispatchQueue.main.async {
self.delegate?.videoProcessingChain(self,
didPredict: actionPrediction,
for: windowStride)
}
Each time the main view controller receives an action prediction, it updates the app’s UI with the prediction and confidence in a helper method.
func videoProcessingChain(_ chain: VideoProcessingChain,
didPredict actionPrediction: ActionPrediction,
for frameCount: Int) {
if actionPrediction.isModelLabel {
// Update the total number of frames for this action.
addFrameCount(frameCount, to: actionPrediction.label)
}
// Present the prediction in the UI.
updateUILabelsWithPrediction(actionPrediction)
}
The main view controller also updates its
property for action labels that come from the model, which it later sends to the
Summary View Controller when the user taps the button.actionFrameCounts
Summary
Present the Poses to the User
The app visualizes the result of each human body-pose request by drawing the poses on top
of the frame in which Vision found them.
Each time the video-processing chain’s
creates an array of
instances, it sends the poses to its delegate, the main view controller.findPosesInFrame(_:)
Pose
// Send the frame and poses, if any, to the delegate on the main queue.
DispatchQueue.main.async {
self.delegate?.videoProcessingChain(self, didDetect: poses, in: frame)
}
The main view controller’s method
uses the frame as the background by first drawing the frame.drawPoses(_:onto:)
// Draw the camera image first as the background.
let imageRectangle = CGRect(origin: .zero, size: frameSize)
cgContext.draw(frame, in: imageRectangle)
Next, the method draws the poses by calling their
method, which draws the pose as a wireframe of lines and circles.drawWireframeToContext(_:applying:)
// Draw all the poses Vision found in the frame.
for pose in poses {
// Draw each pose as a wireframe at the scale of the image.
pose.drawWireframeToContext(cgContext, applying: pointTransform)
}
The main view controller presents the finished image to the user by assigning it to its full-screen image view.
// Update the UI's full-screen image view on the main thread.
DispatchQueue.main.async { self.imageView.image = frameWithPosesRendering }