检测实时视频源中的人类行为

检测实时视频源中的人类行为

通过发送一系列人物的姿势数据来识别身体动作 的视频帧到动作分类模型。

概述

此示例应用识别一个人的身体移动,称为动作, 通过使用 Vision 分析一系列视频帧并通过应用动作分类器预测运动的名称。 此示例中的操作分类器可识别三个练习:

  • 跳千斤顶
  • 弓步
  • 波比跳

说明操作分类器用途的流程图,从人类在设备摄像头前执行跳跃千斤顶开始,以预测标签结束。从流程图的顶部开始,相机生成视频帧。视觉框架使用帧来生成身体位置数据的数据窗口。操作分类器使用数据窗口并预测标签:跳跃千斤顶。

该应用程序在 来自设备摄像头的实时全屏视频源。 当应用识别框架中的一个或多个人时,它会覆盖线框主体 在每个人身上摆姿势。 同时,该应用程序可以预测知名人士当前的行动; 通常,这是离摄像机最近的人。

表示示例应用主视图的关系图。该图像突出地显示了一个人在表演跳千斤顶。该应用程序绘制一个由关键位置的线条连接的圆圈的身体线框,覆盖在手臂、腿部和躯干上。视图底部的两个文本标签显示跳跃千斤顶和 98.7%。

在启动时,该应用程序将设备的相机配置为生成视频帧,然后引导 通过一系列方法与 Combine 链接在一起。 这些方法协同工作以分析帧并通过以下方式进行动作预测 执行以下步骤序列:

  1. 在每一帧中定位所有人体姿势。
  2. 隔离突出的姿势。
  3. 聚合突出姿势随时间推移的位置数据。
  4. 通过将聚合数据发送到操作分类器来进行操作预测。

一个流程图,演示视频帧通过示例应用的路径,从设备相机开始,继续通过视频捕获、视频处理链和主视图控制器,最后是应用界面的模型。该界面显示了一个人形,上面覆盖着一个覆盖在手臂、腿部和躯干上的线框,在两个标签上方执行跳跃千斤顶:“跳跃千斤顶”和“98.7%”。

配置示例代码项目

此示例应用使用相机,因此无法在模拟器中运行它 — 您需要在 iOS 或 iPadOS 设备上运行它。

启动视频捕获会话

应用的类通过创建 AVCaptureSession 将设备的相机配置为生成视频帧。VideoCapture

When the app first launches, or when the user rotates the device or switches between cameras, video capture configures a camera input, a frame output, and the connection between them in its method.configureCaptureSession()

// Set the video camera to run at the action classifier's frame rate.
let modelFrameRate = ExerciseClassifier.frameRate

let input = AVCaptureDeviceInput.createCameraInput(position: cameraPosition,
                                                   frameRate: modelFrameRate)

let output = AVCaptureVideoDataOutput.withPixelFormatType(kCVPixelFormatType_32BGRA)

let success = configureCaptureConnection(input, output)
return success ? output : nil

The method selects the front- or rear-facing camera and configures its frame rate so it matches that of the action classifier.createCameraInput(position:frameRate:)

  • Important: If you replace the file with your own action classifier model, set the property to match the Frame Rate training parameter you used in the Create ML developer tool.ExerciseClassifier.mlmodelframeRate

The method creates an AVCaptureVideoDataOutput that produces frames with a specific pixel format.AVCaptureVideoDataOutput.withPixelFormatType(_:)

The method configures the relationship between the capture session’s camera input and video output by:configureCaptureConnection(_:_:)

  • Selecting a video orientation
  • Deciding whether to horizontally flip the video
  • Enabling image stabilization when applicable

if connection.isVideoOrientationSupported {
    // Set the video capture's orientation to match that of the device.
    connection.videoOrientation = orientation
}

if connection.isVideoMirroringSupported {
    connection.isVideoMirrored = horizontalFlip
}

if connection.isVideoStabilizationSupported {
    if videoStabilizationEnabled {
        connection.preferredVideoStabilizationMode = .standard
    } else {
        connection.preferredVideoStabilizationMode = .off
    }
}

The method keeps the app operating in real time — and avoids building up a frame backlog — by setting the video output’s alwaysDiscardsLateVideoFrames property to .true

// Discard newer frames if the app is busy with an earlier frame.
output.alwaysDiscardsLateVideoFrames = true

See Setting Up a Capture Session for more information on how to configure capture sessions and connect their inputs and outputs.

Create a Frame Publisher

The video capture publishes frames from its capture session by creating a PassthroughSubject in its method.createVideoFramePublisher()

// Create a new passthrough subject that publishes frames to subscribers.
let passthroughSubject = PassthroughSubject<Frame, Never>()

// Keep a reference to the publisher.
framePublisher = passthroughSubject

A passthrough subject is a concrete implementation of Subject that adapts imperative code to work with Combine. It immediately publishes the instance you pass to its send(_:) method, if it has a subscriber at that time.

Next, the video capture registers itself as the video output’s delegate so it receives the video frames from the capture session by calling the output’s setSampleBufferDelegate(_:queue:) method.

// Set the video capture as the video output's delegate.
videoDataOutput.setSampleBufferDelegate(self, queue: videoCaptureQueue)

The video capture forwards each frame it receives to its by passing the frame to its send(_:) method.framePublisher

extension VideoCapture: AVCaptureVideoDataOutputSampleBufferDelegate {
    func captureOutput(_ output: AVCaptureOutput,
                       didOutput frame: Frame,
                       from connection: AVCaptureConnection) {

        // Forward the frame through the publisher.
        framePublisher?.send(frame)
    }
}

Build a Publisher Chain

The sample processes each video frame, and its derivative data, with a series of methods that it connects together into a chain of Combine publishers in the class.VideoProcessingChain

Each time the video capture creates a new frame publisher it notifies the main view controller, which then assigns the publisher to the video-processing chain’s property:upstreamFramePublisher

func videoCapture(_ videoCapture: VideoCapture,
                  didCreate framePublisher: FramePublisher) {
    updateUILabelsWithPrediction(.startingPrediction)
    
    // Build a new video-processing chain by assigning the new frame publisher.
    videoProcessingChain.upstreamFramePublisher = framePublisher
}

Each time the property’s value changes, the video-processing chain creates a new daisy chain of publishers by calling its method.buildProcessingChain()

Flow diagram of the video-processing chain that consumes video frames and produces information to the main view controller. The first two items in the chain are Convert to CGImage and Find poses. The final item in the chain is Send Prediction, which the diagram separates from the Find Poses item with a vertical ellipsis that indicates an indeterminate number of chain items in between. An arrow, labeled CGImage plus poses, goes from the chain item Find Poses to the main view controller. Another arrow, labeled action prediction goes from the chain item Send Prediction to the main view controller.

The method creates each new publisher by calling one of the following Publisher methods:

For example, the publisher that subscribes to the initial frame publisher is a Publishers.CompactMap that converts each (a type alias of CMSampleBuffer) it receives into a CGImage by calling the video-processing chain’s method.FrameimageFromFrame(_:)

// Create the chain of publisher-subscribers that transform the raw video
// frames from upstreamFramePublisher.
frameProcessingChain = upstreamFramePublisher
    // ---- Frame (aka CMSampleBuffer) -- Frame ----

    // Convert each frame to a CGImage, skipping any that don't convert.
    .compactMap(imageFromFrame)

    // ---- CGImage -- CGImage ----

    // Detect any human body poses (or lack of them) in the frame.
    .map(findPosesInFrame)

    // ---- [Pose]? -- [Pose]? ----

The next sections explain the remaining publishers in the chain and the methods they use to transform their inputs.

Analyze Each Frame for Body Poses

The next publisher in the chain is a Publishers.Map that receives each CGImage from the previous publisher (the compact map) by subscribing to it. The map publisher locates any human body poses in the frame by using the video-processing chain’s method. The method invokes a VNDetectHumanBodyPoseRequest by creating a VNImageRequestHandler with the image and submitting the video-processing chain’s property to the handler’s perform(_:) method.findPosesInFrame(_:)humanBodyPoseRequest

// Create a request handler for the image.
let visionRequestHandler = VNImageRequestHandler(cgImage: frame)

// Use Vision to find human body poses in the frame.
do { try visionRequestHandler.perform([humanBodyPoseRequest]) } catch {
    assertionFailure("Human Pose Request failed: \(error)")
}

When the request completes, the method creates and returns a array that contains one pose for every VNHumanBodyPoseObservation instance in the request’s results property.Pose

let poses = Pose.fromObservations(humanBodyPoseRequest.results)

The structure in this sample serves three main purposes:Pose

  • Calculating the observation’s area within a frame (see “Isolate A Body Pose”)
  • Storing the the observation’s multiarray (see “Retrieve the Multiarray”)
  • Drawing an observation as a wireframe of points and lines (see “Present the Poses to the User”)

For more information about using a VNDetectHumanBodyPoseRequest, see Detecting Human Body Poses in Images.

Isolate a Body Pose

The next publisher in the chain is a map that chooses a single pose from the array of poses by using the video-processing chain’s method. This method selects the the most prominent pose by passing a closure to the pose array’s max(by:) method.isolateLargestPose(_:)

private func isolateLargestPose(_ poses: [Pose]?) -> Pose? {
    return poses?.max(by:) { pose1, pose2 in pose1.area < pose2.area }
}

The closure compares the poses’ area estimates, with the goal of consistently selecting the same person’s pose over time, when multiple people are in frame.

  • Important: Get the most accurate predictions from an action classifier by using whatever technique you think best tracks a person from frame to frame, and use the multiarray from that person’s VNHumanBodyPoseObservation result.

Retrieve the Multiarray

The next publisher in the chain is a map that publishes the MLMultiArray from the pose’s property by using the video processing chain’s method.multiArraymultiArrayFromPose(_:)

private func multiArrayFromPose(_ item: Pose?) -> MLMultiArray? {
    return item?.multiArray
}

The initializer copies the multiarray from its VNHumanBodyPoseObservation parameter by calling the observation’s keypointsMultiArray() method.Pose

// Save the multiarray from the observation.
multiArray = try? observation.keypointsMultiArray()

Gather a Window of Multiarrays

The next publisher in the chain is a Publishers.Scan that receives each multiarray from its upstream publisher and gathers them into an array by providing two arguments:

  • An empty multiarray-optional array (MLMultiArray) as the scan publisher’s initial value[?]
  • The video-processing chain’s method as the scan publisher’s transform.gatherWindow(previousWindow:multiArray:)

// ---- MLMultiArray? -- MLMultiArray? ----

// Gather a window of multiarrays, starting with an empty window.
.scan([MLMultiArray?](), gatherWindow)

// ---- [MLMultiArray?] -- [MLMultiArray?] ----

A scan publisher behaves similarly to a map, but it also maintains a state. The following scan publisher’s state is an array of multiarray optionals that’s initially empty. As the scan publisher receives multiarray optionals from its upstream publisher, the scan publisher passes its previous state and the incoming multiarray optional as arguments to its transform.

private func gatherWindow(previousWindow: [MLMultiArray?],
                          multiArray: MLMultiArray?) -> [MLMultiArray?] {
    var currentWindow = previousWindow

    // If the previous window size is the target size, it
    // means sendWindowWhenReady() just published an array window.
    if previousWindow.count == predictionWindowSize {
        // Advance the sliding array window by stride elements.
        currentWindow.removeFirst(windowStride)
    }

    // Add the newest multiarray to the window.
    currentWindow.append(multiArray)

    // Publish the array window to the next subscriber.
    // The currentWindow becomes this method's next previousWindow when
    // it receives the next multiarray from the upstream publisher.
    return currentWindow
}

The method:

  1. Copies the parameter to previousWindowcurrentWindow
  2. Removes elements from the front of , if it’s fullwindowStridecurrentWindow
  3. Appends the parameter to the end of multiArraycurrentWindow
  4. Returns , which becomes the new state of the scan publisher and the next value for when the scan publisher receives the next value from its upstream publisher and invokes the methodcurrentWindowpreviousWindow

The video-processing chain considers a window to be full if it contains elements. When the window is full, this method removes (in step 2) the oldest elements to make room for newer elements, effectively sliding the window forward in time.predictionWindowSize

The Exercise Classifier’s method determines the value of the prediction window size at runtime by inspecting the model’s modelDescription property.calculatePredictionWindowSize()

Monitor the Window Size

The next publisher in the chain is a Publishers.Filter, which only publishes an array window when the method returns .gateWindow(_:)true

// Only publish a window when it grows to the correct size.
.filter(gateWindow)

// ---- [MLMultiArray?] -- [MLMultiArray?] ----

The method returns if the window array contains exactly the number of elements defined in . Otherwise, the method returns , which instructs the filter publisher to discard the current window and not publish it.truepredictionWindowSizefalse

private func gateWindow(_ currentWindow: [MLMultiArray?]) -> Bool {
    return currentWindow.count == predictionWindowSize
}

This filter publisher, in combination with its upstream scan publisher, publishes an array of multiarray optionals (MLMultiArray) once per each number of frames defined in .[?]windowStride

Predict the Person’s Action

The next publisher in the chain makes an from the multiarray window by using the method as its transform.ActionPredictionpredictActionWithWindow(_:)

// Make an activity prediction from the window.
.map(predictActionWithWindow)

// ---- ActionPrediction -- ActionPrediction ----

The method’s input array contains multiarray optionals where each element represents a frame in which Vision wasn’t able to find any human body poses. An action classifier requires a valid, non- multiarray for every frame. To remove the elements in the array, the method creates a new multiarray, , by:nilnilnilfilledWindow

  • Copying each each valid element in currentWindow
  • Replacing each element in with an nilcurrentWindowemptyPoseMultiArray

var poseCount = 0

// Fill the nil elements with an empty pose array.
let filledWindow: [MLMultiArray] = currentWindow.map { multiArray in
    if let multiArray = multiArray {
        poseCount += 1
        return multiArray
    } else {
        return Pose.emptyPoseMultiArray
    }
}

The empty pose multiarray has:

  • Every element set to zero
  • The same value for its shape property as a multiarray from a human body-pose observation

As the method iterates through each element in , it tallies the number of non- elements with .currentWindownilposeCount

If the value of is too low, the method directly creates a action prediction.poseCountnoPersonPrediction

// Only use windows with at least 60% real data to make a prediction
// with the action classifier.
let minimum = predictionWindowSize * 60 / 100
guard poseCount >= minimum else {
    return ActionPrediction.noPersonPrediction
}

Otherwise, the method merges the array of multiarrays into a single, combined multiarray by calling the MLMultiArray(concatenating:axis:dataType:) initializer.

// Merge the array window of multiarrays into one multiarray.
let mergedWindow = MLMultiArray(concatenating: filledWindow,
                                axis: 0,
                                dataType: .float)

The method generates an action prediction by passing the combined multiarray to the action classifier’s helper method.predictActionFromWindow(_:)

// Make a genuine prediction with the action classifier.
let prediction = actionClassifier.predictActionFromWindow(mergedWindow)

// Return the model's prediction if the confidence is high enough.
// Otherwise, return a "Low Confidence" prediction.
return checkConfidence(prediction)

The method checks the prediction’s confidence by passing the prediction to the helper method, which returns the same prediction if its confidence is high enough; otherwise .checkConfidence(_:)lowConfidencePrediction

Present the Prediction to the User

The final component in the chain is a subscriber that notifies the video-processing chain’s delegate with the prediction using the method.sendPrediction(_:)

// Send the action prediction to the delegate.
.sink(receiveValue: sendPrediction)

The method sends the action prediction and the number of frames the prediction represents () to the video-processing chain’s , the main view controller.windowStridedelegate

// Send the prediction to the delegate on the main queue.
DispatchQueue.main.async {
    self.delegate?.videoProcessingChain(self,
                                        didPredict: actionPrediction,
                                        for: windowStride)
}

Each time the main view controller receives an action prediction, it updates the app’s UI with the prediction and confidence in a helper method.

func videoProcessingChain(_ chain: VideoProcessingChain,
                          didPredict actionPrediction: ActionPrediction,
                          for frameCount: Int) {

    if actionPrediction.isModelLabel {
        // Update the total number of frames for this action.
        addFrameCount(frameCount, to: actionPrediction.label)
    }

    // Present the prediction in the UI.
    updateUILabelsWithPrediction(actionPrediction)
}

The main view controller also updates its property for action labels that come from the model, which it later sends to the Summary View Controller when the user taps the button.actionFrameCountsSummary

Present the Poses to the User

The app visualizes the result of each human body-pose request by drawing the poses on top of the frame in which Vision found them. Each time the video-processing chain’s creates an array of instances, it sends the poses to its delegate, the main view controller.findPosesInFrame(_:)Pose

// Send the frame and poses, if any, to the delegate on the main queue.
DispatchQueue.main.async {
    self.delegate?.videoProcessingChain(self, didDetect: poses, in: frame)
}

The main view controller’s method uses the frame as the background by first drawing the frame.drawPoses(_:onto:)

// Draw the camera image first as the background.
let imageRectangle = CGRect(origin: .zero, size: frameSize)
cgContext.draw(frame, in: imageRectangle)

Next, the method draws the poses by calling their method, which draws the pose as a wireframe of lines and circles.drawWireframeToContext(_:applying:)

// Draw all the poses Vision found in the frame.
for pose in poses {
    // Draw each pose as a wireframe at the scale of the image.
    pose.drawWireframeToContext(cgContext, applying: pointTransform)
}

The main view controller presents the finished image to the user by assigning it to its full-screen image view.

// Update the UI's full-screen image view on the main thread.
DispatchQueue.main.async { self.imageView.image = frameWithPosesRendering }

GitHub

点击跳转