video

检测实时视频源中的人类行为

video

检测实时视频源中的人类行为

通过发送一系列人物的姿势数据来识别身体动作的视频帧到动作分类模型。

概述

此示例应用识别一个人的身体移动，称为动作，通过使用 Vision 分析一系列视频帧并通过应用动作分类器预测运动的名称。此示例中的操作分类器可识别三个练习：

跳千斤顶
弓步
波比跳

注：请参阅创建动作分类器模型有关创建自己的操作分类器的信息。

该应用程序在来自设备摄像头的实时全屏视频源。当应用识别框架中的一个或多个人时，它会覆盖线框主体在每个人身上摆姿势。同时，该应用程序可以预测知名人士当前的行动; 通常，这是离摄像机最近的人。

在启动时，该应用程序将设备的相机配置为生成视频帧，然后引导通过一系列方法与 Combine 链接在一起。这些方法协同工作以分析帧并通过以下方式进行动作预测执行以下步骤序列：

Locate all human body poses in each frame.
Isolate the prominent pose.
Aggregate the prominent pose’s position data over time.
Make action predictions by sending the aggregate data to the action classifier.

Configure the Sample Code Project

This sample app uses a camera, so you can’t run it in Simulator — you need to run it on an iOS or iPadOS device.

Start a Video Capture Session

The app’s class configures the device’s camera to generate video frames by creating an AVCaptureSession.VideoCapture

When the app first launches, or when the user rotates the device or switches between cameras, video capture configures a camera input, a frame output, and the connection between them in its method.configureCaptureSession()

// Set the video camera to run at the action classifier's frame rate.
let modelFrameRate = ExerciseClassifier.frameRate

let input = AVCaptureDeviceInput.createCameraInput(position: cameraPosition,
                                                   frameRate: modelFrameRate)

let output = AVCaptureVideoDataOutput.withPixelFormatType(kCVPixelFormatType_32BGRA)

let success = configureCaptureConnection(input, output)
return success ? output : nil

The method selects the front- or rear-facing camera and configures its frame rate so it matches that of the action classifier.createCameraInput(position:frameRate:)

Important: If you replace the file with your own action classifier model, set the property to match the Frame Rate training parameter you used in the Create ML developer tool.ExerciseClassifier.mlmodelframeRate

The method creates an AVCaptureVideoDataOutput that produces frames with a specific pixel format.AVCaptureVideoDataOutput.withPixelFormatType(_:)

The method configures the relationship between the capture session’s camera input and video output by:configureCaptureConnection(_:_:)

Selecting a video orientation
Deciding whether to horizontally flip the video
Enabling image stabilization when applicable

if connection.isVideoOrientationSupported {
    // Set the video capture's orientation to match that of the device.
    connection.videoOrientation = orientation
}

if connection.isVideoMirroringSupported {
    connection.isVideoMirrored = horizontalFlip
}

if connection.isVideoStabilizationSupported {
    if videoStabilizationEnabled {
        connection.preferredVideoStabilizationMode = .standard
    } else {
        connection.preferredVideoStabilizationMode = .off
    }
}

The method keeps the app operating in real time — and avoids building up a frame backlog — by setting the video output’s alwaysDiscardsLateVideoFrames property to .true

// Discard newer frames if the app is busy with an earlier frame.
output.alwaysDiscardsLateVideoFrames = true

See Setting Up a Capture Session for more information on how to configure capture sessions and connect their inputs and outputs.

Create a Frame Publisher

The video capture publishes frames from its capture session by creating a PassthroughSubject in its method.createVideoFramePublisher()

// Create a new passthrough subject that publishes frames to subscribers.
let passthroughSubject = PassthroughSubject<Frame, Never>()

// Keep a reference to the publisher.
framePublisher = passthroughSubject

A passthrough subject is a concrete implementation of Subject that adapts imperative code to work with Combine. It immediately publishes the instance you pass to its send(_:) method, if it has a subscriber at that time.

Next, the video capture registers itself as the video output’s delegate so it receives the video frames from the capture session by calling the output’s setSampleBufferDelegate(_:queue:) method.

// Set the video capture as the video output's delegate.
videoDataOutput.setSampleBufferDelegate(self, queue: videoCaptureQueue)

The video capture forwards each frame it receives to its by passing the frame to its send(_:) method.framePublisher

extension VideoCapture: AVCaptureVideoDataOutputSampleBufferDelegate {
    func captureOutput(_ output: AVCaptureOutput,
                       didOutput frame: Frame,
                       from connection: AVCaptureConnection) {

        // Forward the frame through the publisher.
        framePublisher?.send(frame)
    }
}

Build a Publisher Chain

The sample processes each video frame, and its derivative data, with a series of methods that it connects together into a chain of Combine publishers in the class.VideoProcessingChain

Each time the video capture creates a new frame publisher it notifies the main view controller, which then assigns the publisher to the video-processing chain’s property:upstreamFramePublisher

func videoCapture(_ videoCapture: VideoCapture,
                  didCreate framePublisher: FramePublisher) {
    updateUILabelsWithPrediction(.startingPrediction)
    
    // Build a new video-processing chain by assigning the new frame publisher.
    videoProcessingChain.upstreamFramePublisher = framePublisher
}

Each time the property’s value changes, the video-processing chain creates a new daisy chain of publishers by calling its method.buildProcessingChain()

The method creates each new publisher by calling one of the following Publisher methods:

For example, the publisher that subscribes to the initial frame publisher is a Publishers.CompactMap that converts each (a type alias of CMSampleBuffer) it receives into a CGImage by calling the video-processing chain’s method.FrameimageFromFrame(_:)

// Create the chain of publisher-subscribers that transform the raw video
// frames from upstreamFramePublisher.
frameProcessingChain = upstreamFramePublisher
    // ---- Frame (aka CMSampleBuffer) -- Frame ----

    // Convert each frame to a CGImage, skipping any that don't convert.
    .compactMap(imageFromFrame)

    // ---- CGImage -- CGImage ----

    // Detect any human body poses (or lack of them) in the frame.
    .map(findPosesInFrame)

    // ---- [Pose]? -- [Pose]? ----

The next sections explain the remaining publishers in the chain and the methods they use to transform their inputs.

Analyze Each Frame for Body Poses

The next publisher in the chain is a Publishers.Map that receives each CGImage from the previous publisher (the compact map) by subscribing to it. The map publisher locates any human body poses in the frame by using the video-processing chain’s method. The method invokes a VNDetectHumanBodyPoseRequest by creating a VNImageRequestHandler with the image and submitting the video-processing chain’s property to the handler’s perform(_:) method.findPosesInFrame(_:)humanBodyPoseRequest

Important: Improve your app’s efficiency by creating and reusing a single VNDetectHumanBodyPoseRequest instance.

// Create a request handler for the image.
let visionRequestHandler = VNImageRequestHandler(cgImage: frame)

// Use Vision to find human body poses in the frame.
do { try visionRequestHandler.perform([humanBodyPoseRequest]) } catch {
    assertionFailure("Human Pose Request failed: \(error)")
}

When the request completes, the method creates and returns a array that contains one pose for every VNHumanBodyPoseObservation instance in the request’s results property.Pose

let poses = Pose.fromObservations(humanBodyPoseRequest.results)

The structure in this sample serves three main purposes:Pose

Calculating the observation’s area within a frame (see “Isolate A Body Pose”)
Storing the the observation’s multiarray (see “Retrieve the Multiarray”)
Drawing an observation as a wireframe of points and lines (see “Present the Poses to the User”)

For more information about using a VNDetectHumanBodyPoseRequest, see Detecting Human Body Poses in Images.

Isolate a Body Pose

The next publisher in the chain is a map that chooses a single pose from the array of poses by using the video-processing chain’s method. This method selects the the most prominent pose by passing a closure to the pose array’s max(by:) method.isolateLargestPose(_:)

private func isolateLargestPose(_ poses: [Pose]?) -> Pose? {
    return poses?.max(by:) { pose1, pose2 in pose1.area < pose2.area }
}

The closure compares the poses’ area estimates, with the goal of consistently selecting the same person’s pose over time, when multiple people are in frame.

Important: Get the most accurate predictions from an action classifier by using whatever technique you think best tracks a person from frame to frame, and use the multiarray from that person’s VNHumanBodyPoseObservation result.

Retrieve the Multiarray

The next publisher in the chain is a map that publishes the MLMultiArray from the pose’s property by using the video processing chain’s method.multiArraymultiArrayFromPose(_:)

private func multiArrayFromPose(_ item: Pose?) -> MLMultiArray? {
    return item?.multiArray
}

The initializer copies the multiarray from its VNHumanBodyPoseObservation parameter by calling the observation’s keypointsMultiArray() method.Pose

// Save the multiarray from the observation.
multiArray = try? observation.keypointsMultiArray()

Gather a Window of Multiarrays

The next publisher in the chain is a Publishers.Scan that receives each multiarray from its upstream publisher and gathers them into an array by providing two arguments:

An empty multiarray-optional array (MLMultiArray) as the scan publisher’s initial value[?]
The video-processing chain’s method as the scan publisher’s transform.gatherWindow(previousWindow:multiArray:)

// ---- MLMultiArray? -- MLMultiArray? ----

// Gather a window of multiarrays, starting with an empty window.
.scan([MLMultiArray?](), gatherWindow)

// ---- [MLMultiArray?] -- [MLMultiArray?] ----

A scan publisher behaves similarly to a map, but it also maintains a state. The following scan publisher’s state is an array of multiarray optionals that’s initially empty. As the scan publisher receives multiarray optionals from its upstream publisher, the scan publisher passes its previous state and the incoming multiarray optional as arguments to its transform.

private func gatherWindow(previousWindow: [MLMultiArray?],
                          multiArray: MLMultiArray?) -> [MLMultiArray?] {
    var currentWindow = previousWindow

    // If the previous window size is the target size, it
    // means sendWindowWhenReady() just published an array window.
    if previousWindow.count == predictionWindowSize {
        // Advance the sliding array window by stride elements.
        currentWindow.removeFirst(windowStride)
    }

    // Add the newest multiarray to the window.
    currentWindow.append(multiArray)

    // Publish the array window to the next subscriber.
    // The currentWindow becomes this method's next previousWindow when
    // it receives the next multiarray from the upstream publisher.
    return currentWindow
}

The method:

Copies the parameter to previousWindowcurrentWindow
Removes elements from the front of , if it’s fullwindowStridecurrentWindow
Appends the parameter to the end of multiArraycurrentWindow
Returns , which becomes the new state of the scan publisher and the next value for when the scan publisher receives the next value from its upstream publisher and invokes the methodcurrentWindowpreviousWindow

The video-processing chain considers a window to be full if it contains elements. When the window is full, this method removes (in step 2) the oldest elements to make room for newer elements, effectively sliding the window forward in time.predictionWindowSize

The Exercise Classifier’s method determines the value of the prediction window size at runtime by inspecting the model’s modelDescription property.calculatePredictionWindowSize()

Monitor the Window Size

The next publisher in the chain is a Publishers.Filter, which only publishes an array window when the method returns .gateWindow(_:)true

// Only publish a window when it grows to the correct size.
.filter(gateWindow)

// ---- [MLMultiArray?] -- [MLMultiArray?] ----

The method returns if the window array contains exactly the number of elements defined in . Otherwise, the method returns , which instructs the filter publisher to discard the current window and not publish it.truepredictionWindowSizefalse

private func gateWindow(_ currentWindow: [MLMultiArray?]) -> Bool {
    return currentWindow.count == predictionWindowSize
}

This filter publisher, in combination with its upstream scan publisher, publishes an array of multiarray optionals (MLMultiArray) once per each number of frames defined in .[?]windowStride

Predict the Person’s Action

The next publisher in the chain makes an from the multiarray window by using the method as its transform.ActionPredictionpredictActionWithWindow(_:)

// Make an activity prediction from the window.
.map(predictActionWithWindow)

// ---- ActionPrediction -- ActionPrediction ----

The method’s input array contains multiarray optionals where each element represents a frame in which Vision wasn’t able to find any human body poses. An action classifier requires a valid, non- multiarray for every frame. To remove the elements in the array, the method creates a new multiarray, , by:nilnilnilfilledWindow

Copying each each valid element in currentWindow
Replacing each element in with an nilcurrentWindowemptyPoseMultiArray

var poseCount = 0

// Fill the nil elements with an empty pose array.
let filledWindow: [MLMultiArray] = currentWindow.map { multiArray in
    if let multiArray = multiArray {
        poseCount += 1
        return multiArray
    } else {
        return Pose.emptyPoseMultiArray
    }
}

The empty pose multiarray has:

Every element set to zero
The same value for its shape property as a multiarray from a human body-pose observation

As the method iterates through each element in , it tallies the number of non- elements with .currentWindownilposeCount

If the value of is too low, the method directly creates a action prediction.poseCountnoPersonPrediction

// Only use windows with at least 60% real data to make a prediction
// with the action classifier.
let minimum = predictionWindowSize * 60 / 100
guard poseCount >= minimum else {
    return ActionPrediction.noPersonPrediction
}

Otherwise, the method merges the array of multiarrays into a single, combined multiarray by calling the MLMultiArray(concatenating:axis:dataType:) initializer.

// Merge the array window of multiarrays into one multiarray.
let mergedWindow = MLMultiArray(concatenating: filledWindow,
                                axis: 0,
                                dataType: .float)

The method generates an action prediction by passing the combined multiarray to the action classifier’s helper method.predictActionFromWindow(_:)

// Make a genuine prediction with the action classifier.
let prediction = actionClassifier.predictActionFromWindow(mergedWindow)

// Return the model's prediction if the confidence is high enough.
// Otherwise, return a "Low Confidence" prediction.
return checkConfidence(prediction)

The method checks the prediction’s confidence by passing the prediction to the helper method, which returns the same prediction if its confidence is high enough; otherwise .checkConfidence(_:)lowConfidencePrediction

Present the Prediction to the User

The final component in the chain is a subscriber that notifies the video-processing chain’s delegate with the prediction using the method.sendPrediction(_:)

// Send the action prediction to the delegate.
.sink(receiveValue: sendPrediction)

The method sends the action prediction and the number of frames the prediction represents () to the video-processing chain’s , the main view controller.windowStridedelegate

// Send the prediction to the delegate on the main queue.
DispatchQueue.main.async {
    self.delegate?.videoProcessingChain(self,
                                        didPredict: actionPrediction,
                                        for: windowStride)
}

Each time the main view controller receives an action prediction, it updates the app’s UI with the prediction and confidence in a helper method.

func videoProcessingChain(_ chain: VideoProcessingChain,
                          didPredict actionPrediction: ActionPrediction,
                          for frameCount: Int) {

    if actionPrediction.isModelLabel {
        // Update the total number of frames for this action.
        addFrameCount(frameCount, to: actionPrediction.label)
    }

    // Present the prediction in the UI.
    updateUILabelsWithPrediction(actionPrediction)
}

The main view controller also updates its property for action labels that come from the model, which it later sends to the Summary View Controller when the user taps the button.actionFrameCountsSummary

Present the Poses to the User

The app visualizes the result of each human body-pose request by drawing the poses on top of the frame in which Vision found them. Each time the video-processing chain’s creates an array of instances, it sends the poses to its delegate, the main view controller.findPosesInFrame(_:)Pose

// Send the frame and poses, if any, to the delegate on the main queue.
DispatchQueue.main.async {
    self.delegate?.videoProcessingChain(self, didDetect: poses, in: frame)
}

The main view controller’s method uses the frame as the background by first drawing the frame.drawPoses(_:onto:)

// Draw the camera image first as the background.
let imageRectangle = CGRect(origin: .zero, size: frameSize)
cgContext.draw(frame, in: imageRectangle)

Next, the method draws the poses by calling their method, which draws the pose as a wireframe of lines and circles.drawWireframeToContext(_:applying:)

// Draw all the poses Vision found in the frame.
for pose in poses {
    // Draw each pose as a wireframe at the scale of the image.
    pose.drawWireframeToContext(cgContext, applying: pointTransform)
}

The main view controller presents the finished image to the user by assigning it to its full-screen image view.

// Update the UI's full-screen image view on the main thread.
DispatchQueue.main.async { self.imageView.image = frameWithPosesRendering }

GitHub

点击跳转

video

检测实时视频源中的人类行为

检测实时视频源中的人类行为

概述

Configure the Sample Code Project

Start a Video Capture Session

Create a Frame Publisher

Build a Publisher Chain

Analyze Each Frame for Body Poses

Isolate a Body Pose

Retrieve the Multiarray

Gather a Window of Multiarrays

Monitor the Window Size

Predict the Person’s Action

Present the Prediction to the User

Present the Poses to the User

GitHub

简单的车速表类到iOS和WatchOS

为 swift 包库生成依赖项列表

检测实时视频源中的人类行为

概述

Configure the Sample Code Project

Start a Video Capture Session

Create a Frame Publisher

Build a Publisher Chain

Analyze Each Frame for Body Poses

Isolate a Body Pose

Retrieve the Multiarray

Gather a Window of Multiarrays

Monitor the Window Size

Predict the Person’s Action

Present the Prediction to the User

Present the Poses to the User

GitHub

简单的车速表类到iOS和WatchOS

为 swift 包库生成依赖项列表

你可能也喜欢。。。