Save to file
While it's always possible to decode and re-encode the video stream into any format and configuration you want, it's more simple and efficient to use the video stream directly. Just note that there are upside and downside of each method:
- Codec and configuration: Re-encode the video allows you to use any codec and configuration. Because PC is often more powerful than mobile devices, it might be able to achieve the same quality with less bitrate (thus smaller files). When muxing the video stream, the codec and configuration is always same as captured.
- Quality: Most video encoding processes are lossy. The quality of re-encoded video is lower than the original video. When muxing the video stream, the quality is unchanged.
- Processing power and memory: Re-encode the video requires more CPU/GPU power and memory. Muxing a video requires almost no CPU power and memory.
- Code complexity: Because the video resolution can change with device orientation, it's more complex to re-encode the video. Muxing the video stream doesn't require any additional code to handle that.
The video stream contains only encoded video frames in varies codecs (H.264/AVC, H.265/HEVC, and AV1). A few video players (e.g. VLC) can play those raw streams directly, but most others require a container format (e.g. MP4): the video stream plus some metadata.
To convert the video stream into a video file (muxing), you need a muxer library, like mp4-muxer and webm-muxer.
The APIs of mp4-muxer
and webm-muxer
are designed around WebCodecs API. However, WebCodecs API is not required when you already have encoded video stream, like we do.
No matter which container and muxer library you use, there are some codec-specific conversions that are required.
H.264
The words "H.264" and "AVC" are usually used interchangeable. However, they are actually two different specifications. The part for encoding and decoding video frames is identical, but the part for saving the video stream is different.
The ITU-T H.264 standard (https://www.itu.int/rec/T-REC-H.264) defines a stream format for storing multiple H.264 packets. This format doesn't have an official name, but because it's in the Annex B section of the specification, it's commonly referred to as "Annex B format", and used to store raw H.264 streams.
On the other hand, the ISO/IEC MPEG-4 AVC standard (https://www.iso.org/obp/ui/en/#iso:std:iso-iec:14496:-10:ed-10:v1:en) uses varies C structure-like format to store the packets. This format is commonly referred to as "AVC format", and used by various containers.
Android MediaCodec API (which Scrcpy uses) produces an Annex B format stream, to save the stream into a container format, it needs to be converted to AVC format.
Configuration
As mentioned in configuration packet, H.264 configuration packet contains the Sequence Parameter Set (SPS) and Picture Parameter Set (PPS), which contains information like codec profile, resolution, cropping, framerate, etc.
The configuration packets need to be store in two places:
- In the video metadata, as a
AVCDecoderConfigurationRecord
structure - Prepended to the next frame data, in the video stream.
Configuration packets can occur multiple times in the stream, when the encoder is restarted. ONLY the first one needs to be converted to AVCDecoderConfigurationRecord
and stored in metadata. But ALL of them must be prepended to their next frame data.
Subsequent configuration packets can be different from the first one, even including different resolution or codec profile. Most players can handle this correctly.
If frame metadata is enabled, the SPS and PPS will be in the configuration packets. Tango has a method to extract them from the configuration packet. If they can't be found in the specified buffer, an error will be thrown.
import { h264SearchConfiguration } from "@yume-chan/scrcpy";
for await (const packet of videoPacketStream) {
if (packet.type === "configuration") {
const { sequenceParameterSet, pictureParameterSet } =
h264SearchConfiguration(packet.data);
console.log(sequenceParameterSet, pictureParameterSet);
}
}
Then you can use this function to convert them into AvcDecoderConfigurationRecord
:
- JavaScript
- TypeScript
// https://ffmpeg.org/doxygen/0.11/avc_8c-source.html#l00106
function h264ConfigurationToAvcDecoderConfigurationRecord(
sequenceParameterSet,
pictureParameterSet,
) {
const buffer = new Uint8Array(
11 + sequenceParameterSet.byteLength + pictureParameterSet.byteLength,
);
buffer[0] = 1;
buffer[1] = sequenceParameterSet[1];
buffer[2] = sequenceParameterSet[2];
buffer[3] = sequenceParameterSet[3];
buffer[4] = 0xff;
buffer[5] = 0xe1;
buffer[6] = sequenceParameterSet.byteLength >> 8;
buffer[7] = sequenceParameterSet.byteLength & 0xff;
buffer.set(sequenceParameterSet, 8);
buffer[8 + sequenceParameterSet.byteLength] = 1;
buffer[9 + sequenceParameterSet.byteLength] = pictureParameterSet.byteLength >> 8;
buffer[10 + sequenceParameterSet.byteLength] = pictureParameterSet.byteLength & 0xff;
buffer.set(pictureParameterSet, 11 + sequenceParameterSet.byteLength);
return buffer;
}
// https://ffmpeg.org/doxygen/0.11/avc_8c-source.html#l00106
function h264ConfigurationToAvcDecoderConfigurationRecord(
sequenceParameterSet: Uint8Array,
pictureParameterSet: Uint8Array,
) {
const buffer = new Uint8Array(
11 + sequenceParameterSet.byteLength + pictureParameterSet.byteLength,
);
buffer[0] = 1;
buffer[1] = sequenceParameterSet[1]!;
buffer[2] = sequenceParameterSet[2]!;
buffer[3] = sequenceParameterSet[3]!;
buffer[4] = 0xff;
buffer[5] = 0xe1;
buffer[6] = sequenceParameterSet.byteLength >> 8;
buffer[7] = sequenceParameterSet.byteLength & 0xff;
buffer.set(sequenceParameterSet, 8);
buffer[8 + sequenceParameterSet.byteLength] = 1;
buffer[9 + sequenceParameterSet.byteLength] = pictureParameterSet.byteLength >> 8;
buffer[10 + sequenceParameterSet.byteLength] = pictureParameterSet.byteLength & 0xff;
buffer.set(pictureParameterSet, 11 + sequenceParameterSet.byteLength);
return buffer;
}
How to use the above AvcDecoderConfigurationRecord
data depend on the container format and muxer library. Usually there will be a metadata
, configuration
, or description
field for the video stream.
Frames
When frame metadata is enabled, each data packet contains exactly one encoded frame.
For the first data packet after a configuration packet, the configuration data need to be prepended to the frame data.
It needs the original Annex B format data, not the AvcDecoderConfigurationRecord
.
- JavaScript
- TypeScript
let configuration;
for await (const packet of videoPacketStream) {
if (packet.type === "configuration") {
configuration = packet.data;
// Also convert it to `AVCDecoderConfigurationRecord` and store in metadata
continue;
}
if (packet.type === "data") {
let buffer;
if (configuration) {
const buffer = new Uint8Array(configuration.byteLength + packet.data.byteLength);
buffer.set(configuration);
buffer.set(packet.data, configuration.byteLength);
configuration = undefined;
} else {
buffer = packet.data;
}
console.log(buffer);
}
}
let configuration: Uint8Array | undefined;
for await (const packet of videoPacketStream) {
if (packet.type === "configuration") {
configuration = packet.data;
// Also convert it to `AVCDecoderConfigurationRecord` and store in metadata
continue;
}
if (packet.type === "data") {
let buffer: Uint8Array;
if (configuration) {
const buffer = new Uint8Array(configuration.byteLength + packet.data.byteLength);
buffer.set(configuration);
buffer.set(packet.data, configuration.byteLength);
configuration = undefined;
} else {
buffer = packet.data;
}
console.log(buffer);
}
}
Then the data (maybe with configuration prepended) needs to be converted to a AVCSample
structure.
- JavaScript
- TypeScript
import { annexBSplitNalu } from "@yume-chan/scrcpy";
function nalStreamToAvcSample(buffer) {
const nalUnits = [];
let totalLength = 0;
for (const unit of annexBSplitNalu(buffer)) {
nalUnits.push(unit);
totalLength += unit.byteLength + 4;
}
const sample = new Uint8Array(totalLength);
let offset = 0;
for (const nalu of nalUnits) {
sample[offset] = nalu.byteLength >> 24;
sample[offset + 1] = nalu.byteLength >> 16;
sample[offset + 2] = nalu.byteLength >> 8;
sample[offset + 3] = nalu.byteLength & 0xff;
sample.set(nalu, offset + 4);
offset += 4 + nalu.byteLength;
}
return sample;
}
import { annexBSplitNalu } from "@yume-chan/scrcpy";
function nalStreamToAvcSample(buffer: Uint8Array) {
const nalUnits: Uint8Array[] = [];
let totalLength = 0;
for (const unit of annexBSplitNalu(buffer)) {
nalUnits.push(unit);
totalLength += unit.byteLength + 4;
}
const sample = new Uint8Array(totalLength);
let offset = 0;
for (const nalu of nalUnits) {
sample[offset] = nalu.byteLength >> 24;
sample[offset + 1] = nalu.byteLength >> 16;
sample[offset + 2] = nalu.byteLength >> 8;
sample[offset + 3] = nalu.byteLength & 0xff;
sample.set(nalu, offset + 4);
offset += 4 + nalu.byteLength;
}
return sample;
}
Again, how to save the AVCSample
into the video stream depends on the container format and muxer library.
H.265
H.265 stream format is very similar to H.264. It uses the same Annex B format, and the process to handle packets is also the same.
Configuration
The H.265 configuration packet contains the Video Parameter Set (VPS), Sequence Parameter Set (SPS), and Picture Parameter Set (PPS).
Same as H.264, it also need to be converted and stored in video metadata, and prepended to the next frame data to be stored in the video stream. Except, the functions for doing that are different because the data is different.
First extract them from a packet:
import { h265SearchConfiguration } from "@yume-chan/scrcpy";
for await (const packet of videoPacketStream) {
if (packet.type === "configuration") {
const { videoParameterSet, sequenceParameterSet, pictureParameterSet } =
h265SearchConfiguration(packet.data);
console.log(videoParameterSet, sequenceParameterSet, pictureParameterSet);
}
}
Then convert to HEVCDecoderConfigurationRecord
:
- JavaScript
- TypeScript
import { h265ParseSequenceParameterSet, h265ParseVideoParameterSet } from "@yume-chan/scrcpy";
function h265ConfigurationToHevcDecoderConfigurationRecord(
videoParameterSet,
sequenceParameterSet,
pictureParameterSet,
) {
const {
profileTierLevel: {
generalProfileTier: {
profile_space: general_profile_space,
tier_flag: general_tier_flag,
profile_idc: general_profile_idc,
profileCompatibilitySet: generalProfileCompatibilitySet,
constraintSet: generalConstraintSet,
},
general_level_idc,
},
vps_max_layers_minus1,
vps_temporal_id_nesting_flag,
} = h265ParseVideoParameterSet(videoParameterSet.rbsp);
const {
chroma_format_idc,
bit_depth_luma_minus8,
bit_depth_chroma_minus8,
vuiParameters: { min_spatial_segmentation_idc = 0 } = {},
} = h265ParseSequenceParameterSet(sequenceParameterSet.rbsp);
const buffer = new Uint8Array(
23 +
5 * 3 +
videoParameterSet.data.length +
sequenceParameterSet.data.length +
pictureParameterSet.data.length,
);
/* unsigned int(8) configurationVersion = 1; */
buffer[0] = 1;
/*
* unsigned int(2) general_profile_space;
* unsigned int(1) general_tier_flag;
* unsigned int(5) general_profile_idc;
*/
buffer[1] = (general_profile_space << 6) | (Number(general_tier_flag) << 5) | general_profile_idc;
/* unsigned int(32) general_profile_compatibility_flags; */
buffer[2] = generalProfileCompatibilitySet[0];
buffer[3] = generalProfileCompatibilitySet[1];
buffer[4] = generalProfileCompatibilitySet[2];
buffer[5] = generalProfileCompatibilitySet[3];
/* unsigned int(48) general_constraint_indicator_flags; */
buffer[6] = generalConstraintSet[0];
buffer[7] = generalConstraintSet[1];
buffer[8] = generalConstraintSet[2];
buffer[9] = generalConstraintSet[3];
buffer[10] = generalConstraintSet[4];
buffer[11] = generalConstraintSet[5];
/* unsigned int(8) general_level_idc; */
buffer[12] = general_level_idc;
/*
* bit(4) reserved = '1111'b;
* unsigned int(12) min_spatial_segmentation_idc;
*/
buffer[13] = 0xf0 | (min_spatial_segmentation_idc >> 8);
buffer[14] = min_spatial_segmentation_idc;
/*
* bit(6) reserved = '111111'b;
* unsigned int(2) parallelismType;
*/
buffer[15] = 0xfc;
/*
* bit(6) reserved = '111111'b;
* unsigned int(2) chromaFormat;
*/
buffer[16] = 0xfc | chroma_format_idc;
/*
* bit(5) reserved = '11111'b;
* unsigned int(3) bitDepthLumaMinus8;
*/
buffer[17] = 0xf8 | bit_depth_luma_minus8;
/*
* bit(5) reserved = '11111'b;
* unsigned int(3) bitDepthChromaMinus8;
*/
buffer[18] = 0xf8 | bit_depth_chroma_minus8;
/* bit(16) avgFrameRate; */
buffer[19] = 0;
buffer[20] = 0;
/*
* bit(2) constantFrameRate;
* bit(3) numTemporalLayers;
* bit(1) temporalIdNested;
* unsigned int(2) lengthSizeMinusOne;
*/
buffer[21] = ((vps_max_layers_minus1 + 1) << 3) | (Number(vps_temporal_id_nesting_flag) << 2) | 3;
/* unsigned int(8) numOfArrays; */
buffer[22] = 3;
let i = 23;
for (const nalu of [videoParameterSet, sequenceParameterSet, pictureParameterSet]) {
/*
* bit(1) array_completeness;
* unsigned int(1) reserved = 0;
* unsigned int(6) NAL_unit_type;
*/
buffer[i] = nalu.nal_unit_type;
i += 1;
/* unsigned int(16) numNalus; */
buffer[i] = 0;
i += 1;
buffer[i] = 1;
i += 1;
/* unsigned int(16) nalUnitLength; */
buffer[i] = nalu.data.length >> 8;
i += 1;
buffer[i] = nalu.data.length;
i += 1;
buffer.set(nalu.data, i);
i += nalu.data.length;
}
return buffer;
}
import type { H265NaluRaw } from "@yume-chan/scrcpy";
import { h265ParseSequenceParameterSet, h265ParseVideoParameterSet } from "@yume-chan/scrcpy";
function h265ConfigurationToHevcDecoderConfigurationRecord(
videoParameterSet: H265NaluRaw,
sequenceParameterSet: H265NaluRaw,
pictureParameterSet: H265NaluRaw,
) {
const {
profileTierLevel: {
generalProfileTier: {
profile_space: general_profile_space,
tier_flag: general_tier_flag,
profile_idc: general_profile_idc,
profileCompatibilitySet: generalProfileCompatibilitySet,
constraintSet: generalConstraintSet,
},
general_level_idc,
},
vps_max_layers_minus1,
vps_temporal_id_nesting_flag,
} = h265ParseVideoParameterSet(videoParameterSet.rbsp);
const {
chroma_format_idc,
bit_depth_luma_minus8,
bit_depth_chroma_minus8,
vuiParameters: { min_spatial_segmentation_idc = 0 } = {},
} = h265ParseSequenceParameterSet(sequenceParameterSet.rbsp);
const buffer = new Uint8Array(
23 +
5 * 3 +
videoParameterSet.data.length +
sequenceParameterSet.data.length +
pictureParameterSet.data.length,
);
/* unsigned int(8) configurationVersion = 1; */
buffer[0] = 1;
/*
* unsigned int(2) general_profile_space;
* unsigned int(1) general_tier_flag;
* unsigned int(5) general_profile_idc;
*/
buffer[1] = (general_profile_space << 6) | (Number(general_tier_flag) << 5) | general_profile_idc;
/* unsigned int(32) general_profile_compatibility_flags; */
buffer[2] = generalProfileCompatibilitySet[0]!;
buffer[3] = generalProfileCompatibilitySet[1]!;
buffer[4] = generalProfileCompatibilitySet[2]!;
buffer[5] = generalProfileCompatibilitySet[3]!;
/* unsigned int(48) general_constraint_indicator_flags; */
buffer[6] = generalConstraintSet[0]!;
buffer[7] = generalConstraintSet[1]!;
buffer[8] = generalConstraintSet[2]!;
buffer[9] = generalConstraintSet[3]!;
buffer[10] = generalConstraintSet[4]!;
buffer[11] = generalConstraintSet[5]!;
/* unsigned int(8) general_level_idc; */
buffer[12] = general_level_idc;
/*
* bit(4) reserved = '1111'b;
* unsigned int(12) min_spatial_segmentation_idc;
*/
buffer[13] = 0xf0 | (min_spatial_segmentation_idc >> 8);
buffer[14] = min_spatial_segmentation_idc;
/*
* bit(6) reserved = '111111'b;
* unsigned int(2) parallelismType;
*/
buffer[15] = 0xfc;
/*
* bit(6) reserved = '111111'b;
* unsigned int(2) chromaFormat;
*/
buffer[16] = 0xfc | chroma_format_idc;
/*
* bit(5) reserved = '11111'b;
* unsigned int(3) bitDepthLumaMinus8;
*/
buffer[17] = 0xf8 | bit_depth_luma_minus8;
/*
* bit(5) reserved = '11111'b;
* unsigned int(3) bitDepthChromaMinus8;
*/
buffer[18] = 0xf8 | bit_depth_chroma_minus8;
/* bit(16) avgFrameRate; */
buffer[19] = 0;
buffer[20] = 0;
/*
* bit(2) constantFrameRate;
* bit(3) numTemporalLayers;
* bit(1) temporalIdNested;
* unsigned int(2) lengthSizeMinusOne;
*/
buffer[21] = ((vps_max_layers_minus1 + 1) << 3) | (Number(vps_temporal_id_nesting_flag) << 2) | 3;
/* unsigned int(8) numOfArrays; */
buffer[22] = 3;
let i = 23;
for (const nalu of [videoParameterSet, sequenceParameterSet, pictureParameterSet]) {
/*
* bit(1) array_completeness;
* unsigned int(1) reserved = 0;
* unsigned int(6) NAL_unit_type;
*/
buffer[i] = nalu.nal_unit_type;
i += 1;
/* unsigned int(16) numNalus; */
buffer[i] = 0;
i += 1;
buffer[i] = 1;
i += 1;
/* unsigned int(16) nalUnitLength; */
buffer[i] = nalu.data.length >> 8;
i += 1;
buffer[i] = nalu.data.length;
i += 1;
buffer.set(nalu.data, i);
i += nalu.data.length;
}
return buffer;
}
Frames
Handling H.265 frame data is same as H.264.
AV1
AV1 is much simpler to handle.
Its configuration is in the first data packet, and usually it doesn't need to be stored in the metadata. So the configuration packets can be ignored.
The data packets only need to be saved as-is into the video stream.
Start recording anytime
The above examples assume you are saving the video stream from the beginning. Because a video file must start with a configuration and a keyframe, it's a little different if you want to start and stop recording from a specific point.
Before v3.0, it's not possible to manually request a configuration and keyframe. The best possible approach is saving the latest configuration, and frames starting from the latest keyframe, and use them when starting recoding. Because Android produces keyframes every N frames, and the framerate is variable, it will record several extra seconds.
v3.0 added the reset video command. You can send the control message, then wait for next configuration and keyframe to begin recording.