Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
3D STREAMING AND RECONSTRUCTION
Document Type and Number:
WIPO Patent Application WO/2021/245332
Kind Code:
A1
Abstract:
There is provided an apparatus comprising means for: receiving, at a local site, a first video frame of a scene captured from a first viewpoint; selecting a first predicted frame of one or more predicted frames; transmitting an indication on a prediction mode used to obtain the selected first predicted frame; determining a prediction error between the first video frame and the first predicted frame; encoding the prediction error; transmitting the encoded prediction error to a decoder at one or more remote sites; transmitting data indicating the first viewpoint to the decoder at one or more remote sites; updating a current 3D reconstruction of the scene based on the data indicating the first viewpoint, the first predicted frame and the prediction error to obtain an updated 3D reconstruction.

Inventors:
VALLI SEPPO (FI)
Application Number:
PCT/FI2021/050397
Publication Date:
December 09, 2021
Filing Date:
June 01, 2021
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
TEKNOLOGIAN TUTKIMUSKESKUS VTT OY (FI)
International Classes:
H04N19/597; H04N19/103; H04N19/176; H04N19/46; H04N19/61
Foreign References:
US20180199039A12018-07-12
Other References:
YEA S ET AL: "View synthesis prediction for multiview video coding", SIGNAL PROCESSING. IMAGE COMMUNICATION, ELSEVIER SCIENCE PUBLISHERS, AMSTERDAM, NL, vol. 24, no. 1-2, 1 January 2009 (2009-01-01), pages 89 - 100, XP025884347, ISSN: 0923-5965, [retrieved on 20081029], DOI: 10.1016/J.IMAGE.2008.10.007
Attorney, Agent or Firm:
LAINE IP OY (FI)
Download PDF:
Claims:
CLAIMS:

1. An apparatus comprising at least one processor; and at least one memory including computer program code, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus to perform at least:

- receiving, at a local site, a first video frame of a scene captured from a first viewpoint;

- selecting a first predicted frame of one or more predicted frames;

- transmitting an indication on a prediction mode used to obtain the selected first predicted frame;

- determining a prediction error between the first video frame and the first predicted frame;

- encoding the prediction error;

- transmitting the encoded prediction error to a decoder at one or more remote sites;

- transmitting data indicating the first viewpoint to the decoder at one or more remote sites; and

- updating a current 3D reconstruction of the scene based on the data indicating the first viewpoint, the first predicted frame and the prediction error to obtain an updated 3D reconstruction.

2. The apparatus of claim 1, wherein selecting the first predicted frame of one or more predicted frames comprises selecting of one or more of an inter frame prediction; an intra frame prediction; a viewpoint dependent prediction, wherein a projection to the current 3D reconstruction is determined such that the projection corresponds to the first video frame and a viewpoint of the projection corresponds to the first viewpoint.

3. The apparatus of claim 2, wherein the projection to the current 3D reconstruction is determined by comparing projections to the current 3D reconstruction from different viewpoints to the first video frame; and/or using a known first viewpoint determined based on a tracking algorithm.

4. The apparatus of any preceding claim, wherein updating the current 3D reconstruction of the scene based on the first predicted frame and the prediction error comprises summing the first predicted frame and the prediction error to obtain a reconstructed frame; and updating the current 3D reconstruction using the reconstructed frame.

5. The apparatus of any preceding claim, further caused to perform; receiving an augmented reality object with its scale and pose relating to the updated 3D reconstruction; and rendering the augmented reality object to a wearable display.

6. The apparatus of any preceding claim, further caused to perform:

- receiving, at the local site, a second video frame of the scene captured from a second viewpoint, which is different than the first viewpoint;

- selecting a second predicted frame of one or more predicted frames;

- transmitting an indication on a prediction mode used to obtain the selected second predicted frame;

- determining a prediction error between the second video frame and the second predicted frame;

- encoding the prediction error;

- transmitting the encoded prediction error to the decoder at one or more remote sites;

- transmitting data indicating the second viewpoint to the decoder at one or more remote sites; and

- updating a current 3D reconstruction of the scene based on the data indicating the second viewpoint, the second predicted frame and the prediction error to obtain an updated 3D reconstruction.

7. An apparatus comprising at least one processor; and at least one memory including computer program code, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus to perform at least:

- receiving, at a remote site, data indicating a first viewpoint from which a first video frame of a scene has been captured at a local site;

- receiving, from an encoder at the local site, a prediction error between the first video frame captured at the local site and a predicted frame predicted at the local site;

- receiving an indication of a prediction mode;

- selecting a first predicted frame based on the indication of the prediction mode; - decoding the prediction error; and

- updating a current 3D reconstruction of the scene based on the data indicating the first viewpoint, the first predicted frame and the prediction error to obtain an updated 3D reconstruction.

8. The apparatus of claim 7, wherein the updating the current 3D reconstruction based on the data indicating the first viewpoint, the first predicted frame and the prediction error comprises summing the first predicted frame and the prediction error to obtain a reconstructed frame; and updating the current 3D reconstruction using the reconstructed frame.

9. The apparatus of claim 7 or 8, wherein the indication of the prediction mode is for one of an inter frame prediction; an intra frame prediction; a viewpoint dependent prediction, wherein a projection to the current 3D reconstruction is determined based on the data indicating the first viewpoint

10. The apparatus of any of the claims 7 to 9, further caused to perform: determining an augmented reality object and its scale and pose relating to the updated 3D reconstruction; and transmitting the augmented reality object with its scale and pose to the local site or transmitting instruction to download the augmented reality object.

11. The apparatus of any of the claims 7 to 10, further caused to perform:

- receiving, at the remote site, data indicating a second viewpoint from which a second video frame of the scene has been captured at the local site, wherein the second viewpoint is different than the first viewpoint;

- receiving, from the encoder at the local site, a prediction error between the second video frame captured at the local site and a predicted frame predicted at the local site;

- receiving an indication of a prediction mode;

- selecting a second predicted frame based on the indication of the prediction mode;

- decoding the prediction error; and - updating a current 3D reconstruction of the scene based on the data indicating the second viewpoint, the second predicted frame and the prediction error to obtain an updated 3D reconstruction.

12. A method comprising

- receiving, at a local site, a first video frame of a scene captured from a first viewpoint;

- selecting a first predicted frame of one or more predicted frames;

- transmitting an indication on a prediction mode used to obtain the selected first predicted frame;

- determining a prediction error between the first video frame and the first predicted frame;

- encoding the prediction error;

- transmitting the encoded prediction error to a decoder at one or more remote sites;

- transmitting data indicating the first viewpoint to the decoder at one or more remote sites; and

- updating a current 3D reconstruction of the scene based on the data indicating the first viewpoint, the first predicted frame and the prediction error to obtain an updated 3D reconstruction.

13. The method of claim 12, wherein selecting the first predicted frame of one or more predicted frames comprises selecting of one or more of an inter frame prediction; an intra frame prediction; a viewpoint dependent prediction, wherein a projection to the current 3D reconstruction is determined such that the projection corresponds to the first video frame and a viewpoint of the projection corresponds to the first viewpoint.

14. The method of claim 13, wherein the projection to the current 3D reconstruction is determined by comparing projections to the current 3D reconstruction from different viewpoints to the first video frame; and/or using a known first viewpoint determined based on a tracking algorithm.

15. The method of any of the claims 12 to 14, wherein updating the current 3D reconstruction of the scene based on the first predicted frame and the prediction error comprises summing the first predicted frame and the prediction error to obtain a reconstructed frame; and updating the current 3D reconstruction using the reconstructed frame.

16. The method of any of the claims 12 to 15, further comprising; receiving an augmented reality object with its scale and pose relating to the updated 3D reconstruction; and rendering the augmented reality object to a wearable display.

17. The method of any of the claims 12 to 16, further comprising:

- receiving, at the local site, a second video frame of the scene captured from a second viewpoint, which is different than the first viewpoint;

- selecting a second predicted frame of one or more predicted frames;

- transmitting an indication on a prediction mode used to obtain the selected second predicted frame;

- determining a prediction error between the second video frame and the second predicted frame;

- encoding the prediction error;

- transmitting the encoded prediction error to the decoder at one or more remote sites;

- transmitting data indicating the second viewpoint to the decoder at one or more remote sites; and

- updating a current 3D reconstruction of the scene based on the data indicating the second viewpoint, the second predicted frame and the prediction error to obtain an updated 3D reconstruction.

18. A method comprising

- receiving, at a remote site, data indicating a first viewpoint from which a first video frame of a scene has been captured at a local site;

- receiving, from an encoder at the local site, a prediction error between the first video frame captured at the local site and a predicted frame predicted at the local site;

- receiving an indication of a prediction mode;

- selecting a first predicted frame based on the indication of the prediction mode;

- decoding the prediction error; and - updating a current 3D reconstruction of the scene based on the data indicating the first viewpoint, the first predicted frame and the prediction error to obtain an updated 3D reconstruction.

19. The method of claim 18, wherein the updating the current 3D reconstruction based on the data indicating the first viewpoint, the first predicted frame and the prediction error comprises summing the first predicted frame and the prediction error to obtain a reconstructed frame; and updating the current 3D reconstruction using the reconstructed frame.

20. The method of claim 18 or 19, wherein the indication of the prediction mode is for one of an inter frame prediction; an intra frame prediction; a viewpoint dependent prediction, wherein a projection to the current 3D reconstruction is determined based on the data indicating the first viewpoint

21. The method of any of the claims 18 to 20, further comprising: determining an augmented reality object and its scale and pose relating to the updated 3D reconstruction; and transmitting the augmented reality object with its scale and pose to the local site or transmitting instruction to download the augmented reality object.

22. The method of any of the claims 18 to 21, further comprising:

- receiving, at the remote site, data indicating a second viewpoint from which a second video frame of the scene has been captured at the local site, wherein the second viewpoint is different than the first viewpoint;

- receiving, from the encoder at the local site, a prediction error between the second video frame captured at the local site and a predicted frame predicted at the local site;

- receiving an indication of a prediction mode;

- selecting a second predicted frame based on the indication of the prediction mode;

- decoding the prediction error; and - updating a current 3D reconstruction of the scene based on the data indicating the second viewpoint, the second predicted frame and the prediction error to obtain an updated 3D reconstruction. 23. A non-transitory computer readable medium comprising program instructions that, when executed by at least one processor, cause an apparatus to at least to perform the method of any of the claims 12 to 17.

24. A non-transitory computer readable medium comprising program instructions that, when executed by at least one processor, cause an apparatus to at least to perform the method of any of the claims 18 to 22.

25. A computer program configured to cause the method of any of the claims 12 to 17 to be performed.

26. A computer program configured to cause the method of any of the claims 18 to 22 to be performed.

Description:
3D streaming and reconstruction

FIELD

[0001] Various example embodiments relate to video coding and updating a 3D reconstruction of a space.

BACKGROUND

[0002] Models or reconstructions of physical spaces and/or objects are needed in various digital services, e.g. remote observation, remote maintenance, tele interaction, augmented reality (AR), mixed reality and/or extended reality services. Forming the models, e.g. three-dimensional (3D) reconstructions, require large amount of data.

Transmitting the formed 3D reconstructions between multiple sites requires a lot of bandwidth, and may hinder the use of 3D reconstructions in real-time applications.

SUMMARY

[0003] According to some aspects, there is provided the subject-matter of the independent claims. Some example embodiments are defined in the dependent claims. The scope of protection sought for various example embodiments is set out by the independent claims. The example embodiments and features, if any, described in this specification that do not fall under the scope of the independent claims are to be interpreted as examples useful for understanding various example embodiments. [0004] According to a first aspect, there is provided an apparatus comprising means for: receiving, at a local site, a first video frame of a scene captured from a first viewpoint; selecting a first predicted frame of one or more predicted frames; transmitting an indication on a prediction mode used to obtain the selected first predicted frame; determining a prediction error between the first video frame and the first predicted frame; encoding the prediction error; transmitting the encoded prediction error to a decoder at one or more remote sites; transmitting data indicating the first viewpoint to the decoder at one or more remote sites; updating a current 3D reconstruction of the scene based on the data indicating the first viewpoint, the first predicted frame and the prediction error to obtain an updated 3D reconstruction. [0005] According to an embodiment, selecting the first predicted frame of one or more predicted frames comprises selecting of one or more of an inter frame prediction; an intra frame prediction; a viewpoint dependent prediction, wherein a projection to the current 3D reconstruction is determined such that the projection corresponds to the first video frame and a viewpoint of the projection corresponds to the first viewpoint.

[0006] According to an embodiment, the projection to the current 3D reconstruction is determined by comparing projections to the current 3D reconstruction from different viewpoints to the first video frame; and/or using a known first viewpoint determined based on a tracking algorithm.

[0007] According to an embodiment, updating the current 3D reconstruction of the scene based on the first predicted frame and the prediction error comprises summing the first predicted frame and the prediction error to obtain a reconstructed frame; and updating the current 3D reconstruction using the reconstructed frame.

[0008] According to an embodiment, the apparatus further comprises means for; receiving an augmented reality object with its scale and pose relating to the updated 3D reconstruction; and rendering the augmented reality object to a wearable display.

[0009] According to an embodiment, the apparatus further comprises means for receiving, at the local site, a second video frame of the scene captured from a second viewpoint, which is different than the first viewpoint; selecting a second predicted frame of one or more predicted frames; transmitting an indication on a prediction mode used to obtain the selected second predicted frame; determining a prediction error between the second video frame and the second predicted frame; encoding the prediction error; transmitting the encoded prediction error to the decoder at one or more remote sites; transmitting data indicating the second viewpoint to the decoder at one or more remote sites; updating a current 3D reconstruction of the scene based on the data indicating the second viewpoint, the second predicted frame and the prediction error to obtain an updated 3D reconstruction.

[0010] According to a second aspect, there is provided an apparatus comprising means for receiving, at a remote site, data indicating a first viewpoint from which a first video frame of a scene has been captured at a local site; receiving, from an encoder at the local site, a prediction error between the first video frame captured at the local site and a predicted frame predicted at the local site; receiving an indication of a prediction mode; selecting a first predicted frame based on the indication of the prediction mode; decoding the prediction error; updating a current 3D reconstruction of the scene based on the data indicating the first viewpoint, the first predicted frame and the prediction error to obtain an updated 3D reconstruction.

[0011] According to an embodiment, the updating the current 3D reconstruction based on the data indicating the first viewpoint, the first predicted frame and the prediction error comprises summing the first predicted frame and the prediction error to obtain a reconstructed frame; updating the current 3D reconstruction using the reconstructed frame.

[0012] According to an embodiment, the indication of the prediction mode is for one of an inter frame prediction; an intra frame prediction; a viewpoint dependent prediction, wherein a projection to the current 3D reconstruction is determined based on the data indicating the first viewpoint.

[0013] According to an embodiment, the apparatus further comprises means for determining an augmented reality object and its scale and pose relating to the updated 3D reconstruction; and transmitting the augmented reality object with its scale and pose to the local site or transmitting instruction to download the augmented reality object.

[0014] According to an embodiment, the apparatus further comprises means for receiving, at the remote site, data indicating a second viewpoint from which a second video frame of the scene has been captured at the local site, wherein the second viewpoint is different than the first viewpoint; receiving, from the encoder at the local site, a prediction error between the second video frame captured at the local site and a predicted frame predicted at the local site; receiving an indication of a prediction mode; selecting a second predicted frame based on the indication of the prediction mode; decoding the prediction error; updating a current 3D reconstruction of the scene based on the data indicating the second viewpoint, the second predicted frame and the prediction error to obtain an updated 3D reconstruction.

[0015] According to an embodiment, the means comprises at least one processor; and at least one memory including computer program code, the at least one memory and the computer program code configured to, with the at least one processor, cause the performance of the apparatus. [0016] According to a third aspect, there is provided a method comprising receiving, at a local site, a first video frame of a scene captured from a first viewpoint; selecting a first predicted frame of one or more predicted frames; transmitting an indication on a prediction mode used to obtain the selected first predicted frame; determining a prediction error between the first video frame and the first predicted frame; encoding the prediction error; transmitting the encoded prediction error to a decoder at one or more remote sites; transmitting data indicating the first viewpoint to the decoder at one or more remote sites; updating a current 3D reconstruction of the scene based on the data indicating the first viewpoint, the first predicted frame and the prediction error to obtain an updated 3D reconstruction.

[0017] According to an embodiment, the method further comprises receiving an augmented reality object with its scale and pose relating to the updated 3D reconstruction; and rendering the augmented reality object to a wearable display.

[0018] According to an embodiment, the method further comprises receiving, at the local site, a second video frame of the scene captured from a second viewpoint, which is different than the first viewpoint; selecting a second predicted frame of one or more predicted frames; transmitting an indication on a prediction mode used to obtain the selected second predicted frame; determining a prediction error between the second video frame and the second predicted frame; encoding the prediction error; transmitting the encoded prediction error to the decoder at one or more remote sites; transmitting data indicating the second viewpoint to the decoder at one or more remote sites; updating a current 3D reconstruction of the scene based on the data indicating the second viewpoint, the second predicted frame and the prediction error to obtain an updated 3D reconstruction.

[0019] According to a fourth aspect, there is provided a method comprising receiving, at a remote site, data indicating a first viewpoint from which a first video frame of a scene has been captured at a local site; receiving, from an encoder at the local site, a prediction error between the first video frame captured at the local site and a predicted frame predicted at the local site; receiving an indication of a prediction mode; selecting a first predicted frame based on the indication of the prediction mode; decoding the prediction error; updating a current 3D reconstruction of the scene based on the data indicating the first viewpoint, the first predicted frame and the prediction error to obtain an updated 3D reconstruction. [0020] According to an embodiment, the method further comprises determining an augmented reality object and its scale and pose relating to the updated 3D reconstruction; and transmitting the augmented reality object with its scale and pose to the local site or transmitting instruction to download the augmented reality object.

[0021] According to an embodiment, the method further comprises receiving, at the remote site, data indicating a second viewpoint from which a second video frame of the scene has been captured at the local site, wherein the second viewpoint is different than the first viewpoint; receiving, from the encoder at the local site, a prediction error between the second video frame captured at the local site and a predicted frame predicted at the local site; receiving an indication of a prediction mode; selecting a second predicted frame based on the indication of the prediction mode; decoding the prediction error; updating a current 3D reconstruction of the scene based on the data indicating the second viewpoint, the second predicted frame and the prediction error to obtain an updated 3D reconstruction.

[0022] According to a fifth aspect, there is provided a non-transitory computer readable medium comprising program instructions that, when executed by at least one processor, cause an apparatus to at least to perform a method of the third aspect and its embodiments, or a method of the fourth aspect and its embodiments.

[0023] According to a sixth aspect, there is provided a computer program configured to cause a method in accordance with the third aspect and it embodiments or a method in accordance with the fourth aspect and its embodiments to be performed.

BRIEF DESCRIPTION OF THE DRAWINGS

[0024] Fig. 1 shows, by way of example, a flowchart of a method;

[0025] Fig. 2 shows, by way of example, a block schema of operations at a local site and one or more remote sites;

[0026] Fig. 3 shows, by way of example, principle of viewpoint dependent prediction;

[0027] Fig. 4 shows, by way of example, a flowchart of a method;

[0028] Fig. 5 shows, by way of example, remote monitoring and control service using method(s) disclosed herein; [0029] Fig. 6 shows, by way of example, supporting remote augmentation of instructions using method(s) disclosed herein;

[0030] Fig. 7 shows, by way of example, a block diagram of an apparatus;

DETAILED DESCRIPTION

[0031] Various digital services are based on acquiring 3D data from physical environments. Such services are for example augmented reality (AR/MR/XR) visualizations, which enable users to see virtual objects as seamless part of their environment. 3D reconstructions may be formed, for example, based on multiple images captured of a view. Depth sensors, for example time-of-flight (ToF) and/or RGB-D sensors may be used to capture 3D data for reconstructions.

[0032] In remote applications, 3D data may be transmitted over a network. Further, an identical 3D reconstruction may be needed in both local and one or more remote environments. For example, when forming 3D reconstructions by an unmanned aircraft, e.g. a flying drone or quadcopter, the flying unmanned aircraft may use the 3D reconstruction to semi-autonomous navigation during its flight, and the same 3D reconstruction is also used by a remote person or an application to analyze the scene and plan for the flight. Thus, an updated copy of a reconstruction is needed in two or more sites. As another example, in typical AR applications, a 3D reconstruction made in a local environment may be first captured at the local site and delivered to a remote site for offline content authoring, and used again locally during runtime visualization.

[0033] However, delivering formed 3D reconstructions with a large amount of 3D data over network causes delays and/or latencies, which may degrade service quality or usability.

[0034] There is provided a method for efficient delivery of dynamic 3D reconstructions, e.g. in real-time or close to real-time, even over low bandwidth connections. Dynamic here may mean e.g. semi-static 3D reconstructions, which may vary over time due to lighting changes, for example.

[0035] Fig. 1 shows, by way of example, a flowchart of a method for updating a 3D reconstruction over a network. The method may be performed by an apparatus at the local site. The apparatus may be e.g. a computing device, personal computer or a server based computing device. The apparatus may be e.g. the apparatus of Fig. 7. The method 100 comprises receiving 110, at a local site, a first video frame of a scene captured from a first viewpoint. The method 100 comprises selecting 120 a first predicted frame of one or more predicted frames. The method 100 comprises transmitting 130 an indication on a prediction mode used to obtain the selected first predicted frame. The method 100 comprises determining 140 a prediction error between the first video frame and the first predicted frame. The method 100 comprises encoding 150 the prediction error. The method 100 comprises transmitting 160 the encoded prediction error to a decoder at one or more remote sites. The method 100 comprises transmitting 170 data indicating the first viewpoint to the decoder at one or more remote sites. The method 100 comprises updating 180 a current 3D reconstruction of the scene based on the data indicating the first viewpoint, the first predicted frame and the prediction error to obtain an updated 3D reconstruction.

[0036] The method disclosed herein enables having identical or substantially identical high-quality copies of 3D reconstructions available for multiple parties and sites participating a service session with reduced bitrate and speeded up transmission. The method disclosed herein enables low latency. The method discloses herein enables efficient use of 3D reconstructions e.g. in AR/MR/XR applications and in remote monitoring and control, e.g. by robots and/or drones.

[0037] The method will be described in the context of Fig. 2 that shows, by way of example, a block schema of operations at a local site 210 and one or more remote sites 250. For simplicity, one predictive coding loop and one predictive decoding loop are shown in Fig. 2 instead of e.g. three parallel coding loops for different components of the video signal, i.e. luma component (Y), blue-difference chroma component (Cb), and red- difference chroma component (Cr). In addition, in case of video-plus-depth data, there may be a coding loop for depth signal.

[0038] The local site may be considered as an encoder. At the local site, video or video-plus-depth data may be captured. Video data or video-plus-depth data may be captured using a capture sensor 215, e.g. a video camera, and/or a depth camera setup, e.g. with RGB-D sensors, such as a Kinect sensor. The capture sensor may be a moving sensor that is used in the front end for data capture, e.g. for 3D data capture. The local site may be e.g. a working site or home environment. For example, a person or a robot may carry the capture sensor and the video data or video-plus-depth data may be captured while the person or robot moves around the site. The data capture may be on-going while the person or robot does other tasks at the site. For example, the capture sensor may be mounted into AR glasses wearable by a person, or into a robot moving around at the site. Since the capture sensor is able to move around the local site, the viewpoint may change with time.

[0039] Video data or video-plus-depth data comprises a plurality of frames, e.g. a first video frame, a second video frame, etc. Coding of video sequences or video-plus- depth sequences may be performed on e.g. frame-by-frame basis and/or block-by-block basis. A block of a frame at time point t may be denoted as x^. Let us consider that a first video frame of a scene captured from a first viewpoint is received 217. The first video frame may be a video frame or a video-plus-depth frame. Depth sensor based capture enables outputting features also from areas without texture.

[0040] A first predicted frame p' is received 219. The first predicted frame may be selected 225 from one or more predictions 220, 222, 224. The one or more predictions may comprise e.g. 2D intra prediction 220, 2D inter prediction 222 and/or a viewpoint dependent prediction 224 which may be named as a 3D prediction. The 3D prediction is a viewpoint dependent prediction determined based on a current 3D reconstruction.

[0041] 2D intra prediction, or intra-frame coding, may be used in video coding or compression. Intra-frame prediction exploits spatial redundancy, i.e. correlation among pixels within one frame.

[0042] 2D inter prediction, or inter-frame coding, may be used in video coding or compression. Inter-frame prediction exploits temporal redundancy, i.e. correlation between neighboring frames.

[0043] The 3D prediction, or viewpoint dependent prediction 224, is determined based on a current 3D reconstruction. The 3D reconstruction is a semi-static but dynamic 3D model of the captured space or environment and is created based on the information captured so far at the local site 210. Correspondingly, at the remote site 250, the 3D reconstruction is a semi- static but dynamic 3D model of the captured space or environment at the local site and is created based on the information received from the local site. The local site and the one or more remote sites have the same 3D reconstruction available. [0044] 3D reconstruction may be referred to as Visual Twin, which is a dynamic visual 3D representation of a physical object or an environment captured over time. Visual Twin enables spatial 3D viewing (3D viewing), analysis and retrieval of its constructs. Visual Twin may act as a platform for sharing and visualizing data and building up spatial or situational awareness relating to the corresponding physical entity.

[0045] The 3D reconstruction may be formed or constructed by any feasible dynamic reconstruction algorithm, e.g. Kinect fusion algorithm, or simultaneous localization and mapping (SLAM) algorithm. 3D reconstruction using a moving camera sensor may be referred to as a structure from motion (SfM) technique. Instead of relying on video data captured by a moving sensor, 3D reconstruction may be formed by combining captures of multiple fixed sensors, for example.

[0046] The method as disclosed herein may require a certain ramp-up time, during which enough data is captured for construction of an initial 3D model or 3D reconstruction of the environment. The data may be captured e.g. by sensors which are moved around the site e.g. by a person or a robot. The initial 3D reconstruction may be formed at the local site and at one or more remote sites, or the 3D reconstruction may be formed at the local site and sent to the one or more remote sites, or the captured data may be sent to the one or more remote sites and the one or more remote sites may form the 3D model based on the captured data. It may be that, for example in home telepresence or work post monitoring, the required data for the 3D model is obtained as a by-product without separate, intentional or assisted effort. This may reduce the need for making time consuming, often costly, in advance preparations.

[0047] The current, or so-far obtained, 3D reconstruction may be used to determine a predicted frame, which may be named as 3D prediction or viewpoint dependent prediction. The prediction may use knowledge of the capture viewpoint, which may be obtained e.g. by using a tracking algorithm. The tracking algorithm may use data from motion sensors, such as inertial measurement unit attached to the capture sensor. Alternatively, or in addition, exhaustive search may be applied, i.e. the projections to the current 3D reconstruction from different viewpoints may be compared to the first video frame which represents the current real-world view. Based on the comparison, the current viewpoint may be determined. The projection that corresponds, e.g. the best corresponds, to the first video frame may be selected as a 3D prediction. The 3D prediction obtained this way is a 2D projection to the 3D reconstruction which is aligned with the real-world view, i.e. is in the same orientation and scale, or there is a known mapping between the orientations and scales of the real-world view and the 2D projection to the 3D reconstruction. The 3D prediction may be calculated once, e.g. only once, per a captured frame, which requires less computation resources than e.g. block based predictions.

[0048] Fig. 3 shows, by way of example, principle of viewpoint dependent prediction, i.e. the 3D prediction. In this example, the captured data is video-plus-depth data 310 and the reconstruction algorithm is Kinect fusion 320. Frames (either video or video-plus-depth) may be denoted by X T ' . The frames are composed of blocks. The prediction algorithm may apply e.g. truncated signed distance function (TSDF) 325 for compacting representations of 3D surfaces/volumes. The prediction may comprise errors, e.g. holes that may be caused by not detecting enough scene features, e.g. from transparent window surfaces. In order to reduce possible errors, e.g. hole filling 330 and/or in-painting may be performed to the reconstruction. Determination of the suitable viewpoint may be determined e.g. by exhaustive search as described above, by applying e.g. 3D warping 332. Then, a 2D projection may be formed 335 to obtain a new 2D prediction from a 3D viewpoint. Post processing 340 may comprise e.g. further filtering operations, etc. Other intra or inter predictions may, alternatively or additionally, be used to reduce holes.

[0049] The viewpoint may be received 345 via tracking algorithm, or it may be determined by comparing projections to the current 3D reconstruction from different viewpoints to the first video frame. For captured video frames or video-plus-depth frames, the encoder may find an estimate for the current viewpoint or vantage point w.r.t the 3D reconstruction. The viewpoint may be expressed by a 6 DOF motion vector describing a new viewpoint for a whole frame:

[0050] m T ' = (x, y, z, a, b, y).

[0051] Referring back to Fig. 2, the viewpoint may be transmitted 230 to the decoder at the remote site. Thus, both encoder and the decoder use the same motion vector to form the 3D prediction, i.e. a 2D projection to the 3D reconstruction. This way the 3D prediction at the remote site will end up to the same 3D prediction as performed at the local site.

[0052] The 3D prediction may improve if there has been similar or close to similar viewpoints earlier during the frame sequence. The 3D prediction may be considered to work well if the capture sensor, e.g. video camera or video-plus-depth camera, moves around in a same local environment, which is common e.g. in peoples’ local living or working environments, e.g. a room at home, or a work post. The home and work post may be considered as a semi-static space. The coding based on 3D prediction may reduce the required bitrate. The 3D prediction mode reduces the required bitrate, e.g. in cases where a capture sensor moves around in a semi-static space, e.g. a work site or a room at home.

[0053] The 3D projection may be compared with other predictions, e.g. 2D inter and/or 2D inter predictions. The prediction that the best corresponds with the current view, i.e. the first video frame, may be selected 225 as the predicted frame p e.g. the predicted first frame. The best prediction may be selected e.g. based on rate and distortion. The prediction mode used to achieve the predicted first frame, i.e. the prediction that has been selected as the best prediction, may be informed 235 to a decoder at the remote site. This way the remote site receives information on which prediction mode to use in order to end up to the same prediction.

[0054] Each prediction produces a certain distortion compared to the original image and/or depth map block. Similarly, a certain number of bits is needed to encode the block. The needed number of bits might not be exactly known, but it may be approximately known based on test sequences when developing a coding method. The selection of the prediction may be based on combined, mutually optimized rate-distortion performance, so that even a greater distortion may be accepted if the prediction produces clearly less bits. Policies to select the prediction, e.g. the best prediction, may be chosen by a codec provider.

[0055] A difference is determined 240 between the first video frame and the predicted first frame. The difference represents a prediction error:

[0057] The prediction error may be calculated e.g. based on colour differences between pixels and/or based on differences in depth information. Weights may be used in determination of the prediction error. For example, the colour differences may be chosen to have different weight, i.e. either more or less weight, than the depth differences. [0058] The prediction may be e.g. quantized 242 to obtain a quantized prediction error e' t . The prediction error may be encoded 244 and the encoded prediction error c(e)' t may be transmitted 246 to a decoder at one or more remote sites. The remote site may be e.g. a control or monitoring room.

[0059] The smaller the prediction error to be coded and transmitted, the more the quantization errors and bits are reduced. For example, a higher number of quantization levels may be allocated to more common prediction error values, which allows for reducing the overall number of bits.

[0060] The current 3D reconstruction may be updated based on the predicted first frame and the prediction error to obtain an updated 3D reconstruction. The updating may comprise summing 248 the predicted first frame and the prediction error to obtain a reconstructed first frame. In Fig. 2, a reconstructed block is denoted by x' t . The 3D reconstruction may be updated 249 using the reconstructed first frame and knowledge of the current viewpoint. A projection, i.e. a 2D image, may be determined from the current 3D reconstruction from the current viewpoint. This projection image may be compared to the reconstructed first frame. The viewpoint of the reconstructed first frame corresponds to the current viewpoint. If differences in pixel values and/or depth values are detected based on the comparison, the voxel value of the 3D reconstruction may be updated to correspond to the pixel value of the reconstructed first frame. Position of the voxel may be updated accordingly based on the depth value of the corresponding pixel. An example of a 3D reconstruction algorithm is TSDF algorithm.

[0061] The remote site 250 may be considered as a decoder. The prediction error, i.e. the difference between the first video frame captured at the local site and the predicted first frame predicted at the local site, is received at the remote site. The difference may be decoded 252 to obtain a decoded prediction error e' t . Further, the remote site receives 235 the prediction mode that has been selected at the local site, e.g. to give the best prediction for the first video frame. In addition, the remote site may receive 230 the viewpoint data. With knowledge of the viewpoint, the viewpoint dependent prediction 264, i.e. the 3D prediction, at the remote site is identical or substantially identical to the 3D prediction at the local site. The current 3D reconstruction at the remote site is identical or substantially identical with the current 3D reconstruction at the local site. [0062] Thus, the predicted first frame is determined 265 at the remote site at least based on the selected prediction mode received from the local site. Based on the received prediction mode, the decoder selects one of the one or more predictions 260, 262, 264. In case the selected prediction mode is the 3D prediction, the predicted first frame is determined at the remote site based on the selected prediction mode and the viewpoint data received from the local site. The prediction loops are identical or substantially identical at the local site and at the one or more remote sites.

[0063] The 3D reconstruction at the remote site may be updated 299 based on the predicted first frame and the prediction error to obtain an updated 3D reconstruction. The updating may comprise summing 290 the predicted first frame and the prediction error to obtain a reconstructed frame. In Fig. 2, a reconstructed block is denoted by x' t . The 3D reconstruction may be updated using the reconstructed first frame and knowledge of the current viewpoint.

[0064] The updated 3D reconstructions at the local site and at the one or more remote sites are identical or substantially identical.

[0065] Fig. 4 shows, by way of example, a flowchart of a method for updating a 3D reconstruction over a network. The method may be performed by an apparatus at the remote site. The apparatus may be e.g. a computing device, personal computer or a server based computing device. The apparatus may be e.g. the apparatus of Fig. 7. The method 400 comprises receiving 410, at a remote site, data indicating a first viewpoint from which a first video frame of a scene has been captured at a local site. The method 400 comprises receiving 420, from an encoder at the local site, a prediction error between the first video frame captured at the local site and a predicted frame predicted at the local site. The method 400 comprises receiving 430 an indication of a prediction mode. The method 400 comprises selecting 440 a first predicted frame based on the indication of the prediction mode. The method 400 comprises decoding 450 the prediction error. The method 400 comprises updating 460 a current 3D reconstruction of the scene based on the data indicating the first viewpoint, the first predicted frame and the prediction error to obtain an updated 3D reconstruction.

[0066] Fig. 5 shows, by way of example, remote monitoring and control service using the method disclosed herein. Let us consider a remote monitoring and control application in construction industry, where a working robot 520 performs for example unmanned painting operations around the construction site at the local site. The working robot is equipped with a capture sensor, e.g. RGB-D sensor. While doing its painting task, the robot moves from place to place in the construction site, and collects and delivers dynamic 3D reconstruction data to remote sites according to the method as disclosed herein.

[0067] The 3D reconstruction is being updated at the local site 510 and at the remote site 550 along a robot’s maneuvers. The monitoring user 560 may e.g. perform scene analysis 570 based on differences between reconstructed real-time video stream and the semi-static 3D capture. 3D reconstruction algorithms, e.g. TSDF, may filter and/or average captured frames over time so that a scene feature, e.g. a moving person or object, does not appear instantaneously in the 3D reconstruction. If the features remain long enough in the scene, they will end up to the 3D reconstruction. The speed of making updates to a 3D reconstruction is an adjustable parameter in 3D reconstruction algorithms. The scene features, e.g. moving person or object, may be segmented from the reconstructed real-time video stream by detecting differences between the reconstructed video stream and the 3D reconstruction. The updated 3D reconstruction may be used for example in tracking locations of building materials and tools, helping in managing of material and task logistics, alarming about items and situations causing safety risks, etc. The 3D reconstruction may be viewed e.g. using a 3D viewer which enables viewing the 3D reconstruction from different viewpoints.

[0068] Fig. 6 shows, by way of example, supporting remote augmentation of instructions for a local maintenance worker 620 using the method disclosed herein. A local maintenance worker at the local site 610 wears AR glasses embedded with a capture sensor, e.g. RGB and depth cameras. The capture sensor may capture data for the 3D reconstruction in video-plus-depth format. A reconstructed version of video-plus-depth data, e.g. comprising quantization error being induced, is formed in both the encoder and the decoder, i.e. at the local site and the remote site, respectively. A remote expert 660 at the local site 650 may use an application 670 for producing 3D augmentations, bound to the 3D reconstruction of the local environment.

[0069] Data or references to the augmented objects, and/or their scale and pose w.r.t the 3D reconstruction are sent 680 back to the local site, where the application 630 forwards them further to the maintenance worker’s AR glasses. Instead of sending the AR object, instruction to download the AR object may be transmitted to the local site. Instruction to download may comprise e.g. an address of a server from which the AR object may be downloaded. A tracking algorithm, which is part of the AR application, maintains knowledge of the orientation and scale of the 3D reconstruction w.r.t the local view, i.e. the real world coordinates.

[0070] Knowing the position(s) of the AR objects relating the 3D reconstruction, and the orientation of the reconstruction w.r.t. real world coordinates, it is possible to render the augmentations correctly into the AR glasses of the maintenance worker.

[0071] The method disclosed herein uses a compress-then-analyze (CTA) approach, which enables easier AR content production and object positioning. This is because the AR content production and object position is easier using visual 3D reconstructions instead of e.g. point clouds or features sets.

[0072] Referring back to Fig. 2, the reconstructed video or video-plus-depth stream 298 may be stored in a memory as a function of time. Storing the history of the captured video or video-plus-depth stream, it is possible to form a 3D reconstruction corresponding to a specific point in time. For example, it is possible to check afterwards, what was a situation in a construction site e.g. a month ago. Storing the video or video-plus-depth data requires less memory than storing complete 3D reconstructions as a function of time, i.e. series of 3D reconstructions.

[0073] Fig. 7 shows, by way of example, a block diagram of an apparatus 700. The apparatus may be an apparatus capable of performing the method(s) as disclosed herein. The apparatus may be an apparatus at the local site or at the remote site. Comprised in apparatus 700 is processor 710, which may comprise, for example, a single- or multi-core processor wherein a single-core processor comprises one processing core and a multi-core processor comprises more than one processing core. Processor 710 may comprise, in general, a control device. Processor 710 may comprise more than one processor. Processor 710 may be a control device. Processor 710 may be means for performing method steps in apparatus 700. Processor 710 may be configured, at least in part by computer instructions, to perform actions.

[0074] Apparatus 700 may comprise memory 720. Memory 720 may comprise random-access memory and/or permanent memory. Memory 720 may comprise at least one RAM chip. Memory 720 may comprise solid-state, magnetic, optical and/or holographic memory, for example. Memory 720 may be at least in part accessible to processor 710. Memory 720 may be at least in part comprised in processor 710. Memory 720 may be means for storing information. Memory 720 may comprise computer instructions that processor 710 is configured to execute. When computer instructions configured to cause processor 710 to perform certain actions are stored in memory 720, and apparatus 700 overall is configured to run under the direction of processor 710 using computer instructions from memory 720, processor 710 and/or its at least one processing core may be considered to be configured to perform said certain actions. Memory 720 may be at least in part external to apparatus 700 but accessible to apparatus 700.

[0075] Apparatus 700 may comprise a transmitter 730. Apparatus 700 may comprise a receiver 740. Transmitter 730 and receiver 740 may be configured to transmit and receive, respectively, information in accordance with at least one wireless or cellular or non-cellular standard. Transmitter 730 may comprise more than one transmitter. Receiver 740 may comprise more than one receiver. Transmitter 730 and/or receiver 740 may be configured to operate in accordance with global system for mobile communication, GSM, wideband code division multiple access, WCDMA, 5G, long term evolution, LTE, IS-95, wireless local area network, WLAN, Ethernet and/or worldwide interoperability for microwave access, WiMAX, standards, for example.

[0076] Apparatus 700 may comprise user interface, UI, 760. UI 760 may comprise at least one of a display, a keyboard, a touchscreen, a mouse. A user may be able to operate apparatus 700 via UI 760.

[0077] According to an embodiment, the viewpoint dependent prediction is used as the prediction method at the local site and at the one or more remote sites. The viewpoint dependent prediction is determined based on 3D reconstruction of the scene. The prediction error is determined between the received frame and the viewpoint dependent prediction. The prediction error is compressed, e.g. by applying a standard coding method, e.g. H.264 AVC compression. As another example, video and depth may be coded separately. The compressed errors are then coded and transmitted to the one or more remote sites. In coding of the prediction error, other predictions may be used, such as inter and intra predictions. At the remote site, the errors are decoded and summed with the viewpoint dependent prediction determined at the remote site. The local site and the remote site have the same 3D reconstruction available. Predictive video coding method based on viewpoint dependent prediction may result in higher compression and/or lower bitrate. This kind of predictive video coding method enables new services such as services which use 3D reconstructions of a scene. Such services may be for example augmented reality visualizations. As the 3D reconstruction improves by time, since the video or video-plus- depth data capture device moves repeatedly in the same environment, the viewpoint dependent predictions improve by time as well.