Article swirling streams optical methods. Applying Optical Flow Techniques to Estimating Image Displacement


In computer vision and image processing systems, the problem often arises of determining the movements of objects in three-dimensional space using an optical sensor, that is, a video camera. Having a sequence of frames as input, it is necessary to recreate the three-dimensional space imprinted on them and the changes that occur to it over time. It sounds complicated, but in practice it is often enough to find the offsets of two-dimensional projections of objects in the frame plane.

If we want to know how much this or that object has shifted relative to its position on the previous frame during the time that has passed between fixing frames, then most likely we will first of all remember the optical flow. To find the optical flow, you can safely use a ready-made tested and optimized implementation of one of the algorithms, for example, from the OpenCV library. It is, however, very harmless to understand the theory, so I invite all interested to look inside one of the popular and well-studied methods. This article does not contain code and practical advice, but there are formulas and a number of mathematical derivations.

There are several approaches to determining offsets between two adjacent frames. For example, for each small fragment (say, 8 by 8 pixels) of one frame, you can find the most similar fragment in the next frame. In this case, the difference between the coordinates of the original and found fragments will give us the offset. The main difficulty here is how to quickly find the desired fragment without going through the entire frame pixel by pixel. Various implementations of this approach solve the problem of computational complexity in one way or another. Some are so successful that they are used, for example, in common video compression standards. The price of speed is, of course, quality. We will consider another approach, which allows us to obtain offsets not for fragments, but for each individual pixel, and is used when the speed is not so critical. The term “optical flow” is often associated with it in the literature.

This approach is often called differential, since it is based on the calculation of partial derivatives in the horizontal and vertical directions of the image. As we will see later, derivatives alone are not enough to determine biases. That is why, based on one simple idea, a great many methods have appeared, each of which uses some kind of mathematical dance with a tambourine to achieve the goal. Let's focus on the Lucas-Kanade method, proposed in 1981 by Bruce Lucas and Takeo Kanade.

Lucas-Kanade method
All further reasoning is based on one very important and not very fair assumption: Assume pixel values ​​pass from one frame to the next without change. Thus, we make the assumption that the pixels belonging to the same object can move in any direction, but their value will remain unchanged. Of course, this assumption has little to do with reality, because global lighting conditions and the illumination of the moving object itself can change from frame to frame. There are a lot of problems with this assumption, but, oddly enough, despite everything, it works quite well in practice.

In mathematical language, this assumption can be written as follows: . Where I is a function of pixel brightness from position on the frame and time. In other words, x and y are the pixel coordinates in the frame plane, and y is the offset, and t is the frame number in the sequence. Let us agree that a single time interval passes between two adjacent frames.

One-dimensional case
Let us first consider the one-dimensional case. Imagine two one-dimensional frames 1 pixel high and 20 pixels wide (figure on the right). On the second frame, the image is slightly shifted to the right. It is this offset that we want to find. To do this, we present the same frames as functions (figure on the left). The input is the position of the pixel, the output is its intensity. In such a representation, the required displacement (d) can be seen even more clearly. According to our assumption, this is just a biased , that is, we can say that .

Please note that if you wish, you can also write in general form: ; where y and t are fixed and equal to zero.

For each coordinate, we know the values ​​​​at this point, in addition, we can calculate their derivatives. Associate the known values ​​with the offset d. To do this, we write the expansion in a Taylor series for :

Let's make a second important assumption: Assume that is approximated well enough by the first derivative. Having made this assumption, we discard everything after the first derivative:

How correct is this? In general, not very much, here we lose in accuracy, unless our function / image is strictly linear, as in our artificial example. But this greatly simplifies the method, and to achieve the required accuracy, a successive approximation can be made, which we will consider later.

We are almost there. The offset d is our target value, so we need to do something with . As we agreed earlier, so we just rewrite:

2D case
Let us now pass from the one-dimensional case to the two-dimensional one. We write the expansion in a Taylor series for and immediately discard all higher derivatives. Instead of the first derivative, a gradient appears:

Where is the displacement vector.
In accordance with the assumption made. Note that this expression is equivalent to . This is what we need. Let's rewrite:

Since a single time interval passes between two frames, we can say that there is nothing more than a derivative with respect to time.
Let's rewrite:

Let's rewrite it again, opening the gradient:

We have an equation that tells us that the sum of the partial derivatives must be equal to zero. The only problem is that we have one equation, and there are two unknowns in it: and . At this point, a flight of fancy and a variety of approaches begin.

Let's make a third assumption: Assume that adjacent pixels are shifted by the same distance. Let's take a fragment of the image, let's say 5 by 5 pixels, and we will agree that for each of the 25 pixels and are equal. Then instead of one equation, we get 25 equations at once! Obviously, in the general case, the system has no solution, so we will look for those and that minimize the error:

Here g is a function that determines the weights for the pixels. The most common variant is the 2D Gaussian, which gives the most weight to the central pixel and less and less as you move away from the center.

To find the minimum, we use the least squares method, find its partial derivatives with respect to and :

We rewrite in a more compact form and equate to zero:

Let's rewrite these two equations in matrix form:

If the matrix M is invertible (has rank 2), we can calculate and , which minimize the error E:

That's actually all. We know the approximate pixel offset between two adjacent frames.

Since the neighboring pixels are also involved in finding the offset of each pixel, when implementing this method, it is advisable to preliminarily calculate the horizontal and vertical derivatives of the frame.

Disadvantages of the method
The method described above is based on three significant assumptions, which, on the one hand, give us the fundamental opportunity to determine the optical flow, but, on the other hand, introduce an error. The good news for perfectionists is that we only need one assumption to simplify the method, and we can deal with its consequences. We assumed that the first derivative would be enough for us to approximate the displacement. In the general case, this is of course not the case (figure on the left). To achieve the required accuracy, the offset for each pair of frames (let's call them and ) can be calculated iteratively. In the literature, this is called warping. In practice, this means that, having calculated the offsets at the first iteration, we move each pixel of the frame in the opposite direction so that this offset is compensated. At the next iteration, instead of the original frame, we will use its distorted version. And so on, until at the next iteration all the resulting offsets are less than the specified threshold value. We get the final offset for each specific pixel as the sum of its offsets at all iterations.

By its nature, this method is local, that is, when determining the offset of a particular pixel, only the area around this pixel is taken into account - the local neighborhood. As a consequence, it is impossible to determine displacements inside sufficiently large (greater than the size of the local neighborhood) evenly colored areas of the frame. Fortunately, such areas are not often found on real frames, but this feature still introduces an additional deviation from the true displacement.

Another problem is related to the fact that some textures in the image give a degenerate matrix M, for which the inverse matrix cannot be found. Accordingly, for such textures we will not be able to determine the offset. That is, there seems to be movement, but it is not clear in which direction. In general, not only the considered method suffers from this problem. Even the human eye perceives such a movement ambiguously (Barber pole).

Conclusion
We have analyzed the theoretical foundations of one of the differential methods for finding the optical flow. There are many other interesting methods, some of which, as of today, give more reliable results. However, the Lucas-Kanade method, with its good efficiency, remains quite simple to understand, and therefore is well suited for getting acquainted with the mathematical foundations.

Although the problem of finding the optical flow has been studied for several decades, the methods are still being improved. The work continues in view of the fact that, upon closer examination, the problem turns out to be very difficult, and the stability and efficiency of many other algorithms depend on the quality of determining the biases in video and image processing.

On this pathetic note, let me round off and move on to sources and useful links.
Lucas-Kanade method

  • Lucas Canada
  • Add tags

    Optical flow is a technology used in various areas of computer vision to determine shifts, segmentation, object selection, and video compression. However, if we want to quickly implement it in our project, having read about it on Wikipedia or somewhere else, then, most likely, we will very quickly stumble upon the fact that it works very poorly and fails when determining shifts already on the order of 1-2 pixels (at least that's how it was for me). Then let's turn to ready-made implementations, for example, in OpenCV. There it is implemented by various methods and it is completely incomprehensible why the PyrLK abbreviation is better or worse than the Farneback designation or something like that, and you will have to deal with the meaning of the parameters, which in some implementations are very numerous. And, interestingly, these algorithms somehow work, unlike what we wrote ourselves. What is the secret?

    What is optical flow

    Optical flow (OP) is an image of apparent movement, which is the shift of each point between two images. In fact, it is a field of velocities (because the shift up to the scale is equivalent to the instantaneous speed). The essence of the OP is that for each point of the image there is such a shift (dx, dy) that the point on the second image corresponds to the starting point. How to determine the correspondence of points is a separate issue. To do this, we need to take some function of the point, which does not change as a result of the displacement. It is usually assumed that a point retains intensity (i.e., brightness or color for color images), but points can be considered identical if the gradient value, the Hessian, its magnitude or its determinant, the Laplacian, or other characteristics are preserved. Obviously, the preservation of intensity fails if the illumination or the angle of incidence of light changes. However, if we are talking about a video stream, then, most likely, the illumination will not change much between two frames, if only because a small period of time passes between them. Therefore, intensity is often used as a function that is conserved at a point.

    According to such a description, one can confuse the OP with the search and comparison of characteristic points. But these are different things, the essence of the optical flow is that it does not look for any special points, but tries to determine from the parameters of the images where an arbitrary point has shifted.

    There are two options for calculating the optical flow: dense (dense) and selective (sparse). The sparse stream calculates the shift of individual given points (for example, points selected by some feature detector), the dense stream calculates the shift of all points in the image. Naturally, the sample stream is calculated faster, but for some algorithms the difference is not so big, and for some tasks it is required to find the flow at all points of the image.

    For degenerate cases, simpler methods for determining the shift can be used. In particular, if all points of the image have the same shift (the image is shifted entirely), then the phase correlation method can be applied: calculate the Fourier transform for both images, find the convolution of their phases and determine the shift from it (see en.wikipedia.org /wiki/Phase_correlation). You can also use block matching: find the shift that minimizes the norm of the difference between images in the window. In its pure form, such an algorithm will work for a long time and is unstable to turns and other distortions. The English Wikipedia classifies the listed algorithms as various options for calculating the optical flow, but this seems to me not very correct, since these algorithms can be applied for other purposes and do not completely solve this problem. We will call methods based on local characteristics of images (what is called differential methods in the English Wikipedia) an optical flow.

    Standard approach (Lucas-Kanade method)

    The mathematical description of the algorithm is given in sufficient detail in this article, but only theoretical aspects are affected in it.

    Let us consider the mathematical model of the optical flow, assuming that the intensity of the point has not changed as a result of the displacement.

    Let is the intensity at some point (x, y) in the first image (i.e., at time t). In the second image, this point has shifted by (dx, dy), while the time dt has passed, then - we expanded the intensity function according to Taylor to the first term (later it will be mentioned why only to the first), here - partial derivatives with respect to coordinates and time , that is, in fact - a change in brightness at the point (x, y) between two frames.

    We believe that the intensity of the point has been preserved, which means
    We get one equation with two unknowns (dx and dy), which means it is not enough to solve, that is, only this equation will not go far.

    The simplest solution to the problem is the Lucas-Kanade algorithm. We have objects in the image larger than 1 pixel, which means that, most likely, in the vicinity of the current point, other points will have approximately the same shifts. Therefore, we take a window around this point and minimize (by LSM) the total error in it with weights distributed according to Gauss, that is, so that the pixels closest to the one under study have the greatest weight. After the simplest transformations, we already get a system of 2 equations with 2 unknowns:

    As is known, this system does not always have a unique solution (although very often): if the determinant of the system is equal to zero, then there are either no solutions or an infinite number. This problem is known as the Aperture problem - the shift ambiguity with a limited field of view for periodic pictures. It corresponds to the case when a fragment of an image enters the field of view, in which there is some cyclicity; Here, even a person will not be able to unambiguously determine where the picture has shifted. The problem is that due to noise in such ambiguous situations, we will get not a zero determinant, but a very small one, which, most likely, will lead to very large shift values ​​that do not particularly correlate with reality. So at a certain stage, you just need to check whether the determinant of the system is small enough, and if so, do not consider such points or mark them as erroneous.

    Why does not it work?
    If we stop at this stage and implement this algorithm, then it will work successfully. But only if the shift between adjacent images is very small, on the order of 1 pixel, and then not always. (To analyze the quality, synthetic sequences with different relative shifts were generated, and this shift can be expressed by a non-integer number of pixels, then the resulting image is interpolated accordingly) Already at a shift of 2 pixels, the error will be large, and if 3 or more, then the result will be generally inadequate. What's the matter?

    Mathematics has set us up here. She instilled in us the feeling that all the functions around are continuous and differentiable many times over. And in general, at the institute, we were taught to write the approximation of a function in the vicinity of a point using the Taylor formula, and we everywhere thoughtlessly joyfully use this. And now let's think, what is the physical meaning of the derivatives in this place? We want to use them to determine the change in the value of a function in a finite neighborhood of a point, and the derivative gives an idea of ​​an infinitely small neighborhood. To expand this neighborhood, one could add a higher order of derivatives to the Taylor expansion, but this will lead to nonlinearities in the system, which will make it much more difficult to solve, and the advantages will be doubtful, especially since in practice we are not dealing with continuous multiply differentiable functions, but with generally unclear what discrete functions. Therefore, it would be more logical to look for a function g(x) for which, in our discrete case, f(x) + g(x) = f(x+1), f(x) + 2g(x) = f(x +2), f(x) - g(x) = f(x-1), etc. Thus, in this case we need not a derivative, but some linear function that is closest to the points of the original function. Simple math leads to the solution , where . If we built a derivative with respect to one neighboring point on each side, then we were lucky: in this case, the formula coincides with the formula for the approximate calculation of derivatives: g (x) = (f (x + 1) - f (x-1)) / 2. Tellingly, in OpenCV, when calculating the optical flow of Lucas-Kanade, exactly such a formula is used, we will return to this later. But if you take more points, then the formula already becomes completely different from the classical difference schemes for the first derivative.

    Obviously, if we build this function, for example, by three neighboring points to the left and right of the original one, then it does not depend in any way on the points located further, and, accordingly, when shifting more than three points, we will still often get inadequate results. . And also, the greater the number of points by which we build this function, the greater the average deviation of the resulting line from the points used - again due to the fact that we do not have linearly changing images, but the devil knows what. In practice, shifts of more than 2 pixels already give an inadequately large error, no matter how many points we take.

    Another weak point of the algorithm is that, again, we are not dealing with smooth continuous functions, but with arbitrary, and even discrete ones. Therefore, in some fragments of the image, the intensity can “jump” without any obvious patterns at all, for example, at the boundaries of objects, or due to noise. In this case, no function g(x) will be able to accurately describe changes in the image in the vicinity of the point. In order to overcome this (at least partially), it is proposed to smear the original image, and it will be useful to smear it quite strongly, that is, it is better to use not even everyone’s favorite gaussian blur (averaging with weight coefficients), but just a box filter (uniform averaging over the window ), and even several times in a row. Image smoothness is more important to us now than detail.

    However, these measures will also not save us from limiting the detected shift to 2-3 pixels. And by the way, in OpenCV 1.0 there was such an implementation of the optical flow, and it worked only in ideal conditions at very small shifts.

    What to do?
    In total, the usual Lucas-Kanade is good at detecting small shifts, such that the picture is similar to its linear approximation. To deal with this, we will use the standard CV technique - multi-scaling "ohm: we will build a "pyramid" of images of different scales (almost always scaling is taken 2 times on each axis, it's easier to count) and we will pass through them with an optical flow from a smaller image to a larger one , then the detected small shift on the small image will correspond to the large shift on the large image.On the smallest image, we detect a shift of no more than 1-2 pixels, and moving from a smaller scale to a larger one, we use the result from the previous step and refine the shift values. , in OpenCV it is implemented by the calcOptFlowPyrLK function.Using this pyramidal algorithm allows us not to bother with calculating a linear approximation over many points: it is easier to take more levels of the pyramid, and at each level to take a rather rough approximation of this function.Therefore, OpenCV calculates only two neighboring points, and therefore, as applied to this implementation, In the algorithm, our conclusions about the advantage of the approximating function over the derivative turned out to be useless: for such a number of reference points, the derivative is the best approximating function.

    And what else are there?

    This algorithm is not the only way to calculate the optical flow. In OpenCV, in addition to the Lucas-Kanade flow, there is also the Farneback and SimpleFlow flow, and the Horn-Schunck algorithm is also often referred to.

    Method Horn–Schunck is somewhat more global than the Lucas-Kanade method. It relies on the assumption that the optical flow will be fairly smooth over the entire image. From the same equation, it is proposed to pass to the functional , that is, add the requirement for the absence of a sharp change in shifts with a weighting factor α. Minimizing this functional leads us to a system of two equations:

    In these equations, the Laplacian is proposed to calculate approximately: - the difference with the average value. We get a system of equations that we write for each pixel and solve the general system iteratively:

    In this algorithm, it is also proposed to use multi-scaling, and it is recommended to scale images not by 2 times, but by a factor of 0.65

    This algorithm was implemented in the first versions of OpenCV, but later abandoned.

    Farneback proposed to approximate the change in intensity in the neighborhood using a quadratic form: I = xAx + bx + c with a symmetric matrix A (in fact, considering the Taylor expansion to the first term, we took a linear approximation I = bx + c, that is, now we just decided to improve the accuracy of the approximation) If the image has shifted within this neighborhood, then , we substitute it into the quadratic expansion, open the brackets, we get


    .

    Now we can calculate the values ​​of A, b, c in both pictures, and then this system will become redundant with respect to d (the first equation is especially confusing), and in general d can be obtained from the second equation: . We have to resort to the following approximation: . Let us also denote for simplicity , Then we get just .

    To compensate for noise in the calculation, we again turn to the assumption that in the vicinity of the point under study, all points have more or less the same shift. Therefore, we again integrate the error over the window with Gaussian weight coefficients w, and find the vector d minimizing this total error. Then we get the optimal value and the corresponding minimum error . That is, we need to calculate for each point, average over the window, invert the matrix and get the result. Accordingly, these products can be calculated for the entire picture and using pre-calculated values ​​for different points, that is, this is exactly the case when it makes sense to calculate a dense stream.

    As usual, this algorithm has a number of modifications and improvements, primarily allowing the use of known a priori information - a given initial flow approximation - and, again, multi-scaling.

    At the heart of the method simple flow the following idea lies: if we still cannot determine the shift greater than the size of the window over which we were looking for derivatives, then why bother with calculating derivatives at all? Let's just find the most similar point in the window! And to resolve ambiguities and to compensate for noise, we take into account that the flow is continuous and in the vicinity of a given point, all points have almost the same shift. And again we will solve the problem with the window size due to multi-scaling "a.

    More strictly, the algorithm sounds like this: for all points in the window, there is an “energy” function that is responsible (with an inverse logarithmic dependence) for the probability of the transition of the starting point to this point: . Further, the convolution of this energy with a Gaussian window is considered and the values ​​(dx, dy) are found that minimize this function. To obtain subpixel accuracy, a small neighborhood of the found optimal point (dx, dy) is considered and the peak of the energy function is searched for in it as the peak of the paraboloid. And, as mentioned above, this procedure is performed for the pyramid of scaled images. They also offer tricky methods for speeding up calculations in their algorithm, but those who are interested will figure it out for themselves. For us, it is important that due to this, this algorithm is (theoretically) fast enough with good accuracy. And it does not have such a problem as the previous ones, that the larger the shift, the worse it is detected.

    And if you take not the intensity?

    It was said above that the correspondence between points can be determined by different values, so why do we consider only intensity? But because any other value can be reduced to it: we simply filter the images with the appropriate filter and feed the filtered images to the input of the algorithms described above. Accordingly, if you want to use an optical flow, then first think about which image characteristic will be the most stable in your conditions, and perform the appropriate filtering so that the input of the algorithm is not the intensity, but this characteristic.

    Practice

    Let's try out the algorithms that OpenCV offers us in practice.

    Here you can conduct many different studies of each algorithm, varying the parameters, changing the input sequences - with different shifts, rotations, projective transformations, segments, with different noises, etc. All this would take a lot of time and would exceed the size of the report in this article, therefore, here I propose to confine ourselves to the simple case of a parallel shift of the image by a fixed distance and the imposition of small noises. This will allow you to understand in general terms how to run algorithms and which of them is cooler.

    The syntax of the procedures is described in detail on the page with the manual, here I will give a squeeze-translation with my comments.

    The classic Lucas-Kanade is implemented with a pyramid in the calcOpticalFlowPyrLK procedure. The algorithm calculates the sparse flow, that is, for a given set of points on the first image, it estimates their position on the second one. The input parameters are quite obvious: two input images, an input and an output set of points, status is an output vector indicating whether the corresponding point was successfully found, err is the output vector of the estimated errors of the corresponding points, WinSize is the size of the window over which the Gaussian average is performed, I took 21x21 and it worked well, maxLevel - the number of layers in the pyramid minus one, i.e. the number of the last layer, I took 5, criteria - the condition for exiting the iterative process of determining the shift (the error is minimized iteratively) - I left this parameter by default, flags - additional flags, for example, you can use the initial flow approximation or select the error estimation method, minEigThreshold - the threshold value of the gradient, below which the matrix is ​​considered degenerate, I left it by default. Starting with OpenCV 2.4.1, it is possible to use a precomputed pyramid of scaled images when calculating the flow.

    The result of the work is that both small and large shifts are successfully and stably detected, it is resistant to fairly large noise, the operating time is about 10 ms for 400 points with a 5-layer pyramid (on core i7 950).

    By the way, this algorithm is also implemented on Gpu (CUDA), both dense and sparse versions.

    The Farneback flow is implemented by the calcOpticalFlowFarneback procedure, the dense flow is calculated, that is, the shift of each point. Parameters: input images, output stream in the format of a two-channel matrix of floats, pyr_scale determines the ratio of scales between the layers of the pyramid, levels - the number of levels in the pyramid, winsize - the size of the window over which the averaging is performed, iterations - the number of iterations at each level, poly_n - the size of the polynomial by which the values ​​of A and b are estimated, poly_sigma is the sigma of the Gaussian blur when smoothing the derivatives, the recommended parameter values ​​are specified in the manual, flags are additional flags, for example, you can use the initial flow approximation or average over the window in a different way.

    This algorithm is much less stable (according to my observations), misses more easily on fairly uniform images (apparently, the problem is the lack of filtering out bad points), and does not detect large shifts well. It worked for me in 600 ms on a 512x512 image.

    The SimpleFlow flow implements the calcOpticalFlowSF procedure (again, the dense flow is calculated), and it has many mysterious parameters without default values, and in general, at the moment, the information on the page is provided very concisely. Let's try to figure it out. The first 3 are input images and output two-channel; layers - the number of layers in the pyramid, that is, how many times we scale the original image; averaging_block_size is the size of the window in which we calculated the pixel energy function; max_flow is the maximum shift that we want to be able to determine at each step, in fact it is determined by the size of the window (although it is not entirely clear why it is an int). You can stop there, or you can set a few more parameters, the meaning of some of them eludes me.

    The site offers to see an example of its use, in which it is launched with the following parameters: calcOpticalFlowSF(frame1, frame2, flow, 3, 2, 4, 4.1, 25.5, 18, 55.0, 25.5, 0.35, 18, 55.0, 25.5, 10) ;

    My algorithm works much slower than others, about 9-12 seconds per 512x512 image. The result of the work seems more plausible than Farneback, at least the shift is better defined on uniform images, it works much better with large shifts.

    conclusions

    If you want to use optical flow somewhere, first consider whether you need it: you can often get by with simpler methods. Undertaking to implement the flow on your own is worth it only after thinking a few times: each algorithm has many tricks, subtleties and optimizations; whatever you do, most likely it works better in OpenCV (of course, provided that it is there). Moreover, they use logical and hardware optimizations such as the use of SSE instructions, multithreading, the ability to calculate with CUDA or OpenCL, etc. If you just need to calculate the shift of a certain set of points (i.e., a sparse stream), then you can safely use function calcOpticalFlowPyrLK, it works well, reliably and fast enough. It is good to use the calcOpticalFlowSF function to calculate the dense flow, but it is very slow. If performance is critical, then calcOpticalFlowFarneback, but you still need to make sure that the results of its work will suit you. Add tags


    In computer vision and image processing systems, the problem often arises of determining the movements of objects in three-dimensional space using an optical sensor, that is, a video camera. Having a sequence of frames as input, it is necessary to recreate the three-dimensional space imprinted on them and the changes that occur to it over time. It sounds complicated, but in practice it is often enough to find the offsets of two-dimensional projections of objects in the frame plane.

    If we want to know how much this or that object has shifted relative to its position on the previous frame during the time that has passed between fixing frames, then most likely we will first of all remember the optical flow. To find the optical flow, you can safely use a ready-made tested and optimized implementation of one of the algorithms, for example, from the OpenCV library. It is, however, very harmless to understand the theory, so I invite all interested to look inside one of the popular and well-studied methods. This article does not contain code and practical advice, but there are formulas and a number of mathematical derivations.

    There are several approaches to determining offsets between two adjacent frames. For example, for each small fragment (say, 8 by 8 pixels) of one frame, you can find the most similar fragment in the next frame. In this case, the difference between the coordinates of the original and found fragments will give us the offset. The main difficulty here is how to quickly find the desired fragment without going through the entire frame pixel by pixel. Various implementations of this approach solve the problem of computational complexity in one way or another. Some are so successful that they are used, for example, in common video compression standards. The price of speed is, of course, quality. We will consider another approach, which allows us to obtain offsets not for fragments, but for each individual pixel, and is used when the speed is not so critical. The term “optical flow” is often associated with it in the literature.

    This approach is often called differential, since it is based on the calculation of partial derivatives in the horizontal and vertical directions of the image. As we will see later, derivatives alone are not enough to determine biases. That is why, based on one simple idea, a great many methods have appeared, each of which uses some kind of mathematical dance with a tambourine to achieve the goal. Let's focus on the Lucas-Kanade method, proposed in 1981 by Bruce Lucas and Takeo Kanade.

    Lucas-Kanade method
    All further reasoning is based on one very important and not very fair assumption: Assume pixel values ​​pass from one frame to the next without change. Thus, we make the assumption that the pixels belonging to the same object can move in any direction, but their value will remain unchanged. Of course, this assumption has little to do with reality, because global lighting conditions and the illumination of the moving object itself can change from frame to frame. There are a lot of problems with this assumption, but, oddly enough, despite everything, it works quite well in practice.

    In mathematical language, this assumption can be written as follows: . Where I is a function of pixel brightness from position on the frame and time. In other words, x and y are the pixel coordinates in the frame plane, and y is the offset, and t is the frame number in the sequence. Let us agree that a single time interval passes between two adjacent frames.

    One-dimensional case
    Let us first consider the one-dimensional case. Imagine two one-dimensional frames 1 pixel high and 20 pixels wide (figure on the right). On the second frame, the image is slightly shifted to the right. It is this offset that we want to find. To do this, we present the same frames as functions (figure on the left). The input is the position of the pixel, the output is its intensity. In such a representation, the required displacement (d) can be seen even more clearly. According to our assumption, this is just a biased , that is, we can say that .

    Please note that if you wish, you can also write in general form: ; where y and t are fixed and equal to zero.

    For each coordinate, we know the values ​​​​at this point, in addition, we can calculate their derivatives. Associate the known values ​​with the offset d. To do this, we write the expansion in a Taylor series for :

    Let's make a second important assumption: Assume that is approximated well enough by the first derivative. Having made this assumption, we discard everything after the first derivative:

    How correct is this? In general, not very much, here we lose in accuracy, unless our function / image is strictly linear, as in our artificial example. But this greatly simplifies the method, and to achieve the required accuracy, a successive approximation can be made, which we will consider later.

    We are almost there. The offset d is our target value, so we need to do something with . As we agreed earlier, so we just rewrite:

    2D case
    Let us now pass from the one-dimensional case to the two-dimensional one. We write the expansion in a Taylor series for and immediately discard all higher derivatives. Instead of the first derivative, a gradient appears:

    Where is the displacement vector.
    In accordance with the assumption made. Note that this expression is equivalent to . This is what we need. Let's rewrite:

    Since a single time interval passes between two frames, we can say that there is nothing more than a derivative with respect to time.
    Let's rewrite:

    Let's rewrite it again, opening the gradient:

    We have an equation that tells us that the sum of the partial derivatives must be equal to zero. The only problem is that we have one equation, and there are two unknowns in it: and . At this point, a flight of fancy and a variety of approaches begin.

    Let's make a third assumption: Assume that adjacent pixels are shifted by the same distance. Let's take a fragment of the image, let's say 5 by 5 pixels, and we will agree that for each of the 25 pixels and are equal. Then instead of one equation, we get 25 equations at once! Obviously, in the general case, the system has no solution, so we will look for those and that minimize the error:

    Here g is a function that determines the weights for the pixels. The most common variant is the 2D Gaussian, which gives the most weight to the central pixel and less and less as you move away from the center.

    To find the minimum, we use the least squares method, find its partial derivatives with respect to and :

    We rewrite in a more compact form and equate to zero:

    Let's rewrite these two equations in matrix form:

    If the matrix M is invertible (has rank 2), we can calculate and , which minimize the error E:

    That's actually all. We know the approximate pixel offset between two adjacent frames.

    Since the neighboring pixels are also involved in finding the offset of each pixel, when implementing this method, it is advisable to preliminarily calculate the horizontal and vertical derivatives of the frame.

    Disadvantages of the method
    The method described above is based on three significant assumptions, which, on the one hand, give us the fundamental opportunity to determine the optical flow, but, on the other hand, introduce an error. The good news for perfectionists is that we only need one assumption to simplify the method, and we can deal with its consequences. We assumed that the first derivative would be enough for us to approximate the displacement. In the general case, this is of course not the case (figure on the left). To achieve the required accuracy, the offset for each pair of frames (let's call them and ) can be calculated iteratively. In the literature, this is called warping. In practice, this means that, having calculated the offsets at the first iteration, we move each pixel of the frame in the opposite direction so that this offset is compensated. At the next iteration, instead of the original frame, we will use its distorted version. And so on, until at the next iteration all the resulting offsets are less than the specified threshold value. We get the final offset for each specific pixel as the sum of its offsets at all iterations.

    By its nature, this method is local, that is, when determining the offset of a particular pixel, only the area around this pixel is taken into account - the local neighborhood. As a consequence, it is impossible to determine displacements inside sufficiently large (greater than the size of the local neighborhood) evenly colored areas of the frame. Fortunately, such areas are not often found on real frames, but this feature still introduces an additional deviation from the true displacement.

    Another problem is related to the fact that some textures in the image give a degenerate matrix M, for which the inverse matrix cannot be found. Accordingly, for such textures we will not be able to determine the offset. That is, there seems to be movement, but it is not clear in which direction. In general, not only the considered method suffers from this problem. Even the human eye perceives such a movement ambiguously (Barber pole).

    Conclusion
    We have analyzed the theoretical foundations of one of the differential methods for finding the optical flow. There are many other interesting methods, some of which, as of today, give more reliable results. However, the Lucas-Kanade method, with its good efficiency, remains quite simple to understand, and therefore is well suited for getting acquainted with the mathematical foundations.

    Although the problem of finding the optical flow has been studied for several decades, the methods are still being improved. The work continues in view of the fact that, upon closer examination, the problem turns out to be very difficult, and the stability and efficiency of many other algorithms depend on the quality of determining the biases in video and image processing.

    On this pathetic note, let me round off and move on to sources and useful links.
    Lucas-Kanade method Add tags

    Bright pictures move along with the observed objects. The optical flow is the apparent movement of a luminance image. Ideally, the optical flow corresponds to the field of motion, however, we will show below that this is not always the case.

    Rice. 12.2. Optical flow, not always coinciding with the field of motion. a - a smooth sphere rotates under constant illumination - the image does not change, although the field of motion is non-zero; b - a stationary sphere is illuminated by a moving source - the image illumination distribution changes, although the motion field is zero.

    Consider a perfectly homogeneous sphere rotating in front of the lens of the visual system (Fig. 12.2, a). Since the surface is curved, spatial variations in brightness will be observed in the image of the sphere. However, this brightness pattern does not move with the surface and the image does not change with time. In this case, the optical flow is zero everywhere, although the field of motion is nonzero. Now consider a fixed sphere illuminated by a moving light source (Fig. 12.2, b). The brightness in the image changes with the movement of the source. In this case, the optical flow is obviously non-zero, although the field of motion is zero everywhere. Specular reflections and shadows provide other examples where the optical flow does not match the field of motion.

    Only the optical flow is available to us, and we will proceed from the assumption that in the usual case the optical flow does not differ too much from the field of motion. This allows us to estimate relative motion based on how the image changes over time.

    What do we mean by the apparent movement of the brightness pattern? Consider a point P on the image, which has a brightness E at the moment (Fig. 12.3). Which point P of the image will correspond to it at the moment, in other words, how does the brightness pattern move in this time interval? Usually in the vicinity of the point P there are many

    points with the same brightness E. If the brightness changes continuously in the part of the image of interest to us, then the point P will lie on the line of equal brightness C. At the moment it will correspond to the line of equal brightness C with the same value E. However, what is the correspondence between the points of the lines C and FROM? This question is not easy to answer, since the shape of this line usually changes with movement.

    Thus, we note that the optical flow on a changing image is ambiguously determined by local information. Let's explain this with another example. Consider a region of uniform brightness in an image that does not change with time. The "most believable" in this case would be zero optical flow. In fact, inside a homogeneous spot, we can assign any speed we want to the points. Although, apparently, we would prefer the simplest explanation of the observed changes in the image (in this case, the absence of such changes).

    Let be the illumination at the point of the image at the moment of time . Then, if and are the y-components of the optical flow vector at this point, we can expect that the illumination will be the same at the moment at the point where In other words, for a small time interval This single restriction is not enough to uniquely determine the two unknowns u and V. It is also clear that we would like to take advantage of the fact that the field of motion is continuous almost everywhere.

    If the brightness changes smoothly with and then we can expand the left side of the equation in a Taylor series and from here we get

    where in contains terms of higher orders of smallness in the beginning

    Rice. 12.3. The apparent movement of the brightness pattern.

    It is not easy to decide which point P on the luminance level line C of the second image corresponds to a given point P on the luminance level line C of the first image.

    Rice. 12.4. Local information about the brightness gradient and the rate of its change in time, which imposes only one limitation on the components of the optical flow vector. It is necessary that the flow velocity be directed along a straight line perpendicular to the direction of the brightness gradient. We can only define the component directed along the brightness gradient. Nothing is known about the component of the flow velocity in the perpendicular direction.

    from the second. We cancel, divide by and pass to the limit at Then we get