Processing Trajectory Data for Sequence Generation
Published:
Before considering the details around modelling such tasks, we should spend some time to consider the datasets we will use as well as the preprocessing routines we will consider.
Datasets Used
In this post, we consider four different datasets and namely, ETH University, ETH Hotel [5] (see next 2 photos as examples), and Zara1,2 [7]. Photos of the first two can be seen below.
Data Processing
In these examples we are interested in figuring out the exact pixel location of each individual pedestrian (agent), as well as the associated frames we consider. All four datasets give us annotated positions but differ slightly in representation. Thus, as a first step we ensure they are aligned.
This is common across datasets when they have been built by different groups and projects and have slight misalignments. All four datasets have been recorded on 25Hz and consist on average of 3000 frames. The ETH datasets are comprised of 750 agents each while Zara has two scenes each with 786 agents. All videos include people walking on their own, as well as pedestrians moving in groups in a nonlinear manner. However, some of the videos annotate trajectories in mm in a world reference frame and others have recorded them in pixel coordinates with (0,0) considered in the centre of each video frame. We will process all of them ensuring all positions are represented in pixel positions with (0,0) placed in the bottom left corner. We will further normalise them between 0 and 1 such that the size of the image or the roadwalk considered do not bias our solution.
Processing examples
To simplify this post we will consider transforming 1 of the 4 datasets and leave the rest out. The aim is to clarify how such processing is achieved. The rest of the processing is similar to the one described here and can be further found in the GitHub repository of this post. This part, however, is not necessary to understand the details around the actual model.
ETH Hotel is comprised of positions in world reference frame where we are interested in converting these to local, pixel reference frame. To do this, we are given the required homography.
Those interested in the mathematics behind the introduced conversion can read more about it in Taku Komura’s lecture slides. Further we normalise the data using the minimum and maximum recorded values which results in the following method.
Lines 8 to 20 find the minimum and maximum values for the x and y positions of the agents. The called “obsmat.txt” file contains the annotated data and is comprised of the frame number, the pedestrian id, position in the x axis in the world frame, position in the y, z as well as their associated 3 velocities. More information can be found in the README.txt file within the dataset directory. In this post we are only interested in considering the frame, the pedestrian’s id and their x and y positions.
Lines 34-41 are associated with the ordering of pedestrians across frames along the pedestrian id.
Line 29 calls parse_annotations() parses the collected data and converts it to reference frame.
We can combine the preprocessing of all datasets in a single function. Ideally, we will save the preprocessed data and load it directly each time we need to use it. Once this is done we can then call a function that loads the preprocessed data in the format we need it in. It will be useful to split the trajectories in a chosen in advance sequence length. We can then easily compute the number of batches we will get if we specified a batch size too.
Batching
After obtaining and pre-processing all the information we need to implement a routine for sampling random batches that ensure the samples will be comprised of unbroken sequences. One thing to keep in mind is that some of the sampled trajectories might be shorter than some given sequence length and others might be longer. In the former case, we would like to avoid such trajectories, while in the latter we want to split the trajectories in multiple samples. We will do this by defining a function called “next_batch()” that will take in as input the associated data, required batch size, a frame pointer that indicates the currently considered starting point, desired sequence length, maximum number of pedestrians in the considered sequence across all datasets, the current dataset pointer and whether or not we are sampling during inference or training time.
Now that we have ensured we have the required data processed and have an associated batching function we can focus on building the actual model.