Processing Trajectory Data for Sequence Generation

8 minute read

Published:

Before considering the details around modelling such tasks, we should spend some time to consider the datasets we will use as well as the preprocessing routines we will consider.

Datasets Used

In this post, we consider four different datasets and namely, ETH University, ETH Hotel [5] (see next 2 photos as examples), and Zara1,2 [7]. Photos of the first two can be seen below.

Data Processing

In these examples we are interested in figuring out the exact pixel location of each individual pedestrian (agent), as well as the associated frames we consider. All four datasets give us annotated positions but differ slightly in representation. Thus, as a first step we ensure they are aligned.

This is common across datasets when they have been built by different groups and projects and have slight misalignments. All four datasets have been recorded on 25Hz and consist on average of 3000 frames. The ETH datasets are comprised of 750 agents each while Zara has two scenes each with 786 agents. All videos include people walking on their own, as well as pedestrians moving in groups in a nonlinear manner. However, some of the videos annotate trajectories in mm in a world reference frame and others have recorded them in pixel coordinates with (0,0) considered in the centre of each video frame. We will process all of them ensuring all positions are represented in pixel positions with (0,0) placed in the bottom left corner. We will further normalise them between 0 and 1 such that the size of the image or the roadwalk considered do not bias our solution.

Processing examples

To simplify this post we will consider transforming 1 of the 4 datasets and leave the rest out. The aim is to clarify how such processing is achieved. The rest of the processing is similar to the one described here and can be further found in the GitHub repository of this post. This part, however, is not necessary to understand the details around the actual model.

ETH Hotel is comprised of positions in world reference frame where we are interested in converting these to local, pixel reference frame. To do this, we are given the required homography.

def world_to_image_frame(loc, Hinv):
  """
  Given H^-1 and (x, y, z) in world coordinates, returns (u, v, 1) in image
  frame coordinates.
  """
  loc = np.dot(Hinv, loc) # to camera frame
  return loc/loc[2] # to pixels (from millimeters)

Those interested in the mathematics behind the introduced conversion can read more about it in Taku Komura’s lecture slides. Further we normalise the data using the minimum and maximum recorded values which results in the following method.

def mil_to_pixels(directory=["./data/ewap_dataset/seq_hotel"]):
    '''
    Preprocess the frames from the datasets.
    Convert values to pixel locations from millimeters
    obtain and store all frames data the actually used frames (as some are skipped), 
    the ids of all pedestrians that are present at each of those frames and the sufficient statistics.
    '''
    def collect_stats(agents):
        x_pos = []
        y_pos = []
        for agent_id in range(1, len(agents)):
            trajectory = [[] for _ in range(3)]
            traj = agents[agent_id]
            for step in traj:
                x_pos.append(step[1])
                y_pos.append(step[2])
        x_pos = np.asarray(x_pos)
        y_pos = np.asarray(y_pos)
        # takes the average over all points through all agents
        return [[np.min(x_pos), np.max(x_pos)], [np.min(y_pos), np.max(y_pos)]]

    Hfile = os.path.join(directory, "H.txt")
    obsfile = os.path.join(directory, "obsmat.txt")
    # Parse homography matrix.
    H = np.loadtxt(Hfile)
    Hinv = np.linalg.inv(H)
    # Parse pedestrian annotations.
    frames, pedsInFrame, agents = parse_annotations(Hinv, obsfile)
    # collect mean and std
    statistics = collect_stats(agents)
    norm_agents = []
    # collect the id, normalised x and normalised y of each agent's position
    pedsWithPos = []
    for agent in agents:
        norm_traj = []
        for step in agent:
            _x = (step[1] - statistics[0][0]) / (statistics[0][1] - statistics[0][0])
            _y = (step[2] - statistics[1][0]) / (statistics[1][1] - statistics[1][0])
            norm_traj.append([int(frames[int(step[0])]), _x, _y])

        norm_agents.append(np.array(norm_traj))

    return np.array(norm_agents), statistics, pedsInFrame

Lines 8 to 20 find the minimum and maximum values for the x and y positions of the agents. The called “obsmat.txt” file contains the annotated data and is comprised of the frame number, the pedestrian id, position in the x axis in the world frame, position in the y, z as well as their associated 3 velocities. More information can be found in the README.txt file within the dataset directory. In this post we are only interested in considering the frame, the pedestrian’s id and their x and y positions.

Lines 34-41 are associated with the ordering of pedestrians across frames along the pedestrian id.

Line 29 calls parse_annotations() parses the collected data and converts it to reference frame.

def parse_annotations(Hinv, obsmat_txt):
    '''
    Parse the dataset and convert to image frames data.
    '''
    mat = np.loadtxt(obsmat_txt)
    num_peds = int(np.max(mat[:,1])) + 1
    peds = [np.array([]).reshape(0,4) for _ in range(num_peds)] # maps ped ID -> (t,x,y,z) path
    
    num_frames = (mat[-1,0] + 1).astype("int")
    num_unique_frames = np.unique(mat[:,0]).size
    recorded_frames = [-1] * num_unique_frames # maps timestep -> (first) frame
    peds_in_frame = [[] for _ in range(num_unique_frames)] # maps timestep -> ped IDs

    frame = 0
    time = -1
    blqk = False
    for row in mat:
        if row[0] != frame:
            frame = int(row[0])
            time += 1
            recorded_frames[time] = frame

        ped = int(row[1])

        peds_in_frame[time].append(ped)
        loc = np.array([row[2], row[4], 1])
        loc = to_image_frame(loc)
        loc = [time, loc[0], loc[1], loc[2]]
        peds[ped] = np.vstack((peds[ped], loc))

    return recorded_frames, peds_in_frame, peds

We can combine the preprocessing of all datasets in a single function. Ideally, we will save the preprocessed data and load it directly each time we need to use it. Once this is done we can then call a function that loads the preprocessed data in the format we need it in. It will be useful to split the trajectories in a chosen in advance sequence length. We can then easily compute the number of batches we will get if we specified a batch size too.

BATCH_SIZE = 50
SEQUENCE_LENGTH = 8
agents_data, dicto, dataset_indices = \
  data_tools.preprocess(training_directories)

loaded_data, num_batches = \
  data_tools.load_preprocessed(agents_data, BATCH_SIZE, SEQUENCE_LENGTH)

Batching

After obtaining and pre-processing all the information we need to implement a routine for sampling random batches that ensure the samples will be comprised of unbroken sequences. One thing to keep in mind is that some of the sampled trajectories might be shorter than some given sequence length and others might be longer. In the former case, we would like to avoid such trajectories, while in the latter we want to split the trajectories in multiple samples. We will do this by defining a function called “next_batch()” that will take in as input the associated data, required batch size, a frame pointer that indicates the currently considered starting point, desired sequence length, maximum number of pedestrians in the considered sequence across all datasets, the current dataset pointer and whether or not we are sampling during inference or training time.

def next_batch(_data, pointer, batch_size, sequence_length, infer=False):
    '''
    Function to get the next batch of points
    '''
    # List of source and target data for the current batch
    x_batch = []
    y_batch = []
    # For each sequence in the batch
    for i in range(batch_size):
        # Extract the trajectory of the pedestrian pointed out by pointer
        traj = _data[pointer]
        # Number of sequences corresponding to his trajectory
        n_batch = int(traj.shape[0] / (sequence_length+1))
        # Randomly sample an index from which his trajectory is to be considered
        if not infer:
            idx = random.randint(0, traj.shape[0] - sequence_length - 1)
        else:
            idx = 0

        # Append the trajectory from idx until sequence_length into source and target data
        x_batch.append(np.copy(traj[idx:idx+sequence_length, :]))
        y_batch.append(np.copy(traj[idx+1:idx+sequence_length+1, :]))

        if random.random() < (1.0/float(n_batch)):
            # Adjust sampling probability
            # if this is a long datapoint, sample this data more with
            # higher probability
            pointer = tick_batch_pointer(pointer, len(_data))

    return x_batch, y_batch, pointer

Now that we have ensured we have the required data processed and have an associated batching function we can focus on building the actual model.