Response to reviewers for paper AURO-D-13-00144R1

We again thank the reviewers for their feedback, and we will address their remaining points in this document.

Reviewer #1: Revised paper is acceptable as is.

Thank you. We are glad we were able to adequately address R1’s concerns.

Reviewer #2: Most of the points I requested in the review are corrected appropriately and I think this paper is ready for the publication. If possible, please add the explanation about how they determine threshold values for detecting the disassociation.

Unfortunately we do not have a formal method for determining the threshold values, and they are manually adjusted for best performance with a given sensor network, environment, and configuration of robots. We have added a new Section 4.4.3 discussing some considerations affecting parameter tuning.

Reviewer #4: This paper proposed the method for localizing a robot. The method uses infrastructure of tracking which is fixed and embedded in the environment. Since the infrastructure's configuration is completely known and fixed, it is possible to localize the robot by estimating which is the trajectory of robot.

If my understanding is correct, the Hungarian method requires that the number of data from odometry is the same as the number of the data from tracking system. How do you adjust the number?

We use an implementation of the Hungarian Algorithm which can handle unassigned pairs in m by n matrices. Specifically, we use Kevin L. Stern’s implementation, although other techniques for solving the association problem with non-square matrices exist.

We have added an explanation of this in sec. 4.4.2.

It is better to clarify the assumption to apply the method in Section 4.5.2. I wonder if it is impossible to apply it to the omni-directional mobile robot.

The technique works with omnidirectional robots as well – the main difference is that the robot’s odometry cannot be directly used – instead, the robot’s onboard estimation of its motion should be reported to the server. Also, as the robot’s heading and motion are not aligned for an omnidirectional robot, the problem should be reframed to state that the heading correction in Section 4.5.2 corrects the angular offset between the robot’s intrinsic coordinate system and the world coordinate system, rather than saying that it corrects the robot’s heading.

In fact, we have implemented the system for use with an omnidirectional robot, using the robot’s onboard motion estimation derived from its motor commands instead of reporting odometry –the lack of encoders on free wheels makes obtaining reliable odometry for such a robot difficult. However, we do not have any empirical evaluations of its accuracy that can be presented; nearly all of our robots are differential-drive, so we used those robots for the evaluation experiments.

We have adjusted the text in Sec. 4.5.2 to explain that this technique can be applied to an omnidirectional robot if “heading correction” is interpreted as aligning its estimated direction of motion with its externally-observed direction of motion.

The situation in the simulation is quite different from the situation in real field experiment. Why did you not evaluate the case where multiple robots move, sometimes stop for conversation?

The purpose of the evaluations in simulation and in the shopping mall was to quantitatively evaluate the level of accuracy that could be obtained using extrinsic sensors in comparison with on-board map-matching. The multiple-robot assignment problem is a separate issue, so these two aspects were presented separately. In contrast, the field experiment was intended to complement these controlled evaluations, instead showing a realistic, empirical demonstration of all elements in the system functioning in context.

There is nothing special about that particular scenario of four robots in that environment except that it represents a real situation where we actually used the system – it is likely that system performance would vary for different numbers of robots and humans in different environments, e.g. narrow corridors vs open spaces, etc., so for the simulation and map-matching evaluation it seemed simplest to use one robot.

Why did you define 1000 mm of error as tracking failure? Is this threshold too large?

We have rewritten Section 5.2 to try to be more clear about this point. The 1000mm threshold is quite arbitrary – its purpose is only to separate the small errors (e.g., 10 cm) which occur while the system is working correctly, from the large errors (e.g. 5 m) which can occur when the robot gets lost. Very similar results can be obtained for larger or smaller thresholds.

The purpose of applying this threshold was because a large part of the error observed in the map-matching system was due to times when the robot was lost and unable to match its laser scans to the map. Because people could argue that techniques exist which could help the robot recover from these tracking failures, we wanted to show that even when we ignore large errors, the proposed system still provides a lower error than the map-matching system.

Though you use student's t-tests to evaluate the difference statistically, what is the theoretical background to use the student's t-tests? Does the distribution follow the Gaussian distribution and is the variance the same as each other?

The reviewer is correct that we should have checked normality of the data. We have re-analyzed the data in consideration of this point. The results of this analysis show that the error data with failures removed exhibits normality, whereas the error data including failures is not normally distributed.

We have corrected the statistical analysis in the text. The resultsshow that there is still a significant difference between conditions.

In line 39, PP. 5 single-value decomposition -> singular-value decomposition

Thank you. We have corrected this.

Reviewer #5: This article presents a technique for tracking robots and people in a crowded environment.

We would phrase this summary in a slightly different way: this article presents a technique for localizing robots in an environment where a system is already in place which can track people and robots (but cannot distinguish between them).

A previously published article presented how laser range finders fixed in the environment track blobs. These blob trajectories are assigned to the odometry trajectories of the different robots using shape comparisons. It is an interesting technique, but the explanations are most often confusingas the choice of vocabulary is not consistent, and the style oscillates betweentoo verbose and too concise.

The algorithm should be reviewed to include a proper people tracking,as the one proposed only sees people as "trajectories that do not correspond to robots",and ensures no consistency in their labelling.

We consider the tracking system, which is assumed to exist independently of the proposed system, to be able to handle the task of “people tracking,” and we are making the assumption that robots and people are indistinguishable to that system. While “blob-tracking” does describe part of the basic mechanism by which tracking is handled, the system is quite stable and robust at tracking people. Continuous pedestrian (and/or robot) tracks are assigned a persistent ID within the tracking system, and in some applications we have associated these tracks with information such as personal ID from face recognition.

The task of the proposed system is to determine which of these persistent ID’s corresponds to a given robot’s odometric track. However, there is always the risk of failure within the tracking system, e.g. the ID’s of tracks becoming swapped when two entities are close together, so we consider it safer to dynamically evaluate the mapping of robots to the persistent track ID’s rather than permanently labeling a track as a robot.

Again, we would like to emphasize that the assumption of this work is that the tracking system is already present for purposes other than robot localization, and the current technique examines only the robot-relevant aspects of this system.

The proposed experiments do not validate the strong points of the algorithm(multi-robots-multi-users tracking).

In this work, we present three evaluations: First is a simulation experiment to validate the basic advantage of using environmental sensors over on-board sensing for localization. Second is an evaluation in the field, demonstrating multi-robot localization using our technique in a real robot deployment. Third is a further exploration of the robustness of the proposed technique to the changing maps found in real environments.

We feel that these three evaluations validate the important points of our algorithm: improved tracking accuracy, automatic error recovery, and robustness to changing maps. The tracking of people was not evaluated because it was not the focus of this work, and those results have been published elsewhere (Glas et al., Advanced Robotics 2009) and (Brscic et al., IEEE THMS 2013).

A few ceiling-mounted cameras would have solved the same problem in a much easier way(almost no occlusion, detection made easier, etc.).This is briefly suggested by the end of §8.3 and in the first review discussion: maybe a more in-depth comparison of ceiling cameras vs. LRFswould be more relevant than comparing the proposed technique with map-matching (§5 and §7).

Validations of the tracking accuracy of our tracking systems can be found in (Glas et al., Advanced Robotics 2009) and (Brscic et al., IEEE THMS 2013). Again, the proposed technique in the current paper is intended to be tracking-system-independent, compatible with any tracking system that detects entity positions but does not distinguish between robots and humans.

Although this is a bit off-topic for the current paper, we would like to point out that there are some concrete advantages to depth sensors (2D or 3D) in comparison with video cameras. Most significant are the simplicity of image segmentation and the robustness towards different lighting conditions. The fact that personally-identifying data such as face images are not captured by 2D LRF’s can also be an advantage when deploying sensor networks in commercial spaces. The long sensing range of 2D LRF’s also makes it possible to do wide-area tracking with a small number of sensors – tracking with ceiling-mounted 3D depth cameras or video cameras can require a large number of sensors to provide coverage. Certainly there are many valid tracking technologies, each with its own advantages and drawbacks: just a few other examples we have used include ceiling-mounted video cameras, omnidirectional video cameras, and floor pressure sensors. The proposed system should be applicable to any such system which can track moving entities, although of course the system as proposed does not take into account additional information such as color (for video) or footprint shape (for floor-pressure sensors), which could help in identifying more robustly which entities are humans vs. robots.

== Discussion ==

* §1.2: the system is presented as a solution for crowded environments, where people gather around the robot. However, in that case, fixed LRFs cannot see the robot, as it is surrounded. Why not hybridating then the data of the robot with fixed LRFs?

We have thought about this possibility in the past, and it seems like it could be an interesting project to explore – the robots could be considered as mobile sensors and incorporated as part of the sensor network. However, combining these scans is not straightforward, as robots often have floor-level scanners and the environment (with our 2D setup) has waist-level scanners. It might be possible to match human positions extracted from both sets of scanners, if a stable human-tracking algorithm is implemented for on-board scanners. Alternatively, environmental scans from the robot and sensors could be combined if one assumes the environment does not vary between floor level and waist level, or if 3D point clouds are available. Many possibilities exist, which seem like they would be fun to explore.

However, the simple solution we present has advantages over complex sensor fusion. The largest of these is the fact that our proposed system will work with a robot even if it has no on-board sensors at all. In fact, the robots used in our field experiment did not have any on-board localization capabilities. Furthermore, the proposed technique works with different kinds of tracking systems, such as the method we mention based on 3D depth sensors. Whether or not sensor fusion is useful would be heavily dependent upon both the sensor setup available on the robots and the sensors used in the environment.

* §3.2.3: the presented system shows the interesting feature of being able to use any tracking algorithm, but the LRF-based tracking algorithm is actually at the core of the article, see for instance §4.5 or the whole §7, that intends to demonstrate its efficiency.

At least a summary of how it works should at least be given here.

We have added a summary of the basic points of the tracking algorithm in Sec. 3.2.3.

* §4: the proposed system is supposed to "track people and robots".

However, there is no consistency in people tracking: their trajectories only correspond to the ones not taken by robot-to-blob-trajectories assignments. There is no proper unique ID given to each people - this cannot be called people tracking, but merely people detection.

The tracking of peopleis handled within the tracking system, including the assignment of persistent ID’s to people and tracking of trajectories over time. Each “blob” (person or robot) being tracked is persistently tracked over time with a separate particle filter. A unique “anonymous ID” is assigned to each entity being tracked at the time of detection and stays with that entity as long as it is tracked.

Thus, the proposed localization technique is not a mechanism to perform tracking per se, but instead it identifies which of the entities being tracked is likely to correspond to a known robot, and it uses the position data from that entity to update the robot’s position. The association algorithm creates a mapping between the robots and the corresponding anonymous IDs from the tracking system.

* §4.2: the trajectories obtained by integrating odometry are supposed to be similar in shape to the tracked ones,but they can be very unlikely in case of poor wheel adherency for instance. This is a critical importance as the whole technique relies on similarity between tracked trajectories and odometry trajectories.

The reviewer is correct in that similarity in shape is of critical importance and is a basic assumption of our technique. Although dead reckoning methods using internal sensors such as odometry aresubject to accumulative error, it has been our empirical experience and also been reported elsewhere that well-calibrated odometry precision is usually within 5% of the distance traveled. That is, if the robot traveled 1m, the odometry error would be within 0.05m which is sufficient to compute a local trajectory. Also, inertial sensor units and gyro sensorsarebecoming more affordable,enablingincreased accuracy in the computation of dead reckoning (e.g., “gyrodometry”, Borenstein et al., ICRA 1996) even in cases of poor wheel adherency.

For situations in which odometry is particularly unreliable, it may be appropriate to adjust the error threshold to allow trajectory association despite less-exact matching. Discussion of this point has been added to the new Sec. 4.4.3.

* §4.3.2: the geometrical similarity is a good metric for comparing the comparison, but it gets rid of the time dimension. For instance, two robots following the same track but in opposite directions have exactly the same geometrical shape for their evolution but using the timestamps enables a non ambiguous matching.

Our technique does consider sequence in time. It performs matching based on known correspondences between odometry and tracking data. These correspondences are determined by timestamp, and thus sequence is considered. This differs from approaches such as ICP which perform matching based on unknown correspondences.

To try to make this procedure more clear, we have provided a new Figure 3, which illustrates the process graphically and provides an example such as the reviewer describes.

* §4.4.2.c)

As soon as there is a better assignment robot-to-tracked-trajectory on one frame, a new association is built. This is a behaviour that will tend to be jumpy, and it would have been more relevant to model these matching with smoother associations (e.g. conditional probabilities or simple particle filters).

The reviewer is correct in that a jump is experienced when the system detects and corrects an incorrect association. However, we would argue that a system performing associations based on conditional probabilities or particle filters would also experience a jump. This phenomenon will happen any time a system is constrained to output one maximum-likelihood hypothesis at times when multiple hypotheses may be plausible.

To minimize the cases where our system jumps between associations, we leave robots unassociated in cases where confidence is low, e.g. the error is above the threshold. At these times, the robot relies only on its odometry for localization and there is no jumping induced by the system. Once the error criterion has moved below the confidence threshold, the system associates the robot and begins sending corrections again.