Automatic Address Data Cleansing
Data Cleansing is gaining importance in data oriented research, i.e. social media analysis, or in an applied area like fraud detection, since mining related data requires merging data from different sources. That is, retrieve information on the same individuals from data that is written differently.
For example one may want to find out if the addresses “ 5th avenue south west 16”, “ 5th ave S W 16” and “ fifth ave. SW 16” are the same. This is easy for a human, but not for a machine. The problem gets more complicated when the addresses can be written in several languages as it is the case for addresses from Switzerland (French, Italian and German) or Canada (French and English).
We propose a machine learning based algorithm to automatically retrieve and match the information of the same individuals. We particularly use a combination of cluster analysis and association rules to build our model.
Since our partner has been very impressed by our innovative solution, the development of a software solution based on our results is planned.