The recent evolution of computer based analytics and the development of hardware -- networks -- and software -- Web and Cloud -- infrastructures yields to a renewal of the information systems in organisations as well as a reconsideration of the methodology of data processing (text documents in particular). A new concept -- Big Data -- is now widespread which changes radically the notion of information system and the corresponding practice.
Data processing still uses a number of classical algorithmic paradigms, but also makes an intensive use of new concepts such as learning, data mining, statistical modelling, information retrieval, etc.
Furthermore these methods are commonly used to model experimental phenomena in physics, biology, medicine, thus providing engineers and scientists with a new package of tools on top of the standard mathematical models.
This course presents a concrete introduction to the methods that are used daily to retrieve information in huge sets of documents, to design models from training examples, and to find best fitting data for these models in huge databases.
It can be seen as a complement to INF553: Data Bases and Big Data Management.
It develops a number of methods mentioned in BIO552: Computational Biology.
In a first part, we present the algorithmic methods that allow data mining in very large files and sets of files: frequent associations, highly correlated objects. We show how probabilistic methods using hashing techniques make it possible to design efficient algorithms when all deterministic solutions are definitely out of the game. We also show how one can adapt these static strategies to capture information on high throughput data flows when it is not possible to store all the data, as it is the case on communication networks.
In a second part, we study strategies in order to give structure to the data: either by grouping data in homogeneous subsets or clusters or by organizing them in hierarchies, or by computing models -- probabilistic such as Hidden Markov Models or deterministic such as Support Vector Machines -- from training sets of examples which allow to identify pragmatically characteristic properties of data that would be undefinable by standard syntactical means.
We give examples borrowed from text document analysis, Web networks, biological sequence analysis, image processing, etc.
The methods under study are systematically evaluated in terms of efficiency and applicability.
Requirements : None
Evaluation mechanism : Students will be evaluated on the basis of presentations done during the lectures and of a project at the end of the course.
Last Modification : Tuesday 2 April 2013