PostgreSQL La base de donnees la plus sophistiquee au monde.

France Telecom's Search Engine based on a 24 TB database powered by PostgreSQL

This translation is not finished yet ! Please do not forward this text until it gets properly reviewed…

Please send your comments at : damien@dalibo.info

In the following text, Severine Aubry and Robin POUGET, who are respectively Project Manager and Database Manager at France Telecom, demonstrate their use of PostgreSQL for a major tool of they company.

France Telecom has build a web search engine called “Le Moteur” (http://www.lemoteur.fr/). Behind this engine lies a back office that includes all that the machinery needed to store the keywords contained by web sites (URLs), analyzing them, indexing them, etc.. This application is highly critical because as it determines the quality of the results produced by the search engine. This back office must be refreshed in 24hours/24 and 6days/7n which means it can tolerate a day without updates.

This part of the search engine was build between 2001 and 2002, exclusively with PostgreSQL. At that time, the Postgres project was about to released its versionn 7.4. Nowadays the project is based on PostgreSQL 8.2. Of course, the back office has seen some improvements and fixes over the years.

In details, the engine is composed of a crawler, whose charge is to browse the Internet through a list of URLs and thousands of key sites. It follows links automatically. This data is then processed through several scripts written in TCL. Data is stored in a schema with a few thousand tables. The partitioning is based on technology developed internally, based on hash keys.

From the start, the goals of the project were to have stable and robust solution with an easy maintenance and the ability to handle a huge data growth… This meant that the project had to scale freely its disk space and its number of PostgreSQL servers.

Here's some figures to describe the whole system : More than 5 billion tuples are distributed among 160 Linux servers that run 800 PostgreSQL instance. The overall data volume is 24 Terabytes. Note that PostgreSQL is not only running on these machines, there are also applications along the databases. The Linux servers are spread over three separate datacenters, based on a logical division, called “software blocks”. There are data export from these data towards others databases for various uses.

The entire application is extremely flexible : even if a server falls, there's no service outage because the data is replicated. Moreover there's no dependency between the elements.

There has been in the history of the project a few minor issues that were fixed by the community, whose support was effective. Among these issues, there were :

  • Data fragmentation due to massive updates. This has been fixed with the new features of PostgreSQL over the years (remember the project has been running PostgreSQL for 10 years ! )
  • Some concern with memory management that were corrected in version 8
  • VACUUM FULL are now almost ancient history. At first, the system needed 3 or 4 VACUUM FULL per year. Now only 1 is enough.

In conclusion, PostgreSQL has been a satisfaction for over 10 years. The few problems encountered were all treated with the utmost effectiveness by the community. New versions of PostgreSQL have brought solutions to these problems either with bug fixes, improvements or simplifications.

[Interview realized by Jean-Paul Argudo, March-June 2011]

 
en/temoignages/moteur_orange.txt · Dernière modification : 2011/07/06 23:56 de daamien