HTML predominates in the Catalan web

20-12-2007

The PADICAT project (Digital heritage of Catalonia), led by the Biblioteca de Catalunya (National Library of Catalonia) with the support of the Centre de Supercomputació de Catalunya (CESCA), has carried out an exhaustive analysis of the formats and technology used on the Catalan web, based in a sample of 1.000 websites of all kinds.

The radiography of these 1.000 websites included in the repository of the project's web allows us to confirm that each website has an average of 1,33 GB volume and 33.942 files. Never before has an analysis of the Catalan web been carried out with such a significant sample.

 

Webs included in PADICAT/research sample 1.004
Web pages captured in different editions 2.720
Total number of files 34.077.807
File's average for each web page 33.942
Total volume of PADICAT's archive 1.339,24 GB
Volume's average for each web page 1,33 GB

 

Otherwise, this research confirms that the most usual formats in the Catalan net are html (71,69%), jpeg (7,09%), gif (2,45%) and pdf (1,32%), followed by other not so usual kinds. For the project’s leaders, the majority presence of such popular formats, which altogether comes to 82,5% of the whole existing formats in the Catalan web, allows to predict an encouraging future for the preservation of digital resources on the internet.

Format Files Volume (GB) % Files % Volume
text/html 24.429.679 592,45 71,69% 55,83%
image/jpg 2.416.055 123,81 7,09% 11,67%
image/gif 834.019 6,79 2,45% 0,64%
application/pdf 449.983 167,34 1,32% 15,77%
no-type 75.070 0,16 0,22% 0,02%
image/png 72.905 1,51 0,21% 0,14%
application/x-shockwave- flash 68.379 5,62 0,20% 0,53%
application/msword 42.150 5,31 0,12% 0,50%
text/plain 39.962 15,77 0,12% 1,49%
text/css 35.668 0,17 0,10% 0,02%
text/xml 35.583 0,46 0,10% 0,04%
application/x-javascript 23.882 0,18 0,07% 0,02%
image/pjpeg 14.514 0,38 0,04% 0,04%
audio/mpeg 10.319 41,1 0,03% 3,87%
application/atom+xml 10.264 0,05 0,03% 0,00%
image/bmp 10.202 2,23 0,03% 0,21%
audio/x-ms-wma 8.869 25,78 0,03% 2,43%
application/download 8.122 0,3 0,02% 0,03%
application/zip 5.730 11,49 0,02% 1,08%
application/xml 5.396 0,05 0,02% 0,00%
application/vnd.ms-excel 5.222 0,55 0,02% 0,05%

 

The Biblioteca de Catalunya, which participates in the International Internet Preservation Consortium along with 26 other institutions, aims, through the PADICAT project, to preserve Catalan websites and to guarantee their open and permanent access. The project enjoys the agreement of 287 institutions of all kinds.