Skip to main content

Equal but different: a contextual analysis of duplicated videos on YouTube


Videos have become a predominant part of users’ daily lives on the Web, especially with the emergence of online video sharing systems such as YouTube. Since users can independently share videos in these systems, some videos can be duplicates (i.e., identical or very similar videos). Despite having the same content, there are some potential context differences in duplicates, for example, in their associated metadata (i.e., tags, title) and their popularity scores (i.e., number of views, comments). Quantifying these differences is important to understand how users associate metadata to videos and to understand possible reasons that influence the popularity of videos, which is crucial for video information retrieval mechanisms, association of advertisements to videos, and performance issues related to the use of caches and content distribution networks (CDNs). This work presents a wide quantitative characterization of the context differences among identical contents. Using a large video sample collected from YouTube, we construct a dataset of duplicates. Our measurement analysis provides several interesting findings that can have implications for how videos should be retrieved in video sharing websites as well as for advertising systems that need to understand the role that users play when they create content in services such as YouTube.


  1. Adar E, Zhang L, Adamic L, Lukose R (2004) Implicit structure and the dynamics of blogspace. In: Workshop on the Weblogging Ecosystem

  2. Baeza-Yates R, Ribeiro-Neto B (1999) Modern information retrieval. ACM/Addison-Wesley, New York/Reading

    Google Scholar 

  3. Benevenuto F, Duarte F, Rodrigues T, Almeida V, Almeida J, Ross K (2008) Understanding video interactions in youtube. In: ACM int’l conference on multimedia (MM)

  4. Benevenuto F, Rodrigues T, Almeida V, Almeida J, Zhang C, Ross K (2008) Identifying video spammers in online social networks. In: Workshop on adversarial information retrieval on the web (AIRWeb)

  5. Benevenuto F, Rodrigues T, Almeida V, Almeida J, Gonçalves M (2009) Detecting spammers and content promoters in online video social networks. In: Int’l ACM SIGIR

  6. Benevenuto F, Rodrigues T, Almeida V, Almeida J, Ross K (2009) Video interactions in online video social networks. In: ACM trans on multimedia computing, communications and applications (TOMCCAP)

  7. Cha M, Kwak H, Rodriguez P, Ahn Y, Moon S (2007) I tube, you tube, everybody tubes: analyzing the world’s largest user generated content video system. In: ACM SIGCOMM conference on Internet measurement (IMC)

  8. Cherubini M, Oliveira R, Oliver N (2009) Understanding near-duplicate videos: a user-centric approach. In: ACM int’l conference on multimedia (MM)

  9. Comscore (2010) June 2010

  10. Comscore (2010) Youtube now 25 percent of all Google searches. June 2010

  11. web site (2010) June 2010

  12. Flickr web site (2010) June 2010

  13. Gill P, Arlitt M, Li Z, Mahanti A (2007) Youtube traffic characterization: a view from the edge. In: ACM SIGCOMM conference on Internet measurement (IMC)

  14. Golbeck J (2008) Trust and nuanced profile similarity in online social networks. Technical report

  15. Hauptmann A, Wu X, Ngo C, Tan H (2009) Real-time near-duplicate elimination for web video search with content and context. IEEE Trans Multimedia 11(2):196–207

    Article  Google Scholar 

  16. Heymann P, Koutrika G, Garcia-Molina H (2007) Fighting spam on social web sites: a survey of approaches and future challenges. IEEE Internet Comput 11:36–45

    Article  Google Scholar 

  17. Huang Z, Wang L, Shen H, Shao J, Zhou X (2009) Online near-duplicate video clip detection and retrieval: an accurate and fast system. In: IEEE int’l conference on data engineering (ICDE)

  18. Ispell (2010) June 2010

  19. Jain R (1991) The art of computer systems performance analysis: techniques for experimental design, measurement, simulation, and modeling. Wiley, New York

    MATH  Google Scholar 

  20. Jones KS, Willett P (eds) (1997) Readings in information retrieval. Morgan Kaufmann, San Mateo

    Google Scholar 

  21. Koutrika G, Effendi F, Gyöngyi Z, Heymann P, Garcia-Molina H (2007) Combating spam in tagging systems. In: Workshop on adversarial information retrieval on the Web (AIRWeb)

  22. Lerman K, Jones L (2007) Social browsing on Flickr. In: Int’l conference on weblogs and social media (ICWSM)

  23. Li X, Guo L, Zhao Y (2008) Tag-based social interest discovery. In: Int’l World Wide Web conference (WWW)

  24. Marshall CC (2009) No bull, no spin: a comparison of tags with other forms of user metadata. In: ACM/IEEE conference on digital libraries (JCDL)

  25. Oliveira R, Cherubini M, Oliver N (2009) Human perception of near-duplicate videos. In: Int’l conference on human-computer interaction (INTERACT)

  26. Rijsbergen C (1979) Information retrieval. Butterworth, Stoneham

    Google Scholar 

  27. Rodrigues T, Benevenuto F, Almeida V, Almeida J, Gonçalves M (2009) Uma análise contextual de conteúdo duplicado no youtube. In: Simpósio Brasileiro de sistemas multimídia e Web (WebMedia)

  28. Suchanek F, Vojnovic M, Gunawardena D (2008) Social tags: meaning and suggestions. In: ACM conference on information and knowledge management (CIKM)

  29. Tan H-K, Ngo C-W, Hong R, Chua T-S (2009) Scalable detection of partial near-duplicate videos by visual-temporal consistency. In: ACM international conference on multimedia (MM)

  30. Wu X, Hauptmann A, Ngo C (2007) Practical elimination of near-duplicates from web video search. In: Int’l conference on multimedia

  31. Zhu J, Hoi S, Lyu M, Yan S (2008) Near-duplicate keyframe retrieval by nonrigid image matching. In: ACM int’l conference on multimedia (MM)

  32. Zink M, Suh K, Gu Y, Kurose J (2008) Watch global, cache local: Youtube network traces at a campus network—measurements and implications. In: IEEE multimedia computing and networking (MMCN)

Download references

Author information

Authors and Affiliations


Corresponding author

Correspondence to Virgílio Almeida.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License ( ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Rodrigues, T., Benevenuto, F., Almeida, V. et al. Equal but different: a contextual analysis of duplicated videos on YouTube. J Braz Comput Soc 16, 201–214 (2010).

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: