CC Search (https://search.creativecommons.org/) is a media search engine maintained by Creative Commons which currently indexes metadata for around 500 million images. As with any search engine, one challenge is ranking the results appropriately.
- The ‘authority’ of an image is defined (for us) by the probability that someone would end up looking at a particular image, given that they’re puttering around the internet looking for images in general. For general webpages, the typical example of such a metric is PageRank.
- The ‘popularity’ is simply how many people have viewed/liked/commented on a given image (or some balance of these).
These metrics are likely correlated with each other, but distinct. For example, a particularly shocking image may get passed around a lot, and therefore viewed, but won’t be linked very often from mainstream sites.
Generally, we gather metadata from APIs provided by image hosts, many of which are cultural institutions such as museums. For two of our providers, we have metrics available to help us understand which images should be ranked higher, based on their popularity or authority. Because it’s probable that popularity metrics would not be comparable between different providers, we’ve also developed a standardization procedure to map raw metrics into the same (or at least similar) distributions. For many providers, however, we have no metric available from their API that describes (at least directly) the popularity or authority of a given image. We’re interested in finding ways to measure these metrics for such images.
Questions we’d love to answer would be:
1. Is it possible to develop an authority metric that can be derived from the image data itself? What about EXIF data? For inspiration, one could search for ‘ImageRank’ or ‘VisualRank’ and look at Google Papers and patents related to this question.
2. What about inheriting popularity or authority from the site or domain that hosts the image?
3. Failing (1) and (2), perhaps we could try to gather more data from Common Crawl and process it to try to determine an authority metric for the pages containing a given image.
For datasets, we currently have almost 100 million (and growing) images (resized for a max dimension of 640px) in our catalog, and we have metadata (tags, descriptions, etc.) for most of the ~500 million images we have indexed. Common Crawl is also (of course) available if we’d like more raw web data to work with.
Researchers: Tingwei Shen, Licheng Zhong, Michael Ren, Utkarsh Nath
Advisor: Brent Moran
For this project, we researched methods for assigning a popularity metric to images within the Creative Commons dataset. Creative Commons database contains terabytes worth of images and image metadata. We looked at how to aggregate this information and use machine learning models to figure out how popular and authoritative each image was. These metrics would allow Creative Commons to give better search suggestions to users browsing their database for images.
In general, I created a google colab, updated the weekly information to the Github repository, and wrote the codes surrounding the data cleaning, the Bert embedding, and the prediction of taglists.
Taking the image data transformed by Licheng, I transformed the tags of each image into a taglist in proper form. We tried different NLP embedding models and ended up deciding to use the Bert Model, so I took the transformed taglists and embedded the Bert Model. The result for a taglist is a tensor with 12 dimensions, During this time, we met several difficulties including the updating of the bert model and the speed inefficiency. I tested the model and found that the time is majorly consumed with the model.eval expression, so I took out the eval expression and successfully sped up our program. Later, we tried to predict the popularity by training our data through models. At first, I tried linear regression and logistic regression models, but the linear model can not represent our data very well and logistic regression can not predict a float number. In the end with the help of Brent, we used MLPRegressor. I transformed the embedding into proper forms, did train-test split, and trained them with MLPRegressor. I tested various parameters of the model by changing the hidden layers, maximum iterations, training size, etc. The final score I got by training the whole image data set was around 0.02 to 0.4.
I worked on creating a machine learning model which took in as input an image (not its metadata) and outputted a popularity metric. To reduce training time, much of my work focused on figuring out how best to conduct transfer learning from an existing model. For our existing model, we decided to use [VGG16](https://arxiv.org/abs/1505.06798) due to its overall high rate of accuracy on classifying. In the process, we also looked into alternative methods of transfer learning, as outlined in this [Amazon Paper](https://www.amazon.science/blog/updating-neural-networks-to-recognize-ne...). As of writing, the model has not been fully implemented yet, but I learned a lot about using TensorFlow and training such models.
Thank you to the Discovery Fall’20 program for giving us the opportunity to conduct this research. We all learned a lot from it and had an overall great experience. Finally, thank you to Brent Moran for being a great advisor and mentor to us.