Gathering A Dataset of 33 Million Unlabeled Images from Imgur
As part of a machine learning project to build a text-to-image model, I gathered a large dataset from archives of Imgur. Well, 33 million images in around 11 terabytes sounds big to me, but there are people out there that probably ingest that much in a day. 🤷‍♂️ For my purposes, it’s enough for proof of concept. Anyway, there’s some interesting stuff about the dataset, and some of the work I did to gather it, so I thought I’d write it up.
The archives
In April 2023, Imgur changed their terms of service and announced that they would be deleting “old, unused, and inactive content that is not tied to a user account” and “nudity, pornography, & sexually explicit content” on May 15. This prompted Archiveteam to start a project to archive as much of Imgur as possible given the remaining time. Imgur launched in 2009 and hosted a ton of images which are linked all over the web. Turning all of those into 404s would be something of a tragedy. Archiveteam exists to avert tragedies like those. If someone is deleting a public resource they go and preserve it. Anyway, I heard about what was going on, got to thinking, and ran their archiver software on my machine for a while. Everything grabbed by anyone running the archiver gets sent to Archiveteam for processing, and eventually ends up on the Internet Archive, in a big pile of archives. I think it ends up in the Wayback Machine later, but I don’t really understand the whole process.
To gather my dataset, I downloaded, processed, and stored 1,332 archives from the pile. As of writing there are 49,731 total archives in there, with an average size of 14 GiB, taking up around 680 TiB. A lot of that is redundant - my processed archives are on average 10.24 GiB. That’s after removing downscaled copies of the images, deduplicating by hash, and generating a still from every video. Based on the number of images in my archives, I estimate the total number of unique images in the pile to be around one billion.1 Far more than the 33 million I’m working with. Part of the appeal of this dataset is that it’s effectively infinite. Once I’ve written the processing code, getting more data is a matter of spinning up a VM and running it for a few days. Fetching from IA isn’t fast, but if your alternative is downloading from the web in general like you would if you were grabbing LAION-5B, it’s pretty good.
My title say 33 million, but the actual number of usable images is ~25 million in the context of my project. My preprocessing (which runs after the processing and before the other processing (with an optional processing step in between)) script drops everything smaller than 128x128, everything that’s not RGB, and everything that Pillow can’t load.
What’s in there?
Tons and tons of screenshots of GTA. Seriously, it’s absurd. Why were people uploading so many screenshots of GTA to Imgur? Someone please tell me. It’s around 9% of the dataset, around as much as all the other video games combined.
I made a little gallery you can take a look at, based on a sample of 5000 images from the dataset. Take a look. I manually tagged all the NSFW images, and they’re filtered out by default. There are radio buttons at the top to turn that off or invert it if you’re curious. That said, there’s still plenty of objectionable stuff in there, including at least one Happy Merchant and a use of the “N word”.
As you can see, it’s far from a representative sample of all the pictures that exist, or all the pictures that were ever uploaded to the internet. Imgur has a particular userbase, a particular usecase, and the images that Archiveteam’s efforts grabbed are non-representative as well. To speculate, the images archived are probably biased towards images that were more likely to be deleted, more popular among the sorts of people that upload lists of images to be queued for archiving, and more likely to be linked in places that were scraped for links.
Some broad categories of images:
- Screenshots of GTA
- Screenshots of other video games
- Comics
- Memes
- Screenshots of websites
- Anime
- Pornography and titillating non-pornography
- White on black Russian text
- Labels for Russian standardized exam videos?
- Just says “download link” with some number
- As best I can tell there are (were) some Russian people who used Imgur to store images used as titles for stuff?
- Pictures of English text. Outside of screenshots this is less common than Russian.
I don’t mean to imply that these are the only types of images in the dataset, just that their prevalence is striking when you look at it. Spend a few minutes scrolling through the gallery and you’ll see what I mean. If I look at the first 16 images in a shuffled gallery I see: a photo of the box and disc of a video game, a photo of a watch, a nature photo of a beach, something unrecognizable, some weird meme, a picture of Nelson from the Simpsons, a sprite sheet I think, a photo of a guy in a beanie, a nude photo of a man, a pro wrestling promo image, a screenshot of GTA, a screenshot of Google Image Search results, what looks like a screenshot of Minecraft, some microelectronics work in progress, an album cover, and another screenshot of GTA.
I went through 140 images and did some quick stats:
- 32% have at least one woman in them
- 39% have at least one man
- 4% have nudity
- 9% are screenshots of GTA
- 11% are screenshots of other video games
A lot of the images with men are screenshots of games. Excluding those, 30% had at least one woman, and 24% had at least one man. On the other had, of the images with nudity, 71% had at least one woman and 43% had at least one man. A great deal of the overrepresentation of women is due to pornography as well as pictures of women in titillating poses and outfits. This might lead to an image model using the dataset being more likely to draw women nude, and it would almost certainly lead to one being better at drawing nude women than nude men, but I won’t draw firm conclusions before seeing results from an actual model.
As an aside, ChatGPT-4 is quite good at answering “wtf is this picture?” type questions. The vision model is capable of photo OCR and the language model can do translation, along with understanding a ton of shit.
Data gathering automation
I wrote a bunch of automation, but honestly most of it isn’t too interesting to anyone else. A few highlights:
- Untarring and tarring things can be a bottleneck with a real disk. Getting a machine with a ton of RAM and making a huge tmpfs solves this.
- If you pipeline it so that downloading, processing and uploading can happen simultaneously you can get big performance gains. At the cost of having to debug concurrency issues.
- Some hash functions are faster than others. Use BLAKE2b.
- As always, the right indexes are essential to making SQL fast. You can put the things in the
WHERE
clause of your query in theWHERE
clause of your index. - Cloudflare’s R2 is substantially cheaper than S3.
I’ve processed and uploaded 1,332 archives, containing a total of 32,945,613 unique files. That’s ~24,734 files per archive. That times 49,731 makes 1.23 billion files. This is probably an overestimate, since the more archives you’ve seen the more likely an image in a new archive is to be a duplicate. So I’ll ballpark it at a billion.↩︎