Call for DNM-related datasets and scrapes

Since December 2013, I have been scraping all the English-language blackmarkets on a weekly and sometimes daily basis and archiving my copies. The intent, besides helping support my tables of black-markets and arrests, has always been to release them publicly, unrestricted, for all uses. Up until now I've only shared them privately and a few publicly, but I think it's past due time for a public release of a full archive of all of the scrapes and additional materials I've picked up along the way.

So I've paused my crawls and have begun the tedious process of unpacking and recompressing various partial archives, organizing them, writing up key info on the scrapes, eventually generating PAR2 files and uploading to the Internet Archive; I estimate the whole thing will be somewhere ~50GB compressed, so hosting it myself is not viable.

I would like the archive to be as comprehensive as possible, so I am asking for contributions of any and all datasets relevant to the blackmarkets. I particularly want them uncensored, as that removes most of the utility of crawls (to give an example, I believe there's at least two papers based on my uncensored crawls in the past year in addition to supporting my work on identifying arrests and others incorporating feedback & PGP keys into vendor databases, while I know of zero uses of the public SR1 crawl data, which, while well-intentioned, is so heavily wiped that it cannot even be used to find errors in the analysis).

In particular, while I have excellent coverage of all markets in 2015-2015, I'm missing completely or mostly data on some old markets, some markets which defeated crawling, and some which were too transient for me to capture any crawls of:

Besides crawls, I welcome any parsed or cleaned-up datasets based on an existing crawl, any supporting materials for past papers which were too bulky for you to easily host yourself, any relevant material like the Ulbricht trial documents (I will be including the ~300 evidence-exhibits but it'd also be good to have all the filings and daily transcripts if anyone has those), etc. It's so big already that I might as well anything relevant.

You can PM me on Reddit, or email me.


Comments


[6 Points] None:

Do you have a day job? The amount of time and dedication you put into this community is incredible. I'm useless when it comes to this stuff, but I have to point out how much I appreciate all of this.


[2 Points] None:

Thank you Gwern, what you do here is of immense value to us all.


[2 Points] bobbiggs69:

/u/gwern I have a high bandwidth server from OVH. I'd be happy to host them for you.

I run a bit torrent client there so I could seed it.


[2 Points] -El_Presidente-:

On the basis that you will be releasing all of the scrapes unedited, El Presidente is contributing the folowing unedited scrapes:

There is a PM with an link on your hub account

EP

xx


[1 Points] None:

What information are we talking about here? I'm not sure I understand what benefit there is to publishing this. Isn't it just providing lazy media hacks with more fuel for their ill-informed anti drug editorials?


[1 Points] The_Grid_Is_Up:

Pardon my fawning, but I think it's incredible how you manage to log all of this information. You're a very intelligent person :)

Oh, and I love the Gwern Branwen thing.


[1 Points] Deku-shrub:

Have you seen this research from Trend Micro?

Seems up your street.

I assume you've seen Ryan Compton's visualisations?


[1 Points] ad92wdj929jd:

You mean market forums too? So you made copies of forums every so-many days or did you just settle on the latest copy (like do you have deleted/unedited forum posts?)


[1 Points] None:

[removed]