Importing Gwern's DNM Archives

Anyone else mind sharing what setup they are using to manage the raw dataset provided by Gwern? I want to be able to get it up and running so I can search the forum archives search for keywords and cross reference different items of interest however I have zero clue how as to how I can interface with it.


Comments


[4 Points] bobbiggs69:

Why don't you ask /u/gwern what he would recommend? He's still around and generally happy to help.


[3 Points] Axaq:

The way I've always done this, is to write a batch script to first incrementally rename all the files "filename1", "filename2" etc, and make them txt files too.

Then I would create a MySQL database to hold all the data from each file, with corresponding tables and columns, so for the forum archives you would have a topics table with columns such as ID, title, content and so on.

If you then setup a PHP environment, you can create a PHP script which connects to your database and runs a loop that extracts the content of each file between certain element names, so you might have

<div class="title">Title here</div>

And you would take the content from each element of interest and place them into an object/array and insert into your database tables. To do it this way is very server intensive and definitely not the best way but that will get your data stored into a database if done correctly. You will also need to remove the PHP timeout so that it can run the loop without crashing.

You can then create another PHP script with a form interface for returning search results from your database.

Don't even criticise this, I've done it before and found it easier than methods outside of my skill set of course.


[2 Points] UKSupplies:

easiest way I found was to rent a VPS (or amazon provide them for free, search up EC2) put phpMyAdmin on there and import all the databases. I believe phpmyadmin has a file size limit however so you'll have to import them via SSH directly through the mysql interface, however once thats done, phpMyAdmin provides an easy to use and clean interface to view all the database records