Forum:Easy download of GE data?
The GE price information is extremely useful, however it's optimized for people to view it one item at a time, not for a computer program to parse. Would it be possible for this wiki to provide a way to access all the gathered GE data from a single page that can be downloaded for later processing? This would be extremely useful for projects like calculating economic trends, alchemy losses, etc.
For this to be the most useful possible, the following data is needed:
- Price histories
- Alchemy values
- Buying limits
- If it's P2P or F2P
- Related items
- What items are used to make the item
- What items the item can be used to make
- Anything and everything else currently stored about the item in this Wiki, including historical information from edits
Mostly, only the first couple items in the list are needed, but I'm sure people will come up with uses for the rest. --MarkGyver 21:37, March 10, 2010 (UTC)
- For example, Category:Grand_Exchange is everything in the GEMW. Of course this is a REALLY long list. Hello71 22:27, March 10, 2010 (UTC)
- And, if you want EVERY edit, then uncheck the "latest revision only" checkbox. Of course, that makes for a REALLY big file. Hello71 22:40, March 10, 2010 (UTC)
Thanks for the response! You mention that it would be a "REALLY big file", which I think my system could handle, but would it break/slow the wiki? Also, I was after something more along the lines of just the raw data, not the entire set of pages. I guess I could try making a bot to generate the data from the exported page list, though. --MarkGyver 23:34, March 10, 2010 (UTC)
- You mention that it would be a "REALLY big file", which I think my system could handle, but would it break/slow the wiki?
- I really don't know, but I imagine it would be huge, considering that the GEMW probably has been around for quite a few revisions. You also have to account for vandalism when filtering the results.
- I guess I could try making a bot to generate the data from the exported page list, though.
- That's what I was thinking, and most modern programming languages have built-in libraries to deal with XML data.
- Hello71 23:46, March 10, 2010 (UTC)
Size of exported GE data
Assuming that out of the 3,206 pages in Category:Grand Exchange each has an average of 500 revisions, each of which is an average of 350 bytes, and images are excluded from the export, it will be a total of 561,050,000 bytes -- slightly more than half a gigabyte. I know my computer could easily handle that much, but I don't trust my internet connection to stay stable long enough for that big of a transfer to happen at once. Is there a way to ensure that giant download completes properly? I really want to have that data, but I also really don't want to have to try 10 times to get it to download properly. --MarkGyver 00:00, March 11, 2010 (UTC)
- After looking around a bit, I found Help:Database download and checked the link in Special:Statistics for the full database dump, including histories. The total file size is only 924M, which I'm now downloading with Wget, which can resume interrupted updates easily enough. It's a fair bit more than just the GE data would've been, but it'll work. --MarkGyver 00:34, March 11, 2010 (UTC)
- I've downloaded the 924MB Gzipped XML file and decompressed it into the full 8.7 GB. Now I just have to figure out how to parse it. In my Java class they only taught us the one way of accessing XML files that loaded the entire thing into memory at once. I don't think that's going to work well here.... --MarkGyver 02:25, March 11, 2010 (UTC)
Instead of downloading the entire database and trying to parse a really large file, it would be easier to work with smaller files. Download the Exchange namespace XML in batches using optional parameters such as pages, limit and offset. See Manual:Parameters to Special:Export for more information. If you want, leave a message in my talk page and we can discuss this further. I use Special:Export all the time for AzBot to update the Category:GEMH images. 02:55, March 11, 2010 (UTC)
- From the manual, it seems that the most revisions I could get at once is 100, while I want all of them. The page says:
- The maximum number of revisions to return. If you request more than a site-specific maximum (100 on Wikipedia at present), it will be reduced to this number.
- This limit is cumulative across all the pages specified in the pages parameter. For example, if you request a limit of 100, for two pages with 70 revisions each, you will get 70 from one and 30 from the other.
- Actually, no. I was able to request all 3100+ Exchange pages in one go. You just have to know how to do it. 10:55, March 13, 2010 (UTC)
- "(100 on Wikipedia at present)". Hello71 14:58, March 13, 2010 (UTC)
- Perhaps I should look more into using the JWBF library to get the data more easily. Either way, I think I'll be able to learn the API for properly parsing an XML file with Java without loading the entire file into memory at once, and I already have the file, so I'll just work with that for now. I'll keep the data-importing parts of the program isolated from the rest, though, so I'll always be able to change it later if needed. It'll also be good practice to learn the other XML API without having a teacher to help me. --MarkGyver 06:31, March 11, 2010 (UTC)
Closed - Download obtained.--Degenret01 23:04, March 24, 2010 (UTC)