09-01-2014, 01:32 PM
MODS, I don't know if this should be in Ripping Projects or Requests or what? But uhm this isn't actually any contribution yet so if this needs to be moved, that's okay.
Puggsoy wrote in my previous thread: "How do you go about figuring this stuff out?"
to which I promised to have more transparency in my future rips. As in, make a case study on how I rip. So, here's part one.
Anyway, Trip world (J). What a sweet game. Guaranteed Sunsoft quality.
Shame that if you open the ROM in Tile Molester you only see compressed graphics.
Today's practice: crack open the compression format.
DISCLAIMER: I've never looked into the game's innards except regarding some old level format study, so I'm practically going into this blindly a I type. WITH YOU. I apologise for badly worded text.
First off, I found that there are some uncompressed graphics in the ROM, namely, the protagonist sprites – nothing else. So it's safe to assume that the hedgehog enemy sprites are compressed (why? because they aren't uncompressed.)
So I played the game and found a hedgehog and made a save state.
Great. Now that I opened the save state in Tile Molester, I found the hedgehog's graphics on the first gfx page (0x8040):
so, woo, mission accomplished? Naw, man... Now we'll scour the ROM for anything that looks like that. Apparently, at 0x22E54, there's a graphics page that looks... awfully similar
Don't see the resemblance? Here's a gif (emphasis on the hedgehog)
So, now we know while the ROM contains that seemingly garbled data and the save state contains those pretty clean sprites of the hedgehog, there's some magic inbetween that the game does.
Our job is to figure out how those mangled graphics get transformed into the prettiness that is in the save state.
So, let's compare the bytes. I opened both files from the offset right after where the hedgehog data ends: the save state at 0x9040 and the ROM at 0x23E54. I read around $100 bytes backwards from those points and laid them on top of each other. Find differences and explain them, I thought, while indenting the bytes with spaces so the same bytes are parallel and conflicting bytes isolated. Warning: pretty wide image.. You probably have to open it in a new tab to see it completely.
So, at a quick glance looking at the bytes, it seems that the compressed (top) data is actually prettttty much the same as the decompressed data. In fact, it seems that there are extra bytes on the compressed data with all those $FFs and $FEs thrown in there every now and then.
The major differences hit me right in the eye – I've color-coded some parts for easier analysis. Green is 8-byte-match, purple is 7-byte-match and red is mismatch zone:
Okay.. How the hell does 82 00 E0 in ROM get decompressed into FE 00 00 00 00 00 00 E0 E0? What's the purpose of the $FEs and $FFs?
Well, after some intense staring, apparently, in the ROM:
• There's a 7-byte-match always before a mismatch zone.
• There's a $FF always before an 8-byte-match.
• There's a $FE always before a 7-byte-match followed immediately by a mismatch zone.
• There are never two 7-byte-matches successively.
• There are two kinds of mismatch zones. Some are three bytes and some are four bytes, but in the end they get decompressed to 10 bytes.
———Intermission———
So, like, what is the point of compression anyway? Minimizing redundant data. Instead of na na na na na na na na na na na na na na na na BATMAAAAAAAN it's much more efficient to say [na(16)] BATM[A(7)]N
Since we've only got four colors to work with, can you tell which is the most redundant color just by looking at the graphics pages of the state and the ROM? What would benefit the most from compression? I'll let you think about that.
okay yeah it's black. or transparent, in this case. which is 00. In fact, searching for a string of 00 00 brings.. ZERO RESULTS in the ROM portion. It's very common in the state portion, which is indeed the decompressed data. Write that down.
———End intermission———
• Mismatch zones always begin with the last byte of the previous chunk.
• Mismatch zones always end in either $BF, $C0, $E0.
• Three-byte mismatch zones begin with $80 and four-byte mismatch zones begin with $82.
• The mismatch zones that end in $BF all have the last byte copied from the next chunk, then seven bytes printed as-is.
I'm not sure, but looking at these observations, I think this is how they work:
(At this point I also realised that since there's always a 7-byte match after the special byte $BF, it's safe to assume that $BF is actually an instruction. Never mind it being grouped with the mismatch bytes, should've put it after them...)
Anyway, using this logical set of information we can make some rules for a protoype decompressor:
1. Check instruction.
2. If it's $FF, print the next eight bytes as-is and read next byte as an instruction.
3. If it's $FE, print the next seven bytes as-is, print the seventh byte again and read the eighth and ninth bytes as XY ZZ; print ZZ exactly X times, and if Y is a nonzero value, overwrite the last Y bytes with the byte after ZZ. If Y is zero, never mind that shit. After ZZ (or the byte after ZZ in a case of nonzero Y) read an instruction.
4. If it's $BF, print the next seven bytes as-is, but the first byte twice. Then read the next byte as an instruction.
Now, obviously there has to be more instructions, as these rules alone don't make a too efficient compression scheme. But now that we have a basic grasp of the common methods (instruction, bytes, how the RLEish bytes work), we can continue with relative ease.
So, to see if our conclusions were right, I backtracked the bytes as far as I could until I saw unknown instructions, copied the bytes from 0x23C73 to the end of the graphics page, 0x23E54, and used our prototype decompressor.
Here's the graphics in the ROM at 0x23C73:
Here it is after our decompression:
So far so good! The output was 100% accurate.
Let's go even further back...
Reading backwards from 0x23BD1, I think that's the squirrel thing's sprite set. Again, I laid the bytes parallel to each other, the ROM being above and the state being below, isolating the conflict zones...... And I noticed a pattern. Can you see it?
well, here's the bytes again, color-coded.
As you can see, I was wrong about the "special bytes" and the $FEs function earlier.
I also noticed that the output is always in packs of eight bytes! This helped with the indenting of the state bytes tremendously.
But anyway, we've found some new instructions. If you don't feel like checking the wide images out, I'll list the new findings here:
Instructions, revision 2:
$FF: Same old, same old. Read the following eight bytes and print them all out as-is.
$FE: Read the following seven bytes, but print the last byte twice.
$FA: Read the following six bytes, but print the last and second to last bytes twice.
$F8: Read the following five bytes, but print the last byte four times.
$EF: Read the following seven bytes, but print the third byte twice.
$E8: Read the following four bytes, but print the third byte twice and the last one four times
$D8: Read the following four bytes, but print the second byte twice and the last byte four times
$BF: Read the following seven bytes, but print the first byte twice.
$AF: Read the following six bytes, but print the first and the second byte twice.
$80: Print the following byte eight times.
$82: Read the following two bytes, print the first byte six times and the second byte twice.
$88: Read the following two bytes and print both four times.
$8A: Read the following three bytes, print the first byte four times and the second and third twice.
$8B: Read the following four bytes, print the first byte four times, the second one twice and the rest as-is.
Note: I can't program, but I can do all of the tricks above with just key presses, so I assigned a textmate macro to fourteen keys.
Right. So, at this point, you should have gotten a pretty good grasp on what we're doing here; hunting for unknown instructions in the ROM, figuring out how they convert the bytes to those found in the state, and documenting them.
So what's missing? I don't know, but let's try decompressing some more graphics with out new prototype decompressor that operates by these new rules. So, let's take a snippet all the way from 0x23A196 and run it through the decompressor.
Flawless. But I must admit that I ran into an extra $00 on the way, just before the hedgehog's tiles. I deleted it, and I'm led to believe that instruction $00 alone means "these particular graphics end here."
So, we've made great progress. I used the textmate macro method and tried to decompress some other sprites and every time I stumbled into an unknown instruction, I just skipped it and analyzed it afterwards.
With enough save states and comparisons, I found so many new instructions that I think I could decompress most of the game's graphics now.
Here's a list:
$81: Read the following two bytes, print the first byte seven times and the second byte as-is.
$83: Read the following three bytes, print the first byte six times and the rest as-is.
$87: Read the following four bytes, print the first byte five times and the rest as-is.
$8F: Read the following five bytes, print the first byte four times and the rest as-is.
$97: Read the following five bytes, print the first byte three times, the second byte twice and the rest as-is.
$9F: Read the following six bytes, print the first byte three times and the rest as-is.
$A0: Read the following two bytes, print the first byte twice and the second byte six times.
$A2: Read the following three bytes, print the first byte twice, the second byte four times and the third byte twice.
$A3: Read the following four bytes, print the first byte twice, second byte four times and the rest as-is.
$A8: Read the following three bytes, print the first two bytes twice and the third byte four times.
$B7: Read the following six bytes, print the first byte twice, second byte as-is, third byte twice and the rest as-is.
$B Read the following six bytes, print the first byte twice, the second, third and fourth bytes as-is, the fifth byte twice and the last byte once.
$DA: Read the following five bytes, print the first byte as-is, the second byte twice, third byte once, fourth and fifth bytes twice.
$DE: Read the following six bytes, print the first byte as-is, second byte twice, third, fourth and fifth bytes as-is and finally the last one twice.
$DF: Read the following six bytes, print the first byte as-is, second byte twice and the rest as-is.
$E0: Read the following three bytes, print the first two bytes as-is and the third byte six times.
$EA: Read the following five bytes, print the first two bytes as-is, and the rest twice.
$F2: Read the following five bytes, print the first three bytes as-is, the fourth byte three times and the last one twice.
$F6: Read the following six bytes, print the first three bytes as-is, the fourth byte twice, fifth as-is and finally the last one twice.
$FC: Read the following six bytes, print the first five bytes as-is, and the sixth byte three times.
$F Read the following seven bytes, print the first five bytes as-is, the sixth byte twice and the seventh one as-is.
Also, turns out this compression is actually pretty weak. But what do I know.
So. I think I have enough data. I think I can do this. I'll dive into the ROM and try to decompress everything, starting from the beginning of the first graphics page at 0x8018 (it looks like some tail or something). I'll stop when I read an instruction $00. LET'S GO
here we go
aaaand DONE
as you can see, the output is flawless. With some trial and error, staring at bytes and careful analysis of input and output, we were able to find out what happens in between and created a pseudo-decompressor.
Alas, having to physically push a key every time I see a new instruction is a pain in the ass, I propose that someone who's great at programming makes a decompressor. Doesn't need to be fancy, just needs a ROM, beginning-offset field and a big GO button that starts the engines. I'd also go with a "instruction $00 found, check next byte?" -dialog when it thinks it's done.
And uh, these aren't probably all of the instructions so maybe a "unknown instruction occurred, check log.txt" -dialog would be wise too.
Puggsoy wrote in my previous thread: "How do you go about figuring this stuff out?"
to which I promised to have more transparency in my future rips. As in, make a case study on how I rip. So, here's part one.
Anyway, Trip world (J). What a sweet game. Guaranteed Sunsoft quality.
Shame that if you open the ROM in Tile Molester you only see compressed graphics.
Today's practice: crack open the compression format.
DISCLAIMER: I've never looked into the game's innards except regarding some old level format study, so I'm practically going into this blindly a I type. WITH YOU. I apologise for badly worded text.
First off, I found that there are some uncompressed graphics in the ROM, namely, the protagonist sprites – nothing else. So it's safe to assume that the hedgehog enemy sprites are compressed (why? because they aren't uncompressed.)
So I played the game and found a hedgehog and made a save state.
Great. Now that I opened the save state in Tile Molester, I found the hedgehog's graphics on the first gfx page (0x8040):
so, woo, mission accomplished? Naw, man... Now we'll scour the ROM for anything that looks like that. Apparently, at 0x22E54, there's a graphics page that looks... awfully similar
Don't see the resemblance? Here's a gif (emphasis on the hedgehog)
So, now we know while the ROM contains that seemingly garbled data and the save state contains those pretty clean sprites of the hedgehog, there's some magic inbetween that the game does.
Our job is to figure out how those mangled graphics get transformed into the prettiness that is in the save state.
So, let's compare the bytes. I opened both files from the offset right after where the hedgehog data ends: the save state at 0x9040 and the ROM at 0x23E54. I read around $100 bytes backwards from those points and laid them on top of each other. Find differences and explain them, I thought, while indenting the bytes with spaces so the same bytes are parallel and conflicting bytes isolated. Warning: pretty wide image.. You probably have to open it in a new tab to see it completely.
So, at a quick glance looking at the bytes, it seems that the compressed (top) data is actually prettttty much the same as the decompressed data. In fact, it seems that there are extra bytes on the compressed data with all those $FFs and $FEs thrown in there every now and then.
The major differences hit me right in the eye – I've color-coded some parts for easier analysis. Green is 8-byte-match, purple is 7-byte-match and red is mismatch zone:
Okay.. How the hell does 82 00 E0 in ROM get decompressed into FE 00 00 00 00 00 00 E0 E0? What's the purpose of the $FEs and $FFs?
Well, after some intense staring, apparently, in the ROM:
• There's a 7-byte-match always before a mismatch zone.
• There's a $FF always before an 8-byte-match.
• There's a $FE always before a 7-byte-match followed immediately by a mismatch zone.
• There are never two 7-byte-matches successively.
• There are two kinds of mismatch zones. Some are three bytes and some are four bytes, but in the end they get decompressed to 10 bytes.
———Intermission———
So, like, what is the point of compression anyway? Minimizing redundant data. Instead of na na na na na na na na na na na na na na na na BATMAAAAAAAN it's much more efficient to say [na(16)] BATM[A(7)]N
Since we've only got four colors to work with, can you tell which is the most redundant color just by looking at the graphics pages of the state and the ROM? What would benefit the most from compression? I'll let you think about that.
okay yeah it's black. or transparent, in this case. which is 00. In fact, searching for a string of 00 00 brings.. ZERO RESULTS in the ROM portion. It's very common in the state portion, which is indeed the decompressed data. Write that down.
———End intermission———
• Mismatch zones always begin with the last byte of the previous chunk.
• Mismatch zones always end in either $BF, $C0, $E0.
• Three-byte mismatch zones begin with $80 and four-byte mismatch zones begin with $82.
• The mismatch zones that end in $BF all have the last byte copied from the next chunk, then seven bytes printed as-is.
I'm not sure, but looking at these observations, I think this is how they work:
(At this point I also realised that since there's always a 7-byte match after the special byte $BF, it's safe to assume that $BF is actually an instruction. Never mind it being grouped with the mismatch bytes, should've put it after them...)
Anyway, using this logical set of information we can make some rules for a protoype decompressor:
1. Check instruction.
2. If it's $FF, print the next eight bytes as-is and read next byte as an instruction.
3. If it's $FE, print the next seven bytes as-is, print the seventh byte again and read the eighth and ninth bytes as XY ZZ; print ZZ exactly X times, and if Y is a nonzero value, overwrite the last Y bytes with the byte after ZZ. If Y is zero, never mind that shit. After ZZ (or the byte after ZZ in a case of nonzero Y) read an instruction.
4. If it's $BF, print the next seven bytes as-is, but the first byte twice. Then read the next byte as an instruction.
Now, obviously there has to be more instructions, as these rules alone don't make a too efficient compression scheme. But now that we have a basic grasp of the common methods (instruction, bytes, how the RLEish bytes work), we can continue with relative ease.
So, to see if our conclusions were right, I backtracked the bytes as far as I could until I saw unknown instructions, copied the bytes from 0x23C73 to the end of the graphics page, 0x23E54, and used our prototype decompressor.
Here's the graphics in the ROM at 0x23C73:
Here it is after our decompression:
So far so good! The output was 100% accurate.
Let's go even further back...
Reading backwards from 0x23BD1, I think that's the squirrel thing's sprite set. Again, I laid the bytes parallel to each other, the ROM being above and the state being below, isolating the conflict zones...... And I noticed a pattern. Can you see it?
well, here's the bytes again, color-coded.
As you can see, I was wrong about the "special bytes" and the $FEs function earlier.
I also noticed that the output is always in packs of eight bytes! This helped with the indenting of the state bytes tremendously.
But anyway, we've found some new instructions. If you don't feel like checking the wide images out, I'll list the new findings here:
Instructions, revision 2:
$FF: Same old, same old. Read the following eight bytes and print them all out as-is.
$FE: Read the following seven bytes, but print the last byte twice.
$FA: Read the following six bytes, but print the last and second to last bytes twice.
$F8: Read the following five bytes, but print the last byte four times.
$EF: Read the following seven bytes, but print the third byte twice.
$E8: Read the following four bytes, but print the third byte twice and the last one four times
$D8: Read the following four bytes, but print the second byte twice and the last byte four times
$BF: Read the following seven bytes, but print the first byte twice.
$AF: Read the following six bytes, but print the first and the second byte twice.
$80: Print the following byte eight times.
$82: Read the following two bytes, print the first byte six times and the second byte twice.
$88: Read the following two bytes and print both four times.
$8A: Read the following three bytes, print the first byte four times and the second and third twice.
$8B: Read the following four bytes, print the first byte four times, the second one twice and the rest as-is.
Note: I can't program, but I can do all of the tricks above with just key presses, so I assigned a textmate macro to fourteen keys.
Right. So, at this point, you should have gotten a pretty good grasp on what we're doing here; hunting for unknown instructions in the ROM, figuring out how they convert the bytes to those found in the state, and documenting them.
So what's missing? I don't know, but let's try decompressing some more graphics with out new prototype decompressor that operates by these new rules. So, let's take a snippet all the way from 0x23A196 and run it through the decompressor.
Flawless. But I must admit that I ran into an extra $00 on the way, just before the hedgehog's tiles. I deleted it, and I'm led to believe that instruction $00 alone means "these particular graphics end here."
So, we've made great progress. I used the textmate macro method and tried to decompress some other sprites and every time I stumbled into an unknown instruction, I just skipped it and analyzed it afterwards.
With enough save states and comparisons, I found so many new instructions that I think I could decompress most of the game's graphics now.
Here's a list:
$81: Read the following two bytes, print the first byte seven times and the second byte as-is.
$83: Read the following three bytes, print the first byte six times and the rest as-is.
$87: Read the following four bytes, print the first byte five times and the rest as-is.
$8F: Read the following five bytes, print the first byte four times and the rest as-is.
$97: Read the following five bytes, print the first byte three times, the second byte twice and the rest as-is.
$9F: Read the following six bytes, print the first byte three times and the rest as-is.
$A0: Read the following two bytes, print the first byte twice and the second byte six times.
$A2: Read the following three bytes, print the first byte twice, the second byte four times and the third byte twice.
$A3: Read the following four bytes, print the first byte twice, second byte four times and the rest as-is.
$A8: Read the following three bytes, print the first two bytes twice and the third byte four times.
$B7: Read the following six bytes, print the first byte twice, second byte as-is, third byte twice and the rest as-is.
$B Read the following six bytes, print the first byte twice, the second, third and fourth bytes as-is, the fifth byte twice and the last byte once.
$DA: Read the following five bytes, print the first byte as-is, the second byte twice, third byte once, fourth and fifth bytes twice.
$DE: Read the following six bytes, print the first byte as-is, second byte twice, third, fourth and fifth bytes as-is and finally the last one twice.
$DF: Read the following six bytes, print the first byte as-is, second byte twice and the rest as-is.
$E0: Read the following three bytes, print the first two bytes as-is and the third byte six times.
$EA: Read the following five bytes, print the first two bytes as-is, and the rest twice.
$F2: Read the following five bytes, print the first three bytes as-is, the fourth byte three times and the last one twice.
$F6: Read the following six bytes, print the first three bytes as-is, the fourth byte twice, fifth as-is and finally the last one twice.
$FC: Read the following six bytes, print the first five bytes as-is, and the sixth byte three times.
$F Read the following seven bytes, print the first five bytes as-is, the sixth byte twice and the seventh one as-is.
Also, turns out this compression is actually pretty weak. But what do I know.
So. I think I have enough data. I think I can do this. I'll dive into the ROM and try to decompress everything, starting from the beginning of the first graphics page at 0x8018 (it looks like some tail or something). I'll stop when I read an instruction $00. LET'S GO
here we go
aaaand DONE
as you can see, the output is flawless. With some trial and error, staring at bytes and careful analysis of input and output, we were able to find out what happens in between and created a pseudo-decompressor.
Alas, having to physically push a key every time I see a new instruction is a pain in the ass, I propose that someone who's great at programming makes a decompressor. Doesn't need to be fancy, just needs a ROM, beginning-offset field and a big GO button that starts the engines. I'd also go with a "instruction $00 found, check next byte?" -dialog when it thinks it's done.
And uh, these aren't probably all of the instructions so maybe a "unknown instruction occurred, check log.txt" -dialog would be wise too.