Difference between revisions of "Utility:Accent Removal"

Revision as of 01:15, 10 April 2009

Overview

Replacing accented letters with normal ones in the raws fixes this problem.

You can remove accents from your texts - this even works on existing worlds.

Since the structure of language files might change, it is safest if you remove the accents from the files yourself. Here are two methods to do just that. The first (Jackard's) only works on Windows, but is probably the easiest for novice users. The second (frobnic8's) will work anywhere Python does (i.e. just about anywhere), but requires using the command line a little.

Jackard's InfoRapid Script

Download Inforapid Search and Replace.

Save the list below to a text file.

Find the following files in DF\raw\objects:

language_DWARF.txt
language_ELF.txt
language_GOBLIN.txt
language_HUMAN.txt

Select them all, right-click and choose 'Search with InfoRapid' from the menu.

Click the Replace tab that shows up in the lower half of the window.

Select your text file from before in the Replace With field, make sure Replace is set to 'Whole Search Expression' and click Start.

A prompt will appear asking for confirmation. Check the Replace All button and click Yes. When the program stops running you are done.

<Command>
	<Search>„</Search>
	<Replace>a</Replace>
</Command>
<Command>
	<Search> </Search>
	<Replace>a</Replace>
</Command>
<Command>
	<Search>ƒ</Search>
	<Replace>a</Replace>
</Command>
<Command>
	<Search>†</Search>
	<Replace>a</Replace>
</Command>
<Command>
	<Search>…</Search>
	<Replace>a</Replace>
</Command>
<Command>
	<Search>‡</Search>
	<Replace>c</Replace>
</Command>
<Command>
	<Search>‰</Search>
	<Replace>e</Replace>
</Command>
<Command>
	<Search>‚</Search>
	<Replace>e</Replace>
</Command>
<Command>
	<Search>Š</Search>
	<Replace>e</Replace>
</Command>
<Command>
	<Search>ˆ</Search>
	<Replace>e</Replace>
</Command>
<Command>
	<Search>‹</Search>
	<Replace>i</Replace>
</Command>
<Command>
	<Search></Search>
	<Replace>i</Replace>
</Command>
<Command>
	<CaseSensitive>Yes</CaseSensitive>
	<Search>¡</Search>
	<Replace>i</Replace>
</Command>
<Command>
	<Search>Œ</Search>
	<Replace>i</Replace>
</Command>
<Command>
	<Search>¤</Search>
	<Replace>n</Replace>
</Command>
<Command>
	<Search>•</Search>
	<Replace>o</Replace>
</Command>
<Command>
	<Search>”</Search>
	<Replace>o</Replace>
</Command>
<Command>
	<Search>“</Search>
	<Replace>o</Replace>
</Command>
<Command>
	<Search>¢</Search>
	<Replace>o</Replace>
</Command>
<Command>
	<Search>—</Search>
	<Replace>u</Replace>
</Command>
<Command>
	<Search>–</Search>
	<Replace>u</Replace>
</Command>
<Command>
	<Search>£</Search>
	<Replace>u</Replace>
</Command>
<Command>
	<Search>˜</Search>
	<Replace>y</Replace>
</Command>

frobnic8's Modified Python Script

If you have the programming language Python installed on your machine (or don't mind installing it) and aren't scared of a command prompt, here is an alternate method. Python comes pre-installed on Mac OS X and almost all distributions of Linux. (If you are using Windows, the command line instructions shown will need to be modified slightly.)

Ensure you have Python installed.

Copy and paste (this modified version of) "The Unicode Hammer" with the name 'unicode_hammer.py' in the 'raw/objects' sub-directory of your Dwarf FOrtress directory. (The Unicode Hammer: Is that a name worthy of Dwarf Fortress, or what?)

#!/usr/bin/env python
"""
latin1_to_ascii -- The UNICODE Hammer -- AKA "The Stupid American"

This takes a UNICODE string and replaces Latin-1 characters with
something equivalent in 7-bit ASCII. This returns a plain ASCII string. 
This function makes a best effort to convert Latin-1 characters into 
ASCII equivalents. It does not just strip out the Latin1 characters.
All characters in the standard 7-bit ASCII range are preserved. 
In the 8th bit range all the Latin-1 accented letters are converted to 
unaccented equivalents. Most symbol characters are converted to 
something meaningful. Anything not converted is deleted.

Background:

One of my clients gets address data from Europe, but most of their systems 
cannot handle Latin-1 characters. With all due respect to the umlaut,
scharfes s, cedilla, and all the other fine accented characters of Europe, 
all I needed to do was to prepare addresses for a shipping system.
After getting headaches trying to deal with this problem using Python's 
built-in UNICODE support I gave up and decided to use some brute force.
This function converts all accented letters to their unaccented equivalents. 
I realize this is dirty, but for my purposes the mail gets delivered.

Noah Spurrier noah at noah.org
License free and public domain
"""

"""This version has had its translation table abused to produce
better results for the language files of the game Dwarf Fortress by
frobnic8. The original translation table is commented out.
"""

def latin1_to_ascii (unicrap):
    """This takes a UNICODE string and replaces Latin-1 characters with
    something equivalent in 7-bit ASCII. It returns a plain ASCII string.
    This function makes a best effort to convert Latin-1 characters into
    ASCII equivalents. It does not just strip out the Latin-1 characters.
    All characters in the standard 7-bit ASCII range are preserved.
    In the 8th bit range all the Latin-1 accented letters are converted
    to unaccented equivalents. Most symbol characters are converted to
    something meaningful. Anything not converted is deleted.
    """
    xlate={0xc0:'A', 0xc1:'A', 0xc2:'A', 0xc3:'A', 0xc4:'A', 0xc5:'A',
        0xc6:'Ae', 0xc7:'C',
        0xc8:'E', 0xc9:'E', 0xca:'E', 0xcb:'E',
        0xcc:'I', 0xcd:'I', 0xce:'I', 0xcf:'I',
        0xd0:'Th', 0xd1:'N',
        0xd2:'O', 0xd3:'O', 0xd4:'O', 0xd5:'O', 0xd6:'O', 0xd8:'O',
        0xd9:'U', 0xda:'U', 0xdb:'U', 0xdc:'U',
        0xdd:'Y', 0xde:'th', 0xdf:'ss',
        0xe0:'a', 0xe1:'a', 0xe2:'a', 0xe3:'a', 0xe4:'a', 0xe5:'a',
        0xe6:'ae', 0xe7:'c',
        0xe8:'e', 0xe9:'e', 0xea:'e', 0xeb:'e',
        0xec:'i', 0xed:'i', 0xee:'i', 0xef:'i',
        0xf0:'th', 0xf1:'n',
        0xf2:'o', 0xf3:'o', 0xf4:'o', 0xf5:'o', 0xf6:'o', 0xf8:'o',
        0xf9:'u', 0xfa:'u', 0xfb:'u', 0xfc:'u',
        0xfd:'y', 0xfe:'th', 0xff:'y',
        0xa1:'aa', 0xa2:'cz', 0xa3:'ii', 0xa4:'tz',
        0xa5:'yy', 0xa6:'|', 0xa7:'zz', 0xa8:'"',
        0xa9:'CC', 0xaa:'aa', 0xab:'<<', 0xac:'not',
        0xad:'-', 0xae:'{R}', 0xaf:'_', 0xb0:'o',
        0xb1:'+/-', 0xb2:'^2', 0xb3:'^3', 0xb4:"'",
        0xb5:'uu', 0xb6:'PP', 0xb7:'*', 0xb8:',,',
        0xb9:'^1', 0xba:'^o', 0xbb:'>>',
        0xbc:'1/4', 0xbd:'1/2', 0xbe:'3/4', 0xbf:'?',
        0xd7:'*', 0xf7:'/'
    }
    """ Orignals below, the above is hacked for Dwarf Fortress languages.
        0xa1:'!', 0xa2:'{cent}', 0xa3:'{pound}', 0xa4:'{currency}',
        0xa5:'{yen}', 0xa6:'|', 0xa7:'{section}', 0xa8:'{umlaut}',
        0xa9:'{C}', 0xaa:'{^a}', 0xab:'<<', 0xac:'{not}',
        0xad:'-', 0xae:'{R}', 0xaf:'_', 0xb0:'{degrees}',
        0xb1:'{+/-}', 0xb2:'{^2}', 0xb3:'{^3}', 0xb4:"'",
        0xb5:'{micro}', 0xb6:'{paragraph}', 0xb7:'*', 0xb8:'{cedilla}',
        0xb9:'{^1}', 0xba:'{^o}', 0xbb:'>>',
        0xbc:'{1/4}', 0xbd:'{1/2}', 0xbe:'{3/4}', 0xbf:'?',
        0xd7:'*', 0xf7:'/'
    }
    """
    r = ''
    for i in unicrap:
        if xlate.has_key(ord(i)):
            r += xlate[ord(i)]
        elif ord(i) >= 0x80:
            pass
        else:
            r += str(i)
    return r

if __name__ == '__main__':
    import sys
    input = sys.stdin
    output = sys.stdout
    if len(sys.argv) == 1 or (len(sys.argv) == 2 and \
       sys.argv[1] in ('-h', '-H', '-?', '--help', '/?', '/H', '/h')):
        print 'unicode_hammer.py [infile [outfile]]\n'
        s = unicode('','latin-1')
        for c in range(32, 256):
            if c != 0x7f:
                s = s + unicode(chr(c), 'latin-1')
            plain_ascii = latin1_to_ascii(s)

        print 'INPUT type:', type(s)
        print 'INPUT:'
        print s.encode('latin-1')
        print
        print 'OUTPUT type:', type(plain_ascii)
        print 'OUTPUT:'
        print plain_ascii
        sys.exit()

    if len(sys.argv) > 1:
        input = open(sys.argv[1])
    if len(sys.argv) > 2:
        output = open(sys.argv[2], 'w')
    for line in input:
        output.write(latin1_to_ascii(line))

Open a command prompt and change directory to your 'raw/objects' directory.

Rename the four language files, adding '.orig' to the end of their names:

mv language_DWARF.txt language_DWARF.txt.orig
mv language_ELF.txt language_ELF.txt.orig
mv language_GOBLIN.txt language_GOBLIN.txt.orig
mv language_HUMAN.txt language_HUMAN.txt.orig

Apply the hammer to each of the four language files as follows:

python unicode_hammer.py language_DWARF.txt.orig language_DWARF.txt
python unicode_hammer.py language_ELF.txt.orig language_ELF.txt
python unicode_hammer.py language_GOBLIN.txt.orig language_GOBLIN.txt
python unicode_hammer.py language_HUMAN.txt.orig language_HUMAN.txt

Enjoy!

@@ Line 246: / Line 246: @@
 </pre></p></li>
-<li>Save the script in your 'raw/objects' directory.</li>
 <li>Open a command prompt and change directory to your 'raw/objects' directory.</li>
 <li>Rename the four language files, adding '.orig' to the end of their names:<p><pre>

Difference between revisions of "Utility:Accent Removal"

Revision as of 01:15, 10 April 2009

Overview

Jackard's InfoRapid Script

frobnic8's Modified Python Script

Navigation menu

Search