v50 Steam/Premium information for editors
  • v50 information can now be added to pages in the main namespace. v0.47 information can still be found in the DF2014 namespace. See here for more details on the new versioning policy.
  • Use this page to report any issues related to the migration.
This notice may be cached—the current version can be found here.

Utility:Accent Removal

From Dwarf Fortress Wiki
Revision as of 22:00, 16 May 2012 by 96.231.221.35 (talk) (Updated Unicode Hammer to cover full range of extended ASCII with more explicit mappings.)
Jump to navigation Jump to search

Overview

Replacing accented letters with normal ones in the raws fixes this problem.

Some tile sets use the accented characters for additional graphical symbols. This can make racial language text difficult to read. You can remove the accented characters and symbols from the data files. This works on existing worlds and saved games.

Since the structure of language files might change, it is safest if you remove the problem characters from the files yourself. Here are two methods to do just that. The first (Jackard's) only works on Windows, but is probably the easiest for novice users. The second (frobnic8's) will work anywhere Python does (i.e. just about anywhere), but requires using the command line a little.

Jackard's InfoRapid Script

Download Inforapid Search and Replace.

Save the list below to a text file.

Find the following files in DF\raw\objects:

  • language_DWARF.txt
  • language_ELF.txt
  • language_GOBLIN.txt
  • language_HUMAN.txt

Select them all, right-click and choose 'Search with InfoRapid' from the menu.

Click the Replace tab that shows up in the lower half of the window.

Select your text file from before in the Replace With field, make sure Replace is set to 'Whole Search Expression' and click Start.

A prompt will appear asking for confirmation. Check the Replace All button and click Yes. When the program stops running you are done.

<Command>
	<Search>„</Search>
	<Replace>a</Replace>
</Command>
<Command>
	<Search> </Search>
	<Replace>a</Replace>
</Command>
<Command>
	<Search>ƒ</Search>
	<Replace>a</Replace>
</Command>
<Command>
	<Search>†</Search>
	<Replace>a</Replace>
</Command>
<Command>
	<Search>…</Search>
	<Replace>a</Replace>
</Command>
<Command>
	<Search>‡</Search>
	<Replace>c</Replace>
</Command>
<Command>
	<Search>‰</Search>
	<Replace>e</Replace>
</Command>
<Command>
	<Search>‚</Search>
	<Replace>e</Replace>
</Command>
<Command>
	<Search>Š</Search>
	<Replace>e</Replace>
</Command>
<Command>
	<Search>ˆ</Search>
	<Replace>e</Replace>
</Command>
<Command>
	<Search>‹</Search>
	<Replace>i</Replace>
</Command>
<Command>
	<Search></Search>
	<Replace>i</Replace>
</Command>
<Command>
	<CaseSensitive>Yes</CaseSensitive>
	<Search>¡</Search>
	<Replace>i</Replace>
</Command>
<Command>
	<Search>Œ</Search>
	<Replace>i</Replace>
</Command>
<Command>
	<Search>¤</Search>
	<Replace>n</Replace>
</Command>
<Command>
	<Search>•</Search>
	<Replace>o</Replace>
</Command>
<Command>
	<Search>”</Search>
	<Replace>o</Replace>
</Command>
<Command>
	<Search>“</Search>
	<Replace>o</Replace>
</Command>
<Command>
	<Search>¢</Search>
	<Replace>o</Replace>
</Command>
<Command>
	<Search>—</Search>
	<Replace>u</Replace>
</Command>
<Command>
	<Search>–</Search>
	<Replace>u</Replace>
</Command>
<Command>
	<Search>£</Search>
	<Replace>u</Replace>
</Command>
<Command>
	<Search>˜</Search>
	<Replace>y</Replace>
</Command>

frobnic8's Modified Python Script

If you have the programming language Python installed on your machine (or don't mind installing it) and aren't scared of a command prompt, here is an alternate method. Python comes pre-installed on Mac OS X and almost all distributions of Linux. (If you are using Windows, the command line instructions shown will need to be modified slightly.)

  1. Ensure you have Python installed. (If you have Python 3.x installed, you will need to remove the unicode functions on line 100 and 104, and change the print statements to functions.)
  2. Copy and paste (this modified version of) "The Unicode Hammer" with the name unicode_hammer.py in the raw/objects sub-directory of your Dwarf Fortress directory. (The Unicode Hammer: Is that a name worthy of Dwarf Fortress, or what?)

    #!/usr/bin/env python
    """
    latin1_to_ascii -- The UNICODE Hammer -- AKA "The Stupid American"
    
    This takes a UNICODE string and replaces Latin-1 characters with
    something equivalent in 7-bit ASCII. This returns a plain ASCII string. 
    This function makes a best effort to convert Latin-1 characters into 
    ASCII equivalents. It does not just strip out the Latin1 characters.
    All characters in the standard 7-bit ASCII range are preserved. 
    In the 8th bit range all the Latin-1 accented letters are converted to 
    unaccented equivalents. Most symbol characters are converted to 
    something meaningful. Anything not converted is deleted.
    
    Background:
    
    One of my clients gets address data from Europe, but most of their systems 
    cannot handle Latin-1 characters. With all due respect to the umlaut,
    scharfes s, cedilla, and all the other fine accented characters of Europe, 
    all I needed to do was to prepare addresses for a shipping system.
    After getting headaches trying to deal with this problem using Python's 
    built-in UNICODE support I gave up and decided to use some brute force.
    This function converts all accented letters to their unaccented equivalents. 
    I realize this is dirty, but for my purposes the mail gets delivered.
    
    Noah Spurrier noah at noah.org
    License free and public domain
    """
    
    """This version has had its translation table abused to produce
    better results for the language files of the game Dwarf Fortress by
    frobnic8.
    
    Original here:
    http://code.activestate.com/recipes/251871-latin1-to-ascii-the-unicode-hammer/
    """
    
    def latin1_to_ascii (unicrap):
        """This takes a UNICODE string and replaces Latin-1 characters with
        something equivalent in 7-bit ASCII. It returns a plain ASCII string.
        This function makes a best effort to convert Latin-1 characters into
        ASCII equivalents. It does not just strip out the Latin-1 characters.
        All characters in the standard 7-bit ASCII range are preserved.
        In the 8th bit range all the Latin-1 accented letters are converted
        to unaccented equivalents. Most symbol characters are converted to
        something meaningful. Anything not converted is deleted.
        """
        xlate = {
                0x80: 'E',    # Euro sign
                0x81: 'e',    # Blank
                0x82: 'i',    # Single low 9 quote
                0x83: 'f',    # Latin small letter f with hook
                0x84: 'ii',   # Doubel low 9 quote 
                0x85: 'e',    # Horizontal elipsis
                0x86: 't',    # Dagger
                0x87: 'tt',   # Double dagger
                0x88: 'ea',   # Modified circumflex accent
                0x89: 'oloo', # Per mile sign
                0x8a: 'S',    # Latin capital letter S with caron
                0x8b: '<',    # Single left pointing angle quotation
                0x8c: 'OE',   # Latin capital ligature OE
                0x8d: '-',    # Blank
                0x8e: 'Z',    # Latin capital letter Z with caron
                0x8f: '-',    # Blank
                0x90: '-',    # Blank
                0x91: 'ei',   # Left single quote
                0x92: "ie",   # Right single quote
                0x93: 'ii',   # Left double quote
                0x94: "ee",   # Right double quote
                0x95: 'ao',   # Bullet
                0x96: '-',    # En dash
                0x97: '-',    # Em dash
                0x98: '-',    # Small tilde
                0x99: 'TM',   # Trademark sign
                0x9a: 's',    # Latin small letter s with caron
                0x9b: '>',    # Single right pointing andle quotation
                0x9c: 'oe',   # Latin small ligature oe
                0x9d: '-',    # Blank
                0x9e: 'z',    # Latin small letter z with caron
                0x9f: 'Y',    # Latin capital letter Y with diaeresis
                0xa0: '-',    # Non-breaking space
                0xa1: 'i',    # Inverted exclamation mark
                0xa2: 'c',    # Cent sign
                0xa3: 'E',    # Pound sign
                0xa4: 'o',    # Currency sign
                0xa5: 'Y',    # Yen sign
                0xa6: 'l',    # Pipe, broken vertical bar
                0xa7: 'S',    # Section sign
                0xa8: 'ii',   # Spacing diaeresis
                0xa9: 'c',    # Copyright sign
                0xaa: 'a',    # Feminine ordinal indicator
                0xab: '<<',   # Left double angle quotes
                0xac: 'r',    # Not sign
                0xad: '-',    # Soft hypehen
                0xae: 'R',    # Registered trade mark sign
                0xaf: 'aa',   # Spacing macron
                0xb0: 'o',    # Degree sign
                0xb1: 't',    # Plus or minus sign
                0xb2: '2',    # Superscript 2
                0xb3: '3',    # Superscript 3
                0xb4: "'",    # Acute accent
                0xb5: 'u',    # Micro sign
                0xb6: 'P',    # Pilcrow sign
                0xb7: 'o',    # Middle dot
                0xb8: 'e',    # Cedilla
                0xb9: '1',    # Superscript 1
                0xba: 'o',    # Masculine ordinal indicator
                0xbb: '>>',   # Right double angle quotes
                0xbc: '1/4',  # Fraction one quarter
                0xbd: '1/2',  # Fraction one half
                0xbe: '3/4',  # Fraction three quarters
                0xbf: 'b',    # Inverted question mark
                0xc0: 'A',    # Latin capital letter A with grave
                0xc1: 'A',    # Latin capital letter A with acute
                0xc2: 'A',    # Latin capital letter A with circumflex
                0xc3: 'A',    # Latin capital letter A with tilde
                0xc4: 'A',    # Latin capital letter A with diaeresis
                0xc5: 'A',    # Latin capital letter A with ring above
                0xc6: 'Ae',   # Latin capital letter AE
                0xc7: 'C',    # Latin capital letter C with cedilla
                0xc8: 'E',    # Latin capital letter E with grave
                0xc9: 'E',    # Latin capital letter E with acute
                0xca: 'E',    # Latin capital letter E with circumflex
                0xcb: 'E',    # Latin capital letter E with diaeresis
                0xcc: 'I',    # Latin capital letter I with grave
                0xcd: 'I',    # Latin capital letter I with acute
                0xce: 'I',    # Latin capital letter I with circumflex
                0xcf: 'I',    # Latin capital letter I with diaeresis
                0xd0: 'D',    # Latin capital letter ETH
                0xd1: 'N',    # Latin capital letter N with tilde
                0xd2: 'O',    # Latin capital letter O with grave
                0xd3: 'O',    # Latin capital letter O with acute
                0xd4: 'O',    # Latin capital letter O with circumflex
                0xd5: 'O',    # Latin capital letter O with tilde
                0xd6: 'O',    # Latin capital letter O with diaeresis
                0xd7: 'x',    # Multiplication sign
                0xd8: 'O',    # Latin capital letter O with slash
                0xd9: 'U',    # Latin capital letter U with grave
                0xda: 'U',    # Latin capital letter U with acute
                0xdb: 'U',    # Latin capital letter U with circumflex
                0xdc: 'U',    # Latin capital letter U with diaeresis
                0xdd: 'Y',    # Latin capital letter Y with acute
                0xde: 'P',    # Latin capital letter THORN
                0xdf: 'B',    # Latin small letter sharp s
                0xe0: 'a',    # Latin small letter a with grave
                0xe1: 'a',    # Latin small letter a with acute
                0xe2: 'a',    # Latin small letter a with circumflex
                0xe3: 'a',    # Latin small letter a with tilde
                0xe4: 'a',    # Latin small letter a with diaeresis
                0xe5: 'a',    # Latin small letter a with ring above
                0xe6: 'ae',   # Latin small letter ae
                0xe7: 'c',    # Latin small letter c with cedilla
                0xe8: 'e',    # Latin small letter e with grave
                0xe9: 'e',    # Latin small letter e with acute
                0xea: 'e',    # Latin small letter e with circumflex
                0xeb: 'e',    # Latin small letter e with diaeresis
                0xec: 'i',    # Latin small letter i with grave
                0xed: 'i',    # Latin small letter i with acute
                0xee: 'i',    # Latin small letter i with circumflex
                0xef: 'i',    # Latin small letter i with diaeresis
                0xf0: 'oa',   # Latin small letter eth
                0xf1: 'n',    # Latin small letter n with tilde
                0xf2: 'o',    # Latin small letter o with grave
                0xf3: 'o',    # Latin small letter o with acute
                0xf4: 'o',    # Latin small letter o with circumflex
                0xf5: 'o',    # Latin small letter o with diaeresis
                0xf6: 'o',    # Latin small letter o with slash
                0xf7: 'l',    # Division sign
                0xf8: 'o',    # Latin small letter o with 
                0xf9: 'u',    # Latin small letter u with grave
                0xfa: 'u',    # Latin small letter u with acute
                0xfb: 'u',    # Latin small letter u with circumflex
                0xfc: 'u',    # Latin small letter u with diaeresis
                0xfd: 'y',    # Latin small letter y with acute
                0xfe: 'p',    # Latin small letter thorn
                0xff: 'y',    # Latin small letter y with diaeresis
                }
    
    
        r = ''
        for i in unicrap:
            if ord(i) in xlate:
                r += xlate[ord(i)]
            elif ord(i) >= 0x80:
                pass
            else:
                r += str(i)
        return r
    
    if __name__ == '__main__':
        import sys
        input = sys.stdin
        output = sys.stdout
        if len(sys.argv) == 1 or (len(sys.argv) == 2 and \
           sys.argv[1] in ('-h', '-H', '-?', '--help', '/?', '/H', '/h')):
            print 'unicode_hammer.py [infile [outfile]]\n'
            #for python 3.x, changes the following line to s = ''
            s = unicode('','latin-1')
            for c in range(32, 256):
                if c != 0x7f:
                    #for python 3.x, change the following line to s += str(chr(c))
                    s += unicode(chr(c), 'latin-1')
                plain_ascii = latin1_to_ascii(s)
    
            #for python 3.x, change all of the following print statements to functions (wrap the entire statement in parenthesis)
            print 'INPUT type:', type(s)
            print 'INPUT:'
            print s.encode('latin-1')
            print
            print 'OUTPUT type:', type(plain_ascii)
            print 'OUTPUT:'
            print plain_ascii
            sys.exit()
    
        if len(sys.argv) > 1:
            input = open(sys.argv[1])
        if len(sys.argv) > 2:
            output = open(sys.argv[2], 'w')
        for line in input:
            output.write(latin1_to_ascii(line))
    
    

  3. Open a command prompt and change directory to your raw/objects directory.
  4. Rename the four language files, adding '.orig' to the end of their names:

    mv language_DWARF.txt language_DWARF.txt.orig
    mv language_ELF.txt language_ELF.txt.orig
    mv language_GOBLIN.txt language_GOBLIN.txt.orig
    mv language_HUMAN.txt language_HUMAN.txt.orig
    

  5. Apply the hammer to each of the four language files as follows:

    python unicode_hammer.py language_DWARF.txt.orig language_DWARF.txt
    python unicode_hammer.py language_ELF.txt.orig language_ELF.txt
    python unicode_hammer.py language_GOBLIN.txt.orig language_GOBLIN.txt
    python unicode_hammer.py language_HUMAN.txt.orig language_HUMAN.txt
    

  6. Enjoy!

The Linux way

Conversion between character sets is a standard part of Linux. To convert all the files in one go, change to the "raw/objects" directory and run this command:

for f in language_*.txt; do \
  iconv -f CP437 -t ASCII//TRANSLIT $f > $f.new; \
  mv -fv $f.new $f; \
done

All accented characters are converted to their normal, non-accented versions. Other characters (if any) are converted to their closest 7-bit ASCII representation.

This will overwrite the original language files. If you want them back, you can always unzip them again:

unzip -j path-to-zipfile raw/objects/language_\*.txt

Hermanos small app

For Windows users there is this small application that replaces accented characters from files by just dragging & dropping the file on the application icon.