- v50 information can now be added to pages in the main namespace. v0.47 information can still be found in the DF2014 namespace. See here for more details on the new versioning policy.
- Use this page to report any issues related to the migration.
Difference between revisions of "Utility:Accent Removal"
(Added note about shorter/better python alternative.) |
(Updated Unicode Hammer to cover full range of extended ASCII with more explicit mappings.) |
||
Line 124: | Line 124: | ||
<ol> | <ol> | ||
<li>Ensure you have [http://www.python.org Python] installed. (If you have Python 3.x installed, you will need to remove the unicode functions on line 100 and 104, and change the print statements to functions.)</li> | <li>Ensure you have [http://www.python.org Python] installed. (If you have Python 3.x installed, you will need to remove the unicode functions on line 100 and 104, and change the print statements to functions.)</li> | ||
− | <li>Copy and paste (this modified version of) "The Unicode Hammer" with the name <code>unicode_hammer.py</code> in the <code>raw/objects</code> sub-directory of your Dwarf | + | <li>Copy and paste (this modified version of) "The Unicode Hammer" with the name <code>unicode_hammer.py</code> in the <code>raw/objects</code> sub-directory of your Dwarf Fortress directory. (The Unicode Hammer: Is that a name worthy of Dwarf Fortress, or what?)<p><pre> |
#!/usr/bin/env python | #!/usr/bin/env python | ||
""" | """ | ||
Line 155: | Line 155: | ||
"""This version has had its translation table abused to produce | """This version has had its translation table abused to produce | ||
better results for the language files of the game Dwarf Fortress by | better results for the language files of the game Dwarf Fortress by | ||
− | frobnic8 | + | frobnic8. |
− | + | Original here: | |
− | + | http://code.activestate.com/recipes/251871-latin1-to-ascii-the-unicode-hammer/ | |
− | |||
− | |||
""" | """ | ||
Line 173: | Line 171: | ||
something meaningful. Anything not converted is deleted. | something meaningful. Anything not converted is deleted. | ||
""" | """ | ||
− | xlate={ | + | xlate = { |
− | + | 0x80: 'E', # Euro sign | |
− | + | 0x81: 'e', # Blank | |
− | + | 0x82: 'i', # Single low 9 quote | |
− | + | 0x83: 'f', # Latin small letter f with hook | |
− | + | 0x84: 'ii', # Doubel low 9 quote | |
− | + | 0x85: 'e', # Horizontal elipsis | |
− | + | 0x86: 't', # Dagger | |
− | + | 0x87: 'tt', # Double dagger | |
− | + | 0x88: 'ea', # Modified circumflex accent | |
− | + | 0x89: 'oloo', # Per mile sign | |
− | + | 0x8a: 'S', # Latin capital letter S with caron | |
− | + | 0x8b: '<', # Single left pointing angle quotation | |
− | + | 0x8c: 'OE', # Latin capital ligature OE | |
− | + | 0x8d: '-', # Blank | |
− | + | 0x8e: 'Z', # Latin capital letter Z with caron | |
− | + | 0x8f: '-', # Blank | |
− | + | 0x90: '-', # Blank | |
− | + | 0x91: 'ei', # Left single quote | |
− | + | 0x92: "ie", # Right single quote | |
− | + | 0x93: 'ii', # Left double quote | |
− | + | 0x94: "ee", # Right double quote | |
− | + | 0x95: 'ao', # Bullet | |
− | + | 0x96: '-', # En dash | |
− | + | 0x97: '-', # Em dash | |
− | + | 0x98: '-', # Small tilde | |
− | + | 0x99: 'TM', # Trademark sign | |
− | + | 0x9a: 's', # Latin small letter s with caron | |
− | + | 0x9b: '>', # Single right pointing andle quotation | |
− | + | 0x9c: 'oe', # Latin small ligature oe | |
− | + | 0x9d: '-', # Blank | |
− | + | 0x9e: 'z', # Latin small letter z with caron | |
− | + | 0x9f: 'Y', # Latin capital letter Y with diaeresis | |
− | + | 0xa0: '-', # Non-breaking space | |
− | + | 0xa1: 'i', # Inverted exclamation mark | |
− | + | 0xa2: 'c', # Cent sign | |
− | + | 0xa3: 'E', # Pound sign | |
− | + | 0xa4: 'o', # Currency sign | |
+ | 0xa5: 'Y', # Yen sign | ||
+ | 0xa6: 'l', # Pipe, broken vertical bar | ||
+ | 0xa7: 'S', # Section sign | ||
+ | 0xa8: 'ii', # Spacing diaeresis | ||
+ | 0xa9: 'c', # Copyright sign | ||
+ | 0xaa: 'a', # Feminine ordinal indicator | ||
+ | 0xab: '<<', # Left double angle quotes | ||
+ | 0xac: 'r', # Not sign | ||
+ | 0xad: '-', # Soft hypehen | ||
+ | 0xae: 'R', # Registered trade mark sign | ||
+ | 0xaf: 'aa', # Spacing macron | ||
+ | 0xb0: 'o', # Degree sign | ||
+ | 0xb1: 't', # Plus or minus sign | ||
+ | 0xb2: '2', # Superscript 2 | ||
+ | 0xb3: '3', # Superscript 3 | ||
+ | 0xb4: "'", # Acute accent | ||
+ | 0xb5: 'u', # Micro sign | ||
+ | 0xb6: 'P', # Pilcrow sign | ||
+ | 0xb7: 'o', # Middle dot | ||
+ | 0xb8: 'e', # Cedilla | ||
+ | 0xb9: '1', # Superscript 1 | ||
+ | 0xba: 'o', # Masculine ordinal indicator | ||
+ | 0xbb: '>>', # Right double angle quotes | ||
+ | 0xbc: '1/4', # Fraction one quarter | ||
+ | 0xbd: '1/2', # Fraction one half | ||
+ | 0xbe: '3/4', # Fraction three quarters | ||
+ | 0xbf: 'b', # Inverted question mark | ||
+ | 0xc0: 'A', # Latin capital letter A with grave | ||
+ | 0xc1: 'A', # Latin capital letter A with acute | ||
+ | 0xc2: 'A', # Latin capital letter A with circumflex | ||
+ | 0xc3: 'A', # Latin capital letter A with tilde | ||
+ | 0xc4: 'A', # Latin capital letter A with diaeresis | ||
+ | 0xc5: 'A', # Latin capital letter A with ring above | ||
+ | 0xc6: 'Ae', # Latin capital letter AE | ||
+ | 0xc7: 'C', # Latin capital letter C with cedilla | ||
+ | 0xc8: 'E', # Latin capital letter E with grave | ||
+ | 0xc9: 'E', # Latin capital letter E with acute | ||
+ | 0xca: 'E', # Latin capital letter E with circumflex | ||
+ | 0xcb: 'E', # Latin capital letter E with diaeresis | ||
+ | 0xcc: 'I', # Latin capital letter I with grave | ||
+ | 0xcd: 'I', # Latin capital letter I with acute | ||
+ | 0xce: 'I', # Latin capital letter I with circumflex | ||
+ | 0xcf: 'I', # Latin capital letter I with diaeresis | ||
+ | 0xd0: 'D', # Latin capital letter ETH | ||
+ | 0xd1: 'N', # Latin capital letter N with tilde | ||
+ | 0xd2: 'O', # Latin capital letter O with grave | ||
+ | 0xd3: 'O', # Latin capital letter O with acute | ||
+ | 0xd4: 'O', # Latin capital letter O with circumflex | ||
+ | 0xd5: 'O', # Latin capital letter O with tilde | ||
+ | 0xd6: 'O', # Latin capital letter O with diaeresis | ||
+ | 0xd7: 'x', # Multiplication sign | ||
+ | 0xd8: 'O', # Latin capital letter O with slash | ||
+ | 0xd9: 'U', # Latin capital letter U with grave | ||
+ | 0xda: 'U', # Latin capital letter U with acute | ||
+ | 0xdb: 'U', # Latin capital letter U with circumflex | ||
+ | 0xdc: 'U', # Latin capital letter U with diaeresis | ||
+ | 0xdd: 'Y', # Latin capital letter Y with acute | ||
+ | 0xde: 'P', # Latin capital letter THORN | ||
+ | 0xdf: 'B', # Latin small letter sharp s | ||
+ | 0xe0: 'a', # Latin small letter a with grave | ||
+ | 0xe1: 'a', # Latin small letter a with acute | ||
+ | 0xe2: 'a', # Latin small letter a with circumflex | ||
+ | 0xe3: 'a', # Latin small letter a with tilde | ||
+ | 0xe4: 'a', # Latin small letter a with diaeresis | ||
+ | 0xe5: 'a', # Latin small letter a with ring above | ||
+ | 0xe6: 'ae', # Latin small letter ae | ||
+ | 0xe7: 'c', # Latin small letter c with cedilla | ||
+ | 0xe8: 'e', # Latin small letter e with grave | ||
+ | 0xe9: 'e', # Latin small letter e with acute | ||
+ | 0xea: 'e', # Latin small letter e with circumflex | ||
+ | 0xeb: 'e', # Latin small letter e with diaeresis | ||
+ | 0xec: 'i', # Latin small letter i with grave | ||
+ | 0xed: 'i', # Latin small letter i with acute | ||
+ | 0xee: 'i', # Latin small letter i with circumflex | ||
+ | 0xef: 'i', # Latin small letter i with diaeresis | ||
+ | 0xf0: 'oa', # Latin small letter eth | ||
+ | 0xf1: 'n', # Latin small letter n with tilde | ||
+ | 0xf2: 'o', # Latin small letter o with grave | ||
+ | 0xf3: 'o', # Latin small letter o with acute | ||
+ | 0xf4: 'o', # Latin small letter o with circumflex | ||
+ | 0xf5: 'o', # Latin small letter o with diaeresis | ||
+ | 0xf6: 'o', # Latin small letter o with slash | ||
+ | 0xf7: 'l', # Division sign | ||
+ | 0xf8: 'o', # Latin small letter o with | ||
+ | 0xf9: 'u', # Latin small letter u with grave | ||
+ | 0xfa: 'u', # Latin small letter u with acute | ||
+ | 0xfb: 'u', # Latin small letter u with circumflex | ||
+ | 0xfc: 'u', # Latin small letter u with diaeresis | ||
+ | 0xfd: 'y', # Latin small letter y with acute | ||
+ | 0xfe: 'p', # Latin small letter thorn | ||
+ | 0xff: 'y', # Latin small letter y with diaeresis | ||
+ | } | ||
+ | |||
+ | |||
r = '' | r = '' | ||
for i in unicrap: | for i in unicrap: |
Revision as of 22:00, 16 May 2012
Overview
Some tile sets use the accented characters for additional graphical symbols. This can make racial language text difficult to read. You can remove the accented characters and symbols from the data files. This works on existing worlds and saved games.
Since the structure of language files might change, it is safest if you remove the problem characters from the files yourself. Here are two methods to do just that. The first (Jackard's) only works on Windows, but is probably the easiest for novice users. The second (frobnic8's) will work anywhere Python does (i.e. just about anywhere), but requires using the command line a little.
Jackard's InfoRapid Script
Download Inforapid Search and Replace.
Save the list below to a text file.
Find the following files in DF\raw\objects
:
language_DWARF.txt
language_ELF.txt
language_GOBLIN.txt
language_HUMAN.txt
Select them all, right-click and choose 'Search with InfoRapid' from the menu.
Click the Replace tab that shows up in the lower half of the window.
Select your text file from before in the Replace With field, make sure Replace is set to 'Whole Search Expression' and click Start.
A prompt will appear asking for confirmation. Check the Replace All button and click Yes. When the program stops running you are done.
<Command> <Search>„</Search> <Replace>a</Replace> </Command> <Command> <Search> </Search> <Replace>a</Replace> </Command> <Command> <Search>ƒ</Search> <Replace>a</Replace> </Command> <Command> <Search>†</Search> <Replace>a</Replace> </Command> <Command> <Search>…</Search> <Replace>a</Replace> </Command> <Command> <Search>‡</Search> <Replace>c</Replace> </Command> <Command> <Search>‰</Search> <Replace>e</Replace> </Command> <Command> <Search>‚</Search> <Replace>e</Replace> </Command> <Command> <Search>Š</Search> <Replace>e</Replace> </Command> <Command> <Search>ˆ</Search> <Replace>e</Replace> </Command> <Command> <Search>‹</Search> <Replace>i</Replace> </Command> <Command> <Search></Search> <Replace>i</Replace> </Command> <Command> <CaseSensitive>Yes</CaseSensitive> <Search>¡</Search> <Replace>i</Replace> </Command> <Command> <Search>Œ</Search> <Replace>i</Replace> </Command> <Command> <Search>¤</Search> <Replace>n</Replace> </Command> <Command> <Search>•</Search> <Replace>o</Replace> </Command> <Command> <Search>”</Search> <Replace>o</Replace> </Command> <Command> <Search>“</Search> <Replace>o</Replace> </Command> <Command> <Search>¢</Search> <Replace>o</Replace> </Command> <Command> <Search>—</Search> <Replace>u</Replace> </Command> <Command> <Search>–</Search> <Replace>u</Replace> </Command> <Command> <Search>£</Search> <Replace>u</Replace> </Command> <Command> <Search>˜</Search> <Replace>y</Replace> </Command>
frobnic8's Modified Python Script
If you have the programming language Python installed on your machine (or don't mind installing it) and aren't scared of a command prompt, here is an alternate method. Python comes pre-installed on Mac OS X and almost all distributions of Linux. (If you are using Windows, the command line instructions shown will need to be modified slightly.)
- Ensure you have Python installed. (If you have Python 3.x installed, you will need to remove the unicode functions on line 100 and 104, and change the print statements to functions.)
- Copy and paste (this modified version of) "The Unicode Hammer" with the name
unicode_hammer.py
in theraw/objects
sub-directory of your Dwarf Fortress directory. (The Unicode Hammer: Is that a name worthy of Dwarf Fortress, or what?)#!/usr/bin/env python """ latin1_to_ascii -- The UNICODE Hammer -- AKA "The Stupid American" This takes a UNICODE string and replaces Latin-1 characters with something equivalent in 7-bit ASCII. This returns a plain ASCII string. This function makes a best effort to convert Latin-1 characters into ASCII equivalents. It does not just strip out the Latin1 characters. All characters in the standard 7-bit ASCII range are preserved. In the 8th bit range all the Latin-1 accented letters are converted to unaccented equivalents. Most symbol characters are converted to something meaningful. Anything not converted is deleted. Background: One of my clients gets address data from Europe, but most of their systems cannot handle Latin-1 characters. With all due respect to the umlaut, scharfes s, cedilla, and all the other fine accented characters of Europe, all I needed to do was to prepare addresses for a shipping system. After getting headaches trying to deal with this problem using Python's built-in UNICODE support I gave up and decided to use some brute force. This function converts all accented letters to their unaccented equivalents. I realize this is dirty, but for my purposes the mail gets delivered. Noah Spurrier noah at noah.org License free and public domain """ """This version has had its translation table abused to produce better results for the language files of the game Dwarf Fortress by frobnic8. Original here: http://code.activestate.com/recipes/251871-latin1-to-ascii-the-unicode-hammer/ """ def latin1_to_ascii (unicrap): """This takes a UNICODE string and replaces Latin-1 characters with something equivalent in 7-bit ASCII. It returns a plain ASCII string. This function makes a best effort to convert Latin-1 characters into ASCII equivalents. It does not just strip out the Latin-1 characters. All characters in the standard 7-bit ASCII range are preserved. In the 8th bit range all the Latin-1 accented letters are converted to unaccented equivalents. Most symbol characters are converted to something meaningful. Anything not converted is deleted. """ xlate = { 0x80: 'E', # Euro sign 0x81: 'e', # Blank 0x82: 'i', # Single low 9 quote 0x83: 'f', # Latin small letter f with hook 0x84: 'ii', # Doubel low 9 quote 0x85: 'e', # Horizontal elipsis 0x86: 't', # Dagger 0x87: 'tt', # Double dagger 0x88: 'ea', # Modified circumflex accent 0x89: 'oloo', # Per mile sign 0x8a: 'S', # Latin capital letter S with caron 0x8b: '<', # Single left pointing angle quotation 0x8c: 'OE', # Latin capital ligature OE 0x8d: '-', # Blank 0x8e: 'Z', # Latin capital letter Z with caron 0x8f: '-', # Blank 0x90: '-', # Blank 0x91: 'ei', # Left single quote 0x92: "ie", # Right single quote 0x93: 'ii', # Left double quote 0x94: "ee", # Right double quote 0x95: 'ao', # Bullet 0x96: '-', # En dash 0x97: '-', # Em dash 0x98: '-', # Small tilde 0x99: 'TM', # Trademark sign 0x9a: 's', # Latin small letter s with caron 0x9b: '>', # Single right pointing andle quotation 0x9c: 'oe', # Latin small ligature oe 0x9d: '-', # Blank 0x9e: 'z', # Latin small letter z with caron 0x9f: 'Y', # Latin capital letter Y with diaeresis 0xa0: '-', # Non-breaking space 0xa1: 'i', # Inverted exclamation mark 0xa2: 'c', # Cent sign 0xa3: 'E', # Pound sign 0xa4: 'o', # Currency sign 0xa5: 'Y', # Yen sign 0xa6: 'l', # Pipe, broken vertical bar 0xa7: 'S', # Section sign 0xa8: 'ii', # Spacing diaeresis 0xa9: 'c', # Copyright sign 0xaa: 'a', # Feminine ordinal indicator 0xab: '<<', # Left double angle quotes 0xac: 'r', # Not sign 0xad: '-', # Soft hypehen 0xae: 'R', # Registered trade mark sign 0xaf: 'aa', # Spacing macron 0xb0: 'o', # Degree sign 0xb1: 't', # Plus or minus sign 0xb2: '2', # Superscript 2 0xb3: '3', # Superscript 3 0xb4: "'", # Acute accent 0xb5: 'u', # Micro sign 0xb6: 'P', # Pilcrow sign 0xb7: 'o', # Middle dot 0xb8: 'e', # Cedilla 0xb9: '1', # Superscript 1 0xba: 'o', # Masculine ordinal indicator 0xbb: '>>', # Right double angle quotes 0xbc: '1/4', # Fraction one quarter 0xbd: '1/2', # Fraction one half 0xbe: '3/4', # Fraction three quarters 0xbf: 'b', # Inverted question mark 0xc0: 'A', # Latin capital letter A with grave 0xc1: 'A', # Latin capital letter A with acute 0xc2: 'A', # Latin capital letter A with circumflex 0xc3: 'A', # Latin capital letter A with tilde 0xc4: 'A', # Latin capital letter A with diaeresis 0xc5: 'A', # Latin capital letter A with ring above 0xc6: 'Ae', # Latin capital letter AE 0xc7: 'C', # Latin capital letter C with cedilla 0xc8: 'E', # Latin capital letter E with grave 0xc9: 'E', # Latin capital letter E with acute 0xca: 'E', # Latin capital letter E with circumflex 0xcb: 'E', # Latin capital letter E with diaeresis 0xcc: 'I', # Latin capital letter I with grave 0xcd: 'I', # Latin capital letter I with acute 0xce: 'I', # Latin capital letter I with circumflex 0xcf: 'I', # Latin capital letter I with diaeresis 0xd0: 'D', # Latin capital letter ETH 0xd1: 'N', # Latin capital letter N with tilde 0xd2: 'O', # Latin capital letter O with grave 0xd3: 'O', # Latin capital letter O with acute 0xd4: 'O', # Latin capital letter O with circumflex 0xd5: 'O', # Latin capital letter O with tilde 0xd6: 'O', # Latin capital letter O with diaeresis 0xd7: 'x', # Multiplication sign 0xd8: 'O', # Latin capital letter O with slash 0xd9: 'U', # Latin capital letter U with grave 0xda: 'U', # Latin capital letter U with acute 0xdb: 'U', # Latin capital letter U with circumflex 0xdc: 'U', # Latin capital letter U with diaeresis 0xdd: 'Y', # Latin capital letter Y with acute 0xde: 'P', # Latin capital letter THORN 0xdf: 'B', # Latin small letter sharp s 0xe0: 'a', # Latin small letter a with grave 0xe1: 'a', # Latin small letter a with acute 0xe2: 'a', # Latin small letter a with circumflex 0xe3: 'a', # Latin small letter a with tilde 0xe4: 'a', # Latin small letter a with diaeresis 0xe5: 'a', # Latin small letter a with ring above 0xe6: 'ae', # Latin small letter ae 0xe7: 'c', # Latin small letter c with cedilla 0xe8: 'e', # Latin small letter e with grave 0xe9: 'e', # Latin small letter e with acute 0xea: 'e', # Latin small letter e with circumflex 0xeb: 'e', # Latin small letter e with diaeresis 0xec: 'i', # Latin small letter i with grave 0xed: 'i', # Latin small letter i with acute 0xee: 'i', # Latin small letter i with circumflex 0xef: 'i', # Latin small letter i with diaeresis 0xf0: 'oa', # Latin small letter eth 0xf1: 'n', # Latin small letter n with tilde 0xf2: 'o', # Latin small letter o with grave 0xf3: 'o', # Latin small letter o with acute 0xf4: 'o', # Latin small letter o with circumflex 0xf5: 'o', # Latin small letter o with diaeresis 0xf6: 'o', # Latin small letter o with slash 0xf7: 'l', # Division sign 0xf8: 'o', # Latin small letter o with 0xf9: 'u', # Latin small letter u with grave 0xfa: 'u', # Latin small letter u with acute 0xfb: 'u', # Latin small letter u with circumflex 0xfc: 'u', # Latin small letter u with diaeresis 0xfd: 'y', # Latin small letter y with acute 0xfe: 'p', # Latin small letter thorn 0xff: 'y', # Latin small letter y with diaeresis } r = '' for i in unicrap: if ord(i) in xlate: r += xlate[ord(i)] elif ord(i) >= 0x80: pass else: r += str(i) return r if __name__ == '__main__': import sys input = sys.stdin output = sys.stdout if len(sys.argv) == 1 or (len(sys.argv) == 2 and \ sys.argv[1] in ('-h', '-H', '-?', '--help', '/?', '/H', '/h')): print 'unicode_hammer.py [infile [outfile]]\n' #for python 3.x, changes the following line to s = '' s = unicode('','latin-1') for c in range(32, 256): if c != 0x7f: #for python 3.x, change the following line to s += str(chr(c)) s += unicode(chr(c), 'latin-1') plain_ascii = latin1_to_ascii(s) #for python 3.x, change all of the following print statements to functions (wrap the entire statement in parenthesis) print 'INPUT type:', type(s) print 'INPUT:' print s.encode('latin-1') print print 'OUTPUT type:', type(plain_ascii) print 'OUTPUT:' print plain_ascii sys.exit() if len(sys.argv) > 1: input = open(sys.argv[1]) if len(sys.argv) > 2: output = open(sys.argv[2], 'w') for line in input: output.write(latin1_to_ascii(line))
- Open a command prompt and change directory to your
raw/objects
directory. - Rename the four language files, adding '.orig' to the end of their names:
mv language_DWARF.txt language_DWARF.txt.orig mv language_ELF.txt language_ELF.txt.orig mv language_GOBLIN.txt language_GOBLIN.txt.orig mv language_HUMAN.txt language_HUMAN.txt.orig
- Apply the hammer to each of the four language files as follows:
python unicode_hammer.py language_DWARF.txt.orig language_DWARF.txt python unicode_hammer.py language_ELF.txt.orig language_ELF.txt python unicode_hammer.py language_GOBLIN.txt.orig language_GOBLIN.txt python unicode_hammer.py language_HUMAN.txt.orig language_HUMAN.txt
- Enjoy!
The Linux way
Conversion between character sets is a standard part of Linux. To convert all the files in one go, change to the "raw/objects" directory and run this command:
for f in language_*.txt; do \ iconv -f CP437 -t ASCII//TRANSLIT $f > $f.new; \ mv -fv $f.new $f; \ done
All accented characters are converted to their normal, non-accented versions. Other characters (if any) are converted to their closest 7-bit ASCII representation.
This will overwrite the original language files. If you want them back, you can always unzip them again:
unzip -j path-to-zipfile raw/objects/language_\*.txt
Hermanos small app
For Windows users there is this small application that replaces accented characters from files by just dragging & dropping the file on the application icon.