Remember the message.
The future is not set.
Yeah, that's her.
Okay.
I think I'm just going to start.
This is like a couple people, but I don't know if they're actually going to come in, so yeah, let's just get started.
Welcome, everybody.
Today my presentation is Lost in Transliteration, Hidden Passwords in a Multilingual World.
Real quick spiel that everyone does.
My name is Juan Pablo Gomez-Postigo, or just JP.
I'm a senior penetration tester at Sprocket Security.
I've been there for about three and a half years now, which is kind of crazy to say.
This is my first time speaking at CypherCon, so I'm super excited.
I think this is like my fourth CypherCon overall, but when I'm not hacking, I like to do archery, running, rock climbing, 3D printing, and taking care of my plants.
I live in Wisconsin, and I have a balcony apartment, so either the winter kills my plants or my cats kill my plants, but anyway.
Another thing I absolutely love is languages.
I love learning about languages.
I love learning languages, but by no means am I good at it.
I was born in Arequipa, Peru, and I moved to the United States when I was super young, so I still retained a lot of Spanish, but I have like a little accent.
I studied Mandarin for four years, and I got to go to China for I don't know what that was.
I got to go to China for summer in high school, but as I'm sure you know with languages, if you don't use it, you lose it, so a lot of that kind of just disappeared once I stopped taking classes.
I was planning a boy's trip to Japan that has since died, but at the time of writing, I was studying for about half a year, really hard, kind of dropped it, and I've always been interested in German, and people always say, or at least my friend in high school said,
if you know English, German's actually super easy.
That's a lie, in my opinion.
It's really hard for me.
I think Mandarin was easier for me than German, but I mean, it's just a skill issue, actually, but even though I love learning about languages, I just want to preface with this, that I'm by no means good at it at all.
That's what my Duolingo app looks like.
I have a negative streak, and I get threatened by the bird on Twitter, so I just wanted to let you guys know, you don't actually need to know any of the languages we're going to talk about, except English, I guess, to actually enjoy this presentation.
But anyway, what are we doing here?
We're here to take a look at passwords for specific languages that use writing systems different than ours.
We're here to see how technical limitations for non-Latin script keyboards and how that can influence, like, chosen passwords, or in this case, restrict passwords for people with different native speaking languages.
So you're going to see what happens when a user wants to type their password like that, but they're using a keyboard like that.
There we go.
Super quickly, just to cover the agenda and prove at some point we will talk about cybersecurity .
It might seem like a little long at the beginning, but we're going to cover keyboard layouts, language structures, common passwords in different languages, Unicode, unfortunately, it's like the most boring part of the presentation, but we've got to talk about it,
transliteration, technical limitations, and then we're going to take a look at a lot of examples towards the end.
And by the end of this talk, maybe not by sight alone, but you should be able to understand why I know that first password is Korean and that second password is Russian.
So starting off with keyboard layouts, QWERTY, the one that's familiar to all of us, I hope.
QWERTY is utilized in... or heavily in English-speaking countries as the default layout.
QWERTY has its keys laid out by frequency, something that it inherited from typewriters.
In the UK and Ireland, they utilize QWERTY as well with some small differences.
The at sign and the double quote sign are swapped just because of frequency.
There's other symbols for currency, and it uses something called the alt graph key.
That's usually in the bottom right on the right side of the space bar.
That's used to input things like diacritics, different symbols, that kind of stuff.
And even though it's labeled there, it's on most keyboards as options, if you use a Mac like I do.
You can use the options key to specify different symbols that you can use for math, currencies, and whatnot.
And so even though both the original QWERTY format and the UK and Ireland ones are for the same language, there's already some small differences.
Moving on to other QWERTY-ish keyboard layouts, we have QWERTZ and AZERTY.
QWERTZ is used for like German and Austria.
The Z and the Y are completely swapped just because of the frequency of the letter Z in those respective languages.
You also get different symbols using that alt graph key.
In AZERTY, A and W are swapped with Q and Z, again, just because of the frequency of those letters and their respective language.
And I don't know if you noticed, but all the symbols decided to just jump and go all over the place.
So we haven't really left any languages that don't use like the Latin alphabet, but already we're getting into some huge differences in layouts.
But what about Russian?
What about a language like Russian that doesn't use the Latin alphabet?
Russian utilizes the Cyrillic alphabet and a keyboard like that probably just looks like QWERTY, but the frequency and placement of the letters would be for the Cyrillic alphabet.
And yeah, that makes perfect sense.
Still, there's another keyboard layout that they would use instead, which is this one.
It has dual alphabets on each key.
There you have the Latin alphabet and also the Cyrillic alphabet in the same spot, even though those spots don't actually correlate to each other.
It's just based off of frequency.
So that works just fine.
They can type in Russian when they need to and utilize linescript keys when they have to.
We'll talk more about Russian later because it has the same issues as other languages, but let's move on to another language that also doesn't typically use Latin script writing, Mandarin.
So Mandarin is a logographic writing system, meaning that the symbols or characters represent entire words and concepts without relating to the pronunciation.
This doesn't mean that the characters themselves can't give clues as to the pronunciation, but unless you know what each specific character means and how to pronounce it, you can't just figure it out like you can with other languages.
So their password, if you know the English alphabet, you can kind of sound it out.
And that at the bottom is the International Phonetic Alphabet.
It's a way linguists can figure out how to pronounce words without actually learning every language imaginable.
In the middle, that's Contraseña.
In Spanish, that means password.
If you know the alphabet and how each letter is pronounced in Spanish, you can kind of sound that out.
You can't really do that with Mandarin.
If you look at that, unless you know exactly what each of those characters means, you're not going to be able to figure it out on your own.
Real quick, even though it's considered logographic, that doesn't mean that there isn't some type of phonetic element to each character.
In truth, there's like six types of Chinese characters, and I stopped taking classes like seven years ago, so I'm not super confident in this slide, but we'll only be covering these three here.
Type one are pictograms.
They're simplified drawings of objects.
So huo and mu, forgive me for my horrible pronunciation, I'll probably just skip the rest of these, but that one looks like fire, so it means fire.
That one looks like tree, it means tree.
For day and moon, you're kind of like looking out the window, so it kind of gives you a hint as to what that's supposed to be.
Type two are like ideograms or indicatives, things that are considered self-explanatory or represent abstract ideas.
Yi, er, and san are one, two, and three, and each has one line, two line, three lines.
Shang points up, xia points down, and it's below.
And then this is where I was saying sometimes they can give you a hint as to how to pronounce them.
These are phonosemantic characters.
So here the pronunciation or the meaning can kind of be implied by the individual character that makes up the word as a whole.
So here, those are radicals.
Each of those characters, those six characters have the yang radical, so that can give you a hint as to either the pronunciation or the meaning of the word based off where it's placed.
All of that to say is that there are a ton of just complex characters that make up the Mandarin language.
So how do you build a keyboard to represent all of those?
I'll give you a hint.
It's not that.
That looks horrible.
It actually looks like this.
And this keyboard is entirely Latin script.
The only Mandarin here is on the caps lock.
There it has zhong and ing, which lets you switch between typing in Chinese characters or typing in English.
So just like the Russian keyboard, it has Latin alphabet, and it has to, basically.
Why?
People just need the Latin script in general for things like typing in URLs and programming.
And even though there are things like international domain names and there are programming languages that aren't based entirely in English, at some point you're going to need a Latin character when you're using a computer.
So what do people who speak Mandarin do?
Well, Pinyin is a phonetic writing system that was developed in the 50s to help improve literacy rates in China.
So here, Latin characters are combined with one of four tones to help build up each character in the entire language.
Because we can't make a keyboard with every possible character, we can instead utilize Pinyin and software to help build out each character as we type.
So this is done with something called an IME, or an input method editor.
These are super popular with like the CJK languages, Chinese, Japanese, Korean, but they're used in a lot of writing systems.
The way IMEs work are with two main stages.
First, when you type out, you do your input, it's phonetical, and then the IME will process what you're typing out and then use predictive text to figure out what you're trying to say.
So there I typed out how, and it's giving me five options on the five most common characters I'm probably trying to type out.
You can either hit space and it will automatically select the first one, or you can use the numbers on your keyboard to select the specific one you're talking about.
I had a demo here, it's super tiny, no one's going to be able to see it, so I'm just going to kind of skip through it, and it's just me typing very slowly in Mandarin.
Yeah, so as I'm typing it out, I can either type out the word as a whole or individually, and then I can use the numbers to spell out specifically what I'm trying to say there.
I'm just going to skip to the end here.
I forgot what I said.
That's true, actually, but that's fine.
Here we go.
Real quick, first of many tangents.
I found this freak of nature on the internet.
There's a guy who works in security, and he uses a Dvorak layout keyboard, but he still uses Pinion, and the default IME for Pinion doesn't support different keyboard layouts, so he wrote a PowerShell script and ran it on every computer at his company to edit the registry on all the Windows computers so that Dvorak would work on his computer.
So stupid, first of all, sorry, but no one should use Dvorak.
I will stand by that.
Anyway, let's go to a different language, Korean.
Korean is a little different.
If we look at the keyboard there, I'm realizing a white keyboard with very faint text is not the best thing to show here, but we see the Latin alphabet is printed on the top right, and the Korean alphabet is put on the bottom left.
That's called Hangul.
It's a phonetic writing system.
So it's kind of like a Cyrillic, but not as simple.
I will preface this by saying I have no experience with Korean prior to this language, so feel free to either heckle me or correct me after during the questions, but in Korean, letters are stacked in a grid of... in a grid with specific orders to build out each syllable.
These are split up into components called Jamo.
There's an initial Jamo, a medial Jamo, and sometimes a final Jamo.
Not all syllables contain a final Jamo, but what matters is that on the keyboard, each of those syllables have a specific order that they have to be typed out in, and they'll get placed into a specific way on the grid based off of which one came beforehand.
All that to say is that I could never write this by hand, and thankfully IMEs exist.
So this takes care of all of that for us at a software level.
You see here at this... in this demo, I'm not hitting space at all.
It's just... it knows when the final Jamo is placed that the next word is starting because we have finished this word.
So I'm not going to show this demo because we just saw the GIF and that's a lot better, but now that we know how people can type out passwords in both Mandarin and Korean, finally we can actually start looking at passwords.
And so here are a list of common Chinese passwords from a data leak.
So when I saw this, my first thought was, um, what the hell, where are the Chinese passwords?
And it's all mostly numbers and English letters, so whatever.
Just taking a look at this, all this is like either keyboard walks or random default passwords like admin and password, so the vast majorities are just things that are super easy to crack, but nothing that you could easily tie to knowing that the user speaks mainly Chinese.
In fact, if we take a look at the most popular formats for Chinese data leaks, we see that they're all comprised of numbers.
Here, D means digit, L is lowercase, U is uppercase, and S is symbol.
So if we compare the most popular formats from the top to the bottom, we see there aren't even any letters for the most common Chinese passwords.
So why aren't we seeing any Chinese characters?
The main reason for that is encoding issues.
Like I said before, warning, this is the most boring slide, and I'm really sorry, but we got to get through it.
So Unicode is a universal encoding standard that allows us to encode almost every single language.
It uses code points to reference every possible character in all of existence, allegedly.
Because of that, implementing Unicode creates so much complexity.
The main challenges with supporting Unicode in password fields are consistency and normalization.
So ASCII is present on almost every international keyboard, like we said before, whereas if you set your password up on your laptop with an IME keyboard, you better only log in from that device, otherwise you're going to have to install that language any time you try to log in from anywhere else.
And ASCII, the other issue is normalization.
ASCII only has a single encoding function, so each grapheme, which is a single unit of writing in linguistics, is only one byte.
That's not the case with Unicode.
Unicode has so many different encoding strategies, and it just creates so much complexity, and if that wasn't bad enough, each grapheme can actually have different values depending on how it's typed out.
So here we have E with the acute or the little accent mark.
If you just have that as a built-in key on your keyboard, you can type it, and its value is 233.
But if you typed in E and then use a modifier to add that, you're actually creating a different Unicode value.
So when your password gets hashed, it's going to be two completely different values.
Visually, you'd be looking at the same password, but the way that the computer treats it, it would treat it as two completely different things.
So this encoding issue is a big enough problem where Microsoft just says, fuck it, we're not allowing any Unicode characters whatsoever.
And so that doesn't mean you can't do it, it's just Microsoft is like, please, please don't make a password with Unicode.
So on a Windows computer, I tried to set the password MIMA, which means password in Mandarin.
Quick side note, I put LOL there because I've never seen a password in the description field in any pen test ever, and then literally two days ago, we found it for the first time in like six years, so that joke doesn't make sense anymore.
But essentially, when I tried to set it to the password MIMA, the computer forced me back into English.
So instead of the two-character word on the left, it typed out those four characters in pinyin.
That doesn't mean I can't just copy-paste it, because if it was completely forbidden, it wouldn't have let me do that.
So I was able to do that, and that creates a huge issue for the user, because now if they don't know how to type out that password, and they're not allowed to type it in the first place, they're basically locked out.
You can still log in with other command-like tools, like pen testing tools, not CrackMapExec.
They don't support Unicode for some reason, but NetExec does.
That's why NetExec is way better.
But this isn't like a Windows-exclusive issue.
This is an issue on most websites.
When you highlight a password field on a web form, your computer is going to force you back into the default or English-based keyboard.
So here I had pinyin set as my main keyboard, even for the computer itself, and I also had the Korean keyboard set, and as soon as I started typing into Have I Been Pwned, it forced me back into the English keyboard.
So what happens when you can't use your native tongue for passwords?
Well, I'll give you a hint.
It might have to do with the title of the talk, but we'll cover that in a second.
We're going to first cover low-hanging fruit, and this is a very low-effort slide on my part, but we just got to cover these super quickly.
You get passwords with old numbers.
You kind of see dates sometimes.
If you recognize that date, nothing happened with CrowdStrike on that day.
There's also keyboard walks, where they can get actually super complicated.
We've cracked passwords where people will literally just type all the way on the top row, second row, third row, and then start adding shifts.
So it can get complicated, but those are still mostly documented and pretty easy to crack.
But, yeah, let's jump back into Russian to look at those common passwords.
And if we look at these, for the most part, we're seeing old keyboard walks, mostly numbers, some default passwords , like default and password, I guess.
But one of these does kind of stand out a little bit .
If we look at that, it kind of looks like a keyboard walk.
Like if you started in the middle of the keyboard and started going out a little bit, but then you go to JKM and it kind of doesn't make sense.
So what is that?
That is our first example of a transliterated password.
So here, again, forgive my pronunciation, but assuming this user is a Russian speaker, and let's just say they want to make their password password, which is Barol in Russian.
They start typing it out, but for some reason, either they're forced into it or they do it on purpose, their keyboard gets switched back into the English input while they're still trying to type in the Cyrillic alphabet.
So instead of Barol, they get GFHJKM.
That's transliteration.
So real quick, shout out to Toxic from Team Hashcat.
He mentioned this to me at DEF CON 31.
He was also doing a talk on like complicated passwords.
And when the Have I Been Pwned database got first released and everyone started cracking passwords, he said people are like mashing the keyboard, putting in gibberish, but we can still crack passwords.
That's fine.
And then some random person replied and was like, that's not keyboard smashing, that's actually Russian.
I have no idea how this person saw this tweet and immediately knew that was Russian transliteration, but that's sick.
So transliteration is the process of representing characters from one writing system to another.
In the context of password cracking, there's two types.
There's phonetic-based transliteration and keyboard map transliteration.
Phonetic-based transliteration works pretty intuitively, so right there that word means bestolkovy, which is stupid, so the translation of that from Russian to English is stupid.
The transliteration of that phonetically would be how I know to say that word is bestolkovy without knowing any Russian myself.
Keyboard map transliteration is what we just covered before, where the word password translated is password, or parol in Russian is password in English, but if we transliterate that, we have to look at where those keys are mapped on the keyboard itself,
so parol becomes G-F-H-J-K-M.
This isn't new by any means.
It's a known phenomenon, so anyone here that's super involved in the password cracking community is already aware of this, and there's actually a really good talk.
I just realized this whole slide doesn't really fit up here, but there's a really good YouTube video at the bottom for a talk at a different b-sides, I think.
It's called The One with the Foreign Word List.
They go over just weird ways of cracking different passwords, and they talk about transliteration there too, so I really recommend that talk.
It'll show up again at the end.
So transliteration is actually way more common than you would think.
Google does this automatically, technically.
If you type nonsense in one language, it'll actually check the Latin alphabet first to see if maybe you were using the wrong keyboard.
So here I tried typing out sprocket security, but I typed it out using the Cirilla keyboard, and Google automatically knows that I meant sprocket security.
It doesn't even ask, it just starts showing me those results.
And the reverse of this is also true.
If I typed out password safe in Russian while using the English keyboard, it immediately says showing results for, I don't know how to say password safe.
I mean, passwords, I don't know how to say the safe part, but it knows I meant password safe, and it actually starts showing me only .ru results after the first one.
So that's super cool.
Transliteration, there's a lot of posts about it on hashcat forums, asking for either direct character mapping results or for people to publish transliterated word lists, and for the most part, I really couldn't find any, and I couldn't find any tools on GitHub either,
so I decided to kind of just go out and build my own.
And the main reason I did it was curiosity.
I think I want to clarify, this is most likely not going to be common during a pen test.
This is more so a nerd that likes password cracking and different languages kind of combining both of those things together.
But yeah, I also wanted to figure out how can you even figure out if this password is transliterated or just straight-up gibberish.
So first, I checked the Russian word for password, on the have-I-been-pwned database, and it showed up about 14,000 times, which isn't a lot, but it's still something.
Next, I checked the phonetic transliteration version of it, and it showed up about 6,000 times.
But that's like close enough to a word in any language that I can't even say for sure that's Russian, so I don't even really count that.
But then I looked up the keyboard-based transliteration version of password in Russian.
I can't even say transliteration like more than two times in a row before I mess it up.
But yeah, so that showed up about 478,000 times, which is like 30 to 35 times more than the actual Russian word.
So to me, this is at least happening way more than I initially thought.
So I'll get into the specifics of the tool I wrote when we get to Korean, but let's start off with Russian.
So I thought the best way to just quickly get as many transliterated passwords as possible was just convert words that already exist in English word lists to Russian, and then transliterate them, and then check the have-I-been-pwned database.
So one issue I see right off the bat is... we'll only see this problem with Russian is names.
A lot of these names can be romanized or converted to Latin pronunciation without any letter change.
So we don't know for sure if it's Russian or just English.
However, some of these are super unique and occur enough for me to call them out.
I try to remove all the not-safe-for-work ones, but I didn't try really hard.
I'm sorry, Gutsman.
There's things like locomotive and bear and things that stand out enough to me phonetically where I'm like, okay, that has to be Russian, because otherwise that would be gibberish, and these can't just be random passwords either.
And so I did the same thing for the keyboard-based transliteration version.
Same thing here.
The only difference is instead of phonetically translating them, we mapped out each Russian key to its corresponding English key.
Did the same process, and when we did that, actually got like way more results with the phonetic... or way more results on keyboard transliteration than with the phonetic one, which to me seemed weird, because I guess instinctively I thought it would have been the other way around,
but there's still a lot of cool unique hits here once again with the names, but this time just due to how they look, we know for sure these have to be Russian at least, because the odds of 150,000 passwords all being same randomly generated from the same source is just not likely.
And a really good example of this, something you guys can do right now during this talk, is if you go to Have I Been Pwned and type in any six letters, if it's not a keyboard walk or it's not something that we've already discussed, odds are it's not going to show up,
and even if it does, it's going to show up less than, I don't know, freaking locomotive, you know, so that's a good indicator that that's what's happening here.
If we compare both the translated results to the random words I selected, what surprised me is that, like I said before, the keyboard one is way more common than the phonetic one.
The keyboard mapped transliterated versions of Alexander and Victoria are more common than the phonetic pronunciations for some reason.
And even though words like Marina and Andrew show up more in the phonetic pronunciation, those are so close already to the English version that, again, there's these outliers we can't for sure determine have to be Russian.
If we compare all of these to English, obviously English wins, except for locomotive for some reason.
I don't know, maybe trains are built different there and they just care about it, but yeah.
So English wins for the most part just because of number of English speakers and all the other things that we talked about.
So that was cool, but what about things like Mandarin and Korean?
We'll take a look at Mandarin.
Chinese is interesting because the way pinyin works, the phonetic transliteration is the same as the keyboard one because they're using the same alphabet.
Here I typed out wǒ ài nǐ, which means I love you, in Mandarin, and it shows up about 124,000 times, which is a lot, but it's a lot less than Russian.
And one of the main factors for that that we already talked about is that Mandarin speakers are more frequently likely to use number-based passwords anyway.
Another thing I failed to bring up the first time is we can't treat password practices in English the same for other languages.
With languages like Mandarin, you can't tell them to just use passphrases when entire characters that take up one character space can represent entire ideas and words as a whole, and also you can't enforce uppercase, lowercase password requirements in Unicode-based characters anyway,
so that's a whole problem for a different day.
So running the tool against Mandarin, we see not as great results, in my opinion, but I think the biggest factor for that is the way we're generating these passwords are using common word lists, kind of like Rocky.
Those are all, I think, culturally relevant to mostly English words.
We need to use data sources that are relevant for Mandarin speakers, and if you're curious as to why Chinese passwords are more digit-based and opinion-based, I can't recommend this blog more.
It's called A Study of Chinese Passwords.
I love this blog so much.
It's super cool, super interesting.
It's something I read, like, even before I started pen testing, so I'm glad I get to bring it up here, but I was rereading it for this presentation, and when I was looking at the comments, I found someone here mentioned that there isn't really a good study on Korean-speaking passwords,
and they said that they know people who will purposely transliterate their passwords using the English keyboard.
So when they tried to type out password in Korean, it actually gets hashed, or what gets hashed is the English Q-L-A-L-F-Q-J-S-G-H, and so on.
So that's the last language we're going to take a look at.
I saved it for last because of the way Hangul works and the fact that I don't know Korean.
It was, like, way harder than the other two, but we'll also take a look at the tool itself.
So step one, we need to translate a lot of words very quickly and for free.
The Google Translate API is, like, $20 per million characters translated.
Roku has, like, 1.3 billion characters, so that's absolutely not happening.
Sprocket's not going to give me $25K for this project.
So there's other cloud services that let you translate words, but Google makes enough money.
There's this Go library here that reverse engineer how it creates its API keys so we can just translate as many words as we want for free, thankfully.
So that's cool.
That doesn't mean that all of our problems are over.
We're still going to have issues.
There's a lot of passwords that we don't want just from a principle base.
We don't want anything with, like, lead speak because it's not going to be valid for this attack.
We don't want things with symbols.
We want to create just, like, a base list of transliterated words because we can put... we can replace those with, like, rules and combinator attacks later down the line.
There's also just the fact that we're using Google Translate, we're going to have bad translations.
We have to deal with certain edge cases where words are going to end up in different formats and, like, symbols just aren't going to play nicely.
And then finally, some words are just going to translate too short for me to determine it's from a specific language.
Like, there, moon in Korean is just... you can type that out, EKF.
I mean, moon is honestly too short to even fit most password requirements, but something like EKF, that's too short for me to definitively say that has to be Korean.
So I capped it about six to eight characters.
Step two, we need to map every single Hangul Unicode value to a specific key input and then read Korean and break it down into the order at which it was typed.
This part sucked, and I'm so glad ChatGPT exists, because I don't know Korean and I didn't know Go, so it kind of helped me out here.
And then finally, as you might have guessed, we're going to be generating... or we're going to be comparing these to the Have I Been Pwned database, just to see, has this password ever existed?
Because if it hasn't, we can still store it in the word list when we're cracking passwords in the future, but it might not be worth checking in the first place.
So for that to work, for every password that we translated and then transliterated, we had to generate an NT hash and then make a request to the Have I Been Pwned API to see how many times this password has been seen.
In retrospect, I did this very stupidly, because now I'm generating 1.3 billion API requests on my poor internet, so what I should have done was download the entire database and then just on local lookups, but that's something I can change in the future.
So after doing all that, this is a really crappy screenshot of the old code, but basically we take the passwords from the word list, translate them, transliterate them, generate the NT hash, and then make a request to the Have I Been Pwned API.
And here are some of the raw results on the left and the table on the right.
We see some pretty unique ones, like the Simpsons.
I guess you can't see the full table there, but it's like at five to four digits, but yeah, this definitely gave us the least amount of results, and a big reason for that is just the number of Korean speakers in the world.
Mandarin and English have like 1.3 billion speakers each.
Russian is 300 to 400 million.
Meanwhile, there's only like 65 to 70 million Korean speakers, so that doesn't really surprise me, but the transliterated words are still unique enough that I think it was worth digging into.
And then the last thing I want to cover is putting all of this together, generating a word list, and then actually testing this out in the wild.
So for that, I used Hashmob's API.
Hashmob is a super awesome platform and a password-cracking community where members can crack passwords from user-submitted lists and also like official data leaks.
They're super cool, and they generate like a gigantic word list that probably gets updated like every month.
It's gigantic.
They also have an API that lets you search for hashes and clear plaintext passwords.
They'll tell you how often they've seen that hash or plaintext password, where they've seen it from, what lists they're in, but they'll also tell you similar credentials they've seen it in like other word lists and stuff.
So using that API, we can submit every translate password that we've generated to pull down similar versions.
This is my lazy way of not wasting compute generating different candidates with rules, but after building that word list, we can apply our own rules to it to try to generate passphrases and more complex passwords.
But once we do that, we need to target mostly Russian-speaking hashes.
So the way we can do that is also with HashMob by downloading the top5.ru website leaks that have ever existed, which is super legal.
But after pulling those down and pulling those hashes specifically, we targeted those with just the translate word list we use, and we actually got a lot of very interesting passwords that I'm not sure we would have cracked otherwise.
I'm sure a lot of these exist on those gigantic word lists I was talking about, but this is still super useful when you're generating a targeted word list for whatever reason you knew that the main user base you were trying to crack against were Russian speakers.
So a lot of these would have been difficult to crack with just standard lists and rules, if you happen to have a tiny GPU like I do.
But yeah, building out the base list of translated words is kind of the goal for this.
And after that, you can just give it to someone who's actually good at cracking passwords, and then they can go crazy with a bunch of different rules and techniques.
But yeah, that's pretty much it.
So I know that was a lot, but real quick, I wanted to cover sort of like next steps for this project.
I wanted to generate words using more culturally relevant words and names for each specific language, which is a lot harder done than said.
I also wanted to dig deeper into current languages.
There's some things I didn't get to cover, like internet lingo.
Like in Mandarin, a lot of people, when they're playing games, they'll abbreviate phrases.
So like that says, which means laughing myself to death.
Instead of typing all that out, they just do like XSWL.
And I'm really curious if maybe they'll incorporate that into passwords too.
I also wanted to expand into other languages, but like I said, I suck at learning languages.
So I think I'll just stick to the three that I've targeted now and maybe get better at those.
I wanted to give a quick shout out, again, to the blog that I mentioned.
I've linked the slides and the YouTube presentation for the two talks I talked about.
There's also two YouTube videos I added, because I love YouTube.
The first one is Unicode and Friendly Terms.
It explains all the headaches and pains with supporting Unicode that I can't verbalize, and they do it a lot better.
And then the last video has a warning on it, because it's like Zoomer brain rot humor, but it's called The Challenge of Making a Keyboard for Every Language.
It's one of my favorite videos.
It's super cool, and it's something I visited probably like five times while writing this talk.
It's an awesome video.
Also, a quick shout out to everyone at Sprocket.
I'm sorry for putting this picture on YouTube in nine months, but thank you for motivating and helping me basically rabbit hole on something I really like for a whole week.
So that was super cool.
And then finally, quick shout out, if you haven't already signed up for it, check out Crack the Con.
It's the password cracking competition here at CypherCon.
I know it's like 8 p.m., so it's not really too late.
It goes on until tomorrow.
But yeah, give it a shot.
Let me know if you end up on the leaderboard.
But yeah, that's about it.
Thank you.
I guess, yeah, does anyone have any questions?
Sorry.
All right.
Thank you.
Oh, yes, CEO and founder of Sprocket Security, Casey Camilleri.
My
personal word lists or the company word lists?
Well, not everything I do.
That's true.
I crack passwords for fun.
A little bit, kind of like what I was saying before, this technique is more of like I'm rabbit holing on an idea rather than useful in the future.
I think just all right, how am I going to phrase this?
A lot of the clients that we hack, personally, I don't think have a large user base of people who speak a language that isn't supported by a Latin alphabet.
And like I said, Hashmob already has like a billions and trillions long word list that if I was really desperate, I would just feed that word list with the biggest rules that we had and just kind of like throw compute at the wall with it.
I think when I started cracking the specific Russian cracked ones, that's where I was adding that to existing word lists.
And then what I would do is I would combine them, I would take pass phrases, translate those, and try to like build those out that way.
So I haven't added it to like my actual workflow just because I'm not sure it'll be useful.
I kind of want to just throw our pod file at it again and see if we can identify any of those.
But I'm not sure that would be relevant for like, I don't know, our customers.
But in terms of like for fun password cracking, there we found passwords that weren't found before, which was cool.
That's something worth doing.
But the minute I submit those, those are going to get added into the gigantic hash mob word list anyway.
So I think this is more of like contributing to the community and it'll make its way into these large aggregated word lists anyway.
But still, yeah.
It's a very long-winded answer to like no.
That's my answer.
Yeah.
Yep.
Yeah.
So that's a big factor.
That blog that I linked about Chinese passwords, they go into a deep dive on how they phonetically create passwords using numbers, but based off of how those numbers sound when they speak it and like how that sounds when you speak it out loud.
Like the numbers are 1, 2, 3, 4, 5, 6, 7, 8, 9, 10.
I think Joe's better at Mandarin than me.
Give me the thumbs up.
Thanks, buddy.
Yeah.
And so they like build past phrases using words that sound like those numbers.
I should have just added it to the slides.
I'll pull it up later.
But yeah, they have a really good graph showing like 19 character long passwords based out of numbers, but it's technically like a phrase because it sounds like something else in Mandarin.
So that's the only specific case that I've personally seen and identified, but I'm sure there's way more.
The blog goes into it a lot more in depth.
Honestly, just looking at that on its own could be a blog in itself or it could be a whole talk in itself, but yeah.
So it's definitely something that... something I didn't mention is like I would like to talk to people who crack passwords that maybe Korean is their native language because they would have just a way better understanding of this than I could like trying to dive into that.
Finding these sort of like sources of culturally relevant words, it's a skill on its own that comes with password cracking.
Like when you detect themes, where are you going to build out these custom word lists from?
You can like pull down books or like manuscripts or movies in different languages and then write like a script to generate passphrases off of those, creating like different combinations of that.
That's super popular for like password cracking competitions.
I don't know if I would go to that extent like in the real world trying to crack passwords for work, but most of this is all personal stuff anyway, not really work-related.
So I think that's something that I'll definitely have to dig into, but that means I'm gonna have to communicate more with native speakers.
That's not something I can just like figure out on my own, honestly.
Yeah,
so here, let me go back real quick.
So with here, I used Russian just because it was the easiest one.
In this case, you can see that maybe... yeah, that second one.
So their alphabet also supports like uppercase and lowercases here.
So that second password there has uppercase, lowercase, and numbers.
So that would meet like super weak password requirements.
There's no symbols or stuff there.
But yeah, that's a good case for Russian where that does happen.
In terms of like Mandarin and Korean, I'm not sure.
I forget what the shift modifier does to each Hangul character.
I'm not sure.
That's something where like I wish I knew Korean, but I barely know the languages I'm trying to study.
So just not enough time in the world for that.
I'd go back in time and like try to learn when I was like five years old in a sponge.
But it's so hard to learn languages as an adult if you don't have like the attention span and dedication for it.
But yeah, so another long-winded answer to be like, no.
In Mandarin, that's one of the main reasons why they can't even use the native characters because like how do you enforce uppercase requirements when you can't...
when your language doesn't even have uppercase in the first place?
And how do you support things like passphrases when a five-character password could be a 42-character sentence when you translate it back?
So all right.
Thank you guys so much.