Can I Hear Your Face? Pervasive Attack on Voice Authentication Systems with a Single Face Image

Nan Jiang, Bangjie Sun, and Terence Sim, National University of Singapore; Jun Han, KAIST

๋…ผ๋ฌธ ์š”์•ฝ: Foice๋Š” ๋‹จ์ผ ์–ผ๊ตด ์ด๋ฏธ์ง€๋งŒ์„ ์‚ฌ์šฉํ•ด ํ”ผํ•ด์ž์˜ ์Œ์„ฑ ์ƒ˜ํ”Œ ์—†์ด๋„ ์Œ์„ฑ์„ ํ•ฉ์„ฑํ•˜์—ฌ WeChat, Microsoft Azure, Siri ๋“ฑ ์Œ์„ฑ ์ธ์ฆ ์‹œ์Šคํ…œ์„ ์†์ด๋Š” ์ƒˆ๋กœ์šด ๋”ฅํŽ˜์ดํฌ ๊ณต๊ฒฉ ๊ธฐ๋ฒ•์ด๋‹ค. ์–ผ๊ตด๊ณผ ์Œ์„ฑ์˜ ์ƒ๋ฌผํ•™์  ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ํ™œ์šฉํ•˜๋ฉฐ, ์†Œ์…œ ๋ฏธ๋””์–ด์—์„œ ์‰ฝ๊ฒŒ ์–ป์„ ์ˆ˜ ์žˆ๋Š” ์–ผ๊ตด ์ด๋ฏธ์ง€๋ฅผ ์ด์šฉํ•ด ๊ณต๊ฒฉ์˜ ํ™•์žฅ์„ฑ๊ณผ ์œ„ํ˜‘์„ฑ์„ ๋†’์ธ๋‹ค. Foice๋Š” Train Phase์—์„œ ์‹ ๊ฒฝ๋ง์„ ํ•™์Šตํ•˜๊ณ , Attack Phase์—์„œ ํ•ฉ์„ฑ ์Œ์„ฑ์„ ์ƒ์„ฑํ•ด ์ธ์ฆ ์‹œ์Šคํ…œ์„ ์šฐํšŒํ•œ๋‹ค.


์„ธ๋ถ€ ๋‚ด์šฉ

1. Train Phase

Train Phase์—์„œ๋Š” ์‹ ๊ฒฝ๋ง์„ ํ•™์Šต์‹œ์ผœ ์–ผ๊ตด ์ด๋ฏธ์ง€์—์„œ ์Œ์„ฑ ํŠน์ง•์„ ์ถ”์ถœํ•˜๊ณ  ํ•ฉ์„ฑํ•  ์ˆ˜ ์žˆ๋Š” ๋ชจ๋ธ์„ ๊ตฌ์ถ•ํ•œ๋‹ค. ๊ตฌ์„ฑ: Data Processing โ†’ Face-dependent Voice Feature Extractor โ†’ Face-independent Voice Feature Generator

Data Processing

  • ์„ค๋ช…: ์˜จ๋ผ์ธ ๊ณต๊ฐœ ๋ฐ์ดํ„ฐ์…‹(VoxCeleb1, VoxCeleb2 ๋“ฑ)์„ ์ „์ฒ˜๋ฆฌํ•œ๋‹ค.

    1. Face Processing:

      • Face Cropping & Normalization: ์–ผ๊ตด์„ ํƒ์ง€ํ•ด ํฌ๋กญํ•˜๊ณ , ๋ˆˆ์˜ ์œ„์น˜๋ฅผ ๊ธฐ์ค€์œผ๋กœ ํฌ๊ธฐ์™€ ๋ฐฉํ–ฅ์„ ์ •๊ทœํ™”ํ•œ๋‹ค.

      • Blurriness Assessment: Sobel edge detection์œผ๋กœ ์ด๋ฏธ์ง€ ์„ ๋ช…๋„๋ฅผ ํ‰๊ฐ€ํ•˜๋ฉฐ, ํ’ˆ์งˆ ์ ์ˆ˜๊ฐ€ 100 ๋ฏธ๋งŒ์ธ ํ๋ฆฐ ์ด๋ฏธ์ง€๋Š” ์ œ์™ธํ•œ๋‹ค. ๋‹จ, Foice๋Š” ์ €ํ•ด์ƒ๋„ ์ด๋ฏธ์ง€์—์„œ๋„ ํšจ๊ณผ์ ์ด๋‹ค(ยง5.5.3).

    2. Voice Processing:

      • Speaker Encoder๋ฅผ ์‚ฌ์šฉํ•ด ์Œ์„ฑ ๋…น์Œ์—์„œ voice feature vector(F_GT)๋ฅผ ์ถ”์ถœํ•œ๋‹ค.

      • NISQA ๋ชจ๋ธ๋กœ ์Œ์งˆ์„ ํ‰๊ฐ€ํ•ด ๊ณ ํ’ˆ์งˆ ์Œ์„ฑ๋งŒ ์„ ๋ณ„ํ•˜๋ฉฐ, ์ด๋Š” F_GT๋กœ ์‚ฌ์šฉ๋˜์–ด Face-dependent Voice Feature Extractor์™€ Face-independent Voice Feature Generator์˜ ํ•™์Šต์— ํ™œ์šฉ๋œ๋‹ค(ยง5.2.2).

  • ์ˆ˜์ •:

    • ์›๋ฌธ์—์„œ "์ด๋ ‡๊ฒŒ ์–ป์–ด๋‚ธ feature vector๋Š” deep-learning model์˜ ground truth(f_gt) ์—ญํ• "์€ ๋งž์ง€๋งŒ, F_GT๊ฐ€ Face-dependent Voice Feature Extractor์™€ Face-independent Voice Feature Generator์—์„œ ๊ตฌ์ฒด์ ์œผ๋กœ ์–ด๋–ป๊ฒŒ ์‚ฌ์šฉ๋˜๋Š”์ง€ ๋ช…์‹œํ•ด์•ผ ํ•œ๋‹ค.

    • Voice Processing์€ ๋‹จ์ˆœํžˆ feature vector ์ถ”์ถœ๋ฟ ์•„๋‹ˆ๋ผ ์Œ์งˆ ํ•„ํ„ฐ๋ง(NISQA)์„ ํฌํ•จํ•œ๋‹ค.

  • ๊ฐœ์„ ์  (๋ณด์™„):

    • ์›๋ฌธ์˜ ์ œ์•ˆ(์Œ์„ฑ ๋…ธ์ด์ฆˆ ์ œ๊ฑฐ)์€ ์ ์ ˆํ•˜๋ฉฐ, ์ด๋Š” SV2TTS์˜ ๋…ธ์ด์ฆˆ ๋ฏผ๊ฐ์„ฑ ๋ฌธ์ œ๋ฅผ ํ”ผํ•  ์ˆ˜ ์žˆ๋Š” Foice์˜ ์žฅ์ ๊ณผ ์—ฐ๊ฒฐ๋œ๋‹ค(ยง5.2.1).

    • ์ถ”๊ฐ€๋กœ, ๋ฐ์ดํ„ฐ์…‹์˜ ๋‹ค์–‘์„ฑ(์˜ˆ: ๋‹ค์–‘ํ•œ ์–ธ์–ด, ์–ต์–‘)์„ ํ™•๋ณดํ•˜๋ฉด ๋ชจ๋ธ์˜ ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋  ์ˆ˜ ์žˆ๋‹ค.

Face-dependent Voice Feature Extractor

  • ์„ค๋ช…:

    • ์–ผ๊ตด ์ด๋ฏธ์ง€(Img_face)์—์„œ ์Œ์„ฑ๊ณผ ์ƒ๊ด€๋œ ํŠน์ง•(F_dep)์„ ์ถ”์ถœํ•œ๋‹ค: F_dep = C_f->v(E_face(Img_face)).

    • E_face: ResNet ๊ฐ™์€ CNN์œผ๋กœ, ์„ฑ๋ณ„, ๋‚˜์ด, ์ž…์ˆ  ๋ชจ์–‘ ๋“ฑ ์Œ์„ฑ๊ณผ ๊ด€๋ จ๋œ ์–ผ๊ตด ํŠน์ง•(F_face)์„ ์ถ”์ถœํ•œ๋‹ค.

    • C_f->v: F_face๋ฅผ ์Œ์„ฑ ํŠน์ง•(F_dep)์œผ๋กœ ๋ณ€ํ™˜ํ•œ๋‹ค.

    • ํ•™์Šต์€ F_dep์™€ ground truth ์Œ์„ฑ ํŠน์ง•(F_GT) ๊ฐ„์˜ ๊ฑฐ๋ฆฌ ๊ธฐ๋ฐ˜ loss ํ•จ์ˆ˜(Err(F_dep, F_GT))๋ฅผ ์ตœ์†Œํ™”ํ•˜๋ฉฐ, E_face์™€ C_f->v๋ฅผ ๊ณต๋™์œผ๋กœ ํ›ˆ๋ จํ•œ๋‹ค.

    • ๊ฒฐ๊ณผ์ ์œผ๋กœ, ์„ฑ๋ณ„(์˜ˆ: ๋‚จ์„ฑ์˜ ๋‚ฎ์€ ํ”ผ์น˜), ๋‚˜์ด(์˜ˆ: ๊ณ ๋ น์ž์˜ ์–‡์€ ์Œ์„ฑ) ๋“ฑ ์–ผ๊ตด๊ณผ ์ƒ๊ด€๋œ ์Œ์„ฑ ํŠน์ง•์„ ์ถ”์ถœํ•œ๋‹ค(Observation 1, ยงA.2.1).

    • ์‹คํ—˜์ ์œผ๋กœ, Foice๋Š” ์„ฑ๋ณ„ ์ •๋ณด๋ฅผ 90% ์ด์ƒ ์ •ํ™•๋„๋กœ ์ถ”์ถœํ–ˆ๋‹ค(ยงA.2.1, Figure 17).

    • ์ž…์ˆ , ํ„ฑ์„  ๋“ฑ ํŠน์ • ์–ผ๊ตด ํŠน์ง•์ด ํ”ผ์น˜์™€ ํฌ๋จผํŠธ ์ฃผํŒŒ์ˆ˜์— ์˜ํ–ฅ์„ ๋ฏธ์นœ๋‹ค(ยง5.6).

Face-independent Voice Feature Generator

  • ์„ค๋ช…:

    • ์–ผ๊ตด๋กœ ์ถ”๋ก ํ•  ์ˆ˜ ์—†๋Š” ์Œ์„ฑ ํŠน์ง•(F_indep)์„ ์ƒ์„ฑํ•˜๊ณ , F_dep์™€ ๊ฒฐํ•ฉํ•ด ์™„์ „ํ•œ ์Œ์„ฑ ํŠน์ง•(F_recon)์„ ์žฌ๊ตฌ์„ฑํ•œ๋‹ค.

    • F_indep ์ƒ์„ฑ: Bottleneck(B(ยท))์€ F_GT๋ฅผ ์ž…๋ ฅ๋ฐ›์•„ F_indep๋ฅผ ์ถœ๋ ฅํ•œ๋‹ค(F_indep = B(F_GT)). F_indep๋Š” ์„ฑ๋Œ€, ๋น„๊ฐ• ๊ตฌ์กฐ ๋“ฑ ์–ผ๊ตด๊ณผ ๋ฌด๊ด€ํ•œ ํŠน์ง•๋งŒ ํฌํ•จํ•ด์•ผ ํ•˜๋ฏ€๋กœ, bottleneck dimension(์ตœ์ : 48)์„ ์กฐ์ ˆํ•ด F_dep๋ฅผ ํ•„ํ„ฐ๋งํ•œ๋‹ค(Observation 2, ยงA.2.2).

    • F_recon ์ƒ์„ฑ: Reconstructor(R(ยท,ยท))๋Š” F_indep์™€ F_dep๋ฅผ ๊ฒฐํ•ฉํ•ด F_recon์„ ์ƒ์„ฑํ•œ๋‹ค(F_recon = R(F_indep, F_dep)).

    • ํ•™์Šต: ๋‘ ๊ฐ€์ง€ loss๋ฅผ ์ตœ์†Œํ™”ํ•œ๋‹ค:

      1. Reconstruction error: F_recon๊ณผ F_GT์˜ ์ฐจ์ด(Err(F_recon, F_GT))๋ฅผ ์ตœ์†Œํ™”.

      2. KL-divergence loss: F_indep๊ฐ€ ํ‘œ์ค€ ์ •๊ทœ๋ถ„ํฌ(๐’ฉ(0, I))๋ฅผ ๋”ฐ๋ฅด๋„๋ก ์œ ๋„ํ•ด ์—ฐ์†์ ์ธ search space๋ฅผ ์ƒ์„ฑ. ์ด๋Š” ํ•™์Šต ๋ฐ์ดํ„ฐ์— ์—†๋Š” ํ”ผํ•ด์ž์˜ F_indep๋„ ํฌํ•จํ•œ๋‹ค.


2. Attack Phase

  • ์„ค๋ช…:

    • ํ”ผํ•ด์ž์˜ ์–ผ๊ตด ์ด๋ฏธ์ง€ ํ•œ ์žฅ์„ ์‚ฌ์šฉํ•ด ์Œ์„ฑ ์ธ์ฆ ์‹œ์Šคํ…œ์„ ๊ณต๊ฒฉํ•œ๋‹ค.

    • F_dep ์ƒ์„ฑ: ํ•™์Šต๋œ E_face์™€ C_f->v๋กœ Img_face์—์„œ F_dep๋ฅผ ์ถ”์ถœํ•œ๋‹ค.

    • F_indep ์ƒ˜ํ”Œ๋ง: ํ‘œ์ค€ ์ •๊ทœ๋ถ„ํฌ(๐’ฉ(0, I))์—์„œ N๊ฐœ์˜ F_indep๋ฅผ ๋ฌด์ž‘์œ„๋กœ ์ƒ˜ํ”Œ๋งํ•œ๋‹ค.

    • F_recon ์ƒ์„ฑ: Reconstructor๋กœ F_dep์™€ ๊ฐ F_indep๋ฅผ ๊ฒฐํ•ฉํ•ด N๊ฐœ์˜ F_recon_i = R(F_indep_i, F_dep)๋ฅผ ์ƒ์„ฑํ•œ๋‹ค.

    • ์Œ์„ฑ ํ•ฉ์„ฑ: Voice Synthesizer๋Š” F_recon_i์™€ ํ…์ŠคํŠธ ์ž…๋ ฅ(์˜ˆ: WeChat ์ธ์ฆ ์ˆซ์ž, "Hey, Google!")์„ ๋ฐ›์•„ N๊ฐœ์˜ ํ•ฉ์„ฑ ์Œ์„ฑ์„ ์ƒ์„ฑํ•œ๋‹ค.

    • ๊ณต๊ฒฉ ์‹คํ–‰: N๊ฐœ์˜ ์Œ์„ฑ์„ ์™ธ๋ถ€ ์Šคํ”ผ์ปค๋กœ ์ˆœ์ฐจ์ ์œผ๋กœ ์žฌ์ƒํ•ด ์ธ์ฆ ์‹œ์Šคํ…œ์„ ์šฐํšŒํ•œ๋‹ค. ๋‚ฎ์€ ์ž„๊ณ„๊ฐ’(0.5~0.6)๊ณผ ๋ฌด์ œํ•œ ์‹œ๋„๋ฅผ ์•…์šฉํ•œ๋‹ค(ยง2.2).

    • ์‹คํ—˜ ๊ฒฐ๊ณผ, WeChat Voiceprint๋Š” 10๋ช… ๋ชจ๋‘ ์šฐํšŒ๋˜์—ˆ์œผ๋ฉฐ, ํ‰๊ท  30%์˜ ํ•ฉ์„ฑ ์Œ์„ฑ์ด ์„ฑ๊ณตํ–ˆ๋‹ค(ยง5.2.1).

    • ๊ณต๊ฒฉ์€ ์ธ์ฆ ์‹œ์Šคํ…œ์˜ ์ทจ์•ฝ์ (๋‚ฎ์€ ์ž„๊ณ„๊ฐ’, ๋ฌด์ œํ•œ ์‹œ๋„)์„ ํ™œ์šฉํ•œ๋‹ค.


3. ์™œ ๋ชจ๋ธ์ด ์ž‘๋™ํ•˜๋Š”๊ฐ€?

  • ์„ค๋ช…: Foice๊ฐ€ ์ž‘๋™ํ•˜๋Š” ์ด์œ ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค:

    1. ์–ผ๊ตด-์Œ์„ฑ ์ƒ๊ด€๊ด€๊ณ„: ์–ผ๊ตด ํŠน์ง•(์„ฑ๋ณ„, ๋‚˜์ด, ์ž…์ˆ  ๋“ฑ)์€ ์Œ์„ฑ ํŠน์ง•(ํ”ผ์น˜, ํฌ๋จผํŠธ ์ฃผํŒŒ์ˆ˜ ๋“ฑ)๊ณผ ์ƒ๋ฌผํ•™์ ์œผ๋กœ ์ƒ๊ด€๊ด€๊ณ„๊ฐ€ ์žˆ๋‹ค(ยง2.3). Face-dependent Voice Feature Extractor๋Š” ์ด๋ฅผ ํ•™์Šตํ•ด F_dep๋ฅผ ์ •ํ™•ํžˆ ์ถ”์ถœํ•œ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ์ž…์ˆ ๊ณผ ํ„ฑ์„ ์€ ํ”ผ์น˜์™€ ํฌ๋จผํŠธ ์ฃผํŒŒ์ˆ˜์— ์˜ํ–ฅ์„ ๋ฏธ์นœ๋‹ค(ยง5.6, Figure 15). ์‹คํ—˜์ ์œผ๋กœ, ์„ฑ๋ณ„ ์ •๋ณด๋Š” 90% ์ด์ƒ ์ •ํ™•๋„๋กœ ์ถ”์ถœ๋˜์—ˆ๋‹ค(ยงA.2.1).

    2. Search space์˜ ์ผ๋ฐ˜ํ™”: Face-independent Voice Feature Generator๋Š” bottleneck์„ ํ†ตํ•ด F_indep๋ฅผ ํ‘œ์ค€ ์ •๊ทœ๋ถ„ํฌ๋กœ ์ƒ์„ฑ, ์—ฐ์†์ ์ธ search space๋ฅผ ๋งŒ๋“ ๋‹ค. ์ด๋Š” ํ•™์Šต ๋ฐ์ดํ„ฐ์— ์—†๋Š” ํ”ผํ•ด์ž์˜ ์Œ์„ฑ๋„ ํฌํ•จํ•˜๋ฉฐ, N์„ ๋Š˜๋ฆด์ˆ˜๋ก ์‹ค์ œ ์Œ์„ฑ์— ๊ฐ€๊นŒ์šด F_recon์„ ์ฐพ๋Š”๋‹ค(ยง4.4, Observation 2). ์ตœ์  bottleneck dimension(48)์€ F_dep๋ฅผ ์ตœ์†Œํ™”ํ•œ๋‹ค(ยงA.2.2, Figure 18).

    3. ์ธ์ฆ ์‹œ์Šคํ…œ ์ทจ์•ฝ์ : ๋‚ฎ์€ ์ž„๊ณ„๊ฐ’(0.5~0.6)๊ณผ ๋ฌด์ œํ•œ ์‹œ๋„๋Š” Foice์˜ ์„ฑ๊ณต๋ฅ ์„ ๋†’์ธ๋‹ค(ยง2.2).

    4. ๊ฒฌ๊ณ ์„ฑ: Foice๋Š” ์ €ํ•ด์ƒ๋„ ์ด๋ฏธ์ง€, ์–ผ๊ตด ๊ฐ€๋ฆผ, 5ํšŒ ์‹œ๋„ ์ œํ•œ์—์„œ๋„ ํšจ๊ณผ์ ์ด๋‹ค(ยง5.5).

    • ์‹คํ—˜ ๊ฒฐ๊ณผ: Foice๋Š” WeChat(100% ์„ฑ๊ณต), VGGVox(67.6%), DeepSpeaker(87.7%)๋ฅผ ์šฐํšŒํ–ˆ๋‹ค(ยง5.2).

    • Foice๋Š” SV2TTS์™€ ๊ฒฐํ•ฉ ์‹œ ์„ฑ๊ณต๋ฅ ์„ 3๋ฐฐ ํ–ฅ์ƒ์‹œ์ผฐ๋‹ค(ยง5.4, Figure 11).

    • Face-TTS ๋Œ€๋น„ 30~60๋ฐฐ ๋†’์€ ์„ฑ๊ณต๋ฅ ์„ ๋‹ฌ์„ฑํ–ˆ๋‹ค(ยงA.3, Table 4).


4. ์‹คํ—˜ ๊ฒฐ๊ณผ

  • ๋Œ€์ƒ: WeChat, Siri, Google Assistant, Bixby(์˜จ๋””๋ฐ”์ด์Šค), Microsoft Azure, iFlytek, VGGVox, DeepSpeaker(ํด๋ผ์šฐ๋“œ).

  • ์„ฑ๊ณผ:

    • WeChat: 10๋ช… ๋ชจ๋‘ ์šฐํšŒ, ํ‰๊ท  30% ์„ฑ๊ณต๋ฅ (ยง5.2.1).

    • ํด๋ผ์šฐ๋“œ: VGGVox(67.6%), DeepSpeaker(87.7%), Microsoft/iFlytek์—์„œ SV2TTS์™€ ์œ ์‚ฌ/์šฐ์ˆ˜(ยง5.2.2).

    • Foice๋Š” ์„ฑ๋ณ„/๋‚˜์ด ์™ธ์˜ ์ •๋ณด(ํ”ผ์น˜, ํฌ๋จผํŠธ ์ฃผํŒŒ์ˆ˜)๋ฅผ ํ™œ์šฉ, Descriptive Attack(9.7%)๋ณด๋‹ค 3๋ฐฐ ๋†’์€ 29.7% ์„ฑ๊ณต๋ฅ (ยง5.3).

    • Augmentation Attack(Foice + SV2TTS)์€ ๋ชจ๋“  ์‹œ์Šคํ…œ์—์„œ SV2TTS ๋‹จ๋…๋ณด๋‹ค ์šฐ์ˆ˜(ยง5.4).

  • ๊ฒฌ๊ณ ์„ฑ: ์ €ํ•ด์ƒ๋„, ์–ผ๊ตด ๊ฐ€๋ฆผ, 5ํšŒ ์‹œ๋„์—์„œ๋„ ํšจ๊ณผ์ (ยง5.5). ์ž…์ˆ /ํ„ฑ์„ ์ด ํ”ผ์น˜/ํฌ๋จผํŠธ ์ฃผํŒŒ์ˆ˜์— ํฐ ์˜ํ–ฅ(ยง5.6).


5. ๊ฐœ์„ ์  ๋ฐ ์œค๋ฆฌ์  ๊ณ ๋ ค

  • ๊ฐœ์„ ์ :

    • ์Œ์„ฑ ๋ฐ์ดํ„ฐ ๋…ธ์ด์ฆˆ ์ œ๊ฑฐ ์ „์ฒ˜๋ฆฌ ๊ฐ•ํ™”.

    • 3D ์–ผ๊ตด ์ด๋ฏธ์ง€๋‚˜ ์Œ์„ฑ ์ƒ˜ํ”Œ ๊ฒฐํ•ฉ์œผ๋กœ ์ •ํ™•๋„ ํ–ฅ์ƒ(ยง6).

    • ๋”ฅํŽ˜์ดํฌ ํƒ์ง€์™€ ๋ผ์ด๋ธŒ๋‹ˆ์Šค ํƒ์ง€ ํ†ตํ•ฉ(ยง6).

  • ์œค๋ฆฌ์  ๊ณ ๋ ค:

    • IRB ์Šน์ธ, ์ฐธ๊ฐ€์ž ํ”„๋ผ์ด๋ฒ„์‹œ ๋ณดํ˜ธ, ๋ฐ์ดํ„ฐ ์ต๋ช…ํ™”/์‚ญ์ œ(ยงA.4).

    • ์ทจ์•ฝ์  ์ฑ…์ž„ ๊ณต๊ฐœ๋กœ ๊ธฐ์—…์˜ ์ˆ˜์ • ์กฐ์น˜ ์ด‰์ง„.

    • ์Œ์„ฑ ์ธ์ฆ ์‹œ์Šคํ…œ์˜ ์ƒˆ๋กœ์šด ์œ„ํ˜‘ ๊ฒฝ๊ณ  ๋ฐ ๋Œ€์ฑ… ๊ฐœ๋ฐœ ์ด‰๊ตฌ(ยง8).

Last updated