Why Most Retail Machine Learning Trading Fails — Lessons from Marcos Lopez de Prado

A summary of "Advances in Financial Machine Learning" + "Machine Learning for Asset Managers" — the 5 reasons retail ML trading fails (overfitting, data leakage, multiple testing, non-IID samples, structural breaks) and what actually works.

ปี 2022 ผมใช้เวลา 6 เดือน เรียน Python + scikit-learn + xgboost — สร้าง ML model predict EUR/USD direction. Backtest ได้ Sharpe 3.8 — รู้สึก "ผมแก้ปริศนาตลาดได้แล้ว"

Deploy live $2,000 — blow ภายใน 23 วัน

หายเงินไป + ego พังครั้งใหญ่ — เลยไปอ่าน Marcos Lopez de Prado อย่างจริงจัง. ทุกหน้า เขาบอกว่าผมทำอะไรผิดบ้าง — overfitting, data leakage, multiple testing, non-IID samples, structural breaks. พังครบทั้ง 5

Lopez de Prado เป็น Head of ML ที่ AQR ($140B AUM) + อาจารย์ Cornell — ไม่ใช่ guru YouTube. หนังสือ "Advances in Financial Machine Learning" เป็น textbook ที่ Quant Fund อ่าน. บทนี้ผมสรุป 5 เหตุผลที่ retail ML trading 95% fail — แบบที่ Pro Trader เข้าใจ ไม่ใช่ academic. หวังว่าคุณจะไม่เสีย $2,000 + 6 เดือนแบบผม

Reason #1: Overfitting (Backtest ที่ดูดีเกินจริง)

อันนี้คือเหตุผลแรกที่ผม blow ML model ปี 2022. backtest 200% ROI Sharpe 3.8 — live ขาดทุน 23 วัน. classic overfitting:

ปัญหา: ML model มี parameter หลายร้อย-หลายพัน. ที่ data ขนาดเล็ก (เช่น Daily EUR/USD 5 ปี = 1,300 rows) — model จะ "จำ" data แทนที่จะ "เรียน" pattern → backtest 200% ROI, live 0%

กฎ Lopez de Prado: "If you torture the data long enough, it will confess to anything"

ตัวอย่างที่เห็นบ่อย: Random Forest 500 trees + 50 features บน 1,000 row data → guaranteed overfit

วิธีแก้: • ใช้ feature น้อยกว่า rows × 0.05 (5% rule) • Walk-forward analysis 30+ rolls • Monte Carlo cross-validation (ไม่ใช่ k-fold ปกติ — เพราะ data trading ไม่ IID) • ตั้ง prior = "model ของฉัน blow up จนกว่าจะพิสูจน์ตรงข้าม"

Reason #2: Data Leakage (รู้อนาคตโดยไม่รู้ตัว)

นี่คือ silent killer — ผมพลาดเอง 3 ครั้ง. backtest ดูดีเพราะ feature "รู้อนาคต" โดยไม่รู้ตัว. live = blow:

ปัญหา: Feature ที่ใช้ในการ predict "y at time t" มีข้อมูลจาก "future after t" — ทำให้ backtest ดูดีแต่ live ไม่ทำงาน

ตัวอย่างที่ retail ทำผิดบ่อย: • ใช้ "EMA(20) ที่คำนวณจนถึงวันนี้" predict ราคาวันนี้ → leakage • Normalize ทั้ง dataset (mean + std) ก่อนแบ่ง train/test → leak distribution • Forward fill missing data — ใช้ค่าวันนี้ fill วันก่อน → leak • Backfill earnings/economic data — ใช้ revised number ที่ออกหลัง trade time

กฎ: ทุก feature ต้อง computable "ก่อน" timestamp ของ y. ใช้ Lopez de Prado's "Triple-Barrier Method" สำหรับ label data trading

ทดสอบ: ลบ feature ทีละตัว → ถ้า model ยังกำไร 50%+ ใน backtest แต่ feature ถูกลบ = ตัวอื่นมี leakage

Reason #3: Multiple Testing Bias (ลอง 1000 strategy → เจอ 50 ที่ "ดูดี")

อันนี้คือ trap ที่ trader ส่วนใหญ่หลงโดยไม่รู้ตัว — รวมถึงผม. "ลอง 100 combination จนกว่าจะเจอที่ดี" = guarantee overfitting:

ปัญหา: ทุก strategy ที่ทำ random ๆ มี 5% probability ดู "significant" ที่ p < 0.05. ลอง 1,000 strategy → เจอ 50 ที่ "ดู" profitable แค่จากความบังเอิญ

Trader retail ทำผิด: Backtest 1,000 combination ของ EMA/RSI/MACD → เลือก top 5 ที่ Sharpe > 2 → live เทรด → 4/5 fail ภายใน 3 เดือน

สูตร Lopez de Prado: "Probability of Backtest Overfitting (PBO)"

ทางแก้: • Bonferroni correction — แบ่ง significance threshold ด้วย จำนวน test • Out-of-sample tests ที่จริง (ใช้ data ที่ "ไม่เคยดูเลย" จนกว่าจะ deploy) • "Deflated Sharpe Ratio" — adjust Sharpe down ตามจำนวน test ที่ลอง

Real example: Strategy ที่ Sharpe 2.5 ใน backtest หลัง deflation = 0.8 — เพราะ trader ลอง 200 variations ก่อนจะเจอ

Reason #4: Non-IID Samples (Trading data ไม่อิสระจากกัน)

อันนี้ technical แต่สำคัญ. ML textbook ทุกเล่มสมมติ data IID — แต่ trading data ไม่ใช่. ใช้ standard k-fold = ตัวเลข lie:

ปัญหา: ML model สมมติ samples "Independent + Identically Distributed" (IID). แต่ data trading ไม่ IID — close ของวันนี้ขึ้นกับ close เมื่อวาน + volatility cluster + autocorrelation

ผลกระทบ: • K-fold cross-validation บอก accuracy ผิด (overestimate) • Sharpe ratio บนสมมติฐาน IID overstate edge จริง 30-50% • Backtest ที่ใช้ "random shuffle" สลับ samples → ทำลายโครงสร้าง time series

Lopez de Prado solutions: • "Purged K-Fold" — เว้น gap ระหว่าง train/test เพื่อกำจัด autocorrelation • "Combinatorial Purged Cross-Validation" (CPCV) — robust สำหรับ time series • Sample weight ตาม "uniqueness" — give less weight to overlapping events

Tool: library mlfinlab (Python) — implementation ของวิธีเหล่านี้ทั้งหมด

Reason #5: Structural Breaks (Market Regime เปลี่ยน → Model ตาย)

อันนี้คือเหตุผลที่ "AI strategy ที่กำไร 5 ปี" ตายใน 2020 — COVID เปลี่ยน regime. ผมเห็น Quant Fund หลายเจ้าโดน:

ปัญหา: ML model train บน data ปี 2010-2020 → deploy 2024 → market regime ต่างจาก training (post-COVID volatility, AI sector bubble, USD strong) → model fail

ตัวอย่าง real: • Carry trade ML model ทำกำไร 2010-2018 → blow up 2020 (CHF flash crash) • Mean reversion model ทำกำไร 2010-2020 → fail 2020-2022 (trending bull/bear regime) • Volatility prediction model 2018-2019 → fail Mar 2020 (vol regime shift)

Detection: • "Chow test" — statistical test for structural break • "CUSUM test" — detect regime shift in real-time • Rolling Sharpe ratio — ถ้า drop > 50% ใน 6 เดือน → likely regime change

Adaptive approach: • Re-train model ทุก quarter • Use ensemble ของ models trained บน different regimes • Add "regime feature" (VIX level, term structure, sector dispersion) เข้า model

แล้ว ML trading ที่ work คืออะไร?

อ่านมาถึงตรงนี้อาจจะคิดว่า "ML ไร้ประโยชน์เลย?" — ผิด. มี domain ที่ ML work — แค่ไม่ใช่ "predict ราคาตรง ๆ":

Lopez de Prado บอกชัดในหนังสือ: ML ใน trading work ในด้าน:

(1) Microstructure / Order book modeling — Predict spread / volume / liquidity ในระดับ tick • ใช้โดย HFT firm + market maker • ต้องการ data tick + low latency infrastructure • ไม่ feasible สำหรับ retail

(2) Risk Management — VaR, CVaR, stress test • ใช้ ML predict tail risk + correlation breakdown • Implementable ที่ retail level

(3) Feature engineering — สร้าง alpha factor ใหม่ • เช่น sentiment from earnings call, regime classification, cross-asset signal • ใช้ ML "find" features → trade ด้วย rule-based ปกติ (ไม่ใช่ end-to-end ML)

(4) Portfolio optimization — Hierarchical Risk Parity (Lopez de Prado's own work) • แทนที่ Markowitz mean-variance ที่ unstable • Robust สำหรับ portfolio 10-100 assets

สิ่งที่ ML ไม่ work: • Direct price prediction ของ liquid assets (FX, large-cap stocks, BTC) • ทำกำไรจาก simple feature ที่ทุกคนรู้ (RSI, MACD, EMA cross)

Roadmap สำหรับ trader ที่อยากเรียน ML จริง ๆ

ถ้ายังอยากลุย ML — มี roadmap ที่ realistic. ไม่ใช่ "เรียน 1 เดือนแล้วรวย" — แต่ 18-24 เดือนของ commitment จริง:

ก่อน start ML — ต้องมี: • Python + pandas + numpy proficient (3-6 เดือน learning) • Statistics 101 (regression, hypothesis test, distribution) • Trading edge ที่ proven แล้ว (อย่าเริ่มจาก ML)

Phase 1 (3-6 months): อ่าน "Advances in Financial Machine Learning" Ch 1-7 + implement code ตามทุก chapter

Phase 2 (6-12 months): ใช้ mlfinlab library — implement Triple-Barrier Method + Purged K-Fold + Meta-Labeling

Phase 3 (12+ months): สร้าง strategy ที่ pass robustness tests จริง — paper trade 6 เดือน + live demo 6 เดือนก่อน live ด้วย capital จริง

คำเตือน: trader ที่กระโดด ML โดยไม่มี foundation มี outcome เหมือนกัน 95% — เสียเวลา 1-2 ปี + ไม่มี profitable strategy. อย่าใช้ ML แทน trading skill — ใช้เป็น tool หลังมี skill แล้ว

คำถามที่พบบ่อย

จำเป็นต้องมี PhD เพื่อทำ ML trading ไหม?+

ไม่ — แต่ต้องมี foundation ใน statistics + programming + trading. PhD ช่วยตอนทำงานที่ Quant Fund (Renaissance, Two Sigma) แต่สำหรับ retail trader คนทั่วไปที่อยากใช้ ML — Master's level math + อ่าน Lopez de Prado จบ 2 เล่ม + 2 ปี practice = พอ.

ใช้ ChatGPT หรือ Claude แทน ML model ได้ไหม?+

ไม่ได้ — แต่ใช้ augment ได้. LLM ใช้สำหรับ feature engineering ideas, code review, paper summarization. แต่ไม่ใช่ replacement สำหรับ statistical model. อ่านบทความแยก "ChatGPT/Claude สำหรับ Trader — 12 Prompts" ของเรา.

Library อะไรที่แนะนำสำหรับ ML trading?+

**`mlfinlab`** (Python) — Lopez de Prado's methods. **`scikit-learn`** + **`xgboost`** + **`lightgbm`** สำหรับ model. **`backtesting.py`** หรือ **`vectorbt`** สำหรับ backtest. **`optuna`** สำหรับ hyperparameter tuning. ทั้งหมดฟรี + open source.

มี course ไทย ๆ สอน ML trading ไหม?+

**Coursera** ของ Yuan + Stanford "Machine Learning for Trading" ภาษาอังกฤษ. **คอร์สเทรดตามวาฬ** ของเรา ภาษาไทย — ไม่ได้สอน ML ลึก แต่สอน framework การเทรดที่ ML ใช้ augment ได้ดี (SMC × ICT). ดู /courses

ROI realistic ของ ML trading ที่ทำดี?+

AQR + Two Sigma + Renaissance: 15-30% ROI ต่อปี (Sharpe 2-4) บน billion-dollar AUM. Retail ที่ทำดี: 5-15% ROI ต่อปี (Sharpe 1-1.5) เหนือกว่า S&P 500 ระยะยาว. Retail ที่ "claim" 100%+ ROI ส่วนใหญ่ overfit + ไม่ลง track record live > 12 เดือน.

แหล่งอ้างอิง

Marcos Lopez de Prado · "Advances in Financial Machine Learning" (Wiley 2018)
Marcos Lopez de Prado · "Machine Learning for Asset Managers" (Cambridge UP 2020)
mlfinlab Python library
Bailey + Lopez de Prado · "The Probability of Backtest Overfitting" (2014)
Cornell ORIE 5751 syllabus — ML in Finance