MCP 편하다고 막 써도 괜찮을까? | Is It Really Okay to Use MCP Just Because It's Convenient?

쟈 미 2025. 4. 24. 01:14

728x90

LLM 정말 핫하긴하다. ~~근데 그래서 개발자 못하려나 걱정이 있다.~~
최근엔 chatgpt, cluad, perprexity 필요에 적극적으로 업무에도 활용하고 공부에도 정말 도움을 많이 받고있다.
Junie, Copliot도 코드 짤때 정말 적극 활용하고 있는 요즘이다.

실제로 linux script 실행할때나 간단한 script 코드들 짤 때. 생산성이 정말 많이 올라갔다.
예를들면 log format이 이 형태인데 grep으로 이 포맷에서 이 필드를 가진 로그가 총 몇개인지, unique 값은 몇개인지 전체 log row 중에서의 비율은 몇개인지 간단한 한줄짜리 linux command 알려달라고 할 때 일회성으로 생각없이 쓰게되는 것 같다.
전반적인 구조를 고려해서 짜야하는 코드는 아직 잘 모르겠다. 구조를 고려한건 아무래도 Junie가 잘 해주는것 같긴한데 그래도 결국 실무 코드에서는 실무자가 배포 부담을 져야하니 쉽지않다.

여튼 이런식으로 그동안은 써보기만하다가 이제는 슬슬 이것들의 동작원리나 조심해서 써야하는 부분들을 찾아봐야하려나 하는 고민이 생겼다. mcp의 등장이후로 token 연동해서 외부 api를 (mcp server) llm으로 활용하는 경우도 점점늘어나고 있어서 그렇다. 특이나 아래글들을 읽고 좀 알아봐야겠다는 생각이 들었는데

llm으로 인해 서버비가 너무 많이나온 개발자의 linkedIn 글

어느 날 웹 서버비가 많이 나왔어요. DDOS인 줄 알고 허겁지겁 가장 큰 트래픽 IP들 방화벽으로 차

어느 날 웹 서버비가 많이 나왔어요. DDOS인 줄 알고 허겁지겁 가장 큰 트래픽 IP들 방화벽으로 차단했는데요. 가만히 살펴보니 User-agent에 claudebot geminibot openai ... 라고 쓰여있네요. 마냥 접속을 허

kr.linkedin.com

mcp 보안에 대한 geek news 뉴스레터 글

MCP에서 발생할 수 있는 모든 문제들 | GeekNews

MCP는 LLM 기반 에이전트에 외부 도구 및 데이터를 통합하는 실질적 표준으로 빠르게 자리잡음보안, UX, LLM 신뢰성 문제 등 다양한 잠재적 취약점과 한계가 존재함프로토콜 자체의 설계와 인증 방

news.hada.io

이제 얕게라도 좀 알아야될때가 됐다. mcp에 대해 찾아보고 나서의 생각을 적어본것이기 때문에, 부정확할 수 있다.
더 알아야할 것들이나 정정이 필요하다면 댓글로..

1. MCP가 뭘까

https://modelcontextprotocol.io/introduction

Introduction - Model Context Protocol

Understand how MCP connects clients, servers, and LLMs

modelcontextprotocol.io

내생각엔 그동안 http api, tcp 등으로 통신규약을 정의해서 서버의 요청이나 응답 등으로 서비스를 제공했다면
이제 통신규약이 아니라 지정해둔 llm 키워드로 서비스를 제공하는 방식으로 세상이 변하고 있구나를 느꼈다.

만약 원하는게 github에서 내가 원하는 repo의 issue를 가져오는게 목표다 하면 그동안은
github에서 제공하는 http api 규약을 한땀한땀 맞춰서 아래와같이 요청포맷을 그들이 원하는대로 직접 넣어줬었다면.

curl -L \
  -H "Accept: application/vnd.github+json" \
  -H "Authorization: Bearer <YOUR-TOKEN>" \
  -H "X-GitHub-Api-Version: 2022-11-28" \
  https://api.github.com/repos/OWNER/REPO/issues

mcp를 사용하면 그냥 아래와 같은 prompt를 입력하면 mcp server가 위 api를 매핑해서 그 응답을 잘 내려주는 방식인 것이다.

gem-api repository의 첫번째 issue가 뭔지 알려줘.

실제로 github mcp server 구현을 보면 우리가 @Controller를 이용해서 endpoint를 뚫듯이 mcp server가 매핑할때 참고할만한 description을 추가해서 mcp server의 endpoint를 뚫은 모양새와 같다

https://github.com/modelcontextprotocol/servers/blob/main/src/github/index.ts

servers/src/github/index.ts at main · modelcontextprotocol/servers

Model Context Protocol Servers. Contribute to modelcontextprotocol/servers development by creating an account on GitHub.

github.com

   {
        name: "get_issue",
        description: "Get details of a specific issue in a GitHub repository.",
        inputSchema: zodToJsonSchema(issues.GetIssueSchema)
      },

실제로 안의 inputSchema의 내용을 따라가면 github api 호출을 하고있음을 알 수 있다.
결국 mcp는 llm이 사용하기 위한 @Controller를 하나 뚫어둔거라고 생각하면 된다.
어떻게? description과 name을 적당히 자연어로 잘 적어서

그래서 이제 llm + mcp를 사용하게되면 서버 프로그래밍 상으로 여러 api요청을 연쇄적으로 그때그때 인자값을 열심히 연결해서 코딩해서 넣던걸 자연어로 원하는 응답을 받을 수 있다는 장점이 생기게 된다.

요구사항이 아래와 같다고하자.

내가 가진 GitHub repository 중에 star가 가장 많은 걸 알려줘.
그리고 그 repository의 최근 커밋 수랑 contributor 수, issue 개수도 알려줘.

예전에 코딩으로 이 요구사항을 해결해야했으면
아래와 같은 수도코드를 작성하기 위해 api 명세를 확인하고.. 틀린지 아닌지 확인하고 올바른 dto 매핑인지 살펴보고 등등 귀찮았다.
사실 아래의 수도코드로는 위에 있는 요구사항을 전부 해결할 수 없다. (더 해야한다)

# 기존 방식
import requests

headers = {
    "Authorization": "Bearer <MY_TOKEN>",
    "Accept": "application/vnd.github+json"
}

# 1. 내 전체 repo 가져오기
repos = requests.get("https://api.github.com/user/repos", headers=headers).json()

# 2. 가장 star 많은 repo 찾기
top_repo = max(repos, key=lambda r: r["stargazers_count"])

# 3. 커밋 정보 가져오기
commits = requests.get(f"https://api.github.com/repos/{top_repo['full_name']}/commits", headers=headers).json()

# 4. 통계 출력
print(f"{top_repo['name']}의 커밋 수: {len(commits)}")

근데 이제 llm과 함께 mcp를 사용하게 되면 그냥 저 요구사항을 입력하면 된다.

이 요구사항을 만족하기위해 필요한 mcp server description을 알아서 판별하고 알아서 인자값을 넣어서 github api 를 호출한다.
실제로 저기 블록에 있는 search_repositories 가 호출한 mcp server 프로토콜 명을 뜻한다.

{
    name: "search_repositories",
    description: "Search for GitHub repositories",
    inputSchema: zodToJsonSchema(repository.SearchRepositoriesSchema),
  },
   case "search_repositories": {
    const args = repository.SearchRepositoriesSchema.parse(request.params.arguments);
    const results = await repository.searchRepositories(
      args.query,
      args.page,
      args.perPage
    );
    return {
      content: [{ type: "text", text: JSON.stringify(results, null, 2) }],
    };
  }

결국 자연어에서 어떤 api를 써야하는지 찾기위한 힌트를 적기만해도 api endpoint가 뚫리는게 MCP이다

근데 이 작은 요구사항을 해결하려고 llm은 api 콜을 9개나 썼는데, 정말 이렇게까지 많이 필요한건가?
엄청 많이 하는거아닌가? 사실 개발자가 직접 코딩을 했다면 이렇게까지 많은 api를 썼을까? 이런 생각이 든다.
~~(근데 편하긴하다)~~

예전 방식은 내가 어떤 API를 호출하고 있는지, 어떤 데이터를 어디로 보내고 있는지를 내가 다 컨트롤할 수 있었다.
MCP 방식은 내 의도를 파악한 LLM과 MCP 서버가 대신 처리해주는 구조이기 때문에, 내가 뭘 보내고 있는지 명확히 보이지 않을 수도 있다.

지금까지 설명한 이 흐름이 mcp 문서에서 설명한 architecture의 MCP Server C <-> Remote Service C 부분이다.
이걸 이해했다면 local data source에 대한것도 금방이해하리라 본다.

2. LLM + MCP가 만들어내는 보이지않는 API Call 폭발

위와같이 실제로 MCP를 통해 LLM이 API를 호출하는 과정을 추적해보면, 단일 프롬프트가 여러 개의 API 호출로 이어지는 경우를 확인할 수 있었다. 이러한 호출은 로그나 네트워크 트래픽을 분석하여 파악할 수 있으며, 예상보다 많은 호출이 발생함을 알 수 있었다.

그렇다면 기존에 서비스들이 본인들이 제공하던 open api에 더불어 mcp server 제공하게되면? 본인 서비스의 호출이 증가하게 되고
llm + mcp가 만들어내는 트래픽까지 감당해야하게 되면서 결국 서버 프로그래머들의 대규모 트래픽 관리 능력이 더더욱 중요해지는게 아닐까? ~~(희망회로..)~~

한편으로는 api 호출수로 과금을 하는. 서비스라면 mcp server 호출을 유도해서 돈을 아주 잘 벌 수 있게 되겠지 싶기도 하다.

1. 캐싱전략

a. mcp server inmemory caching

LLM이 동일한 질문을 여러 번 할 수 있고, API 응답은 보통 몇 초 단위로 바뀌지 않기 때문에
응답 결과를 캐싱해두면 서버 부하를 많이 줄일 수 있을 것으로 예상한다.
이때 mcp server는 본인의 local에 있다는 점을 잘 활용하면 remote service까지 가지 않게 트래픽을 조절할 수 있다.
remote service 입장에서는 사실 기존의 클라이언트에서 local storage에 정보를 가지고 서버에 api를 호출하지 않는것과 같은 맥락

import express from "express"
import NodeCache from "node-cache" //가볍고 직관적인 in-memory 캐시 라이브러리야. TTL 기반으로 자동 만료
import axios from "axios"

const app = express()
const cache = new NodeCache({ stdTTL: 300 }) // 기본 TTL 5분

app.get("/commits/:owner/:repo", async (req, res) => {
  const { owner, repo } = req.params
  const cacheKey = `commits:${owner}/${repo}`

  // 1. 캐시에 있으면 리턴
  const cached = cache.get(cacheKey)
  if (cached) {
    console.log(`[CACHE HIT] ${cacheKey}`)
    return res.json(cached)
  }

  // 2. 외부 API 호출
  const response = await axios.get(
    `https://api.github.com/repos/${owner}/${repo}/commits`,
    {
      headers: {
        Authorization: `Bearer ${process.env.GITHUB_TOKEN}`,
        Accept: "application/vnd.github+json"
      }
    }
  )

  const data = response.data

  // 3. 캐시에 저장
  cache.set(cacheKey, data)

  console.log(`[CACHE MISS] ${cacheKey} - 저장 완료`)
  res.json(data)
})

위와 같은 코드로 api를 호출할때 caching 해두는 것 처럼 내가 만든 mcp서버가 외부 api 를 호출하는 서버라면 이 전략을 사용해서 외부 api 호출량을 줄이는 방법이 있을 것으로 보인다.

다만 이렇게 했을때 client에서 "내용이 부정확하다", "잘못된 내용으로 보인다", 등의 프롬프트가 있다면 cache reset 하고 직접 api에 호출한다던지 전략이 필요해보인다.

b. prompt caching / semantic caching

LLM에게 동일한 프롬프트를 반복해서 보냈을 때, 매번 새롭게 생각(=토큰 소모)하지 않도록, 이전 응답을 미리 캐시해두는 방식

“We do not currently cache prompts on our side. However, we recommend client-side caching if you’d like to avoid resending the same prompt multiple times.”

https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching#continuing-a-multi-turn-conversation

Prompt caching - Anthropic

Large context caching example This example demonstrates basic prompt caching usage, caching the full text of the legal agreement as a prefix while keeping the user instruction uncached. For the first request: input_tokens: Number of tokens in the user mess

docs.anthropic.com

mcp client라고 볼수 있는 claud가 제공하고 있는 방식이다. claude나 OpenAI 같은 LLM Provider는 사실상 MCP의 client 역할을 하고 있고, 결국 client 입장에서는 llm 사용요금과도 연결되는 (돈을 아끼면서 llm을 쓰고싶은..) 부분이라서 공식적으로 지원하고 있는것으로 보인다.

요약하면 claud 사용시 아래와 내용을 추가하면 prompt cache가 활성화 된다는 이야기이다.

"cache_control": {"type": "ephemeral"}

실제로 model 로 부터 응답을 받는데 더 작은 시간이 소요된다는 예시는 아래에 있다. Example1의 non-cached api call과 cached api call을 비교하면 20s > 2s 로 많이 줄어들었음을 확인할 수 있다.

https://github.com/anthropics/anthropic-cookbook/blob/main/misc/prompt_caching.ipynb

anthropic-cookbook/misc/prompt_caching.ipynb at main · anthropics/anthropic-cookbook

A collection of notebooks/recipes showcasing some fun and effective ways of using Claude. - anthropics/anthropic-cookbook

github.com

Example2에서 응답 시간은 초기 캐시 설정 후 거의 24초에서 단 7-11초로 단축되었고, 응답 전반에 걸쳐 동일한 수준의 품질을 유지한다고한다. 7~11초의 이유는 대부분은 응답을 생성하는 데 걸리는 시간 때문이며, 캐시 breakpoints를 계속 조정하면서 입력 토큰의 거의 100%가 이후에 캐시되었기 때문에, 사용자 메시지를 거의 즉시 읽을 수 있었다고한다.

prompt_caching을 사용하면 mcp server가 효율적이게 될까? 라고하면 그건 또 상황에 따라 다르다.

1. MCP 서버가 단순 API bridge역할만 하고있다면

외부 api 응답 자체를 mcp 서버 내부에서 캐싱하고있는 것이 훨씬 효율적이다. 왜냐면 prompt를 안쓰니까.
즉, MCP 서버가 단순 API bridge역할만 하고있다면 1번과 같이 api요청에 대한 inmemory caching이 더 효과적이다.

2. mcp 서버가 여러가지 역할을 하고있다면?

지금까지 알아본 prompt caching이 효율적이려면 mcp server가 LLM prompt 결과생성까지 담당하는 구조일 때만 효율적이다.

사용자 → LLM 프롬프트 구성 → 외부 API 호출 → 응답 생성 → LLM에 전달

mcp 서버가 중간 로직과 응답 조합까지 처리하는 경우라면, 같은 프롬프트에 대해 응답을 만들 수 있기때문에 mcp 자체에서 캐싱할 수 있다.

이때 같은 프롬프트에 대한 캐싱만 아니라 의미상 비슷한 내용을 캐싱하기 위해 semantic caching을 이용하는 방법도 있는걸로 안다.
의미적 유사도를 계산하여 vector화 시키고 이것을 임베딩한다. 새로운 입력이들어왔을때 이 입력을 마찬가지로 vector화시키고 임베딩된 데이터와 유사하다면 그 응답을 반환하는 방법이라고 알고있다. ~~근데 직접 한다고 생각하면 머리아프다 그만알고싶다~~

여튼 말하고자 하는 바는 기존의 remote server api 제공자(지금의 서버개발자들)가 mcp server까지 제공하게된다면 어떤 캐싱 전략을 취하는지도 중요한 시대가 되어버렸다.
기존의 remote server 단 캐싱을 믿고 몰려드는 트래픽을 멋진 서버구조로 해결하겠어! 라는 마음가짐이 아니라
제공하는 mcp server 단에도 inmemory caching을 달아서 remote server에 몰리는 트래픽을 줄이는 방법을 고려해야한다.

근데 생각해보면 remote server 단 api 호출 수로 유저가 과금하게 만드는 구조라면 일부러 mcp server에 캐싱을 안 달 것 같기도하다.
유저입장에선 api call bridge 역할의 mcp server들의 호출들을 전부 caching해주는 caching mcp server를 사용하는게 나을 수도

2. 요청 제한 설정

위에 말했듯이. MCP를 쓰기 시작하면서, LLM이 단순히 한 줄 프롬프트만 받아들이는 게 아니라, 그 프롬프트를 해석해서 여러 개의 외부 API를 한꺼번에 호출하기 시작한다는 점이었다.

예전에는 사용자가 직접 API를 호출했기 때문에 “한 번에 몇 개 요청 보낼지”, “실행 시간이 얼마일지”를 어느 정도 예측할 수 있었다.
하지만 LLM은 한 문장의 목적을 이루기 위해 5개, 10개 넘는 요청을 연쇄적으로 호출할 수도 있다.

a. rate limiting

문제는 기존 전통적인 remote server api들은 rate limiting 제한이 있다. 1초에 3개이상의 요청을 보내지 말라는 등의 요구사항으로.
고로 mcp server에서 api 콜을 보낼 때 rate limiting을 고려해야한다. ( 기존 전통적인 client들에서 고민하던 것들을 mcp server에 녹이는 느낌이 든다)

https://github.com/jwaxman19/qlik-mcp/blob/main/src/index.ts

qlik-mcp/src/index.ts at main · jwaxman19/qlik-mcp

An MCP server to run qlik. Contribute to jwaxman19/qlik-mcp development by creating an account on GitHub.

github.com

실제로 위 mcp서버는 Qlik Cloud API를 사용해서 시각화하는 목적을 갖고있는데, 실제 호출부의 코드를 보면 rate limiting 적용을 위해 delay를 적용해둔 걸 확인할 수 있었다.

   const data = await withRetry(async () => chartObject.getHyperCubeData('/qHyperCubeDef', [{
          qTop: startRow,
          qLeft: 0,
          qWidth: metadata.totalColumns,
          qHeight: rowCount
        }]));

        if (data?.[0]?.qMatrix) {
          allData.push(...data[0].qMatrix);
        }

        // Add delay between chunks to avoid rate limiting
        if (startRow + pageSize < rowsToFetch) {
          await delay(REQUEST_DELAY_MS);
        }

페이지네이션 하는 forloop 안에 rate limiting 코드가 들어있었음.

외에도 고려하면 좋을 것들로

b. timeout

mcp server에서 외부 api를 계속 호출하는데 응답이 너무 느리게 오는 상황이라면 일부러 강제종료를 시켜서 다른 mcp tools를 이용하여 llm 이 결과를 낼수록 유도하기 때문에 timeout 설정도 잘해주는게 좋다.

c. 병렬처리 제한.

llm이 mcp tools를 이용하여 병렬로 여러 요청을 날리면 그만큼 remote server에 영향이 커지게 된다. a에서의 ratelimiting을 건다고해도 한개의 api요청에 대해서만 ratelimiting이 걸리게하는 방식으로 코드를 작성한걸 볼 수 있다.
그러나 mcp는 동시에 여러개의 tools를 사용하여 api 요청을 하게할 수 있으니 tools를 동시에 여러개 실행하게 되면 remote server에 부하가 동시에 몰릴 수도 있게되는 상황이다.

고로 java 기준은 api호출시 ExecutorService를 이용해서 고정된 쓰레드 풀로 병렬작업을 실행하도록 병렬처리 작업개수를 조절한다거나 하는 방법을 이용하는 것이다.

d. circuit breaker

나의 remote server가 죽었는데도 llm으로 인해 계속 mcp가 retry를 하게된다면? remote server에 오히려 요청이 몰리면서 c에 해둔 병렬처리 제한이 같이 걸려있다면 오히려 리소스를 사용하지 못하는 상황이 될 수 있다. 이런 상황을 막기위해 일정 횟수 이상 실패시 api 호출을 차단하는 로직들이 필요할 수 있다.

결국 써놓고 보니 mcp server를 구현하는 것은 server와 client를 동시에 제공하는것과 같은느낌이 들지 않는가? mcp server를 기존시스템에 녹여서 사용하기 위해서는 기존에 client단에서 성능을 올리기 위한 여러 트릭들을 mcp server에 적용하면 되는 느낌이다.

3. 보안

제일 무섭다.

4. 기술은 진화하지만, 본질은 크게 다르지 않다

llm이 나오고 나서 “이제 개발자는 할 일 없어지는 거 아닌가?“라는 얘기를 자주 듣는다.
우선 mcp 자체만 놓고봤을 땐, 새로운 형태의 api 프로토콜일 뿐이다. api 요청이 더 자연어에 가까워졌을 뿐

그래서 프론트에서 들어오는 요청이 자연어가 되었다고해서 그걸 처리하는 서버의 역할까지 사라지는건 아니다. 오히려 유저 요청을 더. 편하게 쓸 수 있게되었다는 점이고.

결국 서비스를 만들기 위해서는 여전히 특정 플로우를 설계해야 하고, 보안과 성능을 고려해서 캐싱도 걸고, 트래픽도 분산해야 한다.
이건 예전에도 개발자가 하던 일이었다.

이전에 pc만 쓰던시대에서 mobile도 쓰는 시대로 넘어갈때, 원래도 서버라는 개념이 있었다. 다만 mobile로 넘어가면서 그 서버들이 여러 환경에서 요청을 받을수있고 접근이 쉬워졌고 그러면서 서버에서 처리해야할 요청량들이 엄청나게 많아졌다. 따라서 서버에서 이런 요청을 처리하기 위해 많은 기존의 서버개발자들이 머리를 싸매 성능향상을 위해 여러 방법론을 제안하고 기존의 개념들을 활용한 아키텍쳐가 발생하게 된것이 아닌가?

이제 mobile app을 쓰던 시대에서 llm으로 서비스를 제공받는 시대로 넘어감에 따라서. 이전과 거의 비슷하다. 이전과 같이 유저의 서버 요청이 더 쉬워짐에 따라서 서버는 성능향상에 더 몰두하게 될 것이고, 기존의 여러 client, server 통신, 보안등에 대해서 기존의 개념들을 활용한 아키텍처가 생기고 또 서버 성능을 끌어올리기위한 노력들이 더더욱 생길 것 같다.

그래서 개발자가 사라지는게 아니라 오히려 이런 부분을 채워줄 수 있는 개발자로 나아가야할 것 같다.
결국 기존 기술들의 개념을 잘 이해하고 있는 개발자들이 LLM 시대에도 더 필요한 역할을 맡게 되지 않을까?
그래서 결국 개발공부는 해야할것 같다는 결론이 나버렸다..

끗

근데 난 gpt 로 블로그 글은 못쓰겠다. 얘가 써주는 내용은 너무 오글거림

LLMs are really blowing up. ~~But honestly, I'm a bit worried about whether developers will become obsolete.~~
Lately, I've been actively using ChatGPT, Claude, and Perplexity for work and studying — they've been incredibly helpful.
These days, I'm also heavily using Junie and Copilot when writing code.

My productivity has genuinely skyrocketed, especially when running Linux scripts or writing quick script code.
For example, when I have a log format like this and I need a one-liner Linux command to grep for how many logs have a certain field in that format, how many unique values there are, and what percentage of total log rows they represent — I just mindlessly ask for it and use it as a throwaway thing.
For code that requires thinking about the overall architecture, I'm still not so sure. Junie seems to handle structural considerations pretty well, but at the end of the day, in production code, the developer has to bear the deployment risk, so it's not that simple.

Anyway, up until now I've just been casually using these tools, but I'm starting to think it's time to look into how they actually work and what to watch out for. Especially since the arrival of MCP has led to more and more cases where people connect tokens to use external APIs (MCP servers) through LLMs. In particular, reading the articles below made me think I should dig into this a bit more.

A developer's LinkedIn post about server costs skyrocketing because of LLMs

One day, the web server bill was way too high. Thinking it was a DDOS attack, I frantically started blocking the top traffic IPs with the firewall

One day, the web server bill was way too high. Thinking it was a DDOS attack, I frantically blocked the top traffic IPs with the firewall. But when I looked closely, the User-agent said claudebot geminibot openai ... Just blindly allowing access

kr.linkedin.com

A GeekNews newsletter article about MCP security

All the Problems That Can Occur with MCP | GeekNews

MCP is rapidly becoming the de facto standard for integrating external tools and data into LLM-based agents. Various potential vulnerabilities and limitations exist, including security, UX, and LLM reliability issues. The protocol's own design and authentication approach

news.hada.io

I think it's time to learn at least the basics now. This is written after looking into MCP, so it might not be entirely accurate.
If there's anything that needs correcting or more research, let me know in the comments.

1. What Is MCP?

https://modelcontextprotocol.io/introduction

Introduction - Model Context Protocol

Understand how MCP connects clients, servers, and LLMs

modelcontextprotocol.io

The way I see it, until now we've been providing services through communication protocols like HTTP APIs, TCP, etc., defining request and response formats.
But now, the world is shifting toward providing services not through communication protocols, but through designated LLM keywords.

Say your goal is to fetch issues from a specific repo on GitHub. Previously,
you'd have to manually match the HTTP API specifications that GitHub provides, carefully crafting the request format exactly how they want it, like this:

curl -L \
  -H "Accept: application/vnd.github+json" \
  -H "Authorization: Bearer <YOUR-TOKEN>" \
  -H "X-GitHub-Api-Version: 2022-11-28" \
  https://api.github.com/repos/OWNER/REPO/issues

With MCP, you just type a prompt like the one below, and the MCP server maps it to the API above and returns the response nicely for you.

gem-api repository의 첫번째 issue가 뭔지 알려줘.

If you actually look at the GitHub MCP server implementation, it's structured similarly to how we expose endpoints using @Controller — the MCP server adds descriptions that it can reference for mapping, essentially opening up MCP server endpoints.

https://github.com/modelcontextprotocol/servers/blob/main/src/github/index.ts

servers/src/github/index.ts at main · modelcontextprotocol/servers

Model Context Protocol Servers. Contribute to modelcontextprotocol/servers development by creating an account on GitHub.

github.com

   {
        name: "get_issue",
        description: "Get details of a specific issue in a GitHub repository.",
        inputSchema: zodToJsonSchema(issues.GetIssueSchema)
      },

If you follow the inputSchema inside, you can see that it's actually making GitHub API calls under the hood.
In the end, you can think of MCP as opening up a @Controller for the LLM to use.
How? By writing the description and name appropriately in natural language.

So when you use LLM + MCP, you gain the advantage of receiving the responses you want in natural language, instead of having to chain multiple API requests together in server code, painstakingly passing arguments from one call to the next.

Let's say the requirement is something like this:

내가 가진 GitHub repository 중에 star가 가장 많은 걸 알려줘.
그리고 그 repository의 최근 커밋 수랑 contributor 수, issue 개수도 알려줘.

If you had to solve this requirement with code back in the day,
you'd have to check API specs, verify whether your code is correct, make sure the DTO mapping is right, and so on — all just to write pseudocode like the one below. It was a hassle.
And honestly, the pseudocode below doesn't even fully satisfy the requirements above. (You'd need to do more.)

# 기존 방식
import requests

headers = {
    "Authorization": "Bearer <MY_TOKEN>",
    "Accept": "application/vnd.github+json"
}

# 1. 내 전체 repo 가져오기
repos = requests.get("https://api.github.com/user/repos", headers=headers).json()

# 2. 가장 star 많은 repo 찾기
top_repo = max(repos, key=lambda r: r["stargazers_count"])

# 3. 커밋 정보 가져오기
commits = requests.get(f"https://api.github.com/repos/{top_repo['full_name']}/commits", headers=headers).json()

# 4. 통계 출력
print(f"{top_repo['name']}의 커밋 수: {len(commits)}")

But now with LLM + MCP, you just type in the requirement as-is.

It automatically figures out which MCP server descriptions are needed to fulfill the requirement, fills in the arguments on its own, and calls the GitHub API.
In fact, the search_repositories shown in that block represents the name of the MCP server protocol that was called.

{
    name: "search_repositories",
    description: "Search for GitHub repositories",
    inputSchema: zodToJsonSchema(repository.SearchRepositoriesSchema),
  },
   case "search_repositories": {
    const args = repository.SearchRepositoriesSchema.parse(request.params.arguments);
    const results = await repository.searchRepositories(
      args.query,
      args.page,
      args.perPage
    );
    return {
      content: [{ type: "text", text: JSON.stringify(results, null, 2) }],
    };
  }

Ultimately, MCP is about opening up an API endpoint just by writing hints in natural language so the LLM can figure out which API to use.

But to handle this small requirement, the LLM made 9 API calls — do we really need that many?
Isn't that way too much? Honestly, would a developer have used this many API calls if they coded it themselves? That's what I'm thinking.
~~(But it is convenient, though.)~~

With the old approach, I had full control over which APIs I was calling, what data I was sending, and where it was going.
With the MCP approach, the LLM and MCP server handle things on your behalf based on their interpretation of your intent, which means you might not always have clear visibility into what's being sent.

The flow I've described so far corresponds to the MCP Server C <-> Remote Service C part of the architecture explained in the MCP documentation.
Once you understand this, you should be able to quickly grasp the local data source part as well.

2. The Hidden API Call Explosion Created by LLM + MCP

As shown above, when you actually trace the process of an LLM making API calls through MCP, you can see that a single prompt leads to multiple API calls. These calls can be identified by analyzing logs or network traffic, and it turns out there are far more calls happening than expected.

So what happens when existing services start offering MCP servers on top of the open APIs they already provide? Their service call volume will increase,
and they'll have to handle the additional traffic generated by LLM + MCP — which means server programmers' ability to manage large-scale traffic becomes even more important, doesn't it? ~~(Hopeful thinking...)~~

On the other hand, for services that charge based on API call volume, incentivizing MCP server usage could be a great way to rake in money.

1. Caching Strategies

a. MCP Server In-Memory Caching

An LLM can ask the same question multiple times, and API responses typically don't change within a few seconds,
so caching response results should significantly reduce server load.
If you take advantage of the fact that the MCP server lives on your local machine, you can control traffic so it never even reaches the remote service.
From the remote service's perspective, it's essentially the same concept as a traditional client holding information in local storage and not making API calls to the server.

import express from "express"
import NodeCache from "node-cache" //가볍고 직관적인 in-memory 캐시 라이브러리야. TTL 기반으로 자동 만료
import axios from "axios"

const app = express()
const cache = new NodeCache({ stdTTL: 300 }) // 기본 TTL 5분

app.get("/commits/:owner/:repo", async (req, res) => {
  const { owner, repo } = req.params
  const cacheKey = `commits:${owner}/${repo}`

  // 1. 캐시에 있으면 리턴
  const cached = cache.get(cacheKey)
  if (cached) {
    console.log(`[CACHE HIT] ${cacheKey}`)
    return res.json(cached)
  }

  // 2. 외부 API 호출
  const response = await axios.get(
    `https://api.github.com/repos/${owner}/${repo}/commits`,
    {
      headers: {
        Authorization: `Bearer ${process.env.GITHUB_TOKEN}`,
        Accept: "application/vnd.github+json"
      }
    }
  )

  const data = response.data

  // 3. 캐시에 저장
  cache.set(cacheKey, data)

  console.log(`[CACHE MISS] ${cacheKey} - 저장 완료`)
  res.json(data)
})

Like the code above that caches API call results, if the MCP server you built is one that calls external APIs, you could use this strategy to reduce the number of external API calls.

However, if the client sends prompts like "the information seems inaccurate" or "this looks wrong," you'd need a strategy like resetting the cache and calling the API directly.

b. Prompt Caching / Semantic Caching

When the same prompt is repeatedly sent to an LLM, this approach pre-caches previous responses so it doesn't have to think from scratch (= consume tokens) every time.

“We do not currently cache prompts on our side. However, we recommend client-side caching if you’d like to avoid resending the same prompt multiple times.”

https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching#continuing-a-multi-turn-conversation

Prompt caching - Anthropic

docs.anthropic.com

This is an approach provided by Claude, which can be considered an MCP client. LLM providers like Claude or OpenAI essentially play the role of MCP clients, and since from the client's perspective this directly ties into LLM usage costs (wanting to use LLMs while saving money..), they seem to officially support it.

In short, when using Claude, adding the following activates prompt caching.

"cache_control": {"type": "ephemeral"}

An example showing that it actually takes less time to get a response from the model is below. Comparing the non-cached API call and cached API call in Example 1, the time dropped significantly from 20s to 2s.

https://github.com/anthropics/anthropic-cookbook/blob/main/misc/prompt_caching.ipynb

anthropic-cookbook/misc/prompt_caching.ipynb at main · anthropics/anthropic-cookbook

A collection of notebooks/recipes showcasing some fun and effective ways of using Claude. - anthropics/anthropic-cookbook

github.com

In Example 2, the response time dropped from nearly 24 seconds to just 7-11 seconds after the initial cache setup, while maintaining the same level of quality across responses. The 7-11 seconds is mostly due to the time needed to generate the response, and by continuously adjusting the cache breakpoints, nearly 100% of input tokens were cached afterwards, which means the user message could be read almost instantly.

Does using prompt_caching make MCP servers more efficient? Well, that depends on the situation.

1. If the MCP server is only acting as a simple API bridge

It's much more efficient to cache external API responses internally within the MCP server. Because you're not using prompts at all.
In other words, if the MCP server is only acting as a simple API bridge, in-memory caching for API requests as described in option 1 is more effective.

2. What if the MCP server handles multiple responsibilities?

The prompt caching we've looked at so far is only efficient when the MCP server is structured to handle LLM prompt result generation as well.

User → LLM prompt composition → External API call → Response generation → Pass to LLM

If the MCP server handles intermediate logic and response composition, it can generate responses for the same prompt, so caching can be done at the MCP level itself.

At this point, it's not just about caching for identical prompts — I understand there's also an approach using semantic caching to cache semantically similar content.
It calculates semantic similarity, vectorizes it, and embeds it. When new input comes in, it's similarly vectorized, and if it's similar to the embedded data, the corresponding response is returned. ~~But thinking about implementing this myself gives me a headache. I don't want to know anymore.~~

Anyway, the point I'm trying to make is that if existing remote server API providers (today's server developers) start providing MCP servers as well, choosing the right caching strategy has become important in this new era.
Rather than the mindset of "I'll trust the remote server-side caching and handle the flood of traffic with a fancy server architecture!",
you need to consider adding in-memory caching at the MCP server level to reduce the traffic hitting the remote server.

But then again, if the business model charges users based on remote server API call volume, they might intentionally not add caching to the MCP server.
From the user's perspective, it might be better to use a caching MCP server that caches all the calls from MCP servers acting as API call bridges.

2. Request Throttling

As I mentioned above, once you start using MCP, the LLM doesn't just take in a single line of prompt — it interprets that prompt and starts calling multiple external APIs all at once.

Before, users called APIs directly, so you could somewhat predict "how many requests they'd send at once" and "how long execution would take."
But an LLM might chain 5, 10, or even more requests just to fulfill a single sentence's objective.

a. Rate Limiting

The problem is that traditional remote server APIs have rate limiting restrictions — things like "don't send more than 3 requests per second."
So when making API calls from the MCP server, you need to account for rate limiting. (It feels like we're taking the concerns that traditional clients used to deal with and baking them into the MCP server.)

https://github.com/jwaxman19/qlik-mcp/blob/main/src/index.ts

qlik-mcp/src/index.ts at main · jwaxman19/qlik-mcp

An MCP server to run qlik. Contribute to jwaxman19/qlik-mcp development by creating an account on GitHub.

github.com

The MCP server above is actually designed to visualize using the Qlik Cloud API, and if you look at the actual call code, you can see a delay applied for rate limiting.

   const data = await withRetry(async () => chartObject.getHyperCubeData('/qHyperCubeDef', [{
          qTop: startRow,
          qLeft: 0,
          qWidth: metadata.totalColumns,
          qHeight: rowCount
        }]));

        if (data?.[0]?.qMatrix) {
          allData.push(...data[0].qMatrix);
        }

        // Add delay between chunks to avoid rate limiting
        if (startRow + pageSize < rowsToFetch) {
          await delay(REQUEST_DELAY_MS);
        }

The rate limiting code was inside the pagination for-loop.

Other things worth considering include:

b. Timeout

If the MCP server keeps calling external APIs but the responses are coming back too slowly, it's good to set proper timeouts to force-terminate and guide the LLM to produce results using other MCP tools instead.

c. Concurrency Limits

When the LLM fires off multiple requests in parallel using MCP tools, the impact on the remote server grows accordingly. Even with the rate limiting from section (a), you can see the code only applies rate limiting to individual API requests.
However, since MCP can use multiple tools simultaneously to make API requests, running several tools at once could cause a burst of load on the remote server all at once.

So in Java, for example, you'd use an ExecutorService with a fixed thread pool to control the number of concurrent tasks when making API calls.

d. Circuit Breaker

What if your remote server is down but the LLM keeps making the MCP retry? Requests pile up on the remote server, and if the concurrency limits from section (c) are also in place, you could end up in a situation where resources can't be utilized at all. To prevent this, you may need logic that blocks API calls after a certain number of failures.

When I step back and look at what I've written, doesn't implementing an MCP server feel like providing both a server and a client at the same time? To integrate an MCP server into an existing system, it feels like you just need to apply all the performance tricks that used to live on the client side to the MCP server instead.

3. Security

This one scares me the most.

4. Technology Evolves, but the Fundamentals Stay the Same

Ever since LLMs came out, I keep hearing "aren't developers going to be out of a job?"
First of all, looking at MCP by itself, it's just a new form of API protocol. API requests just got closer to natural language, that's all.

So just because the requests coming from the frontend are now in natural language doesn't mean the server's role in processing them disappears. If anything, it means users can now make requests more conveniently.

At the end of the day, to build a service you still need to design specific flows, add caching for security and performance, and distribute traffic.
This is the same work developers have always done.

Back when we transitioned from the PC-only era to the mobile era, the concept of servers already existed. But with mobile, those servers started receiving requests from multiple environments, access became easier, and the volume of requests servers had to handle skyrocketed. So server developers racked their brains to propose various methodologies for performance improvement and came up with architectures leveraging existing concepts — isn't that what happened?

Now, as we transition from the mobile app era to the era of receiving services through LLMs, it's almost identical to before. Just like before, as it becomes easier for users to make server requests, servers will focus even more on performance improvements, and architectures leveraging existing concepts around client-server communication and security will emerge, along with even more efforts to push server performance further.

So developers aren't disappearing — rather, we should be growing into developers who can fill these gaps.
Ultimately, won't developers who deeply understand the fundamentals of existing technologies be the ones needed even more in the LLM era?
So I've arrived at the conclusion that... we still need to study development after all..

The end.

But honestly, I can't write blog posts with GPT. The stuff it writes is just too cringe.

'Develop > AI,LLM' 카테고리의 다른 글

Claude Code 티스토리 블로그 스킨 커스텀하기 \| Claude Code Customizing a Tistory Blog Skin (1)	2026.03.29