5 백만 개 이상의 레코드에 대한 MongoDB 쿼리 성능

code

5 백만 개 이상의 레코드에 대한 MongoDB 쿼리 성능

codestyles 2020. 11. 12. 08:16

5 백만 개 이상의 레코드에 대한 MongoDB 쿼리 성능

우리는 최근 주요 컬렉션 중 하나에서 200 만 이상의 기록을 달성했으며 이제 해당 컬렉션의 주요 성능 문제로 어려움을 겪기 시작했습니다.

컬렉션의 문서에는 UI를 사용하여 필터링 할 수있는 약 8 개의 필드가 있으며 결과는 레코드가 처리 된 타임 스탬프 필드별로 정렬되어야합니다.

필터링 된 필드와 타임 스탬프가있는 여러 복합 인덱스를 추가했습니다. 예 :

db.events.ensureIndex({somefield: 1, timestamp:-1})

또한 한 번에 여러 필터를 사용하여 더 나은 성능을 얻을 수 있도록 몇 가지 인덱스를 추가했습니다. 그러나 일부 필터는 여전히 수행하는 데 너무 오랜 시간이 걸립니다.

나는 쿼리가 내가 만든 인덱스를 사용한다는 설명을 사용했지만 성능이 여전히 충분하지 않다는 것을 확인했습니다.

샤딩이 지금가는 길이 아닌지 궁금하네요 ..하지만 곧 그 컬렉션에서 하루에 약 100 만개의 새 레코드를 갖게 될 것입니다 .. 그래서 확장이 잘 될지 모르겠습니다 ..

편집 : 쿼리 예 :

> db.audit.find({'userAgent.deviceType': 'MOBILE', 'user.userName': {$in: ['nickey@acme.com']}}).sort({timestamp: -1}).limit(25).explain()
{
        "cursor" : "BtreeCursor user.userName_1_timestamp_-1",
        "isMultiKey" : false,
        "n" : 0,
        "nscannedObjects" : 30060,
        "nscanned" : 30060,
        "nscannedObjectsAllPlans" : 120241,
        "nscannedAllPlans" : 120241,
        "scanAndOrder" : false,
        "indexOnly" : false,
        "nYields" : 1,
        "nChunkSkips" : 0,
        "millis" : 26495,
        "indexBounds" : {
                "user.userName" : [
                        [
                                "nickey@acme.com",
                                "nickey@acme.com"
                        ]
                ],
                "timestamp" : [
                        [
                                {
                                        "$maxElement" : 1
                                },
                                {
                                        "$minElement" : 1
                                }
                        ]
                ]
        },
        "server" : "yarin:27017"
}

deviceType에는 내 컬렉션에 2 개의 값만 있습니다.

건초 더미에서 바늘을 찾고 있습니다. explain()잘 수행되지 않는 쿼리 에 대한 출력이 필요 합니다. 불행히도 특정 쿼리에 대해서만 문제가 해결되므로 여기에 접근하는 방법에 대한 전략이 있습니다.

RAM 부족 및 과도한 페이징 때문이 아닌지 확인
DB를 프로파일 러를 사용 (사용하는 db.setProfilingLevel(1, timeout)경우 timeout쿼리 또는 명령 걸리는 시간 (밀리 초)에 대한 임계 값이고, 아무것도 느린 기록됩니다)
Inspect the slow queries in db.system.profile and run the queries manually using explain()
Try to identify the slow operations in the explain() output, such as scanAndOrder or large nscanned, etc.
Reason about the selectivity of the query and whether it's possible to improve the query using an index at all. If not, consider disallowing the filter setting for the end-user or give him a warning dialog that the operation might be slow.

A key problem is that you're apparently allowing your users to combine filters at will. Without index intersectioning, that will blow up the number of required indexes dramatically.

Also, blindly throwing an index at every possible query is a very bad strategy. It's important to structure the queries and make sure the indexed fields have sufficient selectivity.

Let's say you have a query for all users with status "active" and some other criteria. But of the 5 million users, 3 million are active and 2 million aren't, so over 5 million entries there's only two different values. Such an index doesn't usually help. It's better to search for the other criteria first, then scan the results. On average, when returning 100 documents, you'll have to scan 167 documents, which won't hurt performance too badly. But it's not that simple. If the primary criterion is the joined_at date of the user and the likelihood of users discontinuing use with time is high, you might end up having to scan thousands of documents before finding a hundred matches.

So the optimization depends very much on the data (not only its structure, but also the data itself), its internal correlations and your query patterns.

Things get worse when the data is too big for the RAM, because then, having an index is great, but scanning (or even simply returning) the results might require fetching a lot of data from disk randomly which takes a lot of time.

The best way to control this is to limit the number of different query types, disallow queries on low selectivity information and try to prevent random access to old data.

If all else fails and if you really need that much flexibility in filters, it might be worthwhile to consider a separate search DB that supports index intersections, fetch the mongo ids from there and then get the results from mongo using $in. But that is fraught with its own perils.

-- EDIT --

The explain you posted is a beautiful example of a the problem with scanning low selectivity fields. Apparently, there's a lot of documents for "nickey@acme.com". Now, finding those documents and sorting them descending by timestamp is pretty fast, because it's supported by high-selectivity indexes. Unfortunately, since there are only two device types, mongo needs to scan 30060 documents to find the first one that matches 'mobile'.

I assume this is some kind of web tracking, and the user's usage pattern makes the query slow (would he switch mobile and web on a daily basis, the query would be fast).

Making this particular query faster could be done using a compound index that contains the device type, e.g. using

a) ensureIndex({'username': 1, 'userAgent.deviceType' : 1, 'timestamp' :-1})

b) ensureIndex({'userAgent.deviceType' : 1, 'username' : 1, 'timestamp' :-1})

Unfortunately, that means that queries like find({"username" : "foo"}).sort({"timestamp" : -1}); can't use the same index anymore, so, as described, the number of indexes will grow very quickly.

I'm afraid there's no very good solution for this using mongodb at this time.

Mongo only uses 1 index per query. So if you want to filter on 2 fields, mongo will use the index with one of the fields, but still needs to scan the entire subset.

This means that basically you'll need an index for every type of query in order to achieve the best performance.

Depending on your data, it might not be a bad idea to have one query per field, and process the results in your app. This way you'll only need indexes on every fields, but it may be too much data to process.

If you are using $in, mongodb never uses INDEX. Change your query, by removing this $in. It should use index and it would give better performance than what you got earlier.

http://docs.mongodb.org/manual/core/query-optimization/

참고URL : https://stackoverflow.com/questions/19559405/mongodb-querying-performance-for-over-5-million-records

'code' 카테고리의 다른 글

입출력 매개 변수를 언제 사용합니까? (0)	2020.11.13
JSON.stringify에 해당하는 jquery (0)	2020.11.12
IOS 프로젝트에서“내부 오류가 발생했습니다. (0)	2020.11.12
Windows의 git-svn. (0)	2020.11.12
사용자 지정 예외 메시지 : 모범 사례 (0)	2020.11.12

현재글5 백만 개 이상의 레코드에 대한 MongoDB 쿼리 성능

codestyle

5 백만 개 이상의 레코드에 대한 MongoDB 쿼리 성능

5 백만 개 이상의 레코드에 대한 MongoDB 쿼리 성능

'code' 카테고리의 다른 글

'code'의 다른글

티스토리툴바

5 백만 개 이상의 레코드에 대한 MongoDB 쿼리 성능

5 백만 개 이상의 레코드에 대한 MongoDB 쿼리 성능

'code' 카테고리의 다른 글

'code'의 다른글

관련글

티스토리툴바